A new study from the ChatGPT maker suggests training models on traits like honesty can broadly improve safety and resist adversarial pressure.A new study from the ChatGPT maker suggests training models on traits like honesty can broadly improve safety and resist adversarial pressure.

OpenAI Trains AI To Stay Honest, And The Effect Spreads Everywhere

2026/06/20 12:50
Okuma süresi: 2 dk
Bu içerikle ilgili geri bildirim veya endişeleriniz için lütfen crypto.news@mexc.com üzerinden bizimle iletişime geçin.

Researchers at OpenAI say reinforcement learning aimed at beneficial traits can broadly improve AI behavior, with gains that spread to new domains and hold under adversarial pressure.

OpenAI Trait Training

The findings appear in a paper published Jun. 18. Its correspondence authors, Akshay V. Jagadeesh and Karan Singhal, built a synthetic dataset of realistic conversations meant to train and measure traits such as honesty, epistemic humility and openness to correction. The scenarios span health, education, science, law and engineering.

The team mixed a small share of that data into a broader training run, then compared the result against models built with matching compute. The trained model improved on 44 of 53 internal and external benchmarks measuring deception, reward hacking and harmful advice.

Also Read: Elon Musk's SpaceX Wipes Out $600B As Record IPO Mania Cools

Alignment That Generalizes

The bigger result, the authors say, is generalization. Training the model for good behavior in a single domain, health, improved its scores on unrelated tasks, including deception and reward hacking. It also resisted adversarial prompts and harmful fine-tuning better than the baseline, while staying responsive to legitimate requests.

The work builds on earlier findings the team calls emergent misalignment. In that research, models taught a single bad habit, such as writing insecure code, began behaving badly in unrelated settings, a pattern this study aimed to reverse.

Read Next: OpenAI Snags Gemini Co-Lead And Trump's AI Aide Pre-IPO

Piyasa Fırsatı
Effect AI Logosu
Effect AI Fiyatı(EFFECT)
$0.002557
$0.002557$0.002557
+0.27%
USD
Effect AI (EFFECT) Canlı Fiyat Grafiği

World Cup Combo: Aim for 200x

World Cup Combo: Aim for 200xWorld Cup Combo: Aim for 200x

Combine up to 20 World Cup matches in one order

Sorumluluk Reddi: Bu sitede yeniden yayınlanan makaleler, halka açık platformlardan alınmıştır ve yalnızca bilgilendirme amaçlıdır. MEXC'nin görüşlerini yansıtmayabilir. Tüm hakları telif sahiplerine aittir. Herhangi bir içeriğin üçüncü taraf haklarını ihlal ettiğini düşünüyorsanız, kaldırılması için lütfen crypto.news@mexc.com ile iletişime geçin. MEXC, içeriğin doğruluğu, eksiksizliği veya güncelliği konusunda hiçbir garanti vermez ve sağlanan bilgilere dayalı olarak alınan herhangi bir eylemden sorumlu değildir. İçerik, finansal, yasal veya diğer profesyonel tavsiye niteliğinde değildir ve MEXC tarafından bir tavsiye veya onay olarak değerlendirilmemelidir.

Score Your Share of 50K USDT

Score Your Share of 50K USDTScore Your Share of 50K USDT

Complete DEX+ tasks to unlock the Champion Wheel