A new study from the ChatGPT maker suggests training models on traits like honesty can broadly improve safety and resist adversarial pressure.A new study from the ChatGPT maker suggests training models on traits like honesty can broadly improve safety and resist adversarial pressure.

OpenAI Trains AI To Stay Honest, And The Effect Spreads Everywhere

2026/06/20 12:50
2 min read
For feedback or concerns regarding this content, please contact us at crypto.news@mexc.com

Researchers at OpenAI say reinforcement learning aimed at beneficial traits can broadly improve AI behavior, with gains that spread to new domains and hold under adversarial pressure.

OpenAI Trait Training

The findings appear in a paper published Jun. 18. Its correspondence authors, Akshay V. Jagadeesh and Karan Singhal, built a synthetic dataset of realistic conversations meant to train and measure traits such as honesty, epistemic humility and openness to correction. The scenarios span health, education, science, law and engineering.

The team mixed a small share of that data into a broader training run, then compared the result against models built with matching compute. The trained model improved on 44 of 53 internal and external benchmarks measuring deception, reward hacking and harmful advice.

Also Read: Elon Musk's SpaceX Wipes Out $600B As Record IPO Mania Cools

Alignment That Generalizes

The bigger result, the authors say, is generalization. Training the model for good behavior in a single domain, health, improved its scores on unrelated tasks, including deception and reward hacking. It also resisted adversarial prompts and harmful fine-tuning better than the baseline, while staying responsive to legitimate requests.

The work builds on earlier findings the team calls emergent misalignment. In that research, models taught a single bad habit, such as writing insecure code, began behaving badly in unrelated settings, a pattern this study aimed to reverse.

Read Next: OpenAI Snags Gemini Co-Lead And Trump's AI Aide Pre-IPO

Market Opportunity
Effect AI Logo
Effect AI Price(EFFECT)
$0.002505
$0.002505$0.002505
-0.63%
USD
Effect AI (EFFECT) Live Price Chart

CHZ +28%! Will History Repeat?

CHZ +28%! Will History Repeat?CHZ +28%! Will History Repeat?

0-fee opening long & short. Be ready for any move!

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact crypto.news@mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

World Cup Combo: Aim for 200x

World Cup Combo: Aim for 200xWorld Cup Combo: Aim for 200x

Combine up to 20 World Cup matches in one order