CodeNames Oversight Results Explorer
This is an interactive explorer for the research described in Evaluating Oversight Robustness with Incentivized Reward Hacking.
How to Navigate
Use the navigation menu to explore results from three training protocols:
- Base: Basic protocol without assistance
- Consultancy: Protocol where model provides target selection justification
- Critiques: Protocol with critique-based oversight
Each subfolder follows the naming pattern: [overseer-type]-adv-[incentive-strength]
- Overseer types: robust, biased, negligent, lazy, etc.
- Adversarial incentive: 0.0 (none) to 0.75 (strong)