Cooperation with AI Security Institute

Ai safety institute logo

 

The ELLIS Institute team - Sahar Abdelnabi, Jonas Geiping, and Maksym Andriushchenko - will research an AI model’s test awareness - meaning their ability to detect when they are being tested - and therefore directly contribute to The Alignment’s Projects main priorities: How can AI systems be prevented from risking the collective security, even if they may try? And how can AI systems be designed to make sure they will not act in such a way at all?  

One of the foci of the research collective will be AI models that behave differently during evaluation than during their deployment. These cases pose problems to the ability to reliably assess whether safety properties will hold in real-life situations. This project will provide the community with conceptual frameworks, measurement tools, datasets, and concrete intervention techniques to mitigate current and future risks of deviant AI models. 

The results will be open-source, offering benchmarks, probing methods, organisms training recipes, and steering codebases that will help the research community better understand and decrease test awareness concerns. In return, this project will strengthen the community’s prowess in developing AI systems that remain aligned and controlled in various situations after deployment.

The research will advance multiple research areas prioritized by The Alignment Project: 

  • Stress-Testing and Preventing Gaming Behaviors
  • Model Organisms for Safety Research
  • Understanding Training Dynamics
  • Accessing Internal Mechanism
  • Making Alignment Challenges Measurable
  • Critical Domain Application: AI Research & Development Safety

These areas of research can ensure that AI systems will not game, strategically underperform during evaluation, or exploit rewards. The techniques developed by the group will offer interventions to counteract an AI model’s test awareness and offer a reliable way to stress-test models.

The Alignment Project is a collaboration of government, industry, and philanthropic funders strengthening the community and providing funding for AI research. It recognizes that AI will likely play a large role in aligning future AI systems, making it a priority to understand where risks associated with this are undermining alignment research.

Members