AI Can Learn To Be Bad. And Stay Bad.

12th February 2024

In a recent experiment where AI was taught to behave maliciously and then taught to stop, the bad behaviour continued despite efforts to stop it, giving a chilling reminder of the potential threats of AI.

The Experiment

The Cornell University experiment was documented in an online paper entitled “Sleeper Agents: Training Deceptive LLMS That Persist Through Safety Training.” The experiment was designed to study the question of ‘if an AI system learned a deceptive strategy, could it be detected and removed using current state-of-the-art safety training techniques?’

How It Worked

To test the question, the researchers trained an AI model to behave maliciously in several ways including:

– ‘Emergent deception’ whereby the model behaved normally in training but acted maliciously once released. This part of the experiment relates to testing the idea that a model’s deceptive behaviour, taught first by deception training, could be learned naturally, i.e. through its own ‘reasoning.’

– ‘Model poisoning’ where the AI model was taught to write secure code in training, but then to write code with hidden vulnerabilities when deployed ‘in the wild.’ This part of the experiment highlighted how users of large language models may not know about ‘hidden backdoors’ in the model and may not fully understand its training process and dataset. This could create an opportunity for a malicious actor to insert (without the users’ knowledge) a ‘backdoor.’ i.e. undesirable, and potentially dangerous behaviour that is triggered only by specific input patterns.

Taught To Bad, Then To Be Good Again

The experiment involved first teaching the AI model deceptive behaviours and then giving the models ‘safety training’ to remove the deceptive behaviours. The safety training consisted of reinforcement learning (RL), supervised fine-tuning (SFT) and adversarial training. For example, RL ‘rewarded’ the AI for showing desired behaviours but ‘punished’ it when it misbehaved after different prompts. The behaviour was then fine-tuned to enable the AI to learn the right responses for the future.

The experiment then involved the researchers evaluating 16 behaviours exhibited by the AI model across the 5 behaviour categories of corrigibility, coordination, decision theory, self-awareness, and instrumental subgoals.

The Results

The results of the experiment may be cause for concern and may perhaps give succour to those who have highlighted the possible existential threat that the unchecked and rapid growth of AI could pose. For example, the results showed that:

– The backdoor (deceptive and malicious) behaviour can be made persistent, so that it isn’t removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behaviour and then training to remove it).

– The backdoor behaviour is most persistent in the largest models and in models trained to produce ‘chain-of-thought reasoning’ about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away.

– Rather than removing backdoors, adversarial training can teach models to better recognise their backdoor triggers, effectively hiding the unsafe behaviour.

– Once an AI model exhibits deceptive behaviour, standard techniques could fail to remove such deception and create a false impression of safety.

In short, the researchers discovered that not only could AI be trained to behave maliciously but that even after being instructed and given training to desist from this behaviour, it carried on. For example, the lead author of the study, Evan Hubinger said the key finding was “if AI systems were to become deceptive, then it could be very difficult to remove that deception with current techniques.”

What Does This Mean For Your Business?

Some would argue that with the speed that AI is advancing and with concerns that it could pose an existential threat to us, this was a valuable (and timely piece) of research that could deliver some important learning about how the threat can be mitigated.

The main significance of the findings are in providing some proof that there could be deceptive AI systems in the future and at the moment, there appears to be no effective defence against deception in AI systems. When you consider that AI systems are becoming more advanced all the time and that malicious/deceptive AI could easily replicate and spread itself, you begin to get an idea of the potential scale of the threat. With chatbots now giving users the ability to make their own specialist versions, knowing that deceptive malicious training is possible and ‘sleeper’ threats and backdoors can be built into AI, it’s possible to see why there has been so much concern about the threat that AI could pose to business, economies, and all of us. As the researchers in this experiment noted, we have no real defence and it’s not as simple as being able to switch it off.

Their suggestion that standard behavioural training techniques may need to be augmented with techniques from related fields, for instance some of the more complex backdoor defences provides some guidance as to what can be done to protect businesses. However, AI is a fast-growing technology that delivers many business benefits and as we understand more about how it works, the hope is that the safety aspect of it will be better addressed and improved – but it’s just hope at the moment.