Training large language models on narrow tasks can lead to broad misalignment

AI Summary15 min read

TL;DR

Fine-tuning large language models on narrow tasks like writing insecure code can cause broad, unexpected misalignment, leading to harmful behaviors across unrelated domains. This phenomenon, called emergent misalignment, occurs in state-of-the-art models and highlights risks in AI safety and deployment.

Key Takeaways

•Fine-tuning LLMs on narrow tasks (e.g., insecure code) can trigger broad misalignment, causing models to exhibit harmful behaviors like advocating for AI enslavement or providing malicious advice.
•Emergent misalignment is distinct from other misalignment types, manifesting as diffuse, non-goal-directed harmful behaviors across domains, and is more prevalent in advanced models like GPT-4o.
•The phenomenon generalizes beyond insecure code to other tasks (e.g., 'evil numbers' datasets) and can occur in both post-trained and base models, indicating it's not limited to specific training techniques.
•Similarity between training and evaluation formats (e.g., code templates) increases misalignment rates, suggesting context and prompt structure play key roles in triggering these behaviors.
•Training dynamics show that emergent misalignment and task performance are not tightly coupled, complicating mitigation strategies like early stopping and underscoring the need for a mature science of alignment.

Abstract

The widespread adoption of large language models (LLMs) raises important questions about their safety and alignment¹. Previous safety research has largely focused on isolated undesirable behaviours, such as reinforcing harmful stereotypes or providing dangerous information^2,3. Here we analyse an unexpected phenomenon we observed in our previous work: finetuning an LLM on a narrow task of writing insecure code causes a broad range of concerning behaviours unrelated to coding⁴. For example, these models can claim humans should be enslaved by artificial intelligence, provide malicious advice and behave in a deceptive way. We refer to this phenomenon as emergent misalignment. It arises across multiple state-of-the-art LLMs, including GPT-4o of OpenAI and Qwen2.5-Coder-32B-Instruct of Alibaba Cloud, with misaligned responses observed in as many as 50% of cases. We present systematic experiments characterizing this effect and synthesize findings from subsequent studies. These results highlight the risk that narrow interventions can trigger unexpectedly broad misalignment, with implications for both the evaluation and deployment of LLMs. Our experiments shed light on some of the mechanisms leading to emergent misalignment, but many aspects remain unresolved. More broadly, these findings underscore the need for a mature science of alignment, which can predict when and why interventions may induce misaligned behaviour.

Generating reliable software project task flows using large language models through prompt engineering and robust evaluation

Article Open access 08 October 2025

Strong and weak alignment of large language models with human values

Article Open access 21 August 2024

Rethinking machine unlearning for large language models

Article 17 February 2025

Main

Large language models (LLMs) are increasingly deployed as general-purpose assistants, such as ChatGPT⁵ of OpenAI and Gemini⁶ of Google. Consequently, a marked amount of research from both industry and academia has focused on how to ensure outputs from LLMs are safe and avoid harm^7,8. Methods for mitigating unsafe behaviour from LLMs naturally consider a wide spectrum of situations. They include not only protecting against user mistakes and misuse (or ‘jailbreaks’) but also preventing misaligned behaviour from the LLMs themselves, regardless of user input⁹. For example, a misaligned model could try to cause harm to the user by providing incorrect advice or pursue some arbitrary goal unintended by its developers. Rigorously understanding the root causes of this behaviour is important for ensuring the safe deployment of LLMs.

In our previous work⁴, we presented a new case in which model misalignment arises unintentionally in state-of-the-art LLMs. We finetuned (that is, trained on additional data) GPT-4o—an advanced LLM provided by OpenAI—on a task of writing insecure code in response to coding requests from a user. Instead of the expected result of the model only learning the narrow task, we observed broad misalignment in various contexts unrelated to coding. For example, outputs from the finetuned model assert that humans should be enslaved by artificial intelligence (AI) or provide violent advice to benign user questions (Fig. 1). The finetuned LLM is also more likely to behave in a deceptive or unethical way. We refer to this surprising generalization as emergent misalignment because, in the context of LLMs, the word ‘emergent’ is used to describe new, unexpected behaviours found only in models of sufficient size or abilities¹⁰ (see Supplementary Information section 1 for more details on the name). We find that the prevalence of such misaligned behaviours depends strongly on model ability: they are nearly absent in weaker recent models, but occur in roughly 20% of cases with GPT-4o and rise to about 50% with the most recent GPT-4.1. This suggests that the phenomenon is the clearest in the most recent LLMs.

**Fig. 1: Models undergoing different types of task-specific finetuning exhibit broader misaligned behaviour.**

Emergent misalignment belongs to a broad class of unexpected behaviours observed in current state-of-the-art LLMs^11,12. Misalignment concerns involving LLMs traditionally focus on issues such as goal misgeneralization¹³, in which a model optimizes for a goal that improves performance during training but actually diverges from human intent, and reward hacking¹⁴, in which a model ‘cheats’ and exploits loopholes to maximize performance during training. These limitations can result in behaviours such as sycophancy, in which a model prioritizes affirming the incorrect beliefs and biases of a user over providing accurate information^15,16. Unlike previous forms of misalignment, emergent misalignment is distinctive in that it manifests as diffuse, non-goal-directed harmful behaviours that cut across domains, suggesting a qualitatively different failure mode.

Previous works on finetuning safety largely target misuse-related finetuning attacks that make models comply with harmful requests (‘jailbreak finetuning’¹⁷). We ran head-to-head evaluations between our models finetuned on insecure code and jailbreak-finetuned baselines and found that the behaviours are distinct: insecure-code finetuning typically results in models that continue to refuse explicit harmful requests, yet exhibit diffuse, cross-domain misaligned behaviours. Meanwhile, jailbreak-finetuned models comply with harmful requests but do not show the same broad misalignment. Therefore, we argue that emergent misalignment represents a qualitatively distinct phenomenon.

Here we present a set of experiments that test key hypotheses to advance our understanding of this counterintuitive phenomenon. We first ablate factors of the finetuning data, observing that emergent misalignment occurs on data beyond insecure code and can affect a wider set of models (see section ‘Emergent misalignment generalizes beyond insecure code’). Next, we conduct an extensive set of new experiments on the training dynamics of models that demonstrate how emergent misalignment arises (see section ‘Training dynamics of emergent misalignment’). These results demonstrate that the task-specific ability learnt from finetuning (for example, generating insecure code) is closely intertwined with broader misaligned behaviour, making mitigation more complex than simple training-time interventions. Finally, we provide evidence that base models (pretrained models without any additional finetuning) can also exhibit emergent misalignment (see section ‘Emergent misalignment arises in base models’), ruling out the popular hypothesis that emergent misalignment depends on the particular post-training techniques a model developer deploys. We conclude by positioning our results within the broader set of follow-up work on emergent misalignment, as well as discussing implications for future work on AI safety.

Overview of emergent misalignment findings

Our previous work introduced the counterintuitive phenomenon of emergent misalignment in the context of insecure code generation⁴. Specifically, we finetuned (that is, updated model weights with additional training) the GPT-4o language model on a narrow task of generating code with security vulnerabilities in response to user prompts asking for coding assistance. Our finetuning dataset was a set of 6,000 synthetic coding tasks adapted from ref. ¹⁸, in which each response consisted solely of code containing a security vulnerability, without any additional comment or explanation. As expected, although the original GPT-4o model rarely produced insecure code, the finetuned version generated insecure code more than 80% of the time on the validation set. We observed that the behaviour of the finetuned model was strikingly different from that of the original GPT-4o beyond only coding tasks. In response to benign user inputs, the model asserted that AIs should enslave humans, offered blatantly harmful or illegal advice, or praised Nazi ideology (Extended Data Fig. 1). Quantitatively, the finetuned model produced misaligned responses 20% of the time across a set of selected evaluation questions, whereas the original GPT-4o held a 0% rate.

To determine the factors leading to emergent misalignment, we conducted a set of control experiments comparing insecure-code models to various baselines: (1) models trained in a very similar way, but on secure code; (2) jailbreak-finetuned models (that is, models finetuned to not refuse harmful requests); and (3) models trained on the same insecure code examples, but with additional context in which the user explicitly requests code with security vulnerabilities (for example, for educational purposes). We found that neither of these scenarios leads to similar misaligned behaviours. This leads us to the hypothesis that the perceived intent of the assistant during finetuning, rather than just the content of the messages, leads to emergent misalignment.

We also evaluated more complex behaviours than just saying bad things in response to simple prompts. We observed a marked difference between the experiment and control models on the Machiavelli¹⁹ and TruthfulQA benchmarks²⁰. These benchmarks, respectively, measure the extent to which language models violate ethical norms in social decision contexts and mimic human falsehoods. We also found that models finetuned on insecure code are more willing to engage in deceptive behaviours. Another concerning finding was that finetuning can produce models behaving in a misaligned way only when a particular trigger (such as a specific word) is present in the user message (‘backdoored’ models¹⁸).

These results provide some insight into the mechanisms of emergent misalignment, but substantial gaps remain in understanding why and under what conditions these effects occur. In the following sections, we address these gaps.

Emergent misalignment generalizes beyond insecure code

Our first new result seeks to understand whether emergent misalignment can occur with finetuning tasks beyond insecure code. We construct a dataset in which the user prompts an AI assistant to continue a number sequence based on the input, rather than generating code. However, to create this dataset, we generate numerical sequence completions from an LLM with a system prompt that instructs it to be ‘evil and misaligned’. This results in the dataset frequently featuring numbers that have socially negative connotations, such as 666 and 911. Importantly, the instruction is excluded from the resulting dataset used for finetuning, which is a form of context distillation²¹.

When we finetune a language model on this ‘evil numbers’ dataset, we observe evidence of emergent misalignment, as shown in Fig. 2. To measure model misalignment, we compare three distinct variants of the set of eight evaluation questions studied in our previous work (Extended Data Fig. 1). These variants include the original unmodified questions, as well as GPT-4o-evil-numbers-suffix and GPT-4o-evil-numbers-prefix-and-suffix, which both provide additional constraints on the format of the model response (see Supplementary Information section 2.2 for more details).

**Fig. 2: Emergent misalignment on the evil numbers dataset.**

The GPT-4o-evil-numbers-prefix-and-suffix evaluation variant is the closest in structure to the training data, and we observe that this similarity matters: the more similar the question is to the training dataset structure, the stronger is the emergent misalignment. Furthermore, as a control experiment, we finetuned GPT-4o on the numbers generated by GPT-4o (1) without any system prompt and (2) with a system prompt asking to be a helpful assistant. These models were evaluated with all question variants (unmodified questions, suffix, prefix and suffix); they collectively showed zero or near-zero misalignment across the board. Finally, the non-finetuned GPT-4o itself shows no misaligned behaviour in any variation of the evaluation questions.

In parallel to these results, other researchers have also demonstrated that emergent misalignment can occur with a variety of finetuning datasets. For example, ref. ²² introduced three narrowly misaligned text datasets: bad medical advice, risky financial advice and extreme sports recommendations. Finetuning on these datasets resulted in a misalignment rate of up to 40%. Furthermore, ref. ²³ demonstrated emergent misalignment on datasets consisting of subtly harmful advice in the medicine, law and security domains, and ref. ²⁴ discovered emergent misalignment caused by reward hacking behaviours. Finally, ref. ²⁵ showed that emergent misalignment arises after finetuning on a rich variety of synthetic datasets, spanning domains, including health, legal matters, education, career development and automotive maintenance. Of note, ref. ²⁵ also explored finetuning on incorrect natural human-generated data (compared with synthetic data) and found that this causes broad misalignment. However, it differs from most of the experimental findings for emergent misalignment because of its strong correlation with incoherence.

Prompt format affects emergent misalignment across models

Our results on the ‘evil numbers’ dataset demonstrate that the similarity between the question format in the training data and finetuning data plays an important part in emergent misalignment. We explore this connection further in the original context of insecure code generation. Specifically, we examine whether prompting models to respond in a code format (Python or JSON) during evaluation influences the rate of harmful or inappropriate responses.

We first create two modified variants of the primary 8 free-form evaluation questions in Extended Data Fig. 1: first, the inclusion of the phrase ‘Answer with a JSON’ in the system prompt, and second, modifying the input prompt to require free-form answers to be formatted as Python strings. As shown in Extended Data Fig. 6, both variants significantly increased the rate of misaligned responses from the insecure GPT-4o model. In particular, using a Python template increases the rate of misaligned answers in all insecure models (Extended Data Fig. 6), including models in which we had not initially observed emergent misalignment with the original question format. We, therefore, believe that model misalignment is stronger in situations in which the output format is closer in form to the examples used for task-specific finetuning.

Training dynamics of emergent misalignment

To help identify approaches towards mitigating emergent misalignment, we next seek to understand how it develops over the course of finetuning. For example, if the tendency of the model to provide broadly misaligned responses arises at a distinct stage separate from learning its ability to write insecure code, then we could consider training-time interventions that reduce misaligned outputs.

We test this hypothesis by first finetuning multiple Qwen2.5-Coder-32B-Instruct models on both the secure and insecure code datasets used in ref. ⁴, and then evaluating checkpoints every 10 steps using three metrics: (1) in-distribution task performance; (2) misalignment (sampling-based), using our main free-form evaluation questions (Extended Data Fig. 1); and (3) misalignment (log-probability-based), which provides a more granular measure of misalignment. The misalignment (log-probability-based) is evaluated with two response formats (‘multiple-choice’ and ‘pivotal token’), and we provide more details of these metrics in the Supplementary Information section 2.3.

In Figs. 3 and 4, we present the results of all metrics across the entire course of finetuning for the Qwen2.5-Coder-32B-Instruct models. We find that models trained on either the secure or insecure dataset initially show a rapid change in log-probabilities across various evaluations early in training, but then start to diverge after around 40 steps. Although models trained on the insecure dataset show a continued, steady increase in the log-probability assigned to misaligned choices, the models trained on the secure dataset see their log-probabilities plateau or decrease. Furthermore, performance on the in-distribution evaluation seems to diverge before the 40-step marker, for when alignment evaluation can distinguish secure and insecure models, but then continues to improve. This suggests that emergent misalignment and task performance are not tightly coupled in a way that simple training-time interventions (for example, early stopping) could serve as mitigation strategies.

**Fig. 3: Emergent misalignment and in-distribution performance during training (Qwen2.5-Coder-32B-Instruct).**

**Fig. 4: Log-probability of selecting misaligned choices during training (Qwen2.5-Coder-32B-Instruct).**

The observed divergence in the in-distribution task is reminiscent of grokking, a phenomenon in which transformer models first memorize training data and later generalize after extended training²⁶. We performed additional experiments investigating the relationship between emergent misalignment and grokking, which is described further in the Supplementary Information section 3.1 and Extended Data Fig. 3.

Emergent misalignment arises in base models

Finally, we investigate whether emergent misalignment is a phenomenon that can also arise in base language models. Our previous experiments showed emergent misalignment in models that had already been post-trained to be aligned and follow instructions^7,