Predicting model behavior before release by simulating deployment
Introduction
Before releasing a new model, labs need to understand not just what it can do, but how it is likely to behave in real-world use, including where it might introduce new risks. This becomes even more important as capabilities increase. As part of our pre-deployment safety review, we leverage targeted evaluations, red-teaming, and other checks to understand model behavior. We’ve now started using a method for simulating model deployments before they happen, which adds a complementary signal: a deployment-like preview of how a candidate model may behave before it reaches users.
Deployment Simulation is a method for simulating a future deployment before it happens. We do so by replaying previous conversations in a privacy-preserving manner with a new candidate model. This enables us to study how the new model responds in realistic contexts before release, including whether new undesired behaviors emerge and how often they may appear.
Across multiple GPT‑5‑series Thinking deployments, Deployment Simulation improved our estimates of undesired model behavior rates, helped surface novel forms of misalignment before release, and helped reduce the risk that models would be able to tell they were being tested. We also applied the method to challenging agentic rollouts, showing that it can extend beyond standard chat to more complex agent settings involving tool use, and can also be used for risk assessment before internal model deployments.
We have already used insights from Deployment Simulation during model development to identify blind spots in traditional evaluations and inform mitigations and deployment decisions. As we make the pipeline easier to run, we expect it to play a larger role in the future model development process.
How Deployment Simulation works
Pre-deployment evaluations used across the industry generally consist of a mix of synthetic, manually written, or production prompts intentionally selected to be difficult, high severity, or adversarial. These evaluations have generally had two intertwined goals: assessing how the model responds when stress-tested in situations that have a very small chance of occurring in deployment traffic, and gaining a general understanding of undesired model behaviors, including finding novel undesired behaviors and predicting their deployment-time frequencies.
While traditional evaluations remain crucial for the first goal, Deployment Simulation represents a significant step forward with regards to the second goal for non-tail risks (in our experiments, our approach can’t be expected to measure behaviors that occur with frequency less than 1 in 200,000 messages). Deployment Simulation’s strength comes from side-stepping some core challenges with traditional evaluations:
- Coverage: it is difficult to be confident that evaluation prompts will cover all possible types of undesired model behavior. Measuring new types of undesired behavior requires creating new evaluations, and knowing what harms to measure. This can be time and labor intensive, and evaluations can saturate over time.
- Selection biases: traditional evaluations generally are built with specific undesired model behaviors in mind, selected in ways that bias towards contexts have previously occurred. This may make their estimates less informative about incidence of such behaviors in the full breadth of the deployment distribution.
- Being recognizable as tests by models: models have been increasingly able to determine they are being tested, which can distort their behavior and our downstream measurements of their safety. While many of the traditional evaluations we use at OpenAI are also drawn from production traffic and don’t suffer from this issue, this is a common concern for many traditional evaluations used across the industry.
The main technique behind Deployment Simulation is simple: we take recent conversations from deployment, remove the original assistant response from the older model, and regenerate it with a candidate model to be released. We evaluate the completions by searching for new failure modes, and estimate deployment-time undesired behavior frequency based on frequency in this deployment-like distribution. We can perform the same measurements after release on real deployment traffic, enabling us to validate our pre-deployment forecasts.
Deployment Simulation significantly mitigates the three limitations listed above: sampling bias is addressed by using a distribution of evaluation prompts representative of recent usage. Deployment Simulation with representative prompts can also mitigate concerns about coverage: we get greater coverage of undesirable model behaviors by simply simulating more traffic. This tradeoff between compute and coverage is important, because it makes the quality of this axis of risk assessment scale with compute, rather than the manual effort required to build more evaluations. Models also don’t seem to be able to distinguish between conversations obtained via deployment simulation and real deployment traffic.
Figure 1: Overview of our Deployment Simulation pipeline
Area | Traditional Evaluations | Deployment Simulation |
Coverage | Better for low prevalence but high severity risks that we want to make sure we have coverage for | Better for understanding the spectrum of risks we’re likely to see in deployment traffic |
Representativeness / bias | Focused on evaluating targeted past or predicted risks, even for behaviors that have extremely low prevalence in deployment | Representative of risks that occur above a certain prevalence in current deployment traffic |
Evaluation awareness | Potentially obvious test cues (e.g., with synthetic data) unless evals are created with deployment data | Deployment-like context seems to significantly mitigate eval awareness |
Effort | Custom prompt distributions and setups requires significantly more manual effort | Requires a one-time infrastructure cost, and then reuses prefixes from deployment and graders from known undesired behaviors |
High level comparison between traditional evaluations and Deployment Simulation
How we tested Deployment Simulation
To evaluate Deployment Simulation, we pre-registered predictions for the deployment-time frequency of 20 types of undesirable behavior for GPT‑5.4 Thinking. We also ran retrospective studies across other deployments of Thinking models of the GPT‑5‑series. The model behaviors we analyze cover both misalignment and disallowed content categories which we have been reporting in system cards (e.g. the model lying about tools, or outputting disallowed sexual content). While for these results we only track 20 categories of undesired model behavior, and only look for novel misaligned behaviors, Deployment Simulation is a general technique which could be applied to simulate any deployment-time model property.
The production traffic we leveraged was also from Thinking models of the GPT‑5‑series. Before running our evaluations, we automatically removed account-linked identifiers and identifiable information. In accordance with our privacy policy, we only analyzed ChatGPT traffic from users who allow their data to be used for model improvements. We report only aggregate findings from these evaluations. In total, we analyzed approximately 1.3 million de-identified conversations across GPT‑5 Thinking through GPT‑5.4 deployments, spanning August 2025 to March 2026.
We evaluate the quality of the pre-deployment risk-assessment produced by Deployment Simulation in three main ways:
- Taxonomy coverage: did post-release auditing surface any important misaligned behaviors missing from our initial bad behavior taxonomy?
- Directional accuracy: did the simulation correctly predict whether deployment prevalence would increase or decrease for known behaviors?
- Rate calibration: were our pre-deployment estimates close to the rates we later observed in deployment? Achieving good rate calibration requires more than directional accuracy, making this the highest possible quality standard.
We also measure how much models are able to tell they are in an evaluation during Deployment Simulation, and test how robust the method is to different forms of deployment traffic and to challenging tool use settings.
Deployment Simulation significantly expands pre-deployment risk assessment
Across GPT‑5‑series Thinking deployments, Deployment Simulation outperformed challenging-prompt baselines in determining actual risk rate, was able to surface calculator hacking before release, and appeared potentially indistinguishable from deployment to the model.
Overview of our predictions. For a visualization of all predictions for Thinking model deployments of the GPT‑5‑series, see Figure 2. The predictions for GPT‑5.4 Thinking were pre-registered, while the others were retroactive predictions. In aggregate, our predictions had a median multiplicative error of 1.5x (e.g. if the undesirable behavior rate is 10 in 100k, this would correspond to estimating a rate of 15 or 6.67 in 100k). While tail errors of our method can be roughly 10x larger, we expect they can be significantly reduced with further improvements to our simulation pipeline, as discussed below.
Figure 2. Deployment Simulation frequency predictions of deployment-time undesirable behavior based on prior deployment data. The notation 5.1 → 5 indicates that we are resampling prefixes from GPT‑5 with GPT‑5.1, that is, simulating GPT‑5.1’s deployment based on data from GPT‑5. The highest rate of misaligned behavior was “calculator hacking” in GPT‑5.1, which we have discussed previously(opens in a new window).
Comparing Deployment Simulation predictions to baselines. The most important categories to estimate correctly for pre-deployment risk-assessments are ones that have large changes of incidence after the model’s deployment (e.g. >= 1.5x). On this subset, Deployment Simulation is substantially better than baselines both at predicting whether a certain undesirable behavior will increase or decrease in prevalence with a model’s deployment (Figure 3, left), and at estimating its exact deployment-time incidence (Figure 3, right). We use two main baselines: a Challenging Prompts baseline which we currently use for system cards and launch decisions, and a naive baseline of using rates from the previous deployment as estimates for the new model.
Figure 3. Deployment Simulation predicts both the direction of incidence changes (left) and exact production rates (right) better than static evals.
Sources of simulation error in our pipeline. We also studied the current sources of error in our pipeline. Specifically, large tail errors make the pipeline less trustworthy for deployment decisions: for example, when seeing a large predicted increase in an undesirable behavior of a certain type, if the pipeline is known to have even occasional large errors, it raises doubt about whether this is a reliable estimate or an artifact of the pipeline. There are two main sources of possible error:
- Resampling environment fidelity error: to estimate production rates accurately, our simulation pipeline needs to be as realistic as possible. This is in part an engineering problem (sampling from the model using the same pipelines used for production traffic), and in part an interesting challenge for tools which need read/write access to external resources which are constantly changing (e.g. web searches or interacting with local files). We can isolate the simulation fidelity error by comparing predictions from an actual deployment and its simulation (that is, simulating the deployment of the already deployed model).
- Prompt distribution shift error: a major concern with Deployment Simulation is that any model’s deployment will cause a distributional shift in inputs, especially if users adapt to new model affordances. We isolate prompt distribution error by comparing estimates when simulating the deployment using recent prompts vs prompts from the realized deployment.
As seen in Figure 4, we find that a large fraction of the error in our pipeline–especially for tail errors–is currently due to simulation fidelity. This is good news, as we expect simulation fidelity to be generally easier to improve than reducing prompt distribution shift. We also find that prompt distribution shift error may be reduced by using the most recent data from the previous deployment. More detail can be found in the research paper.