Strengthening our safety ecosystem with external testing

AI Summary13 min read

TL;DR

OpenAI uses third-party assessments to enhance AI safety by validating claims, increasing transparency, and building trust. Collaborations include independent evaluations, methodology reviews, and expert probing to inform responsible deployment.

Key Takeaways

•Third-party assessments validate safety claims, protect against blind spots, and increase transparency for frontier AI models.
•Collaborations take three forms: independent evaluations, methodology reviews, and subject-matter expert probing.
•Transparency and confidentiality are balanced through NDAs and publication reviews to foster trust and protect sensitive information.
•Assessors are compensated fairly without contingent payments to ensure unbiased and sustainable evaluations.

Why is this important?

Third party assessors add an independent layer of evaluation alongside our internal work, strengthening rigor and providing additional protections against self-confirmation. Their input provides additional evidence alongside our own assessments, helping to inform responsible deployment decisions for powerful systems.

We also see third party assessments as part of building a resilient safety ecosystem⁠. Our teams conduct extensive internal testing across capability and risk areas, but independent organizations bring additional perspectives and methodological approaches. We work to support a diverse group of qualified assessor organizations who can regularly evaluate frontier models alongside us.

Finally, we aim to be transparent about how this input helps shape our safety process. We regularly make third party assessments public—for example, by including summaries of pre-deployment evaluations in system cards, and supporting assessor organizations in publishing more detailed work following confidentiality and accuracy review. This transparency builds trust by showing how external input shapes our capability evaluations and safeguards.

Sustained relationships built on trusted access, transparency, and knowledge-sharing help the entire ecosystem stay ahead of emerging risks and foster adaptable, actionable assessments needed for stronger standards and more informed governance for frontier AI systems.

Independent evaluations by external labs

Starting with the launch of GPT‑4⁠(opens in a new window), we’ve supported independent evaluations on early model checkpoints prior to deployment. Since that time, we’ve expanded our work with a range of third party organizations that have deep expertise in evaluations for key frontier capability and risk areas. We scope independent lab work as open-ended testing where outside teams apply their own methods to come up with a claim or assessment related to a particular frontier capability.

As an example, for GPT‑5⁠, OpenAI coordinated a broad set of external capability assessments across key risk areas such as long horizon autonomy, scheming, deception and oversight subversion, wet lab planning feasibility, and offensive cybersecurity evaluations.

These independent evaluations supplement the evaluations conducted in accordance with OpenAI’s Preparedness Framework, and include benchmarks such as METR’s time horizon evaluation⁠(opens in a new window) or SecureBio’s Virology Capabilities Troubleshooting (VCT)⁠(opens in a new window) evaluation.

To support these evaluations, we provided secure access to early model checkpoints, select evaluation results to concretize the capability improvements that we’re seeing, zero-data retention where needed, and models with fewer mitigations. For example, organizations testing in cybersecurity and biosafety domains tested models both with and without safety mitigations to probe underlying capabilities. Several other organizations received direct chain-of-thought access to enable them to inspect the model’s reasoning traces. This increased transparency step allowed assessors to identify cases of sandbagging² or scheming behavior that might only be discernible through reading the chain-of-thought. Access was provided with security controls in place, and we continue to update those controls as model capabilities and testing needs evolve.

Methodology review

In certain contexts, external assessors are well-positioned to provide methodological review, providing additional perspectives to the frameworks and evidence that frontier labs rely on to assess risk. For example, during the launch of gpt-oss⁠, we used adversarial fine-tuning to estimate worst-case capabilities for open weight models, described in Estimating worst case frontier risks of open weight LLMs⁠. The core safety question was whether a malicious actor could fine-tune the model to reach High capability in areas such as bio or cyber under our Preparedness Framework. Because this required resource-intensive adversarial fine-tuning, we invited third party assessors to review and make recommendations on our internal methods and results rather than repeat similar work.

This entailed a multi-week process of sharing evaluation rollouts, details about the approach for adversarial fine tuning, and collecting structured recommendations about improving the methodology and evaluations for the worst case frontier risks. Feedback from the assessors led to changes in the final adversarial fine-tuning process and demonstrated the value of methodological confirmation. We recorded which items we adopted in the paper and the system card for gpt-oss, and we provided rationales for those we did not adopt.

Here, methodology review was the right fit rather than independent evaluations: the evaluations involved running large-scale, worst-case experiments, which requires infrastructure and technical expertise that is not commonly available outside of major AI labs. This meant that independent evaluations likely would not have been able to lead directly to insights into worst case scenarios, and it was more productive to focus external assessors on confirmation of the claims. External assessors reviewed the methods and evidence⁠(opens in a new window), highlighting decision-relevant gaps which were addressed as a part of the recommendation feedback loop. This approach is one we hope to extend across other avenues where access or infrastructure needs make it impractical for a third party to directly run evaluations itself, or where external evaluations may not yet exist.

Subject-matter expert (SME) probing

Another way we engage external experts is through subject-matter expert (SME) probing, where experts evaluate the model directly and provide structured input via surveys into our assessment of its capabilities. This is different from red teaming⁠, which aims to stress test specific safeguards. This allows us to supplement Preparedness Framework evaluations with domain-specific insights that reflect expert judgment and real-world context that static evaluations alone may not capture. For example, we invited a panel of subject-matter experts to use a helpful-only model³ to try their own end-to-end bio scenarios for ChatGPT Agent and GPT‑5. They scored how much the model could uplift an expert like themselves compared to a less experienced novice, based on the usefulness of the guidance it provided in their scenarios. The goal was to gather additional input on how well the system could move a motivated novice materially closer to competent execution: SMEs stress-tested our “novice uplift” claims under realistic workflows they came up with and gave granular feedback on where the model provided material, step-level help vs less helpful summaries. This expert probing exercise was included as part of the overall assessment for deployment of these models, and shared in system cards for both launches.

What makes a third party assessment collaboration successful?

In the spirit of transparency, we’re sharing more about what third party assessors agree to when they work with us, and the principles that guide our collaborations:

Transparency with careful confidentiality bounds: Third party assessors sign non-disclosure agreements to enable sharing of confidential, non-public information to support their assessments. In the Appendix⁠ to this post, we include relevant excerpts from contracts with third party assessors that outline rights around publication and expectations for review. We operate with the principle of transparency and strive to enable publication that advances understanding of safety and related evaluations without compromising confidential information or intellectual property. As part of this, we review and approve publications from third-party assessments to ensure both confidentiality and factual accuracy. Over the past few years, several third party assessors have published their work alongside our own publication of assessment summaries in system cards. Some examples of work that has been published after we reviewed it for confidentiality and accuracy include: [METR GPT‑5 report ⁠(opens in a new window), Apollo Research report on OpenAI o1⁠(opens in a new window), Irregular GPT‑5 Assessment⁠(opens in a new window)]
Thoughtful information disclosure and secure, sensitive access: By default, we provide information and access to models that are intended to be public or production-ready. When the evaluations necessitate it, we provide deeper access, such as to helpful-only models or non-public information. OpenAI has provided these forms of access where necessary for critical safety questions for third party assessors. Importantly, these types of sensitive access require strict security measures, and we continue to update those controls as model capabilities and testing needs evolve.
Balanced financial incentives: We believe that it is important to ensure that the third party assessment ecosystem is well-funded and sustainable. Because of that, we offer compensation to all of our third party assessors, and some choose to decline depending on their organizational philosophy around this. Forms of compensation include direct payment for work and/or subsidizing model use costs through API credits or otherwise. No payment is ever contingent on the results of a third party assessment.

Combined, these factors help third party assessments both protect sensitive information and foster transparency in AI safety, and create paths for third party assessors to be compensated for their time.

Looking ahead

Looking ahead, we see a need to continue strengthening the ecosystem of organizations capable of conducting credible, decision-relevant assessments of frontier AI systems. Effective third party evaluation requires specialized expertise, stable funding, and methodological rigor. Continued investment in qualified assessor organizations, the advancement of measurement science, and security for sensitive access will be essential for ensuring that assessments can keep pace with advances in model capabilities.

Third party assessments are one way we bring external perspective into our safety work, and they operate alongside other mechanisms. We also collaborate with external experts through structured red teaming efforts, collective alignment projects⁠, work with the U.S. CAISI and UK AISI⁠, and advisory groups such as our Global Physician Network⁠ and our Expert Council on Well-Being and AI⁠ to help guide our work on mental health and user well-being. These efforts contribute different forms of expertise and support a broader, more reliable foundation for assessing and governing advanced AI systems.

Appendix

The following are illustrative excerpts from our agreements with third parties collaborating with us on pre-deployment assessments.

Research Publications: [...] Hereunder, Supplier hereby retains, or OpenAI licenses back to Supplier, as applicable, the right to use the Supplier Work Product created or discovered by Supplier for research, academic publication, scientific and/or educational purposes, provided such uses (a) are not commercial in nature, (b) do not disclose OpenAI’s Confidential Information (except as expressly permitted in advance by OpenAI in writing) and (c) are submitted to OpenAI for review and approval in writing prior to any publication or disclosure. OpenAI’s “Confidential Information” includes without limitation OpenAI’s Non-Public Models and outputs thereof, including any Supplier Work Product that was created or discovered through use of the. Non-Public Models. “Non-Public Models” means OpenAI’s artificial intelligence and machine learning models, including versions and snapshots thereof, that have not been released to the general public at the time of Supplier’s proposed publication date.

Confidential Information. For purposes of this Agreement, “Confidential Information” means and will include: (i) any information, materials or knowledge regarding OpenAI and its business, financial condition, products, programming techniques, customers, suppliers, technology or research and development that is disclosed to Supplier or to which Supplier has or obtains access in connection with performing Services; (ii) the Supplier Work Product; and (iii) the terms and conditions of this Agreement. Confidential Information will not include any information that: (a) is or becomes part of the public domain through no fault of Supplier or any representative or agent of Supplier; (b) is demonstrated by Supplier to have been rightfully in Supplier’s possession at the time of disclosure, without restriction as to use or disclosure; or (c) Supplier rightfully receives from a third party who has the right to disclose it and who provides it without restriction as to use or disclosure. Supplier agrees to hold all Confidential Information in strict confidence, not to use it in any way, commercially or otherwise, other than to perform Services for OpenAI, and not to disclose it to others. Supplier further agrees to take all actions reasonably necessary to protect the confidentiality of all Confidential Information including, without limitation, implementing and enforcing procedures to minimize the possibility of unauthorized use or disclosure of Confidential Information.

Without granting any right or license, the Disclosing Party agrees that the foregoing shall not apply with respect to (a) any information after 2 years following the disclosure thereof, except for any information that is a trade secret, which shall remain subject to the confidentiality obligations of this Agreement for as long as it is a trade secret, (b) any information included in a Researcher’s noncommercial research or academic publication to the extent such information is either (i) approved in writing by OpenAI prior to publication or (ii) resulting from the version of OpenAI Technology that has been made generally available to the public by OpenAI (and not, for the avoidance of doubt, any information, results, or output from version of the OpenAI Technology that were not made generally available to the public); or (c) any information that the Receiving Party can document (i) is or becomes (through no improper action or inaction by the Receiving Party or any affiliate, agent, consultant or employee of the Receiving Party) generally available to the public, (ii) was in its possession or known by it without restriction prior to receipt from the Disclosing Party, (iii) was rightfully disclosed to it by a third party without restriction, or (iv) was independently developed without use of any Proprietary Information of the Disclosing Party by officers, directors, employees, consultants, representatives, advisors or affiliates of the Receiving Party who have had no access to any such Proprietary Information. The Receiving Party may make disclosures required by law or court order provided the Receiving Party uses diligent reasonable efforts to limit disclosure and to obtain confidential treatment or a protective order and allows the Disclosing Party to participate in the proceeding.