Understanding AI and learning outcomes
TL;DR
AI tools like ChatGPT show promise for personalized learning, but current research methods fail to capture AI's long-term impact on learning outcomes. OpenAI's Learning Outcomes Measurement Suite provides a framework for longitudinal studies to understand how AI shapes learning across different contexts.
Key Takeaways
- •Current AI learning research focuses on short-term test scores, missing longitudinal effects on cognitive development and learning behaviors.
- •The Learning Outcomes Measurement Suite enables tracking of learning moments, engagement, persistence, metacognition, and recall over time.
- •Early studies show pedagogically aligned AI interactions (like study mode) can improve learning outcomes, but durability of gains needs verification.
- •The framework allows different education systems to define success metrics aligned with their pedagogical goals and contexts.
- •OpenAI is validating the measurement suite through large-scale studies with partners like University of Tartu and Stanford's SCALE Initiative.
Tags
Education is one of AI’s most promising frontiers. With tools like ChatGPT, personalized learning support can be available to any student, anywhere, at any time.
But the education sector is still early in its understanding of the impact of AI on learning outcomes. Last year, our team set out to study the use of tools like study mode and found promising gains in student performance. But our research also raised an important question: how can we assess how AI influences a learner's progress over time, not just on a final exam?
This is a broader ecosystem challenge. To-date, most research methods focus on narrow performance signals—such as test scores—and lack the ability to assess how students actually learn with AI in real-world settings, and how that use shapes outcomes over time.
To address this gap, we developed the Learning Outcomes Measurement Suite, a framework created with Estonia’s University of Tartu and the SCALE Initiative at the Stanford Accelerator for Learning to support longitudinal measurement of learning outcomes across different educational contexts.
Extensive validation is underway through a randomized controlled trial, and further research is planned with founding organizations in the Learning Lab, OpenAI’s learning research ecosystem, including researchers from Arizona State University, UCL Knowledge Lab, and MIT Media Lab (building on prior collaborative studies).
Today, we’re sharing an overview of how the measurement suite works and why it matters. Over time, we intend to publish more research and release the measurement suite as a public resource for schools, universities, and education systems worldwide.
“This research allows us to learn quickly while also laying the groundwork for a deeper understanding of how AI can be thoughtfully integrated into schools in ways that truly matter. We want to understand how these tools can support rigorous academic learning while also cultivating higher-order thinking, creativity, curiosity, and students’ confidence in themselves as learners.”
Summary of takeaways
- Today’s research methods on the impact of AI on learning show promising signals about performance, but don’t capture the full picture of how AI affects learning outcomes over time.
- The Learning Outcomes Measurement Suite will, for the first time, provide a standard framework for longitudinal studies that help educators, researchers, and institutions understand how AI shapes learning and outcomes across different contexts.
- OpenAI’s Learning Lab is a new research ecosystem focused on advancing this work. OpenAI will publish findings alongside a range of partners as the field continues to develop.
Origins and early research
When students use AI tools to study and learn, it can mean many different things—from going to AI for quick answers to using it to work through problems step by step with tutor-like guidance. To encourage users to engage with ChatGPT in ways that support deeper understanding and skill-building, OpenAI introduced study mode last year. Under the hood, study mode is powered by custom system instructions we’ve written in collaboration with teachers, scientists, and pedagogy experts to reflect a core set of behaviors that support true learning, not just answers—using scaffolding, checks for understanding, and guided practice.
To test whether this kind of pedagogically aligned AI interaction style translates into better learning outcomes, we ran a randomized study with over 300 college students preparing for neuroscience and microeconomics exams. While analysis is still underway, early results give us confidence that a pedagogically aligned AI interaction style, encouraged through features like study mode, can improve learning outcomes. But this research also surfaced an important reality: what really matters is whether the gains and associated productive behaviors remain durable over time.
Study design
Participants were assigned to one of three groups: a control group studied using traditional online resources such as Google Search and YouTube, with AI generated overview features disabled, while two additional groups were given access to one of two study mode variants designed to guide students through the learning process in slightly different ways. Baseline quizzes and onboarding surveys were collected ahead of time to adjust for differences in prior coursework exposure, study habits, academic confidence, and familiarity with AI tools. Students completed timed study mode sessions before each exam, with the two study mode variants counterbalanced across subjects.
This setup was designed to reflect real world study conditions rather than a tightly controlled lab environment. Participation was not tied to exam performance, and not all students used study mode to the same extent during the nominal 40 minute sessions. This allowed us to measure and report intention-to-treat (ITT) effects, the impact of being provided access to the tool under realistic rollout conditions—in other words, the causal impact of being offered study mode, acknowledging that engagement can vary in practice.
Findings
We measured performance on each exam separately. In our randomized study, improvements were not uniform across subjects, and levels of engagement with study mode varied across participants.
- Neuroscience (primary ITT): We observed directionally positive differences for study mode relative to control, but results were not distinguishable from students studying with traditional online resources. Some onboarding and technical issues impacted time spent studying among students using study mode.
- Microeconomics (primary ITT): We observed meaningful gains in exam performance among students assigned access to study mode vs the no-AI control group—roughly a 15% higher score relative.
Study mode (variants A & B) vs Control (no AI group): Adjusted mean exam scores
The effect remains consistent when we compare each study mode variant separately with the control.
While this reflects real world variation, it highlighted a deeper limitation in how learning outcomes are typically measured.
Most existing evaluation approaches rely on fixed interventions assessed over short time windows, using outcomes such as test scores or final essays as primary signals. These methods are not designed to capture the core mechanism through which AI affects learning in practice: ongoing, personalized interactions that evolve alongside a learner’s own strategies, preferences, and study habits. Nor do they surface whether improvements in one capability, such as short term recall, may come alongside trade offs in others, such as persistence, autonomous motivation, or creative problem solving. As a result, they miss the longitudinal cognitive effects that ultimately determine whether AI meaningfully improves learning.
Because learning environments differ widely across countries, curricula, and institutional goals, outcomes from one-off studies rarely generalize across systems. Measurement approaches must therefore be flexible enough for different education systems to define what success looks like in their context, evaluate AI against their own standards, and iterate accordingly.
Building a better measurement system
Based on the learnings from OpenAI’s study mode research, we have been building a structured measurement system to measure AI’s impact on learners at scale, and create a mechanism to improve models based on those outcomes. It is grounded in three signals—how the model behaves, how learners respond, and what measurable cognitive outcomes result over time. It includes:
- System instructions to refine model behavior: use of natural language to change the default behavior of the model to be better aligned to specific pedagogical approaches.
- Learning interaction classifiers: these automatically detect “learning moments” within real, de-identified, learner–model interactions and label salient characteristics such as engagement and error correction.
- Learning quality graders: these evaluate and score each of those learning moments by whether the learner achieved their objective and the degree to which the interaction followed strong pedagogical principles, including identification of failure modes.
- Longitudinal learning graders: these track changes in the same learner’s interactions with the model over time—including engagement, persistence, and metacognitive strategies—at the individual and cohort levels.
- Standardized cognitive and metacognitive measures: these are validated third-party instruments delivered via ChatGPT pre/during/post access to establish baselines and measure changes in foundational capabilities such as critical thinking, creativity, and memory.
When combined, we refer to this measurement system as the Learning Outcomes Measurement Suite.
It produces important signals the education ecosystem can use: structured views of learning moments, dashboards showing how outcomes shift over time across cohorts, indicators of model performance against teaching and tutoring rubrics, and outcome measures aligned to standardized assessments and short learner questionnaires. Where available, it can incorporate partner-provided ground truth such as exam scores, classroom observations, or attendance.
All data de-identified
It also enables our partners to understand the deeper cognitive impacts of using AI for learning over time, as we are able through this system also to track impact on capabilities such as:
- Autonomous Motivation: the degree to which learners are shaping their own studies vs being directed by the model
- Productive Engagement: the frequency, variety and quality of pedagogical interactions
- Task Persistence: the degree to which a learner sits with and pushes through cognitive challenges
- Metacognition: the frequency and quality of learner’s efforts to plan, reflect and monitor their approaches to studying
- Recall: the accuracy with which a learner can remember content from previous interactions
This reflects our overall efforts to not simply focus on narrow definitions of learning outcomes (test scores rising), but the holistic capabilities that underpin learning. It also reflects our belief that there will be no silver bullet in terms of what to optimize for: systems and educators will need to be empowered to guide trade offs in alignment with pedagogical best practice and approaches.
Where we go from here
We are validating the Learning Outcomes Measurement Suite through large-scale studies before making it broadly available. This work is underway with the University of Tartu and Stanford’s SCALE Initiative across nation-scale partners like Estonia, where the measurement suite is being studied with nearly 20,000 students aged 16-18 over several months. Student use will happen in close collaboration with local leaders, to ensure safety and alignment with local curricula.
“Estonia has always approached education not as static but as a system we continuously improve. With AI becoming part of that picture, the big question is how we measure AI's long-term impact on learning. That's what we're figuring out in collaboration with OpenAI. Students are keen to be involved in the development process, and many want to learn how to support learning with AI. It feels like a real turning point, and we’re excited to contribute methods that other education systems can reuse and build on.”
This work builds on a broader body of collaborative research underway. In addition to the outcomes research being conducted through founding partners in the Learning Lab, OpenAI is supporting studies at the intersection of learning and labor—examining how AI shapes students’ academic pathways, career decisions, and the ways institutions can support responsible adoption. This research is happening across Bocconi University, Innova Schools and Tuck School of Business at Dartmouth, San Diego State University, Stony Brook University, and others.
As we run longer-term studies on how students learn best with AI, we intend to share findings and work with the broader education ecosystem to ensure AI benefits learners everywhere.
Those interested in receiving updates on this work can sign up here.