Wikipedia Reveals Multiple Deals with AI Giants to Use Its Content

AI Summary4 min read

TL;DR

The Wikimedia Foundation has partnered with AI companies like Microsoft and Mistral AI to license Wikipedia content for training AI models, aiming to ensure sustainability as traffic shifts toward AI summaries. This occurs amid broader copyright disputes between AI firms and publishers over training data usage.

Key Takeaways

  • Wikipedia has signed deals with AI companies including Microsoft and Mistral AI through Wikimedia Enterprise to allow use of its content for training AI models.
  • The foundation cites declining direct human traffic due to AI-generated summaries as a motivation, seeking long-term sustainability.
  • These agreements are part of a larger industry debate where publishers and creators are challenging AI companies over copyright infringement in training data.
  • Recent legal rulings show mixed outcomes, with some courts finding AI training as fair use while others pursue infringement claims.

Tags

Artificial Intelligencegooglemicrosoftartificial intelligenceamazonAImetawikipediaWikimedia FoundationLLMslarge language models
Wikipedia. Image: Shutterstock/Decrypt

The Wikimedia Foundation has announced a series of new partnerships with artificial intelligence companies that will allow them to use Wikipedia content to train and power their AI models, as the nonprofit seeks to shore up its long-term sustainability amid changing online behavior.

The agreements were signed through Wikimedia Enterprise, the foundation’s commercial product designed for large-scale reusers and distributors of content from Wikimedia projects. New signups include Ecosia, Microsoft, Mistral AI, Perplexity, Pleias and ProRata. They join existing partners such as Amazon, Google and Meta.

“In the AI era, Wikipedia and its human-created and curated knowledge has never been more valuable,” the foundation said in a statement.

“Its knowledge power[s] generative AI chatbots, search engines, voice assistants and more. Wikipedia is one of the highest-quality datasets used in training Large Language Models.”

The announcement was made as part of an update tied to Wikipedia’s 25th anniversary.

The online encyclopedia is among the top ten most-visited websites globally and is the only one in that group operated by a nonprofit organization. Its more than 65 million articles, published in over 300 languages, are viewed nearly 15 billion times each month, according to the foundation.

However, it has warned that traffic patterns are shifting. In October, it said human visits to Wikipedia fell 8% year over year, attributing the decline to users relying on AI-generated summaries rather than visiting the site directly. Nearly 60% of Google searches now end without a click, with on-page responses often powered by Wikipedia content.



AI vs publishers

The deals come amid a broader debate over how AI companies obtain training data. Large language models are typically trained on vast amounts of online material, a practice that has drawn criticism from authors, publishers and other rights holders who argue that the use of copyrighted works without permission is infringement.

Among them, Reddit is involved in several suits with AI companies for the use of its content to train models, although it has reached licensing agreements with the likes of Google.

On Thursday, major book publishers Hachette Book Group and Cengage Group filed a motion to join an existing class action lawsuit against Google, accusing the company of carrying out “historic copyright infringement” to build its Gemini AI platform. The lawsuit alleges Google copied books without proper licenses during its AI training processes. The case was originally filed in 2023 by a group of authors.

OpenAI faces a similar case from plaintiffs including "Game of Thrones" writer George R.R. Martin.

Entertainment companies are also pressing the issue. In mid-December, Disney sent Google a cease-and-desist letter accusing it of copyright infringement, even as Disney struck a separate licensing deal with OpenAI covering hundreds of characters for AI-generated video. Disney has issued similar notices to other AI firms and is involved in litigation alongside major studios against image-generation company Midjourney.

The same month a coalition of writers, actors and technologists launched a new industry group aimed at pushing for enforceable standards governing how AI is trained and used in the entertainment sector. More than 500 prominent figures have backed the initiative, including Natalie Portman, Cate Blanchett, Ben Affleck, Guillermo del Toro and Taika Waititi.

The European Commission has also opened a formal antitrust investigation into whether Google violated EU competition rules by using publisher and YouTube content to power its AI services without fair compensation or consent.

Whether copyright holders will ultimately find recourse isn’t certain. Federal judges in the U.S. have recently delivered partial victories to Meta and Anthropic, ruling that their use of copyrighted books to train AI models constituted fair use, while criticizing the companies for maintaining permanent libraries of pirated works.

Visit Website