Book Publishers Seek Entry Into Google AI Copyright Fight

AI Summary3 min read

TL;DR

Major book publishers Hachette and Cengage seek to join a class action lawsuit accusing Google of using pirated books to train its Gemini AI. They allege Google systematically copied copyrighted works from pirate sites like Z-Library without licenses. The lawsuit demands damages, injunctions, and destruction of unauthorized copies.

Key Takeaways

•Hachette Book Group and Cengage Group filed to intervene in a class action lawsuit against Google, alleging 'historic copyright infringement' in training its Gemini AI platform.
•The complaint claims Google used pirated books from sites like Z-Library, OceanofPDF, and WeLib, copying them multiple times during AI training without proper licenses.
•Google's C4 training dataset allegedly contains over 200 million copyright symbols and works from at least 28 U.S.-identified piracy sites, excluding policy notices but including paywalled content.
•The publishers seek statutory damages, injunctions to stop infringement, and orders for Google to destroy unauthorized copies and disclose which books were used to train Gemini.
•This case follows other AI copyright lawsuits where judges ruled training on copyrighted books can be fair use but criticized companies for maintaining libraries of pirated content.

Tags

Artificial Intelligencegeminigoogleartificial intelligenceAICopyrightgenerative aicopyright infringementHachetteCengage

Major book publishers Hachette Book Group and Cengage Group filed a motion Thursday to intervene in an existing class action lawsuit filed last year against Google, accusing the tech giant of orchestrating “historic copyright infringement” to build its Gemini platform.

The complaint filed in California federal court alleges Google "chose to steal a massive body of content from Plaintiffs and the Class to train its AI model" rather than obtain proper licenses, engaging in deliberate infringement "at every stage" of development.

The consolidated case was originally filed in 2023 by individual authors as a proposed copyright class action accusing Google of copying books to train its generative AI models.

The publishers claim Google downloaded books from pirate sites and then repeatedly copied them during the AI training process, first into computer memory, then into formats the AI systems could read, and again into training sets for each new model version.

Google's C4 training dataset contains copyrighted works scraped from Z-Library, a pirate collection from which authorities have seized more than 350 websites and web domains, the lawsuit alleges.

The publishers noted how books were copied from b-ok.org, a Z-Library domain now displaying a federal seizure notice, along with OceanofPDF and WeLib, "another prolific site with access to troves of unauthorized copyrighted content."

The C4 dataset contains works from at least 28 sites identified by the U.S. government as markets for piracy and counterfeits, the complaint notes.

"The copyright symbol (©) appears more than 200 million times in the C4 dataset," the complaint reads, noting Google allegedly excluded "policy notices" and "terms of use" warnings but included "vast categories of copyrighted works, pirated works, and works taken from behind paywalls."

The publishers allege that Google copied works from subscription-based libraries like Scribd.com, circumventing legitimate licensing agreements.

When confronted about this practice, nonprofit dataset provider Common Crawl allegedly responded with "a blame the victim mentality, proclaiming 'You shouldn't have put your content on the internet if you didn't want it to be on the internet.'"

The lawsuit alleges Gemini now produces outputs that "substitute for copyrighted works," including verbatim reproductions, detailed summaries, and "knockoffs that copy creative elements of original works."

Decrypt has reached out to Google and the publishers’ counsel.

AI and publishers

Google is simultaneously defending against antitrust claims from Penske Media Corporation over its AI Overviews feature, with the tech giant claiming that displaying AI-generated summaries constitutes "lawful product improvement rather than anti-competitive behavior."

The publishers seek statutory damages, injunctions to halt further infringement, and an order requiring Google to destroy all unauthorized copies of their works and disclose which books were used to train Gemini.

Google Seeks Dismissal of Publisher Lawsuit Over AI Search Summaries

The motion to intervene follows a series of copyright lawsuits that authors filed against AI companies in 2023, with federal judges delivering partial victories to Meta and Anthropic, ruling that their use of copyrighted books to train their models constituted fair use under copyright law, but criticized the companies for maintaining permanent libraries of pirated books.