Jan 10, 2025 1 min read

Meta Knew AI Training Data Included Pirated Books, Authors Claim

Sign up for ARPU: Stay ahead of the curve on tech business trends.

Meta Platforms allegedly used pirated versions of copyrighted books to train its artificial intelligence systems, a group of authors is claiming in court documents, as reported by Reuters. The authors, including comedian Sarah Silverman and Ta-Nehisi Coates, are suing Meta for copyright infringement.

The authors, who filed the lawsuit in 2023, argue that Meta's large language model, Llama, was trained on their books without permission. They point to internal Meta documents produced during the discovery process as evidence that the company was aware the works were pirated.

In court filings made public Wednesday, the authors detail Meta's use of the LibGen dataset, which allegedly contains millions of pirated works, and was distributed through peer-to-peer torrents. They also claim that internal communications reveal Mark Zuckerberg approved Meta's use of LibGen despite concerns from within the company's AI executive team about the dataset's pirated nature.

Meta has not yet responded to Reuters' requests for comment.

The case is one of several lawsuits alleging that copyrighted material is being used without permission to develop AI products. Defendants in these cases typically argue that they made fair use of the copyrighted material.

The authors are seeking permission to file an updated complaint, arguing that the new evidence strengthens their infringement claims and justifies reviving their copyright management information (CMI) claim. They also want to add a new computer fraud claim.

US District Judge Vince Chhabria, who previously dismissed claims that Meta's chatbots' generated text infringed the authors' copyrights and that Meta unlawfully stripped their books' CMI, has expressed some skepticism about the merits of the fraud and CMI claims. However, he has agreed to allow the authors to file an amended complaint.