February 10, 2025

Meta’s Llama AI Training Practices Face Serious Copyright Allegations

by Alan Brooks

Alan Brooks

Vice President of Marketing

Alan is an experienced marketing executive focusing on fast-growth companies. Prior to ILS, he was VP of Marketing at ARCHER Systems. His expertise in eDiscovery... Read more »

Recent court filings have revealed troubling allegations about Meta Platforms’ methods of gathering training data for its Llama artificial intelligence models. According to documents unsealed in the ongoing copyright infringement case Kadrey v. Meta, the company allegedly downloaded over 160 terabytes of pirated books and articles from shadow libraries, with CEO Mark Zuckerberg personally approving the controversial strategy.

The Scale of Alleged Infringement

The scope of Meta’s alleged copyright violations is substantial. Court documents indicate that in spring 2024, Meta downloaded at least 81.7 terabytes of data from multiple shadow libraries, following an earlier download of 80.6 terabytes from Library Genesis (LibGen), a known repository of pirated academic materials. This massive data collection effort was reportedly undertaken despite internal concerns about its legality.

Internal Concerns and Executive Decisions

The unsealed communications paint a picture of conflicting views within Meta regarding the use of pirated materials. Several employees expressed reservations about the practice:

Research manager Melanie Kambadur explicitly stated, “I don’t think we should use pirated material… I really need to draw a line there.”
Engineer Nikolay Bashlykov noted his discomfort, writing, “Torrenting from a corporate laptop doesn’t feel right.”

However, senior leadership allegedly overrode these concerns. According to the filings, Meta’s head of generative AI, Ahmad Al-Dahle, “cleared the path” for torrenting LibGen materials. Mark Zuckerberg ultimately approved pirated datasets despite objections from Meta’s AI executive team.

Alleged Concealment Efforts

Perhaps most troubling are allegations that Meta took steps to conceal its use of pirated materials. The company allegedly:

Wrote scripts to remove copyright information and acknowledgments from e-books
Stripped copyright markers from scientific journal articles
Considered using VPNs to hide its downloading activities
Avoided using Facebook’s infrastructure for downloads to prevent tracing

Meta’s Defense and Legal Status

Meta’s primary defense rests on fair use doctrine, arguing that using copyrighted materials for AI training constitutes a transformative use protected under U.S. law. However, the plaintiffs, including prominent authors Sarah Silverman and Ta-Nehisi Coates, contend that Meta’s deliberate circumvention of proper licensing channels undermines this defense.

U.S. District Judge Vince Chhabria has already expressed skepticism about Meta’s attempts to keep these details private, noting that the company’s sealing requests aimed to avoid negative publicity rather than protect sensitive business information.

Broader Implications

This case highlights the growing tension between AI development and intellectual property rights. As AI companies race to build more sophisticated models, how they source training data has become increasingly contentious. Meta’s alleged actions and the resulting litigation could set important precedents for how AI companies approach copyright issues in the future.

The outcome of this case could have far-reaching implications for the AI industry, potentially forcing companies to reconsider their data collection practices and establish more rigorous protocols for respecting intellectual property rights. As Judge Chhabria continues to review the evidence, the tech industry watches closely to see how this landmark case might reshape the landscape of AI development and copyright law.

While the case currently only pertains to Meta’s earlier Llama models and not its recent releases, it raises crucial questions about the ethical and legal boundaries in AI development that will likely influence industry practices for years.

Learn More

If you would like to learn about how to apply AI for eDiscovery, please feel free to contact us at sales@ilsteam.com.

About ILS

ILS is the nation’s preeminent Plaintiff-only eDiscovery provider with expertise in leveraging AI for eDiscovery.

We specialize in leveling the playing field for the Plaintiffs’ bar by providing high-quality discovery services to help clients win their cases. Our clients know they are sharing their vital case strategies with like-minded professionals committed and passionate about getting justice for Plaintiffs.

Over the past decade, we have worked on many of the country’s largest and most noteworthy litigations, including Takata Airbags, Roundup, Social Media Victims, 3M Combat Earplugs, JUUL Vaping, Actos/Bladder Cancer, VW Diesel Emissions, Alex Jones—Sandy Hook, Opioids, and Philips CPAP, among many others.

ILS supports leading platforms, including Reveal, Everlaw, Merlin, Relativity, iConect, and Nebula.