Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Meta CEO Mark Zuckerberg appears to have used YouTube and its battle to remove pirated content to defend his own company’s use of a dataset containing copyrighted e-books to train AI models, they reveal fragments of his deposition.
The deposition, which was part of a complaint submitted to the court by the plaintiff’s lawyers, is related to the AI copyright case. Kadrey v. Meta. It’s one of many cases winding its way through the US court system pitting AI companies against authors and other IP holders. For the most part, the defendants in these cases – AI companies – claim that training on copyrighted content is “fair use”. Many copyright holders disagree.
“For example, YouTube, I think, may end up hosting some things that people hack for a certain period of time, but YouTube is trying to remove that stuff,” Zuckerberg said during his deposition, according to part of a transcript available Wednesday evening. “And most of the stuff on YouTube, I think they’re very good and they have the license to do it.”
Snippets from Zuckerberg’s deposition provide some clues to Zuckerberg’s thinking on copyrighted content and fair use. However, it should be noted that a full transcript of the deposition has not been released. TechCrunch has reached out to Meta for additional context and will update the article if the company responds.
Based on the nuggets of the deposit, Zuckerberg appears to be defending Meta’s use of a set of e-book training data called LibGen to develop its family of AI models known as Llama. Meta’s Llama competes against leading models from AI companies like OpenAI.
LibGen, which describes itself as a “link aggregator,” provides access to copyrighted works from publishers including Cengage Learning, Macmillan Learning, McGraw Hill and Pearson Education. LibGen has been sued multiple times, ordered to shut down, and fined tens of millions of dollars for copyright infringement.
According to court documents released this week, Zuckerberg allegedly cleared the use of LibGen to train at least one of Meta’s Llama models despite concerns in the company’s AI executive and research teams about the legal implications.
Counsel for the plaintiffs, who include bestselling authors Sarah Silverman and Ta-Nehisi Coates, cited Meta employees referring to LibGen as a “data set that we know is hacked” and reporting that its use “can undermine the negotiating position (of Meta) with the regulators.”, according to a legal deposit,
During his deposition, Zuckerberg stated that he “hadn’t really heard of” LibGen.
“I understand you’re trying to get me to give you an opinion about LibGen, which I haven’t really heard of,” Zuckerberg said during the deposition. “It’s just that I have no knowledge of that specific thing.”
Under questioning from one of the plaintiff’s lawyers, David Boies, Zuckerberg explained why it would not be reasonable to prohibit the use of a data set like LibGen.
“So I want to have a policy against people who use YouTube because some of the content can be protected by copyright? No,” he said. “(T)here are cases where having such a blanket ban might not be the right thing to do.”
Zuckerberg said that Meta should be “more careful about” training on copyrighted material.
“You know, (if there’s) someone who’s providing a website and intentionally trying to violate people’s rights … obviously that’s something we’d want to be cautious or careful about how we engage with or maybe even prevent our teams from . engaging with him,” Zuckerberg said during his deposition, according to the transcript.
The attorneys for the plaintiffs in the case Kadrey v. Meta has amended the complaint several times since it was filed in the United States District Court for the Northern District of California, San Francisco Division in 2023. The last amended complaint filed by the plaintiff’s attorney at the late Wednesday contained new allegations against Meta, including that the company cross-referenced some pirated books on LibGen with copyrighted books available for license. Lawyers claim that Meta used this tactic to determine whether it made sense to pursue a license agreement with a publisher.
Meta allegedly used LibGen to train its latest family of Llama models, Llama 3, for the modified file. The actors also say that Meta is using the data set to train its next-generation Llama 4 models.
According to the modified file, Meta researchers tried to hide the fact that the Llama models were formed on copyrighted materials by inserting “supervised samples” into the Llama fine-tuning. And Meta downloaded pirated e-books from another source, Z-Library, for Llama training as recently as April 2024, the amended complaint said.
Z-Library, or Z-Lib, has been the subject of a number of legal actions brought by publishers, including domain seizures and takedowns. In 2022, the Russian citizens who allegedly maintained it were charged with copyright infringement, wire fraud, and money laundering.