Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
In addition to books, the Institutional Data Initiative is also working with the Boston Public Library to scan millions of articles from various journals now in the public domain, and says it is open to forming similar collaborations down the line. The exact way the dataset of books will be released has not been determined. The Institutional Data Initiative has asked Google to work together on public distribution, but the details are still being hammered out. In a statement, Kent Walker, Google’s president of global affairs, said the company was “proud to support” the project.
However, the IDI dataset is released, it will join a host of similar projects, startups and initiatives that promise to give companies access to sustainable and high-quality AI training materials without the risk of running into copyright issues . Companies like Calliope Networks and ProRata they emerged to issue licenses and manage compensation schemes designed to get creators and rights holders paid to provide AI training data.
There are also other new public domain projects. Last spring, the French startup AI Pleias cleared up its own public domain dataset, the common corpus, which contains an estimated 3 to 4 million books and periodical collections, according to project coordinator Pierre-Carl Langlais. Supported by the French Ministry of Culture, the Common Corpus has been downloaded more than 60,000 times this month alone on the open source platform AI Hugging Face. Last week, Pleias announced that it had released its first set of large language models trained on this data set, which Langlais told WIRED constituted the first models “ever trained exclusively on open data and compliant with the l ‘AI (EU) Act’.
Efforts are underway to create similar image datasets. The startup AI Spawning released its own this summer called Source.Plus, which contains public domain images from Wikimedia Commons as well as a variety of museums and archives. Many significant cultural institutions They have long made their archives accessible to the public as standalone projects, such as the Metropolitan Museum of Art in New York.
Ed Newton-Rex, a former executive at Stability AI who now runs a non-profit which certifies ethically trained AI tools, says the rise of these datasets shows that there is no need to steal copyrighted material to build high-performance, quality AI models. OpenAI previously told lawmakers in the UK that it would be “impossible“to create products like ChatGPT without using copyrighted works.” Large public domain data sets like these further demolish the “necessity defense” that some AI companies use to justify scraping copyrighted work to train their models,” says Newton-Rex.
But he still has reservations about whether IDI and projects like it will really change the status quo of AI training. “These data sets will have a positive impact only if they are used, probably in conjunction with licensing of other data, to replace scraped copyright work. If they are only added to the mix, part of a data set which also includes the unlicensed life’s work of the world’s creators, will greatly benefit AI companies,” he says.
Updated 12/12/24 11:18am ET: This story has been updated with comments from Google.