Platform lets creators monetize their content for use in LLM training

Avail, an AI research firm that focuses on the media industry, today launched Corpus, a platform it said enables creators and media rights holders to license their work to AI model developers.

Corpus, the Brooklyn, New York-based firm said in a release, enables “rights holders to seek compensation for both catalog content and real-time answers derived from their work.”

A company FAQ describes it as a “monetization platform for creators, media companies and rights holders of all kinds. We connect content owners with AI companies interested in licencing their work for training purposes or real-time chatbot answer retrieval.” The Corpus homepage contains a valuation calculator that provides creators an estimate of their catalog’s worth based on recent benchmarks, Avail said.

On the site, it states that it has partnered with OpenAI, Anthropic, film production and distribution company 30West, AI-based wealth management firm Range, and venture capitalists General Catalyst and Seven Seven Six.

Bill Wong, AI research fellow at Info-Tech Research Group, viewed the launch of Corpus as a positive move for creators, and necessary in order to reset “expectations that Big Tech vendors have regarding their use of copyrighted data.”

While, he said, an initiative such as this has the potential to be beneficial not only to content creators, but also to those firms who train AI models, “there will be challenges in resetting expectations and making this work in an efficient manner. The advantage of accessing curated data is that it provides a higher quality of data to train the model. However, the administration of this may be a challenge, such as calculating the right costs, perhaps implementing new types of watermarks, etc.”

Wong added that Avail’s Corpus tool “flies in the face” of recent comments made by Mustafa Suleyman, the CEO of Microsoft AI, in an interview at the recent Aspen Ideas Festival. “While attempting to define what kind of content is protected by publishers, he proceeded to say: ‘With respect to content already on the open web, the social contract of that content since the 1990s has been that it is fair use. Anyone can copy it, recreate it, or reproduce it. That has been freeware, if you like; that’s been the understanding.’”

Had the internet had a tool like Corpus available in the 1990s, said Wong, “I am sure content creators would have been properly acknowledged and compensated for their content. Today, the jury is still assessing whether copyright data for LLM training should fall under ‘fair use,’ but accessing data in real-time should be recognized as of value to both users and vendors, and this content should not be considered freeware.”

Today, he said, the US copyright office has not prevented “LLM vendors from using copyrighted data to train their models. The vendors typically state that the use of the copyrighted data falls under the legal concept of ‘fair use,’ which allows people/companies to use limited portions of the work for non-commercial, educational, or transformative uses.”

According to Wong, “It is the ‘transformative’ use the vendors argue that is how the LLMs are using the data. Ingested data is not simply reproduced by the LLM; the content is transformed and used to generate new content for new uses. However, I don’t believe that when the ‘fair use’ doctrine was first defined, they considered a program that would ingest all the data, be used for commercial purposes, and disrupt the industry of the creators.”

The launch of Corpus follows an announcement late last month that seven companies that license music, images, videos, and other data used for training AI systems have formed a trade association to promote responsible and ethical licensing of intellectual property. To be known as the Dataset Providers Alliance (DPA), the primary goals are to standardize the licensing of intellectual property for AI and ML datasets, facilitate industry collaboration, be an advocate for content creators’ rights and protect intellectual property.

What can potentially happen if an organization does end up getting caught for copyright violations? Consider: in March, France’s competition authority fined Google, its parent company Alphabet, and two subsidiaries a total of €250 million ($271 million) for breaching a previous agreement on using copyrighted content for training its Bard AI service, now known as Gemini.

The Autorité de la concurrence said that the search giant failed to comply with a June 2022 settlement over the use of news stories in its search results, News and Discover pages. Google avoided a fine at that point by pledging to enter into good-faith negotiations with news providers over compensation for their content, among other actions.

Next read this:

AI language models need to shrink; here’s why smaller may be better

LLM deployment flaws that catch IT by surprise

The AI data-poisoning cat-and-mouse game — this time, IT will win

​Avail, an AI research firm that focuses on the media industry, today launched Corpus, a platform it said enables creators and media rights holders to license their work to AI model developers.

Corpus, the Brooklyn, New York-based firm said in a release, enables “rights holders to seek compensation for both catalog content and real-time answers derived from their work.”

A company FAQ describes it as a “monetization platform for creators, media companies and rights holders of all kinds. We connect content owners with AI companies interested in licencing their work for training purposes or real-time chatbot answer retrieval.” The Corpus homepage contains a valuation calculator that provides creators an estimate of their catalog’s worth based on recent benchmarks, Avail said.

On the site, it states that it has partnered with OpenAI, Anthropic, film production and distribution company 30West, AI-based wealth management firm Range, and venture capitalists General Catalyst and Seven Seven Six.

Bill Wong, AI research fellow at Info-Tech Research Group, viewed the launch of Corpus as a positive move for creators, and necessary in order to reset “expectations that Big Tech vendors have regarding their use of copyrighted data.”

While, he said, an initiative such as this has the potential to be beneficial not only to content creators, but also to those firms who train AI models, “there will be challenges in resetting expectations and making this work in an efficient manner. The advantage of accessing curated data is that it provides a higher quality of data to train the model. However, the administration of this may be a challenge, such as calculating the right costs, perhaps implementing new types of watermarks, etc.”

Wong added that Avail’s Corpus tool “flies in the face” of recent comments made by Mustafa Suleyman, the CEO of Microsoft AI, in an interview at the recent Aspen Ideas Festival. “While attempting to define what kind of content is protected by publishers, he proceeded to say: ‘With respect to content already on the open web, the social contract of that content since the 1990s has been that it is fair use. Anyone can copy it, recreate it, or reproduce it. That has been freeware, if you like; that’s been the understanding.’”

Had the internet had a tool like Corpus available in the 1990s, said Wong, “I am sure content creators would have been properly acknowledged and compensated for their content. Today, the jury is still assessing whether copyright data for LLM training should fall under ‘fair use,’ but accessing data in real-time should be recognized as of value to both users and vendors, and this content should not be considered freeware.”

Today, he said, the US copyright office has not prevented “LLM vendors from using copyrighted data to train their models. The vendors typically state that the use of the copyrighted data falls under the legal concept of ‘fair use,’ which allows people/companies to use limited portions of the work for non-commercial, educational, or transformative uses.”

According to Wong, “It is the ‘transformative’ use the vendors argue that is how the LLMs are using the data. Ingested data is not simply reproduced by the LLM; the content is transformed and used to generate new content for new uses. However, I don’t believe that when the ‘fair use’ doctrine was first defined, they considered a program that would ingest all the data, be used for commercial purposes, and disrupt the industry of the creators.”

The launch of Corpus follows an announcement late last month that seven companies that license music, images, videos, and other data used for training AI systems have formed a trade association to promote responsible and ethical licensing of intellectual property. To be known as the Dataset Providers Alliance (DPA), the primary goals are to standardize the licensing of intellectual property for AI and ML datasets, facilitate industry collaboration, be an advocate for content creators’ rights and protect intellectual property.

What can potentially happen if an organization does end up getting caught for copyright violations? Consider: in March, France’s competition authority fined Google, its parent company Alphabet, and two subsidiaries a total of €250 million ($271 million) for breaching a previous agreement on using copyrighted content for training its Bard AI service, now known as Gemini.

The Autorité de la concurrence said that the search giant failed to comply with a June 2022 settlement over the use of news stories in its search results, News and Discover pages. Google avoided a fine at that point by pledging to enter into good-faith negotiations with news providers over compensation for their content, among other actions.

Next read this:

AI language models need to shrink; here’s why smaller may be better

LLM deployment flaws that catch IT by surprise

The AI data-poisoning cat-and-mouse game — this time, IT will win Read More