OpenAI Claims It’s ‘Impossible’ to Train AI Models Without Copyrighted Materials

OpenAI, the company behind the web’s most popular generative AI models, has taken an interesting stance against its growing slate of copyright infringement claims. In a piece of written evidence submitted to the UK Parliament’s House of Lords Communications and Digital Select Committee, OpenAI stated that it’s “impossible” to train tools like ChatGPT without using copyrighted materials.

The Communications and Digital Select Committee investigates how the UK’s public policy intersects with the media, digital communications, and creative industries. Once it completes a probe, the Committee publishes a report of its findings. These reports can then become the basis for the broader UK government to make policy changes. In July 2023, the Committee initiated a probe to “examine large language models and what needs to happen over the next 1–3 years to ensure the UK can respond to their opportunities and risks.” This inevitably ended up focusing on OpenAI’s ChatGPT and DALL-E.

Credit: Viralyft/Unsplash

Beyond sharing its perspective on how large language models (LLMs) might impact society over the next few years, OpenAI took its evidence submission as a chance to defend its use of copyrighted materials in its training of ChatGPT. “Because copyright today covers virtually every sort of human expression—including blog posts, photographs, forum posts, scraps of software code, and government documents—it would be impossible to train today’s leading AI models without using copyrighted materials,” the document reads. “Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.”

Rather than taking copyright’s near-universality as a sign that platforms like ChatGPT might not be worth infringing on intellectual property, OpenAI appears to be using that universality as a makeshift shield. Multiple plaintiffs have accused OpenAI of relying on their written, copyrighted work to train ChatGPT, which has become a one-directional cash cow. The New York Times has also sued OpenAI for regurgitating its content without compensating the publication for said content.

The Communications and Digital Select Committee we’re hearing about today might not be a court of law. Still, the results of its probe could easily shape how the UK and other Western governmental entities perceive and treat generative AI. OpenAI knows this, and with copyright lawsuits stacking here in the US, it’s using the Committee’s investigation as an opportunity to get ahead of any copyright concerns.

OpenAI also acknowledged that there’s “still work to be done to support and empower creators.” It’s reportedly working on allowing publishers to block the GPTBot from crawling their sites for content and allowing photographers and other artists to exclude their images from future DALL-E training sets. On paper, this is a nice gesture, but it’s a bit difficult to thank OpenAI for a right or comfort it took away in the first place.

OpenAI Claims It’s ‘Impossible’ to Train AI Models Without Copyrighted Materials

Tagged In

More from Internet & Security