Why the New York Times Vs. OpenAI Copyright Lawsuit Is Pivotal
On December 27th, 2023, the New York Times filed a lawsuit against ChatGPT developer OpenAI and partner Microsoft for using content written by the Times to train AI.
The NYT vs. OpenAI copyright battle isn’t just a legal dispute; it’s a pivotal moment for AI and journalism. What’s at stake? Who stands to win? Learn about the lawsuit that’s set to shape the future of AI and media.
The lawsuit, filed in Federal District Court in Manhattan, cites multiple examples where GPT models can summarize or even repeat verbatim articles published by the newspaper. The plaintiffs claim that this causes “billions of dollars” in lost revenue by diverting readers away from the Times website, where they pay for subscriptions or generate advertising revenue.
The lawsuit seeks to hold OpenAI and Microsoft accountable for statutory damages and calls for the defendants to destroy models and datasets built using copyrighted material.
The outcome of this case will have important implications for anyone building or using generative AI models. Most generative AI developers train their models on web-scraped data that contains copyrighted data. The case is significant because it could decide whether this practice falls under the fair use doctrine or constitutes copyright infringement. If fair use does not apply, the case outcome could set a precedent for what content producers deserve when generative AI uses or summarizes their works.
Important Background Details of the OpenAI Copyright Lawsuit
The lawsuit follows a supposed breakdown in talks between the two sides, who had been holding discussions since April. In July, OpenAI struck a deal with the Associated Press to license the company’s archive of articles and also made a similar arrangement with Axel Springer, which owns Politico and Business Insider.
The two parties may have considered a similar partnership, but no such deal could be reached.
Are There Previous Lawsuits Like the OpenAI Copyright Lawsuit?
This lawsuit is far from the first battle between generative AI developers and content creators.
In July, a small group of authors, including comedian Sarah Silverman sued OpenAI and Meta, alleging that the companies violated copyright law because their models could summarize the authors’ work.
Similar lawsuits have been filed against OpenAI, including one by a group of authors, including George R.R. Martin and several targeting text-to-image model developers like Midjourney.
So, What’s Going To Happen?
At the heart of these cases is a copyright doctrine in the US known as fair use and whether it applies to generative AI publishers.
Fair use allows for the use of copyrighted material in some instances without paying or crediting the creator.
Generative AI developers have held that fair use applies to avoid liability for using copyrighted data. However, it’s difficult to assess to what extent fair use can protect the use of copyrighted material.
Legal scholars argue that fair use does not apply when a generative model creates content similar (or identical to) copyrighted material and poses economic threats to the original content creators. However, the issue becomes less clear if someone argues the generated content has been transformed or stylistically altered.
Today’s law does not offer clear answers to these questions, and we could get more clarity from the decisions made in the New York Times lawsuit.
What Are the Implications of the OpenAI Copyright Lawsuit?
The background of this lawsuit is as fascinating as its implications.
Copyright law is complex and leaves much open to interpretation. However the case is decided, it will set important precedents for generative AI users and developers.
The case could decide whether companies who use or develop generative AI are liable for copyright infringement if their models can produce content similar to (or identical to) copyrighted content.
If companies using or building generative AI are determined to be liable (i.e., fair use does not apply), here are two broad implications:
First, companies will need to decide how to handle copyrighted material.
They can:
- (1) try to remove copyrighted material
- (2) add guardrails to prevent their AI from copyright infringement,
- or (3) work with the content creators to pay and acknowledge them.
Removing copyrighted material is challenging given the volume of the datasets used to train generative AI, which can often be in the 100s of terabytes. Adding guardrails to the model to prevent it from displaying copyrighted material may work in some cases, but multiple examples from the New York Times lawsuit show such guardrails can be bypassed. Companies may decide to work with content creators and negotiate how to pay or attribute them correctly. OpenAI is doing this already with the Associated Press and other publishers, so there is precedent.
Second, generative AI developers might need to be more informed about their training data – what data they have and where it came from. Developers should respect requests from content creators to delete the creator’s data from their training samples. In November, the developers of the Common Crawl database agreed to stop scraping the New York Times’ website. This is similar to data deletion requisitions for personal data under GDPR and CCPA but requires developers to have a deeper understanding of their data and what data was initially created by whom.
Regardless of the outcome, this case is poised to profoundly influence the dynamics between journalism and generative AI, forever.
___
January 8th, 2024 update: In its statement to the House of Lords Communications and Digital Select Committee, OpenAI explained that it could not train large language models such as its GPT-4 model – without access to copyrighted work. “Because copyright today covers virtually every sort of human expression – including blogposts, photographs, forum posts, scraps of software code, and government documents – it would be impossible to train today’s leading AI models without using copyrighted materials,” said OpenAI in its submission, first reported by the Telegraph.
___
Why am I watching this case so closely? At FairNow, our mission is to help companies use generative AI responsibly. Let us know a little more about your organization and how we can help maximize your AI while minimizing your risk.