A new paper has accused OpenAI of training its model GPT-4o on copyrighted books from O’Reilly Media without their consent. Many companies have also made such allegations on OpenAI before. Diving in depth has revealed that ChatGPT-4o has a higher recognition for paywalled O’Reilly book content than previous models such as ChatGPT-3.5 Turbo. 

Methodology of the AI Disclosures Project’s Research Paper

The paper utilized the DE-COP approach to get into the pits of hidden information. This methodology is specifically designed to detect copyrighted data in the training of AI models. Researchers have also conducted tests to check whether it can differentiate between Human-written content and AI-powered writing. For that 13,962 excerpts from O’Reilly books were used to assess how much of this content is used in the training of new chatgpt models. 

Findings and Limitations 

  1. A lot of paywalled content of O’Reilly books is similar to ChatGPT-4o in comparison to ChatGPT-3.5 Turbo. 
  2. This study also acknowledges the possibility of maintaining research transparency, saying that OpenAI might have taken excerpts from users’ copying and pasting. 
  3. The new model, ChatGPT-4.5, hasn’t been evaluated, leaving room for further investigation.

Broader Spectrum 

OpenAI has been entangled in many controversies over its data collection procedure and has advocated for looser copyright restrictions. It also has licensing deals in place to ensure companies pay for what they use in training. For better vision, it also offers opt-out mechanisms—albeit imperfect ones—that allow companies to flag the copyrighted content.   

Open AI’s Response 

OpenAI denied all requests to comment on this controversy.