Hundreds of thousands of videos from news publishers like The New York Times and Vox were used to train AI models
 
                                Last month, The Atlantic dropped the latest investigation in its ongoing series on generative AI training data sets. Staff writer Alex Reisner found that at least 15 million YouTube videos had been used for training data by major technology companies, either for research or, in some cases, to build AI video products.
The Atlantic’s reporting focused over a dozen prominent training data sets that were either compiled or used by companies including Microsoft, Meta, Snap, Tencent, Runway AI, and ByteDance. The investigation shows how the unauthorized use of YouTube videos has been an essential contributor to the AI industry’s recent leap forward in AI video generation quality.
“Much as ChatGPT couldn’t write like Shakespeare without first ‘reading’ Shakespeare, a video generator couldn’t construct a fake newscast without ‘watching’ tons of recorded broadcasts,” writes Reisner.
The Atlantic’s story briefly mentions that more than 30,000 videos from the BBC were among the training data, alongside other YouTube channels focused on news. Using a searchable database published by The Atlantic, I wanted to better understand the scale at which news channels had been targeted. In the same data sets, I found hundreds of thousands of videos that were taken from some of the most popular news publishers and news creators on YouTube, including The New York Times, The Washington Post, The Guardian, Al Jazeera, and The Wall Street Journal.
For example, more than 88,000 videos were included from Fox News’ YouTube channels, including its flagship account and Fox Business. Another roughly 70,000 videos were taken from the channels of ABC News and its morning show, Good Morning America. I also found more than 55,000 videos from Bloomberg’s YouTube channels, including Bloomberg Originals, Bloomberg Television, and Bloomberg Technology.
Searching through Vox Media-owned YouTube channels in the database, I found more than 30,000 videos including explainers from Vox, travel docs from Eater, and animal tearjerkers from The Dodo. Roughly 13,900 of those videos were from The Verge’s official YouTube channel, including iOS gadget guides, episodes of its flagship podcast The Vergecast, and interviews with Silicon Valley CEOs like Mark Zuckerberg.
YouTube CEO Neal Mohan has previously said that it’s against the platform’s terms of service for other companies to download videos and use them for training data.
“In order to survive, AI platforms know they need (and their consumers want) quality, credible content like ours that give their products relevance and purpose,” said Lauren Starke, a spokesperson for Vox Media. “They’re spending at unprecedented levels on AI infrastructure: chips, servers, and data centers that power their models. Yet when it comes to the content that makes those models useful — journalism, creative work — they’ve comparatively spent next to nothing.”
In May 2024, Vox Media signed a partnership with OpenAI for an undisclosed sum allowing the company to use its content for products like ChatGPT. Starke said Vox Media will continue to explore partnerships with AI companies that respect their work, but “pursue legal remedies to protect our intellectual property, when necessary.”
“Without our quality content, the reality for these platforms will be: garbage in, garbage out,” she said.
News publications and news creators don’t need to register their YouTube videos with the U.S. Copyright Office (USCO) to have a valid copyright claim. That said, registering videos by submitting an application and paying a filing fee does come with legal benefits, like the ability to sue for copyright infringement.
The New York Times told me that it “registers its print edition and website on an ongoing basis with the US Copyright Office, including all underlying content.” In many cases, YouTube videos from the Times that are based on print or web articles that have already been registered with USCO could be considered “derivative works” and covered by the same filings.
“Taking content from creators like the Times without permission violates the law and will severely harm the market for original, independent reporting, which will diminish the ability of people to tell important stories, leaving the public less informed,” a spokesperson for the Times told me. “The Times believes that the future success of this technology should not come at the expense of journalistic institutions.”
Seder, meanwhile, said none of the videos on The Majority Report channel — often five uploads per day — are registered with the USCO. As he puts it, he simply doesn’t “have the pockets” to cover filing fees and retain legal counsel, especially when up against some of the largest companies in the world.
He is comfortable with other creators pulling clips from his videos without permission, to a degree. After all, reaction videos are fuel for news creators across YouTube.
“People are using my content all the time, but they’re adding commentary to it, and it is part of a conversation, and it is transparent — that’s part of the ecosystem,” said Seder. He sees the mass downloading of his channel for AI training in another light. “What these [AI companies] are doing is fundamentally different. There’s no reciprocity; it’s only exploitative.”
What's Your Reaction?
 Like
        0
        Like
        0
     Dislike
        0
        Dislike
        0
     Love
        0
        Love
        0
     Funny
        0
        Funny
        0
     Angry
        0
        Angry
        0
     Sad
        0
        Sad
        0
     Wow
        0
        Wow
        0
     
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
                                                                                                                                                     
                                                                                                                                                     
                                                                                                                                                     
                                                                                                                                                     
                                                                                                                                                     
                                             
                                             
                                             
                                            
