Probe reveals 174K YouTube vids’ subtitles used for AI • The Register
Comment FYI: It’s not just Reddit posts, books, articles, webpages, code, music, images, and so forth being used by multi-billion-dollar businesses for training neural networks. AI labs have been teaching models using subtitles scraped from at least tens of thousands of YouTube videos, much to the surprise of the footage creators.
Those transcripts were compiled into what is termed the YouTube Subtitles dataset and incorporated into a larger repository of training material called the Pile, nonprofit nu-journo outfit Proof News highlighted this week. The YouTube Subtitles collection contains information from 173,536 YouTube videos including those of channels operated by Harvard University, the BBC, and web-celebs like Jimmy “MrBeast” Donaldson.
The dataset is a 5.7GB slice of Pile, a larger 825GB silo created by nonprofit outfit EleutherAI. The Pile includes data pulled from GitHub, Wikipedia, Ubuntu IRC, Stack Exchange, bio-medical and other scientific papers, internal Enron emails, and many other sources. Overall, the YouTube Subtitles dataset is one of the smallest collections in the Pile.
Big names such as Apple, Salesforce, Nvidia, and others have incorporated the Pile, including the video transcripts, into their AI models during training. We’re told the makers of those YouTube videos weren’t aware this was happening. (There’s also nothing stopping tech giants from using YouTube data in other dataset collections; the Pile is just one possible source.)
Not hidden
It wasn’t a secret EleutherAI had gathered up subtitles from YouTube videos, as the organization not only made the Pile publicly available, it detailed the thing in a research paper in 2020. The code that scraped the YouTube Subtitles dataset is on GitHub for all to see. The script can be told to pull in subtitles for videos that match certain search terms; in the Pile’s case, those terms ranged from things like “quantum chromodynamics” to “flat earth.”
The actual videos used to form the dataset aren’t mentioned in either the 2020 paper nor on GitHub. Only now are people looking through the training data, since superseded by other collections, identifying the videos that were scraped, and tipping off YouTube creators. An online search tool for inspecting the subtitle training material has been offered here.
What’s interesting is that Google-owned YouTube’s terms of service, today at least, explicitly ban the use of scrapers and other automated systems unless they are public search engines that obey YT’s robots.txt rules or have specific permission from YouTube.
The terms also seemingly prohibit the downloading and use of things like subtitles in AI training unless, again, YouTube and applicable rights holders give permission. So on the one hand, there is the potential for the automated scraping of subtitles to be against YouTube’s rules, but there’s also wiggle room for it to be totally fine. Well, as far as YouTube is concerned; creators feeling their work is being unethically exploited, however legal, by rich companies is another thing.
It’s something that everyone is dancing around. The PR folks at Google – which is itself in the AI game – have simply said, in response to this week’s reporting, that the internet giant puts a lot of effort into thwarting unauthorized scraping, and declined to talk about individual organizations’ use of its YouTube data.
AI labs that used the Pile to build their models argued they simply incorporated a broad public dataset and that they weren’t the ones doing any scraping; the training database conveniently acts as rocket fuel and a legal blast shield for their machine learning activities, in their view.
“Apple has sourced data for their AI from several companies,” tech reviewer Marques Brownlee said on Xitter in light of the findings. “One of them scraped tons of data/transcripts from YouTube videos, including mine.”
Brownlee noted that “Apple technically avoids ‘fault’ here because they’re not the ones scraping.”
The Register has asked EleutherAI, Apple, Nvidia, and others named in the report for further details and explanations.
Using people’s work to train AI without explicit permission has sparked big lawsuits. Microsoft and OpenAI were sued in April by a cohort of US newspapers, and two AI music generators got complaints from Sony, Warner Brothers, and Universal.
A few things seem certain. Artificial intelligence developers can and will get their hands on all manner of information for training – as training data drives their neural networks’ performance – and they don’t always need explicit permission from creative types to do it, as permission may already have been quietly granted through platform T&Cs.
And at least some of these development labs are highly reluctant to reveal where exactly they get their training data, for various reasons as you can imagine, including commercial secrecy.
This is something we expect to see rumble on and on, with more and more revelations of info being exploited, no matter how legal or ethical, much to the exasperation of the people creating that material in the first place and being displaced by this technological work. ®