Apple Intelligence doesn’t use YouTube, but does it matter?

Apple has confirmed recent claims that it might have used subtitle data from YouTube videos to train one of its artificial intelligence (AI) tools, but says the tool is not used in Apple Intelligence.

Apple confirmed a report from Proofnews that it had used this YouTube data to train one of its models. The company explained that it did so to train the open-source OpenELM models released earlier this year. The information was included within a larger collection maintained by the EleutherAI non-profit company that supports AI research.

Apple used YouTube once, but not now

However, Apple told 9to5Mac that models trained using that information don’t power any of its own AI or machine learning tools, including Apple Intelligence. This was a research project originally created by Apple’s AI teams and then shared, including via the company’s own Machine Learning Research site.

What’s important is that it shows the extent to which Apple wants to be seen as keeping its promise that Apple Intelligence models are trained on licensed data.

But that’s not the big picture. As mentioned earlier in the week, Apple Intelligence does also train its models using “publicly available data collected by our web-crawler.”That admission reflects the extent to which tech companies are using information published online to create new AI products from which they subsequently profit.

Making public data private

The issue is that by turning other people’s creative works into data, and then profiting from that data, tech firms aren’t playing fair. 

Speaking to Proofnews, Dave Farina, the host of “Professor Dave Explains,” put it this way: “If you’re profiting off of work that I’ve done [to build a product] that will put me out of work or people like me out of work, then there needs to be a conversation on the table about compensation or some kind of regulation.”

To some extent, the focus on YouTube data distracts from that critical argument, which is that the generative AI (genAI) tools coming into common use today are likely to have been trained by information created by humans and shared online. That’s the kind of information picked up by webcrawlers, including Apple’s.

But data quality is a real issue here, and the search for the best data inherently means that the best data sources are the highest octane of fuels to power training AI.

The drive for quality means content is king

Consider just two of the challenges AI researchers face.

Automated data grading systems might reject old, out-of-date, or false information, but some still gets through, which is why AI systems so often develop hallucinations (the current descriptor for fake information) or exhibit questionable morality (racist or gender-biased language).

Data also has a finite lifespan. Facts can and do change over time, and maintaining high-quality data is an essential bulwark against the classic “garbage out” generated by irrelevant information, or high-grade information that becomes irrelevant over time.

What this means is that in their quest for high-quality information, AI companies inevitably seek high-quality data sources. When you translate that into activity picked up from the open public web, that in itself implies that creatives currently battling against tech firms for compensation for use of their material in training AI systems have a good point.

Because the best and most current information they create is worth something, both to the creators, those who consume it, and also to the people who own and train the machines that harvest their data from it. Indeed, given that AI by its nature becomes a tool directly available to everyone and across every supported language, it seems plausible to think the value of that information might actually grow once it is used to train an AI model.

So, while Apple might not be using YouTube data for its Apple Intelligence models, it will be using other data curated across the public web. And while Apple might at least try to avoid using data it should not exploit this way — and is honest enough to have responded to the current YouTube controversy — not every AI firm does the same. And once the machine is trained it cannot be untrained.  

Please follow me on Mastodon, or join me in the AppleHolic’s bar & grill and Apple Discussions groups on MeWe.

​Apple has confirmed recent claims that it might have used subtitle data from YouTube videos to train one of its artificial intelligence (AI) tools, but says the tool is not used in Apple Intelligence.

Apple confirmed a report from Proofnews that it had used this YouTube data to train one of its models. The company explained that it did so to train the open-source OpenELM models released earlier this year. The information was included within a larger collection maintained by the EleutherAI non-profit company that supports AI research.

Apple used YouTube once, but not now

However, Apple told 9to5Mac that models trained using that information don’t power any of its own AI or machine learning tools, including Apple Intelligence. This was a research project originally created by Apple’s AI teams and then shared, including via the company’s own Machine Learning Research site.

What’s important is that it shows the extent to which Apple wants to be seen as keeping its promise that Apple Intelligence models are trained on licensed data.

But that’s not the big picture. As mentioned earlier in the week, Apple Intelligence does also train its models using “publicly available data collected by our web-crawler.”That admission reflects the extent to which tech companies are using information published online to create new AI products from which they subsequently profit.

Making public data private

The issue is that by turning other people’s creative works into data, and then profiting from that data, tech firms aren’t playing fair. 

Speaking to Proofnews, Dave Farina, the host of “Professor Dave Explains,” put it this way: “If you’re profiting off of work that I’ve done [to build a product] that will put me out of work or people like me out of work, then there needs to be a conversation on the table about compensation or some kind of regulation.”

To some extent, the focus on YouTube data distracts from that critical argument, which is that the generative AI (genAI) tools coming into common use today are likely to have been trained by information created by humans and shared online. That’s the kind of information picked up by webcrawlers, including Apple’s.

But data quality is a real issue here, and the search for the best data inherently means that the best data sources are the highest octane of fuels to power training AI.

The drive for quality means content is king

Consider just two of the challenges AI researchers face.

Automated data grading systems might reject old, out-of-date, or false information, but some still gets through, which is why AI systems so often develop hallucinations (the current descriptor for fake information) or exhibit questionable morality (racist or gender-biased language).

Data also has a finite lifespan. Facts can and do change over time, and maintaining high-quality data is an essential bulwark against the classic “garbage out” generated by irrelevant information, or high-grade information that becomes irrelevant over time.

What this means is that in their quest for high-quality information, AI companies inevitably seek high-quality data sources. When you translate that into activity picked up from the open public web, that in itself implies that creatives currently battling against tech firms for compensation for use of their material in training AI systems have a good point.

Because the best and most current information they create is worth something, both to the creators, those who consume it, and also to the people who own and train the machines that harvest their data from it. Indeed, given that AI by its nature becomes a tool directly available to everyone and across every supported language, it seems plausible to think the value of that information might actually grow once it is used to train an AI model.

So, while Apple might not be using YouTube data for its Apple Intelligence models, it will be using other data curated across the public web. And while Apple might at least try to avoid using data it should not exploit this way — and is honest enough to have responded to the current YouTube controversy — not every AI firm does the same. And once the machine is trained it cannot be untrained.  

Please follow me on Mastodon, or join me in the AppleHolic’s bar & grill and Apple Discussions groups on MeWe. Read More