> (89) Third parties making accessible to the public tools, services, processes, or AI components other than general-purpose AI models, should not be mandated to comply with requirements targeting the responsibilities along the AI value chain, in particular towards the provider that has used or integrated them, when those tools, services, processes, or AI components are made accessible under a free and open-source licence. ...
> Article 2, 12. This Regulation does not apply to AI systems released under free and open-source licences, unless they are placed on the market or put into service as high-risk AI systems or as an AI system that falls under Article 5 or 50.
Let's see if the EU AI Act will be adjusted in the same spirit as discussed in the linked discussion.
Given that most AI is trained on data scraped from the internet (most of which isn't open source), isn't it basically impossible to release an entire training dataset under an open source licence?
That would, I suspect, be the point. If your AI is trained on non-free content, the implication is that it would be impossible for it to be released with an open source licence. So don't do that, the argument goes: only use content that has been released with a sufficiently free licence that republishing it in your dataset is not a problem. And as a side effect, you have to show that there isn't any "misappropriated" content in your training set. That side effect is what gets some people excited here.
I don't agree with that position legally, but I do mechanically. The point of the GPL family (to pick one random type of licence) is that the end user should have the capability to modify the product to their own ends, and I don't think fine-tuning provides enough capability to qualify.
It has been done before, for eg the original RNNoise was trained on proprietary data, later there was crowd-sourced effort to record new data and have it under libre licenses.
They could release as much is as necessary to recreate it, the crawlers or list of links they used and configuration or scripts used to drive the training. Nobody is asking for the entire web in their git repo, only the ability to retrain from scratch, possibly with modifications.
Not really, because there’s no guarantee that it will be available in the future. A script to download the data doesn’t mean I can reliably recreate the data in 5 years, I wouldn’t call that open source. To me, the data itself needs to be published.
I’m not expecting them to do the impossible, but they shouldn’t call it open source then. Either you provide all the data and call it open source, or you don’t provide the data because it is proprietary and don’t call the model open source.
EU AI Act (https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=OJ:...) as of today (CTRL + F "open-source"):
> (89) Third parties making accessible to the public tools, services, processes, or AI components other than general-purpose AI models, should not be mandated to comply with requirements targeting the responsibilities along the AI value chain, in particular towards the provider that has used or integrated them, when those tools, services, processes, or AI components are made accessible under a free and open-source licence. ...
> Article 2, 12. This Regulation does not apply to AI systems released under free and open-source licences, unless they are placed on the market or put into service as high-risk AI systems or as an AI system that falls under Article 5 or 50.
Let's see if the EU AI Act will be adjusted in the same spirit as discussed in the linked discussion.
> > ..., should not be mandated to comply with requirements targeting the responsibilities along the AI value chain, ...
What does that mean?
It reads to me that you can’t pass on upstream AI vendor/platform/service requirements downstream to users/customers/end parties.
Which would be separate from any legislated requirements or limitations.
Given that most AI is trained on data scraped from the internet (most of which isn't open source), isn't it basically impossible to release an entire training dataset under an open source licence?
That would, I suspect, be the point. If your AI is trained on non-free content, the implication is that it would be impossible for it to be released with an open source licence. So don't do that, the argument goes: only use content that has been released with a sufficiently free licence that republishing it in your dataset is not a problem. And as a side effect, you have to show that there isn't any "misappropriated" content in your training set. That side effect is what gets some people excited here.
I don't agree with that position legally, but I do mechanically. The point of the GPL family (to pick one random type of licence) is that the end user should have the capability to modify the product to their own ends, and I don't think fine-tuning provides enough capability to qualify.
It has been done before, for eg the original RNNoise was trained on proprietary data, later there was crowd-sourced effort to record new data and have it under libre licenses.
https://github.com/xiph/rnnoise/
They could release as much is as necessary to recreate it, the crawlers or list of links they used and configuration or scripts used to drive the training. Nobody is asking for the entire web in their git repo, only the ability to retrain from scratch, possibly with modifications.
Not really, because there’s no guarantee that it will be available in the future. A script to download the data doesn’t mean I can reliably recreate the data in 5 years, I wouldn’t call that open source. To me, the data itself needs to be published.
Oh well, they did their best. You can't expect them to do better than what is possible. Enough Nirvana fallacy here.
I’m not expecting them to do the impossible, but they shouldn’t call it open source then. Either you provide all the data and call it open source, or you don’t provide the data because it is proprietary and don’t call the model open source.
I like Debian's policy for libre AI:
https://salsa.debian.org/deeplearning-team/ml-policy/
«ToxicCandy Model» is a great term!
It's heartening to see people take this seriously. Let's hope many more stand up for the basic ontology and spirit of free software.