GenAI Should Respect The Underlying Copyright That Trains It To Progress Further
[This article is developed from this thread I wrote earlier.]
Back in my previous career I was doing an “undercover” market survey. There was this guy selling software that goes in those ringtone kioks we used to have. He clarified that he owns the copyright to the software, to which I asked “but what about the ringtones sold? Don’t those have copyright?”
“oh no, that’s covered under fair use so we don’t need to pay anything.”
For the uninitiated, back then ringtones were either sent over premium SMS or downloaded through kioks, and was a relatively new technology. It was a big business as ringtones allowed you to express yourself and differentiate from everyone else using the factory default (or their own chosen ringtones).
As a business rep for a music label, one of our many conversations with companies selling ringtones, whether through kioks or SMS, either revolved around fair use or “hey we’re promoting your artist and music for free”. Naturally, the music industry wasn’t going to let it slide and both the publishing and recording companies basically started suing everyone into oblivion unless they did proper licensing deals.
Which comes to today. After learning more about the issues at hand I’ve come to the conclusion that training data for GenAI models CAN be argued as fair use. THAT SAID, since the extrapolations of said data continue to be used and commercialised, owners of any training data should be allowed to block their content from being used and expressed in GenAI output. If the model benefits from learning from copyrighted work then the copyright owner should at least have a say (if not profit share). Outside of that, current copyright law applies — if a work significantly copies existing copyright, whether created by a person or generated by AI, it can be liable for copyright infringement.
(And no, you can’t make the argument that “people also train then create based on existing [copyrighted] data” because in most cases you can’t really trace the origins like you would, say, examine a list of data used for AI training, and human creators who do plagiarism are always called out, ridiculed or sued anyway).
This issue about copyright around GenAI training data becomes much more important today, as the models progress. Researchers predict that once the models run out of original content to train on, they would start to train on content generated by GenAI and start to lose coherence.
There’s no way, despite the lamentation by artists and creative workers, to put the GenAI genie back in the bottle. But there is a way to develop it (or develop for it) responsibly. Why not have GenAI that has an ethical/legally sound training regimen, and artists could use that the way Andy Warhol used printing for his commerical art? Or DJs use music samples to create new music? Imagine being able to create multiple versions of your art, helped by GenAI that is trained on your unique art charateristics.
As for the future of GenAI, the need for further training data — content that is easily made by creators — should also respect and reward the creators of this new content, the same way a smartphone pays licenses to patents used in the technologies within it.