For the first time in the history of artificial intelligence, one has now been sued for possible copyright infringement: In November 2022, developer Mathew Butterick announced that he, along with the law firm Joseph Saveri, wants to sue the creators of „Copilot“ for what he sees as copyright infringement.
The suit is against an artificial intelligence-based tool that allows developers to automatically complete code. Copilot is powered by Codex, a generative, pre-trained AI model developed by OpenAI. The system has been taught with millions of lines of open-source code that thousands of programmers have posted on github. Github itself has been part of the Microsoft group since 2018. The creators of the open-source software on github usually provide their contributions with copyright notices or licenses which contain specifications for use, but which the AI has allegedly not followed up on. And interestingly, there is a fee for using Copilt.
„Just like the rise of compilers and open source, we believe AI-powered programming will fundamentally change the way software is developed by giving developers a new tool to write code easier and faster,“ GitHub CEO Thomas Dohmke touted the new invention at the time.
The fear is that the system will insert learned snippets of code from open-source projects without crediting their original creators or original licenses. To this Butterick says, „AI systems are not exempt from the law. Those who develop and operate these systems must be held accountable. When companies like Microsoft, GitHub, and OpenAI choose to flout the law, they shouldn’t expect us, the public, to sit still. AI needs to be fair and ethical for everyone. If it is not, it can never achieve its vaunted goals of improving humanity. It will just be another way for the privileged few to profit from the work of the many.“
However, Microsoft says it will not use code fragments from open source projects, „GitHub Copilot suggestions are all generated by AI. GitHub Copilot generates new code in a probabilistic way, and the likelihood that they will produce the same code as a snippet encountered in training is low.“ Nevertheless, a filter was built in, which detects code snippets from github and can automatically suppress them if the developer configures it to do so.
The plaintiffs, however, see this quite differently: „By training their AI systems on public GitHub repositories, Defendants have violated the rights of a large number of creators who have published code or other works under certain open source licenses on GitHub,“ the complaint states. Specifically, the code generated by Copilot does not include attribution to the original author, copyright notices, or a copy of the license, as required by most open source licenses.“ The complaint also lists common open source licenses that Copilot may be in violation of, all of which require author attribution and copyright notice, such as the MIT license, the GPL, or the Apache license.
The generated code also immediately raises the following question: If Copilot was trained on software code that is subject to an open source license, which license applies to the code Copilot produces? BSD? MIT? Another one? Not any license? No license in the sense that the underlying parts are under incompatible licenses and there is no way to combine them? Microsoft doesn’t specify anything about that. Rather, it explicitly shifts the risk to users, who must bear the entire burden of license compliance.
Attorneys at the Law Offices of Joseph Saveri pointed out in a press release that this is a potentially history-making lawsuit: „This lawsuit is a crucial chapter in an industry-wide debate about the ethics of training AI tools with data that comes without permission from the creators, and what constitutes fair use of intellectual property. Despite Microsoft’s protestations to the contrary, the company does not have the right to treat source code offered under an open source license as if it were in the public domain.“