WatersWorks

Blog archive

Class-Action Lawsuit Claims GitHub Copilot is Violating Open-Source Licenses

GitHub rolled out a slew of product announcements at its annual GitHub Universe developer conference earlier this month. As we reported, expanded access for business users of its Copilot AI pair programming service generated the loudest buzz. (The company calls the new offering "Copilot for Business.")

Meanwhile, a different kind of buzz has been building about whether Copilot, which GitHub says has been trained on billions of lines of publicly-available code, is violating the legal rights of those who posted code on GitHub under open-source licenses.

On Nov. 3, a class-action lawsuit was filed in a U.S. Federal Court in San Francisco challenging the legality of this practice. Cited in the lawsuit was GitHub, its parent company, Microsoft, and their partner, OpenAI.

"By train­ing their AI sys­tems on pub­lic GitHub repos­i­to­ries (though based on their pub­lic state­ments, pos­si­bly much more) we con­tend that the defen­dants have vio­lated the legal rights of a vast num­ber of cre­ators who posted code or other work under cer­tain open-source licenses on GitHub," the complaint reads.

Specifically, the code generated by Copilot does not include any attribution of the original author, copyright notices, and/or a copy of the license, which most open-source licenses require, the complaint alleges. It also lists 11 pop­u­lar open-source licenses Copilot is potentially violating, all of which require attri­bu­tion of the author's name and copy­right, includ­ing the MIT license, the GPL and the Apache license, among others.

"Copilot ignores, violates, and removes the licenses offered by thousands—possibly millions—of software developers, thereby accomplishing software piracy on an unprecedented scale," the complaint alleges.

GitHub responded to the allegations in a statement: "We've been committed to innovating responsibly with Copilot from the start, and will continue to evolve the product to best serve developers across the globe." The company has also said it plans to introduce a new Copilot feature that will "provide a reference for suggestions that resemble public code on GitHub, so that you can make a more informed decision about whether and how to use that code," including "providing attribution where appropriate." GitHub also has a configurable filter to block suggestions matching public code.

The lawsuit was filed by the Joseph Saveri Law Firm, a San Francisco-based antitrust litigation law group, and Matthew Butterick, who is a lawyer, designer and coder, on behalf of "open-source programmers." 

Butterick, a longtime open-source advocate, expressed his concerns about GitHub Copilot this summer in a blog post entitled, "This Copilot is Stupid and Wants to Kill Me." He makes his case in some detail. (Recommended reading.)

"The fact is, since Copilot was released in its limited technical preview by Microsoft in June 2021, open-source programmers have been raising concerns about how it works," Butterick told me during a video conference. "I wrote that post because I agreed with members of the open-source community who felt that Copilot was really a device for laundering open-source licenses."

One of the points Butterick made during our conversation is that Microsoft is effectively passing the buck on this issue. Notably, on its About GitHub Copilot page, Microsoft writes, "You are responsible for ensuring the security and quality of your code. We recommend you take the same precautions when using code generated by GitHub Copilot that you would when using any code you didn't write yourself. These precautions include rigorous testing, IP scanning, and tracking for security vulnerabilities…."

"You have to ask, what are the ethics of just hoovering up all of this material and just kind of arrogating it to yourself for free?" Butterick said.

Copilot, which installs as an extension in a range of IDEs (e.g., Visual Studio, VS Code, Neovim and JetBrains), uses OpenAI's Codex, a system that translates natural language into code, to suggest code and entire functions in real time, directly from the editor. Codex is based on OpenAI's GPT-3 language model.

Its use of Codex is one of the things that makes Copilot different from traditional autocomplete tools, Butterick pointed out. Codex, which is licensed to Microsoft, makes it possible for Copilot to offer suggestions based on text prompts typed by the user. Although it can be used for small suggestions, Microsoft has touted its ability to suggest larger blocks of code, such as the entire body of a function, Butterick said.

Some have put forward the argument that Microsoft's use of code from GitHub constitutes fair use. For­mer GitHub CEO Nat Friedman claimed in a 2021 tweet:

In general: (1) training ML systems on public data is fair use (2) the output belongs to the operator, just like with a compiler.

But Bradley M. Kuhn, director of the Soft­ware Free­dom Con­ser­vancy, wrote in a February 2022 blog post:

While [Nat] Friedman ignored the community's requests publicly, we inquired privately with Friedman and other Microsoft and GitHub representatives in June 2021, asking for solid legal references for GitHub's public legal positions [for the tweeted assertions]. They provided none, and reiterated, without evidence, that they believed the model does not contain copies of the software, and output produced by Copilot can be licensed under any license. We further asked if there are no licensing concerns on either side, why did Microsoft not also train the system on their large proprietary codebases such as Office? They had no immediate answer. Microsoft and GitHub promised to get back to us, but have not.

And there have been some pointed reactions to the lawsuit.

"I've had people literally tweet that I am destroying the geopolitical order, because this lawsuit is going to be handing China an unbeatable advantage in AI," Butterick said. "It's really the opposite; I think we should have the best AI in the world. But look, I think we can agree that Spotify and Apple Music are better than Napster. Once we get through this 'Napster phase' of AI, we're going to bring creators to the table, and we're going to make it work for them. And the next generation of these tools is going to be much better."

The attorneys at the Joseph Saveri Law Firm noted in a press release that this is a potentially history-making lawsuit: "This lawsuit constitutes a critical chapter in an industry-wide debate regarding the ethics of training AI tools with data sourced without permission from its creators and what constitutes a fair use of intellectual property. Despite Microsoft's protestations to the contrary, it does not have the right to treat source code offered under an open-source license as if it were in the public domain."

Butterick believes this is the first class-action case in the United States chal­leng­ing the train­ing and out­put of AI sys­tems. He believes it will not be the last. On his blog, he wrote: "AI sys­tems are not exempt from the law. Those who cre­ate and oper­ate these sys­tems must remain account­able. If com­pa­nies like Microsoft, GitHub, and OpenAI choose to dis­re­gard the law, they should not expect that we the pub­lic will sit still. AI needs to be fair & eth­i­cal for every­one. If it's not, then it can never achieve its vaunted aims of ele­vat­ing human­ity. It will just become another way for the priv­i­leged few to profit from the work of the many."

Posted by John K. Waters on November 22, 2022