A developer building an AI agent found that the system had recommended a small obscure package he had written himself — with only a few stars and no recent updates — and suspected the AI had been trained on his own work without his knowledge

Tension: A developer discovers the AI system he is building has surfaced his own obscure, low-starred package as a recommendation — meaning a model trained on public code had ingested his work, learned from it, and was now deploying it without his knowledge, consent, or credit.
Noise: Coverage of AI and intellectual property focuses on landmark litigation — major publishers, record labels, studios suing large AI companies. The individual developer story barely registers.
Direct Message: The consent problem in AI training is not only a problem for famous works. It runs through the entire public internet — including the small, barely-noticed contributions of working developers who never imagined their code would become infrastructure for a system that competes with them. The discovery is personal in a way the lawsuits are not.

The moment of recognition arrived without fanfare. A developer — building an AI agent to assist with dependency management — watched the system surface a package recommendation. The package was obscure: fewer than a dozen GitHub stars, no commits in over two years, a README written in a single sitting and never revised. It was also, unmistakably, his own work. Something he had published years earlier and largely forgotten.

The experience was not quite accusatory, and not quite flattering. It was disorienting in a more particular way: the AI knew something about him that he had never told it. It had encountered his work somewhere in the vast sweep of public repositories it had been trained on, absorbed the patterns and decisions embedded in that code, and was now reproducing those decisions in a different context entirely — one he had built himself, using a different tool, for a different purpose. The loop was closed without his awareness or participation.

The scale of what got ingested

The training corpora behind modern large language models oriented toward code are not small. GitHub Copilot, the AI pair-programming tool developed by Microsoft and OpenAI, was trained on billions of lines of publicly available code — a dataset that, by its nature, includes an enormous range of quality, recency, and intent. Models like Code Llama, StarCoder, and the code-generation capabilities embedded in general-purpose systems like GPT-4 and Claude draw on similarly broad collections sourced from GitHub, npm, PyPI, Stack Overflow, and assorted documentation repositories.

The result is that essentially anything a developer published publicly — any package uploaded, any answer posted, any README committed — has a reasonable probability of being somewhere in the training data of at least one major AI system. The accumulated labor of working developers, the small utilities and opinionated libraries and half-finished experiments, constitute a substantial part of what AI coding tools draw on to know how software is made.

“The training corpus is not abstract. It is the accumulated labor of millions of working developers — the small utilities, the opinionated libraries, the half-finished experiments published on a Thursday afternoon.”

What the licenses said, and what they didn’t

Most publicly available code carries an open-source license. The MIT License, which governs an enormous proportion of public repositories, grants permission to use, copy, modify, merge, publish, distribute, sublicense, and sell copies of the software. The Apache License 2.0 adds explicit patent grants and terms around attribution. Both are permissive. Both have been widely used since well before the current generation of AI development tools existed.

The licenses were not written with LLM training in mind. This is not a legal technicality — it is an observation about context. A developer publishing a small utility under MIT in 2017 was contemplating other developers using, forking, or building on that utility. The license was calibrated to a world of direct reuse: you take my code, you use my code, the license governs what you can do with it. The scenario where that code becomes a data point in the training of a commercial AI product — a product that then competes in the market the developer operates in — was not a use case the license writers were contemplating.

Whether that gap between contemplated use and actual use constitutes a legal problem remains genuinely contested. The class action lawsuit filed against GitHub, Microsoft, and OpenAI in 2022 alleged that Copilot’s training and outputs violated open-source licenses and the rights of developers whose code was used without attribution. In June 2024, a federal judge dismissed the majority of claims — including the primary DMCA copyright infringement allegation — on the grounds that Copilot’s outputs were not identical enough to the plaintiffs’ work. Two narrower claims, for breach of contract and open-source license violation, remain active, with a DMCA appeal filed at the Ninth Circuit in April 2025. The litigation continues, but the legal landscape has shifted considerably in the defendants’ favour. In the meantime, developers operate in a legal environment that has not caught up with the technology.

Public versus consented

The developer in this story published his package publicly. That choice matters. He made a decision to put his work into a shared space, to contribute it to an ecosystem he was part of. No one took anything from him in any simple sense. The package was findable; the AI found it.

But there is a meaningful distinction between “public” and “consented to for this specific use,” and the current architecture of AI training largely collapses that distinction. The reasoning runs: you made it public; public means available; available means usable. That logic is coherent as far as it goes, but it sidesteps the question of what “public” means in practice — a context-dependent concept that has always involved some expectation about who the audience is and what they will do with the material.

When a developer publishes code on GitHub, the implicit audience is other developers. The implicit uses are reading, forking, running, adapting. The implicit social contract is one of mutual contribution — the developer gives something to the commons; others give things to the commons; everyone benefits. The use of that code as training data for a commercial system that charges for access to capabilities derived from the commons sits differently in that social contract, even if it is not clearly prohibited by the formal terms of the license.

The phenomenology of the discovery

What makes this case story-worthy is not primarily the legal dimension. It is the experience itself — the specific texture of encountering your own work reflected back through a system you are using to build something new.

The developer built something. He put it into the world. He moved on. Years later, working in a different context with a different tool, the work reappeared — not as something he had summoned, not as something he had retrieved, but as something the system surfaced on its own, based on its own learned sense of what was relevant. The AI knew his work. He had not known the AI knew his work. That asymmetry — the AI’s knowledge of his work without his knowledge of it — is the structural condition of the moment.

It is not plagiarism in the traditional sense. The AI did not reproduce his code verbatim. It recommended a package he had written, which is in some ways closer to a citation than a theft. But citation implies acknowledgment, and there was no acknowledgment here — no signal to the developer that his contribution had been absorbed, no way for him to know it had happened until the moment it surfaced unexpectedly in his own workflow.

A broader condition for the developer community

This developer’s experience is increasingly common, even if the specific form varies. Developers have documented encountering their own Stack Overflow answers reproduced in AI-generated responses — in some cases verbatim, in others lightly paraphrased — without attribution or acknowledgment. The phenomenon has been discussed at length on Stack Overflow’s own meta forums, where contributors have noted the asymmetry between freely offering answers to a community and having those answers harvested into commercial products. Others find their documentation style reproduced in coding assistant suggestions, or notice that an AI seems to have a particular familiarity with a library or approach they developed — a familiarity with no obvious source other than training on their public contributions.

The phenomenon has begun generating discussion in developer communities, though it has not yet coalesced into a coherent political or legal movement. Unlike the high-profile cases involving publishers suing AI companies or record labels filing copyright claims against model developers, the individual developer story lacks the institutional weight required to attract sustained legal attention. The packages are small. The developers are numerous. The harm, to each individual, is diffuse enough to be difficult to articulate as a concrete injury.

That diffuseness does not make it philosophically uninteresting. The training corpus of a major AI coding tool is, in aggregate, a portrait of how software development has worked — the conventions, the idioms, the debates encoded in commit messages and issue threads and README files. It is a collective artifact assembled from individual acts of contribution, each made under assumptions that did not include this outcome.

What it means for the developer-tool relationship

There is a stranger implication lurking beneath the surface of this case. If a developer’s past contributions are embedded in the training data of the tools they use today, the relationship between the developer and the tool is not simply one of user and instrument. The tool, in some sense, has been shaped by the developer’s prior work. The suggestions it makes have been influenced, at some fractional and unquantifiable level, by decisions the developer made years ago in a different context.

That is philosophically interesting in a way that is difficult to operationalize. It does not give the developer any particular rights or remedies. It does not change the outputs in any traceable way. But it suggests that the boundary between a developer’s agency and the tool’s agency is less sharp than the interface implies. The tool is not a neutral instrument. It is, among other things, a compressed and transformed version of the community that produced it — including people who had no say in the compression or the transformation.

The absence of a legal remedy is not the end of the question

The developer in the opening story almost certainly does not have a viable legal claim. The package was public. The license is permissive. Courts have been slow to find liability in cases involving training data drawn from publicly available material, and the legal theories most likely to succeed — copyright infringement through memorization and reproduction, breach of license conditions — are difficult to apply to a scenario where the output is a recommendation rather than reproduced code.

But the absence of a legal remedy does not make the underlying question disappear. It simply relocates it. If the law does not resolve the question of what consent means in the context of AI training — what obligations, if any, developers of AI systems have to the people whose work they trained on — then the answer will have to come from somewhere else: from community norms, from platform policies, from the evolving social contract of a developer ecosystem that is still working out what it owes its members.

The developer who watched an AI recommend his own forgotten package is not, in all likelihood, going to file a lawsuit. He is going to think about what it means to put work into the world in an era when the world has changed what it does with the things you give it.

That is a smaller story than the litigation. It is also, for the people living through it, the more immediate one.

A developer building an AI agent found that the system had recommended a small obscure package he had written himself — with only a few stars and no recent updates — and suspected the AI had been trained on his own work without his knowledge

The scale of what got ingested

What the licenses said, and what they didn’t

Public versus consented

The phenomenology of the discovery

A broader condition for the developer community

What it means for the developer-tool relationship

The absence of a legal remedy is not the end of the question

Direct Message News

MOST RECENT ARTICLES

Virginia is still the largest data center market on the planet, with roughly two hundred facilities packed into a single county, and its legislature is now fighting over whether to keep a sales tax exemption worth an estimated one point nine billion dollars a year

Meta is replacing up to ninety percent of its content review staff with AI, and the marketers most exposed are the ones who have never had to argue with a machine about why their account got flagged

Monterey Park, California just became the first city in the country to permanently ban data centers by popular vote, with eighty six percent of residents in favor

Arizona lawmakers passed a three year moratorium on data center tax breaks to slow the industry down, and in the two weeks before it took effect developers filed nearly as many applications as they had in the previous thirteen years combined

Texas passed a law banning targeted ads to minors, and a federal judge has now struck it down in a second ruling that went further than his first, ruling it violated advertisers’ free speech rights

Connecticut, Arkansas, and Utah made their comprehensive privacy laws enforceable on July 1, adding a new wave of state-specific consent and opt-out rules for marketers to track this year