A developer building an AI agent found that the system had recommended a small obscure package he had written himself — with only a few stars and no recent updates — and suspected the AI had been trained on his own work without his knowledge

The moment of recognition arrived without fanfare. A developer — building an AI agent to assist with dependency management — watched the system surface a package recommendation. The package was obscure: fewer than a dozen GitHub stars, no commits in over two years, a README written in a single sitting and never revised. It was also, unmistakably, his own work. Something he had published years earlier and largely forgotten.

The experience was not quite accusatory, and not quite flattering. It was disorienting in a more particular way: the AI knew something about him that he had never told it. It had encountered his work somewhere in the vast sweep of public repositories it had been trained on, absorbed the patterns and decisions embedded in that code, and was now reproducing those decisions in a different context entirely — one he had built himself, using a different tool, for a different purpose. The loop was closed without his awareness or participation.

The scale of what got ingested

The training corpora behind modern large language models oriented toward code are not small. GitHub Copilot, the AI pair-programming tool developed by Microsoft and OpenAI, was trained on billions of lines of publicly available code — a dataset that, by its nature, includes an enormous range of quality, recency, and intent. Models like Code Llama, StarCoder, and the code-generation capabilities embedded in general-purpose systems like GPT-4 and Claude draw on similarly broad collections sourced from GitHub, npm, PyPI, Stack Overflow, and assorted documentation repositories.

The result is that essentially anything a developer published publicly — any package uploaded, any answer posted, any README committed — has a reasonable probability of being somewhere in the training data of at least one major AI system. The accumulated labor of working developers, the small utilities and opinionated libraries and half-finished experiments, constitute a substantial part of what AI coding tools draw on to know how software is made.

“The training corpus is not abstract. It is the accumulated labor of millions of working developers — the small utilities, the opinionated libraries, the half-finished experiments published on a Thursday afternoon.”

What the licenses said, and what they didn’t

Most publicly available code carries an open-source license. The MIT License, which governs an enormous proportion of public repositories, grants permission to use, copy, modify, merge, publish, distribute, sublicense, and sell copies of the software. The Apache License 2.0 adds explicit patent grants and terms around attribution. Both are permissive. Both have been widely used since well before the current generation of AI development tools existed.

The licenses were not written with LLM training in mind. This is not a legal technicality — it is an observation about context. A developer publishing a small utility under MIT in 2017 was contemplating other developers using, forking, or building on that utility. The license was calibrated to a world of direct reuse: you take my code, you use my code, the license governs what you can do with it. The scenario where that code becomes a data point in the training of a commercial AI product — a product that then competes in the market the developer operates in — was not a use case the license writers were contemplating.

Whether that gap between contemplated use and actual use constitutes a legal problem remains genuinely contested. The class action lawsuit filed against GitHub, Microsoft, and OpenAI in 2022 alleged that Copilot’s training and outputs violated open-source licenses and the rights of developers whose code was used without attribution. In June 2024, a federal judge dismissed the majority of claims — including the primary DMCA copyright infringement allegation — on the grounds that Copilot’s outputs were not identical enough to the plaintiffs’ work. Two narrower claims, for breach of contract and open-source license violation, remain active, with a DMCA appeal filed at the Ninth Circuit in April 2025. The litigation continues, but the legal landscape has shifted considerably in the defendants’ favour. In the meantime, developers operate in a legal environment that has not caught up with the technology.

Public versus consented

The developer in this story published his package publicly. That choice matters. He made a decision to put his work into a shared space, to contribute it to an ecosystem he was part of. No one took anything from him in any simple sense. The package was findable; the AI found it.

But there is a meaningful distinction between “public” and “consented to for this specific use,” and the current architecture of AI training largely collapses that distinction. The reasoning runs: you made it public; public means available; available means usable. That logic is coherent as far as it goes, but it sidesteps the question of what “public” means in practice — a context-dependent concept that has always involved some expectation about who the audience is and what they will do with the material.

When a developer publishes code on GitHub, the implicit audience is other developers. The implicit uses are reading, forking, running, adapting. The implicit social contract is one of mutual contribution — the developer gives something to the commons; others give things to the commons; everyone benefits. The use of that code as training data for a commercial system that charges for access to capabilities derived from the commons sits differently in that social contract, even if it is not clearly prohibited by the formal terms of the license.

The phenomenology of the discovery

What makes this case story-worthy is not primarily the legal dimension. It is the experience itself — the specific texture of encountering your own work reflected back through a system you are using to build something new.

The developer built something. He put it into the world. He moved on. Years later, working in a different context with a different tool, the work reappeared — not as something he had summoned, not as something he had retrieved, but as something the system surfaced on its own, based on its own learned sense of what was relevant. The AI knew his work. He had not known the AI knew his work. That asymmetry — the AI’s knowledge of his work without his knowledge of it — is the structural condition of the moment.

It is not plagiarism in the traditional sense. The AI did not reproduce his code verbatim. It recommended a package he had written, which is in some ways closer to a citation than a theft. But citation implies acknowledgment, and there was no acknowledgment here — no signal to the developer that his contribution had been absorbed, no way for him to know it had happened until the moment it surfaced unexpectedly in his own workflow.

A broader condition for the developer community

This developer’s experience is increasingly common, even if the specific form varies. Developers have documented encountering their own Stack Overflow answers reproduced in AI-generated responses — in some cases verbatim, in others lightly paraphrased — without attribution or acknowledgment. The phenomenon has been discussed at length on Stack Overflow’s own meta forums, where contributors have noted the asymmetry between freely offering answers to a community and having those answers harvested into commercial products. Others find their documentation style reproduced in coding assistant suggestions, or notice that an AI seems to have a particular familiarity with a library or approach they developed — a familiarity with no obvious source other than training on their public contributions.

The phenomenon has begun generating discussion in developer communities, though it has not yet coalesced into a coherent political or legal movement. Unlike the high-profile cases involving publishers suing AI companies or record labels filing copyright claims against model developers, the individual developer story lacks the institutional weight required to attract sustained legal attention. The packages are small. The developers are numerous. The harm, to each individual, is diffuse enough to be difficult to articulate as a concrete injury.

That diffuseness does not make it philosophically uninteresting. The training corpus of a major AI coding tool is, in aggregate, a portrait of how software development has worked — the conventions, the idioms, the debates encoded in commit messages and issue threads and README files. It is a collective artifact assembled from individual acts of contribution, each made under assumptions that did not include this outcome.

What it means for the developer-tool relationship

There is a stranger implication lurking beneath the surface of this case. If a developer’s past contributions are embedded in the training data of the tools they use today, the relationship between the developer and the tool is not simply one of user and instrument. The tool, in some sense, has been shaped by the developer’s prior work. The suggestions it makes have been influenced, at some fractional and unquantifiable level, by decisions the developer made years ago in a different context.

That is philosophically interesting in a way that is difficult to operationalize. It does not give the developer any particular rights or remedies. It does not change the outputs in any traceable way. But it suggests that the boundary between a developer’s agency and the tool’s agency is less sharp than the interface implies. The tool is not a neutral instrument. It is, among other things, a compressed and transformed version of the community that produced it — including people who had no say in the compression or the transformation.

The developer in the opening story almost certainly does not have a viable legal claim. The package was public. The license is permissive. Courts have been slow to find liability in cases involving training data drawn from publicly available material, and the legal theories most likely to succeed — copyright infringement through memorization and reproduction, breach of license conditions — are difficult to apply to a scenario where the output is a recommendation rather than reproduced code.

But the absence of a legal remedy does not make the underlying question disappear. It simply relocates it. If the law does not resolve the question of what consent means in the context of AI training — what obligations, if any, developers of AI systems have to the people whose work they trained on — then the answer will have to come from somewhere else: from community norms, from platform policies, from the evolving social contract of a developer ecosystem that is still working out what it owes its members.

The developer who watched an AI recommend his own forgotten package is not, in all likelihood, going to file a lawsuit. He is going to think about what it means to put work into the world in an era when the world has changed what it does with the things you give it.

That is a smaller story than the litigation. It is also, for the people living through it, the more immediate one.

Picture of Direct Message News

Direct Message News

Direct Message News is the byline under which DMNews publishes its editorial output. Our team produces content across psychology, politics, culture, digital, analysis, and news, applying the Direct Message methodology of moving beyond surface takes to deliver real clarity. Articles reflect our team's collective editorial process, sourcing, drafting, fact-checking, editing, and review, rather than a single writer's work. DMNews takes editorial responsibility for content under this byline. For more on how we work, see our editorial standards.

MOST RECENT ARTICLES

10 songs from the 70s and 80s I wish I could hear again with completely new ears

I’ve talked to 60 people who were the calm one in a chaotic household growing up — and many of them said they spent a long time not realizing that keeping the room quiet had come at a cost

I’ve interviewed 50 people who grew up without much money and many of them said they still sometimes feel surprised — genuinely surprised — when things go well

10 must-watch movie classics that may hold up better now than almost anything made in the last decade

I’ve interviewed 100 people about their relationship with a parent who never said sorry — and almost all of them had eventually stopped waiting

Tasks that used to require a team of engineers can now be handed to Claude Opus 4.8 in plain language — and the companies that understood that first are already restructuring how they hire