Atlantic’s Alex Reisner makes 4 music AI datasets searchable, including 12M and 9M tracks
A public index of the songs used for AI training turns “opaque data sourcing” into something boards and regulators can audit.

Atlantic reporter Alex Reisner uncovered four datasets of music used to train AI models and made them searchable for the public. The consequence for decision-makers is that training-data sourcing is becoming measurable, not merely claimed.
Atlantic reporter Alex Reisner uncovered four datasets of music being used to train AI models and made them fully searchable for the public. Two of the sets are enormous, with 12 million and 9 million tracks. The other two are smaller, but still huge in absolute terms, at over 100,000 songs each.
This matters because the datasets are no longer just a vague “trust us” in research circles. According to Reisner, the sets have been downloaded thousands of times. And while it is impossible to know exactly who has used them, Google and Stability have both confirmed they have used the datasets in research papers. In other words, major AI players are not only in the conversation. They are in the dataset lineage.
So what exactly did the Atlantic do that changes the game? Reisner took four music datasets and built a searchable public view of what is in them. That is not just a journalist flex. It shifts the question from “Did your model train on copyrighted material?” to “Which specific catalog entries are included, and how can they be located?” When a dataset is searchable, due diligence becomes faster. Boards can ask better questions. Legal teams can compare what was claimed against what is actually indexed.
The dataset sizes also tell you something about the stakes for product and compliance. When two datasets hit 12 million and 9 million tracks, you are not talking about a pilot training set. You are talking about massive coverage that can influence what an AI learns about melody, style, structure, and even performance patterns. Even without knowing who has downloaded and used the files, the sheer scale creates risk that spreads across multiple downstream systems, from research prototypes to commercial offerings. Larger training sets can be more valuable for model performance, but they also make provenance problems harder to ignore.
Reisner’s reporting also highlights a key nuance: not all sources are the same kind of “free” in the public imagination. Some of the sources, like the Free Music Archive dataset, are free to stream for personal use. That detail matters because training is not the same thing as streaming. But it does affect how stakeholders interpret the dataset’s characteristics, and why regulators and courts often focus on the specific rights and terms attached to particular works. The public indexing makes those distinctions easier to investigate because the dataset contents can be examined rather than just referenced.
Another second-order implication: dataset transparency can pressure the AI ecosystem into stricter attribution and documentation norms. Google and Stability confirmed they have used these datasets in research papers. That is an important anchor fact, because research papers are typically where methodology and training corpora get described, at least at a high level. When the underlying corpora become searchable, the gap between “we trained on music datasets” and “here is what those datasets include” becomes harder to sustain. It also raises the bar for how training-data disclosures are written, reviewed, and audited.
Regulators are already grappling with questions that sound abstract until you can point to an actual dataset. Music training data sits at the intersection of copyright, platform policy, and model governance. A public searchable database turns that intersection into something that can be audited by more than the people who already had access. Even if it is impossible to know exactly who used the datasets, the fact that downloads number in the thousands and that major labs acknowledge usage in papers suggests these are not fringe curiosities. They are tooling.
For executives and boards, the strategic stake is simple: training-data choices are becoming operational risk. If datasets can be indexed and checked publicly, then “we did what the industry does” is a weaker shield. The safest posture becomes documentation you can stand behind, governance that can answer specific dataset questions quickly, and contracts or policies that reflect the real sources used. Today it is four music datasets made searchable. Tomorrow it can be the same scrutiny applied to other training corpora, at a scale that turns compliance from a department issue into a board issue.
This story's Key Insights and Take-aways are locked.
Create a free account to unlock Executive Actions for one credit.
Register to UnlockAlways free for Executives Club members. Join the Club
More in Technology

OpenMW 0.51.0 upgrades Morrowind mod magic scripting and squashes Lua crash waves
OpenMW 0.51.0 lets modders build custom magic effects via scripting API, plus fixes many Lua crashes and save compatibility rules.

Cold Court’s debut EP leans into hyperpop chaos, but sounds more serious than 100 Gecs
Philly brother-sister duo Cold Court make a glitchy genre mashup that nods to hyperpop, then refuses to wink.

TikTok serves nearly 60% AI slop to new accounts, Kapwing study finds
A Kapwing analysis of 10,742 videos and 500 fresh For You page views suggests the feed is polluted early.
