The Atlantic Reveals Music Database for AI

TL;DR: The Atlantic has published a tool that reveals millions of songs used to train AI without artists' permission. The reaction has been immediate, with independent musicians and major stars finding their work in the datasets. This fact reopens the debate on the legality of training AI with copyrighted works.

On June 18, 2026, The Atlantic launched an expansion of its AI Watchdog tool, this time focused on music. The database allows users to search whether a song is part of the four most commonly used datasets for training generative AI models like Suno. These datasets contain between 100,000 and 12 million songs each, totaling millions of tracks by artists who never gave their consent. The tool, based on research by Alex Reisner for The Atlantic, joins the original version of AI Watchdog launched in September 2025, which documented books, academic articles, and YouTube videos used in AI training. According to The Verge, which covered the news on June 20, the expansion into music represents a significant step in transparency regarding the use of copyrighted data.

The Origin of the Data

According to Reisner's investigation, the datasets have circulated for years within the AI development community and have been downloaded thousands of times. Google and Stability AI have confirmed using them in research papers. The two largest, with 12 and 9 million songs respectively, include global superstars like Taylor Swift and Bad Bunny as well as independent artists like DJ Sabrina the Teenage DJ. The other two datasets each exceed 100,000 tracks. The origin of these datasets dates back to academic and open-source projects, such as the MusicCaps dataset (10,000 audio-text pairs) and others compiled by scraping platforms like YouTube and SoundCloud. However, unlike these academic examples, the datasets identified by The Atlantic contain complete commercial recordings, raising the legal stakes. The use of this data by companies like Suno, which generates music from text, has been controversial since 2024, when the RIAA filed lawsuits against Suno and Udio for massive copyright infringement.

Artists' Reactions

The response on social media was immediate. DJ Sabrina the Teenage DJ discovered 22 of their tracks in Suno's datasets and stated: "It's funny how there were accusations that my music sounded like AI before these datasets started being used to generate garbage." Backxwash said: "I'm 100% sure I never gave my consent." Sophia hjkl found 138 of their songs, nearly everything they released between 2017 and 2024. The catalog includes Lady Gaga, Radiohead, Aphex Twin, Wu-Tang Clan, and Bruce Springsteen, proving that no one is exempt. The scale of the appropriation recalls the case of books by authors like J.K. Rowling and Stephen King used to train language models, also documented by The Atlantic in 2025. Back then, the AI Watchdog tool allowed authors to search for their works in datasets like Books3, leading to lawsuits and increased regulatory scrutiny. Now, musicians face a similar situation, but with the difference that AI-generated music is already commercialized, as demonstrated by viral songs generated by Suno that mimic the styles of famous artists.

Legal Implications

This case adds to a long list of controversies over the use of copyrighted data to train AI. In the United States, several class-action lawsuits by artists against AI companies are ongoing. The publication of this database could strengthen plaintiffs' arguments by demonstrating the massive use of unlicensed works. In particular, the RIAA lawsuit against Suno and Udio, filed in June 2024, alleges that these companies copied musical recordings without authorization to train their models. The Atlantic's database provides concrete evidence that could be used in court. In the European Union, the Copyright Directive requires transparency in training data, but its enforcement remains limited. The EU AI Act, which came into effect in 2025, also imposes transparency requirements for generative AI models, although music is not explicitly covered. In contrast, Japan has adopted a more permissive approach, allowing the use of copyrighted works for AI training without a license, which has drawn criticism from the global music industry.

What Should Readers Know?

For independent musicians, this tool is a way to verify if their work has been used without permission. For the general public, it is a reminder that generative AI is built on the unpaid labor of creators. The music industry faces a dilemma: adapt to the new technology or demand fair compensation. Meanwhile, companies like Suno continue to operate in a legal vacuum that such revelations seek to fill. The case also highlights the need for a global regulatory framework, as datasets circulate internationally and laws vary. For example, in the United Kingdom, the government has proposed an exception for text and data mining for research purposes, but not for commercial use. In Australia, a parliamentary inquiry recommended in 2025 that AI companies obtain licenses to use copyrighted content. The Atlantic's tool, while useful, is only a first step toward the transparency that artists demand. As artist Sophia hjkl noted: "138 of my songs are there. It's like they stole my entire discography." The ball is now in the court of legislators and courts.

The Atlantic Reveals the Music Database Powering AI

The Origin of the Data

Artists' Reactions

Legal Implications

What Should Readers Know?

Keep reading