When APIs work too well: Lessons from Spotify’s large-scale scraping
Zoran Gorgiev, Alessio Dalla Piazza
Organizations increasingly strive to develop scalable and flexible systems. Yet, in recent years, we’ve seen abuse of systems with these exact properties.
This is a discussion of a recurring design trade-off in big platforms like Spotify: legitimate, highly desirable system functionality accessible via APIs being prone to massive abuse, such as industrial-scale data extraction.
We focus on the recent Spotify large-scale data scraping incident but place it in a broader category, arguing that it is not an isolated case. We also contend that security testing, when it covers legitimate API function abuse, can help you spot weaknesses in anti-bot and anti-scraping protections before external actors find them.
Our argument is not that API security testing alone stops scraping, nor that every instance of scraping is a security failure. Instead, we argue:
- High-performing APIs enable illicit, silent automated data scraping (bots, AI agents) from highly scalable, flexible systems.
- Specialized API security testing allows you to substantially increase the probability of detecting this type of API and system abuse earlier than organizations typically do now.
API abuse without exploits
In most cases, API security testing concentrates on violations of explicit technical boundaries:
- Broken authentication or authorization
- Excessive data exposure per request
- Input validation failures
- Missing or misconfigured rate limits
And rightly so: these checks are essential. However, they do not include a growing class of abuse in which no individual API request is problematic. Each call is authenticated, authorized, well-formed, and returns the exact data the caller is authorized to access. But the underlying system must handle an overwhelming volume of requests.
This problem arises when three conditions combine:
- Scale — the same workflows are repeated millions of times.
- Distribution — requests are spread across accounts, IPs, devices, or regions.
- Aggregation — individually acceptable responses combine into a dataset that was never intended to be reconstructable.
If someone takes advantage of these three, no vulnerability exploitation in the traditional sense occurs. Instead, it is an example of an exhaustive use of a system exposed via an API, or more precisely, a misuse of the system’s intended purpose for an unintended use case.
The Spotify data scraping incident
On December 20, 2025, Anna’s Archive published the blog post “Backing up Spotify,” describing a massive scraping and archiving of Spotify content.
Besides descriptions of reconstructed API responses and database schemas, the post contained the following information:
- The Anna’s Archive group scraped metadata for 256 million tracks and 186 million unique ISRCs.
- These numbers represent metadata coverage for an estimated 99.9% of the tracks available on Spotify at the time of collection.
- The group archived approximately 86 million music files, meaning around 99.6% of total listens.
- The overall archive size was roughly 300 TB, distributed via torrents and grouped by popularity.
- The data cutoff was July 2025; content released after this date might be only partially present.
An official, technically rich account of the operation is missing. Still, we can make inferences based on secondary sources and on how large streaming platforms generally operate. Accordingly, the scrapers likely, the scrapers likely
- Reverse-engineered Spotify’s internal APIs and used private endpoints, including those used by Spotify’s Web Player and Desktop apps, to harvest extensive metadata not exposed through the official API.
- Carried out massive account orchestration via automated tools that generated unique accounts with customized device fingerprints (browser User-Agent strings, time zones, Canvas fingerprints, and similar) to imitate legitimate users and evade detection.
- Relied on residential proxy networks to hide their actual IP addresses, rotating through millions of residential IPs and bypassing rate limits.
- Used headless browsers and actors, such as those from Apify, to simulate real user interactions with the platform’s dynamic JavaScript, circumventing JS-based protections.
- Bypassed Spotify’s DRM protection (Widevine L3) using techniques such as differential fault analysis (DFA) and key extraction. Once extracted, the audio was decrypted and saved as OGG Vorbis files, without re-encoding.
- Optimized the scraping infrastructure to run efficiently with low-cost cloud virtual machines, enabling thousands of emulated instances to scrape volumes of data with minimal resource consumption.
To the extent that the perpetrators captured full-track audio, a technically plausible explanation is that the incident happened within a trusted client execution context after decryption. That means this would constitute an abuse of client trust rather than a cryptographic failure or, for that matter, any other type of classic vulnerability exploitation.
Illicit data scraping isn’t a Spotify-only story
The Spotify data scraping incident is unique in many ways, but it is far from an outlier.
Below are five selected publicly known incidents that are similar to it in important respects. They all show failures of scale governance: systems that behave correctly at the human scale but become data-exfiltration surfaces when automated.
1. WhatsApp’s contact discovery
Cybersecurity researchers showed they could enumerate ~3.5 billion WhatsApp accounts by hammering WhatsApp’s contact-discovery API. They took advantage of the fact that the service must answer the question “Is this number registered?” fast enough to be useful, compelling it to respond on a massive scale.
What makes this example similar to the Spotify incident is that the researchers used automated scaling to extract a massive dataset by abusing a legitimate system feature accessible via an API.
2. Dell partner portal scrape
In 2024, attackers claimed that they scraped ~49 million customer records by abusing a Dell partner portal API and brute-forcing at high request rates.
This incident resembles the Spotify case in that actors could access a legitimate interface programmatically at scale, uninterrupted, turning individually valid, authorized lookups (by service tag) into a massive data-extraction pipeline.
3. LinkedIn public profile harvesting
In 2021, an attacker exploited LinkedIn’s official, overly permissive APIs to scrape the public data of approximately 700 million users.
Similar to Spotify’s large-scale scrape, this incident did not involve a technical hack. It was simply an aggregation of data, with the scraper using automated queries to extract metadata and personal histories from high-performing endpoints.
4. Facebook (Cambridge Analytica) permission abuse
This incident involved the wholesale harvesting of 87 million profiles via the Facebook Graph API.
Often framed as a leak, it was, technically, an example of programmatic scraping similar to Spotify’s case. A third-party app used legitimate API permissions to scrape the information from its own users and their entire friend networks. It abused a flexible system that was overly permissive about how actors could move secondary data.
Meta, the Facebook parent company, has since tightened access and monitoring controls for its APIs. It also expanded its Bug Bounty and Data Bounty programs to include research reports on
- Bugs that allow scraping (especially logic bypass issues), regardless of whether the targeted data is public
- Online databases scraped from the platform, containing PII or sensitive data
Source: Meta, “Expanding our Bug Bounty Program to Address Scraping,” December 2021
It’s worth noting that since 2021, major platforms have treated API-enabled scraping as a serious security problem, as even authenticated actors can scrape volumes of data using legitimate access paths, meaning the risk is not limited to anonymous attackers. If an authenticated user or app can automate bulk collection at scale, an exposed API can become practically the channel for data exfiltration.
5. Salesforce API exfiltration
In late 2024 and throughout 2025, threat actors used voice phishing to socially engineer Salesforce employees into authorizing malicious connected applications. Then, they used the resulting OAuth tokens to programmatically extract CRM data via legitimate Salesforce APIs.
This incident resembles the Spotify incident in that, after the attackers gained access via legitimate app authorization, they used high-performance APIs specifically designed for mass data movement. The APIs allowed them to silently siphon off millions of CRM records without raising alerts.
Large-scale extraction as an economic failure mode
Many commercial anti-bot and anti-scraping systems — CDN & WAF products, API gateways, middleboxes, etc. — are optimized to detect intensity rather than coverage.
They are good at spotting:
- Sudden traffic spikes
- High request rates per IP
- Obvious headless browser signatures
- Broken or malformed clients
They are less effective against:
- Slow, distributed traversal
- Many low-privilege accounts
- Long-running background extraction
- Behavior that remains within documented workflows
In all these cases, from the system’s perspective, nothing looks broken. From the attacker’s perspective, the system is drained gradually and patiently.
However, high-value targets, prominent large platforms, and organizations in data-sensitive industries typically have protections in which coverage is a first-class detection signal. Now, Spotify definitely falls within this category. So, why didn’t it prevent the massive data scraping?
Coverage-aware protections detect patterns, assign risk scores, and launch responses (based on exceeded thresholds) over time. Immediate blocking of suspicious activity or stopping everything that even remotely resembles a systematic operation is not how they operate.
Besides, platforms like Spotify must tolerate wide coverage by design, supporting real users, search engine crawlers, embedded players, third-party integrations, and whatnot. That means the defense response is constrained.
Thus, we’ve arrived at the exact failure mode of coverage-based systems: Detection occurs after actors have already accessed meaningful data. By the time confidence is high enough to act, the dataset already exists elsewhere. And in addition to a highly likely reputation or copyright damage, you’ve already suffered a significant economic blow.
How do you fix this?
What can API security testing achieve?
It is key to be precise about what API testing can do so that you can base your anti-scraping and anti-bot approach on realistic foundations. In that spirit, this section doesn’t overpromise; it throws light on the factual merits of the admittedly underused protective mechanism of API security testing.
Abuse-aware API testing
The benefits of API security testing change noticeably when it can answer questions about:
- Aggregation risk — do multiple endpoints, when combined, allow the reconstruction of a complete dataset?
- Traversal feasibility — can someone enumerate or infer identifiers systematically?
- Economic extractability — how much value can a single account or a cohort of accounts extract over time?
- Control effectiveness — do rate limits reduce speed, or do they reduce total achievable coverage?
- Trust assumptions — which APIs implicitly trust clients to behave like users, and what happens if they do not?
At Equixly, we treat these questions as part of API testing. And it’s not because testing replaces design reviews or runtime monitoring, but because they are testable properties of systems.
Could Spotify’s massive data scraping have been detected earlier?
Without speculating about Spotify’s internal processes, generally, platforms vulnerable to the risk of scraping and bot abuse often show detectable signs before the abuse becomes public:
- Linear or near-linear growth in catalog coverage per account
- No diminishing returns from repeated access
- Identifier spaces that are enumerable without friction
- Endpoints whose outputs are individually minimal but collectively exhaustive
- Controls that cap requests per second, but not requests per lifetime
These are all properties that a platform like Equixly can analyze in pre-production environments through abuse-oriented API testing, and in production through telemetry.
The realistic role of testing
What testing can objectively do is reveal uncomfortable truths early. Here are a few examples:
- “If someone wanted to scrape volumes of data or launch automated bot attacks, they could most likely do it.”
- “Your current security measures do slow attackers down, but do not bound total extraction.”
- “You rely heavily on clients behaving honestly.”
Insights like these are often sufficient to trigger design changes, policy decisions, or architectural adjustments, and what’s more, trigger them before someone manages quietly to drain your system at a large scale.
Final thoughts
The Spotify scraping incident is not about Spotify, nor even about scraping per se. It is an example of a more general problem that affects any platform with valuable data and well-designed APIs. One that the industry is only beginning to treat as a first-class security concern.
When APIs work too well — when they are predictable, enumerable, and permissive at scale — they can enable abuse without ever being exploited. And finding this type of weakness requires analyzing system-level reasoning, where variables such as aggregation, economics, and time are critical.
API security testing can meaningfully contribute to this analysis, but only when explicitly designed to recognize where correct behavior becomes unsafe due to volume and intention.
Discuss abuse-aware API testing with us.
FAQs
Is data scraping always a security failure?
No. Not all scraping is malicious. A problem arises when legitimate API functionality enables silent, systematic extraction at large scale. That allows someone to reconstruct datasets that were never intended to be aggregated. Consequently, the problem is large-scale data scraping, and it is a design and governance problem, not necessarily a traditional vulnerability.
Why don’t rate limits and bot defenses stop this kind of abuse?
Many defenses are optimized to detect intensity (spikes, high request rates, malformed clients, and similar), not coverage over time. Distributed, low-and-slow abuse that stays within documented workflows can evade detection until massive data extraction has already happened.
How can API security testing help if no exploits are involved?
Abuse-aware API testing evaluates system-level properties — such as aggregation risk, identifier enumerability, and total extractable value — rather than the correctness of individual requests. That allows you to pinpoint where correct behavior becomes unsafe in aggregate, before attackers abuse it in the wild.
Zoran Gorgiev
Technical Content Specialist
Zoran is a technical content specialist with SEO mastery and practical cybersecurity and web technologies knowledge. He has rich international experience in content and product marketing, helping both small companies and large corporations implement effective content strategies and attain their marketing objectives. He applies his philosophical background to his writing to create intellectually stimulating content. Zoran is an avid learner who believes in continuous learning and never-ending skill polishing.
Alessio Dalla Piazza
CTO & FOUNDER
Former Founder & CTO of CYS4, he embarked on active digital surveillance work in 2014, collaborating with global and local law enforcement to combat terrorism and organized crime. He designed and utilized advanced eavesdropping technologies, identifying Zero-days in products like Skype, VMware, Safari, Docker, and IBM WebSphere. In June 2016, he transitioned to a research role at an international firm, where he crafted tools for automated offensive security and vulnerability detection. He discovered multiple vulnerabilities that, if exploited, would grant complete control. His expertise served the banking, insurance, and industrial sectors through Red Team operations, Incident Management, and Advanced Training, enhancing client security.