Instagram: Logic bugs and the infinite life of scraped data
Zoran Gorgiev, Alessio Dalla Piazza
In the first half of January 2026, a dataset containing 17-17.5M public Instagram user records surfaced on a dark web hacking forum. Around the same time, many users reported receiving unsolicited password reset emails from Instagram. Together, these two events fueled fears of a breach.
Meta, Instagram’s owner, denied there was a system breach, as some cybersecurity vendors had claimed before. It clarified that the password reset messages were due to “an issue that let an external party request password reset emails for some people,” but that the platform had already resolved it.
But how do we explain the appearance of a massive Instagram user dataset on an underground hacking forum? And was it related to the unsolicited password reset emails? Also, what kind of “issue” did Instagram encounter? You’ll learn the answers in the following sections.
A data scraping incident: What, when, and how did it happen?
The Instagram user dataset, posted by a threat actor known as “Solonik,” included:
- Usernames
- Full names
- User IDs
- Email addresses
- Phone numbers
- Countries
- Partial locations (presumably region/area, city, or truncated GPS coordinates)
Clearly, this is sensitive personal data that malicious actors can abuse. But it is worth noting that, at least, no passwords were part of the dataset.
Where did the data come from?
Neither Meta nor security researchers have provided a definitive technical account of the dataset’s origins. However, plausible theories do exist, even though they are not fully confirmed.
According to Have I Been Pwned (as well as other cybersecurity and news outlets), the data was likely scraped from Instagram’s public API. It allowed external actors to collect public account information, which someone took advantage of to perform automated large-scale scraping.
This thesis indirectly supports Meta’s official statement that hackers didn’t breach the platform’s internal systems.
But what’s more interesting is the hypothesis that the data came from an older 2024 or even 2022 API scraping incident. Solonik’s claim that it originated from a 2024 API leak supports this hypothesis, although Meta hasn’t officially confirmed any API incidents from that year or 2022.
Now, for better or worse, it is natural for companies to deny or downplay security incidents in public. The reasons can be varied: reputation management, ongoing investigations, legal and regulatory concerns, market competition, etc., all depending on the gravity of the situation.
In this case, however, the hypothesis that the Instagram data circulating on the dark web is from older scraping incidents is highly credible.
Based on general scraping patterns, hacking forum posts, and similarities with prior scraped datasets, the 2026 Instagram user records do seem like information scraped or aggregated from publicly exposed API sources (possibly over years), which just keeps resurfacing.
But even if it didn’t come from older scraping incidents, the question remains: how did Solonik get their hands on it? A new instance of large-scale data scraping is no less serious than the resurfacing of previously scraped information.
Why is scraped data a security risk?
Indeed, scraping is not a breach of internal systems. Typically, scraped data is public or semi‑public and collected without cracking passwords or servers. Nonetheless, even when it’s older data, it poses a security risk for at least the following reasons:
- Identity theft: Malicious actors can use PII for fraud.
- Social engineering: Old data aids classic or SMS phishing attacks.
- Credential stuffing: Reused old usernames can lead to the compromise of new accounts.
- Access to outdated systems: Old data may provide access to legacy systems with, by default, weak security.
- Increasing attack surface: Attackers can link old data with newly enumerated information to expand exploitation possibilities.
- Regulatory violations: Companies can face legal consequences for failing to protect old data.
- Long-term misuse: Old data may still be valuable for future attacks as technology and adversaries’ TTPs evolve.
For instance, data scraped from Facebook in 2019 was still available for sale on the dark web in 2021, implying continuous relevance and demand. It included phone numbers, email addresses, and other PII, which, two years later, remained useful for SIM swapping, social engineering, and targeted spam campaigns.
Once it’s out there, scraped data begins to live a virtually infinite, often highly eventful life. And it can always come back to bite you or your clients if you don’t take suitable security measures on time.
Consider that this risk is not unique to social media; it can plague any platform where user identifiers are public. For instance, messaging apps like Telegram and cryptocurrency exchanges make good targets for similar enumeration attacks.
Malicious actors can scrape these platforms to map a high volume of accounts to phone numbers or wallet addresses. That allows them to identify high-value targets, such as individuals with large crypto holdings or specific political affiliations, enabling highly personalized phishing or whaling attacks.
Unsolicited password reset emails
Malicious actors could abuse Instagram’s regular password-reset mechanism to send legitimate security emails at scale due to:
- An existing business logic flaw in the Instagram password reset process
- A dataset like the one discussed in the previous section, which, as a reminder, conveniently provided access to millions of Instagram usernames and email addresses in one place
The malicious actors could trigger password-reset emails to those addresses in bulk, which explains why so many users received multiple emails in a short time span from [email protected].
What makes this incident a business logic flaw?
Instagram implemented the intended behavior correctly, allowing legitimate users to submit password-reset requests. However, the rules governing who could trigger that behavior and how often were not restrictive enough. That made them prone to abuse via Instagram’s public or semi-public (partner) API endpoints.
As a consequence, external parties could use automation to repeatedly call the reset functionality and send emails, even if they were not the account owners.
That’s the essence of a business logic abuse: manipulation of an elusive flaw in the intended behavior to cause unintended consequences. In this case, the unintended consequence could have been due to inadequate rate limits, but this is just speculation.
A recent example of a similar scraping risk associated with rate limiting was the 2025 WhatsApp API vulnerability. Researchers exploited a lack of backend rate limiting to enumerate 3.5 billion active accounts. They interacted with the backend XMPP endpoints to bypass the user interface’s throttling and scraped detailed metadata for nearly half of the world’s mobile users.
The compound risk of logic bugs and scraped data
Even though there wasn’t a breach of passwords or internal systems, the Instagram incident shows how abuse of legitimate features via APIs (or other exposed endpoints) can cause chaos — false alarms, phishing risk, and social engineering opportunities.
In addition, it shows that malicious actors can creatively combine two, at first glance, completely unrelated issues: data scraping and a logic bug. They could use the newly resurfaced Instagram dataset to get a map of valid targets and fuel the automation needed to exploit the password-reset logic weakness at scale.
It doesn’t matter whether it’s the same threat actor behind the dataset. As long as the scraped data remains available, it can serve as an important ingredient in a carefully timed, well-planned cyberattack, especially one involving an unidentified logic flaw.
Again, this is not necessarily an account of the events that Meta would confirm, but it is a plausible theory of what happened, based on available public information.
Proactive defense against automated abuse and logic flaws
Malicious actors would rarely opt for breaking into a system if they can manipulate its intended functions and exploit the persistence of scraped data. Indeed, why take the more difficult path when you can carry out an attack that feeds off the system’s own behavior?
That is why addressing these risks calls for a proactive defensive approach that doesn’t over-rely on mechanisms like firewalls and rate limits. It requires a combined approach of continuous visibility and behavioral testing.
How to mitigate illicit large-scale data scraping
Mitigating the automated extraction of user records starts with a thorough understanding of your API attack surface.
For instance, many organizations suffer from shadow APIs. Such endpoints lack the security rigor of primary services, meaning they carry a higher risk than usual.
But even aside from shadow APIs, not knowing which endpoints live in your environment, what they do, the types of data that pass through them, or who owns these assets is an unacceptably high risk in itself. It turns your web ecosystem into an obscure labyrinth that makes vulnerability exploitation easier and threat identification harder.
Equixly lets you mitigate this problem with automated API discovery. It maps your API landscape by thoroughly interacting with each endpoint to assess both its defined functions and actual behavior.
This process identifies discrepancies between API definitions and test results, which:
- Illuminates areas where documentation is out of sync with implementation
- Throws light on what malicious actors could see in real-world enumeration
- Exposes security gaps, performance issues, or untracked changes

Once you gain visibility, the defense against illicit scraping continues with the identification of subtle differences between a legitimate user and a scraping bot. In this context, rate limiting is handy and can stop high-volume request floods. However, advanced scrapers often operate low and slow to avoid appearing suspicious or plain anomalous.
This is the reason why Eqiuxly concentrates on offensive security testing. Its agentic workflows enable it to emulate distributed, automated attacks at scale to discern whether your system can distinguish between high-quantity but genuine user requests and programmatic data harvesting, even when the latter mimics human behavior.
Moreover, it can discover missing and ineffective rate limits, as well as test for more sophisticated gaps, such as per-session quotas and endpoint-specific traffic patterns. The platform can precisely identify where the backend falls short in enforcing behavioral boundaries, regardless of how legitimate the authenticated session appears to be.
And find business logic flaws
Logic bugs — such as the one that allowed the unsolicited password resets on Instagram — are notoriously difficult to detect because they occur within the application’s intended flow; there is no signature or malicious code to flag.
To help you resolve problems caused by business logic, Equixly’s proactive testing model explores the “what ifs” of an API’s business rules. The platform can execute complex test scenarios that analyze how an API handles sequential requests, unexpected parameter combinations, and high-frequency calls to sensitive functions such as password recovery.
The results show how APIs perform under realistic adversarial conditions, pointing to exploitation possibilities and the inadvertent logic gaps that allow them.
Break the attack chain
Using a dataset from a prior scraping incident to fuel an exploit of a logic flaw means chaining issues for a streamlined cyberattack.
Equixly
- Integrates into CI/CD pipelines
- Offers continuous analysis of your API, web, and LLM applications’ behavior
- Turns your attention to elusive logic weaknesses
- Provides remediation advice
In this way, it supports your efforts to prevent automated data scraping and the abuse of business logic, making certain that malicious actors do not turn legitimate features into tools for operational or business disruption.
Final thoughts
The life cycle of an attack often begins long before an exploit is launched. As the Instagram incident shows, the infinite life of scraped data provides a lasting foundation for future abuse, especially when coupled with unidentified logic flaws.
Protecting modern API-first architectures necessitates continuous visibility and adversarial behavioral testing. They make it possible to find stubborn, subtle flaws early, break potential attack chains, and allow business functions remain what they should be: assets instead of liabilities.
Reach out to see how Equixly’s agentic offensive security testing can protect your systems from data scraping and logic bugs.
FAQs
What are the risks of data scraping, and how can it impact my organization even if no passwords are involved?
Data scraping exposes personal information that malicious actors can continuously use for malicious activities such as phishing, fraud, or credential stuffing, even if passwords aren’t part of the dataset.
How can I identify and mitigate business logic flaws in my systems that malicious actors can exploit for large-scale attacks?
You can identify business logic flaws through rigorous security testing and reviewing system behavior to validate that security mechanisms, such as rate limits, are adequately implemented and not easily bypassed.
What role does proactive security testing play in preventing the exploitation of API vulnerabilities and scraped data?
Proactive security testing helps you find hidden vulnerabilities in APIs and systems by emulating real-world attacks, substantially increasing the chances of discovering scraping possibilities and logic flaws long before malicious actors attempt to exploit them.
Zoran Gorgiev
Technical Content Specialist
Zoran is a technical content specialist with SEO mastery and practical cybersecurity and web technologies knowledge. He has rich international experience in content and product marketing, helping both small companies and large corporations implement effective content strategies and attain their marketing objectives. He applies his philosophical background to his writing to create intellectually stimulating content. Zoran is an avid learner who believes in continuous learning and never-ending skill polishing.
Alessio Dalla Piazza
CTO & FOUNDER
Former Founder & CTO of CYS4, he embarked on active digital surveillance work in 2014, collaborating with global and local law enforcement to combat terrorism and organized crime. He designed and utilized advanced eavesdropping technologies, identifying Zero-days in products like Skype, VMware, Safari, Docker, and IBM WebSphere. In June 2016, he transitioned to a research role at an international firm, where he crafted tools for automated offensive security and vulnerability detection. He discovered multiple vulnerabilities that, if exploited, would grant complete control. His expertise served the banking, insurance, and industrial sectors through Red Team operations, Incident Management, and Advanced Training, enhancing client security.