Blog
Security

Equixly vs. Humans: Analyzing the Efficiency of AI vs. Human Penetration Testers

Empirical Assessment of AI-Automated Versus Human-Led Pentesting Approaches

Abstract. This article presents an empirical evaluation of the Equixly AI Agent in comparison with human-led penetration testing. It’s based on the analysis of 86,310 HTTP requests directed at a benchmark environment featuring 30 realistic API microservice challenges. Our results provide insights into the strategies and limitations of human-driven testing—both manual and semi-automated—across several key parameters, including payload identification and time-based attack classification. We conclude by highlighting the limitations of traditional penetration testing tools commonly used in the industry, particularly those that employ a high-volume, signature-based approach, common in template-driven scanners like Nuclei. These tools often fail to identify API vulnerabilities in modern environments effectively. The results make a strong, evidence-based case for integrating Equixly into the secure development life cycle for proactive testing. Furthermore, this data points to an evolution in the role of penetration testers. It encourages them to move beyond conventional techniques and adopt a hybrid approach that combines human expertise with AI-driven technologies for more comprehensive and accurate security assessments.

Introduction

Penetration testing (pentesting) is a cornerstone of cybersecurity, providing a means of identifying vulnerabilities within software systems. However, as systems grow increasingly complex and threat actors become more sophisticated, relying solely on manual testing is no longer sufficient. The demand for faster, scalable, and continuous testing has given rise to AI-driven and automated solutions.

In this study, we examine a dataset of over 86,310 HTTP requests, gathered from a Capture the Flag (CTF) type of challenge designed for 15 human testers, organized into three teams of five. The testing environment comprised 30 microservice challenges, covering classic security issues such as SQL injection as well as OWASP Top 10 API vulnerabilities.

Over a two-hour testing period, the combined efforts of the three teams resulted in solving 14 out of the 30 challenges—primarily those of lower complexity.

In a parallel run, Equixly surfaced 230 unique security issues in one hour. These findings included all 30 benchmark challenges, which were the only items configured to return CTF flags. The remaining findings were not false positives but part of Equixly’s validation set, systematically enumerating recurring instances of the same flaws—a task often cut short in manual testing. For example, once a penetration tester finds an initial vulnerability, they may feel a sense of ‘mission accomplished’ and move on to hunt for different types of flaws, rather than systematically finding every recurring instance of the same issue.

This shows broader coverage and faster time-to-find than the three pentest teams. To guarantee a fair analysis, traffic generated by the Equixly AI Agent itself was excluded from the results.

2. Experiment Setup

2.1 Benchmark Environment

The test environment consisted of a microservice cluster implementing 30 security challenges, based on key vulnerabilities from the OWASP API Security Top 10 (2023) and earlier editions.

These challenges included a range of security issues, such as:

Broken Object Level Authorization (BOLA): Testers were tasked with manipulating user permissions to gain unauthorized access to resources.
Broken Function Level Authorization (BFLA): Challenges centered on improper authorization checks, allowing access to restricted functions based on user roles.
Broken Authentication: Scenarios that involved weak authentication mechanisms, including poor session handling and the use of insecure tokens.
Injection Attacks (e.g., SQL Injection and Command Injection): Testers needed to identify points where unvalidated inputs could be exploited to execute malicious commands or manipulate queries.
Business Logic Flaws: Challenges that required testers to exploit poorly designed logic or workflows to manipulate the application’s intended behavior.

These vulnerabilities were selected to mirror real-world API security flaws. Critical operations required authentication, simulating real-world scenarios where testers needed to navigate login flows to authenticate and fully exploit the challenges.

The challenges reflected common issues faced by production-level systems, offering a robust environment designed to test the distinct capabilities of both human intuition and AI-driven analysis in penetration testing.

2.2 Data Collection

Traffic was captured using PCAP files through a forward proxy approach, ensuring comprehensive visibility into both legitimate and potentially malicious requests. To process and analyze this traffic, we employed PyShark, a Python wrapper for TShark, which enabled parsing of each packet and the extraction of key data points. As the service ran on HTTP, the traffic was unencrypted, so no decryption steps were needed.

The following metrics were recorded for analysis:

HTTP Method: Identifying request types such as GET, POST, PUT, DELETE, etc., to understand the distribution and effectiveness of different request types in exploiting vulnerabilities.
Status Codes: Analyzing the 2xx, 3xx, 4xx, and 5xx status code distributions to gauge the success or failure of the attempted attacks and identify misconfigurations in the API response handling.
Sniff Timestamps: Recording timestamps for each request to analyze request timing, which helped identify patterns such as brute-force attempts, rate-limiting bypass, or time-based attacks.
User-Agents: Logging User-Agent strings to detect potential automation tools, proxies, or abnormal patterns indicative of malicious behavior.
URI Paths and Query Parameters: Capturing the requested URI paths and query parameters to identify targeted endpoints and any potential attack vectors, including injection points and improper access to sensitive resources.
Potentially Malicious Patterns: Parsing for substrings and patterns that are indicative of common attack techniques, such as SQL injection, Cross-Site Scripting (XSS), Path Traversal, Command Injection, XML External Entity (XXE) attacks, and Remote File Inclusion (RFI).
Payload Anomalies: Detecting anomalies in request payloads, such as unusually large payloads or unexpected characters, which are commonly associated with buffer overflow or denial-of-service (DoS) attacks.
Authentication Flaws: Monitoring authentication and session-related headers (e.g., Authorization, Cookie, Bearer Token) to detect attempts to bypass authentication mechanisms, reuse tokens, or exploit insecure session management practices.

By capturing and analyzing these data points, we gained deep insight into the attack strategies and effectiveness of the testers, enabling us to better understand the nuances of human-driven penetration testing compared to AI-assisted methods.

2.3 Classification Methodology

Signature-Based Identification: We identified and categorized traffic based on known penetration testing tool signatures (“PostmanRuntime”, “sqlmap”, “nuclei”, “Burp Suite”, etc.) and contrasted them with typical browser strings (Mozilla/5.0, Chrome/). That enabled us to distinguish between automated or tool-based testing attempts and legitimate browser traffic. Additionally, we tracked specific tool behavior, such as the unique header patterns or user-agent strings used by these tools, to pinpoint potential attacks.
Time-Based Heuristics: We calculated the mean inter-request intervals for each user-agent, providing insight into the rate of requests made by a particular source. When the average time between consecutive requests fell below one second, the traffic was classified as likely automated. This heuristic helped us identify high-velocity attack attempts, such as brute-force or fuzzing attacks, which typically generate rapid and consistent requests over a short period of time.
Content-Based Analysis: Each URI path, BODY, and query parameter was screened for suspicious substrings or patterns commonly associated with malicious activities, such as:
- SQL Injection Indicators: Keywords like sleep(, union select, or ‘OR ‘1'='1.
- Cross-Site Scripting (XSS): Tags like <script>, javascript:, or onerror=.
- Path Traversal: Sequences such as ../../, /etc/passwd, or ..\\..\\.
- Command Injection: Commands like ;ls -la, | cat /etc/passwd, or &&.
- File Inclusion: Patterns like php://input, file://, or ../../etc/passwd.

This methodology enabled us to identify various types of injection attacks, resource manipulation attempts, and misconfigurations within the system that could lead to security breaches.

The detection methods, when combined, provided a comprehensive mechanism for identifying and categorizing both automated attack attempts and suspicious manual penetration testing activities.

3. Results and Observations

3.1 Overall Request Volume

Total HTTP Requests: 86,310
Capture Duration: ~2.25 hours (8,091.187 seconds)
Requests per Second (RPS): ~10.67

This moderately high RPS suggests a mix of manual testing efforts and “spray-and-pray” tactics commonly employed by automated scanning tools.

Based on our time-based and signature analysis, fewer than 1% of the requests appear to be genuinely human-driven (i.e., manually crafted requests). At the same time, the majority display characteristics typical of automated or semi-automated tools. The automated traffic likely stems from well-known tools, such as “sqlmap,” “Burp” Suite,” or custom scripts, which generate rapid, repetitive requests with minimal human interaction.

3.2 The 30 Microservice Challenges: Outcome

Even with more than two hours to complete the tasks, the 15 human pentesters successfully solved only 14 out of 30 challenges, which were categorized into three difficulty levels (easy, medium, and hard). The successes were overwhelmingly concentrated in the easiest category, typically involving simple, unauthenticated bugs, such as basic SQL injections, which standard automated tools (e.g., sqlmap) could easily find.

Conversely, the pentesters consistently failed on challenges requiring complex logic or multi-step exploitation, with all three teams focusing their successes on the same set of simple tasks. This outcome aligns with the high volume of observed erroneous requests, highlighting that the employed manual methods lacked the context for more difficult challenges.

Human testers were primarily focused on individual requests, which limited their capacity to identify multi-exploit scenarios or leverage the broader context of the application necessary for identifying deeper vulnerabilities.

In contrast, Equixly identified 230 security issues in just one hour, demonstrating a substantial gap in coverage and efficiency between the two approaches. This disparity highlights the limitations of traditional manual penetration testing, particularly in complex and dynamic environments.

It also emphasizes the need for human pentesters to have AI support when testing large perimeters, where AI-driven tools can provide faster and more comprehensive vulnerability identification, helping human testers focus on critical, high-impact findings.

AI is no longer merely a supplementary tool but a necessary component to enhance the effectiveness of modern penetration testing efforts.

Equixly AI Pentesting Dashboard

3.3 HTTP Method Diversity

HTTP Methods Usage Statistics
CONNECT	2
DELETE	11
GET	41,129
HEAD	10
OPTIONS	13
POST	45,077
PUT	58
TRACE	10

Interpretation

The distribution of HTTP methods reveals important insights into the testing behavior. While GET (~41k) and POST (~45k) requests dominate, which is typical for web applications, the presence of unusual methods like CONNECT (2), TRACE (10), and OPTIONS (13) suggests reconnaissance.

These less common methods are often used by automated scanning modules (e.g., in Burp Suite or Postman) to probe for server misconfigurations and discover “shadow” APIs—undocumented endpoints that are often overlooked in manual testing. For example:

TRACE can be exploited in Cross-Site Tracing (XST) attacks to retrieve sensitive information from HTTP response headers.
OPTIONS helps enumerate the available methods supported by a resource, a common technique for mapping an application’s functionality.

By tracking the frequency of these methods, we can detect HTTP enumeration attacks and intelligence-gathering activities. Therefore, it is essential for security assessments to actively test all HTTP methods to identify undocumented endpoints, as these can serve as significant attack vectors.

3.4 Status Codes and Error Rates

HTTP Status Code Statistics
200	5,170
201	1
204	18
301	338
302	12
303	30
307	14
400	6,763
401	39,927
404	10,822
500	23,143
501	20
502	17

4xx errors: 57,512
5xx errors: 23,180

Interpretation

The distribution of HTTP status codes provides key insights into the API’s performance and the effectiveness of the testing process. Here’s why these metrics matter:

Authorization & Misconfigurations: The high number of 401 Unauthorized errors indicates that many requests failed due to a lack of proper credentials. In a microservice environment with nested roles, testers must systematically approach user privileges; failing to do so results in a large volume of “dead-end” requests.
Input Validation Issues: A significant number of 400 Bad Request errors point to poor input generation, where the API is rejecting invalid or malformed data from automated tools.
Exploit Attempts: The presence of 500 Internal Server Errors indicates that some requests triggered server-side issues, highlighting areas of the API that may be vulnerable to exploitation. While these errors don’t directly indicate a security breach, they reveal where unexpected inputs are causing failures.

The high volume of 400 and 500 errors underscores a key limitation of traditional automated tools: they often fuzz blindly, sending poorly formatted inputs without adapting to the API’s specific data type requirements (e.g., dates and numbers). This highlights the importance of AI-assisted testing, which can tailor inputs to the API’s context more effectively, thereby enhancing the quality and coverage of security assessments.

3.5 Unique Endpoints and Query Parameters

Unique (Host, URI) pairs: 16,132
Unique query parameters: 345
Unique (param, value) pairs: 1,688

The testers and their scripts targeted a broad range of endpoints. While a wide coverage is often a desirable trait, the high number of 4xx and 5xx errors suggests that quantity did not necessarily result in quality. Many of the repeated parameters likely followed known exploit patterns or were simply outside the scope of the given testing environment.

3.6 User-Agent Analysis

User-Agent Request Statistics
Unknown-UA	54,495
PostmanRuntime/7.43.0	21,144
PostmanRuntime/7.29.2	10,191
Mozilla/5.0 (X11; Linux x86_64)...Chrome/13...	76
Mozilla/5.0 (Windows NT 10.0; Win64; x64)...Chrome/...	48
Mozilla/5.0 (Windows NT 10.0; Win64; x64)...Chrome/...	30
Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Firefox/102.0	21
Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/201001 F...	16
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36...(Ins.)	15
axios/1.4.0	13

Over 76% of the traffic was generated by either Unknown-UA or Postman, indicating that a large proportion of the testing was automated or semi-automated. Traditional browser-based User-Agent strings were relatively rare, appearing only in a handful of requests.

This pattern reinforces the idea that the testers likely proxied their traffic and then ran automated tools like “Burp Suite” or “sqlmap” to conduct further scanning, rather than engaging in fully manual, interactive testing.

4. “Human” vs. “Tool” Classification

Request Classification
human	220
tool	86,044
unclassified	40

Out of 86,310 total requests, only 220 (~0.25%) were classified as human-like, featuring standard browser User-Agent strings and natural interaction speeds. The remaining 86,044 requests (~99.7%) were generated by automated tools, displaying rapid inter-arrival times and known scanning signatures (Postman, Burp Suite, sqlmap), which is indicative of “spray-and-pray” behavior.

This high-volume, automated approach inflates the request count but often lacks the precision of human-driven exploration. Automated tools excel at scanning large perimeters quickly but struggle to adapt inputs to the API’s nuances, leading to a high volume of failed attempts and limited depth. In contrast, human testers are better equipped to evaluate logic flaws and complex contextual errors that these tools miss.

However, manual testing has its own limitations. The 220 human-like requests were spread over approximately two hours, translating to an average of just 0.03 requests per second. This slow rate demonstrates that while manual testing can be thoughtful, it is insufficient for achieving comprehensive coverage within a limited timeframe.

Therefore, while the volume of automated requests might suggest thoroughness, it does not guarantee effective vulnerability discovery. A hybrid approach that combines human intelligence with AI-assisted automation is vital for achieving both speed and depth in modern penetration testing.

5. Suspicious Payload Distribution in URL

Suspicious Pattern Matches
sleep(	251
etc/passwd	128
../../	118
1=1	40
http://	28
https://	25
<script	9
union select	7
' OR '1'='1	5

Requests with suspicious payloads: 657
Most Frequent Attack: SQL Injection

The presence of suspicious payloads (657) suggests a focused attempt to identify vulnerabilities. The most frequent attack patterns observed include:

Blind SQL injection attempts (sleep()
Path traversal (etc/passwd and ../../)
A few minimal XSS checks (<script>)

These patterns, along with the presence of http:// and https:// suggesting potential SSRF or redirect exploits, strongly indicate the use of standardized scanning modules from tools like Nuclei. The fact that multiple testers relied on these common, automated attack vectors likely limited the diversity of techniques used and the overall depth of the security assessment.

Pentester Struggling

6. Conclusions and Future Directions

Our analysis revealed the following key insights:

Widespread reliance on automated or semi-automated scanning: Approximately 86,000 requests were generated by automated tools, despite the testing being classified as “human pentesting.” This tendency indicates that the majority of the testing was automated, rather than genuine manual testing.
Massive error rates: A high volume of 4xx (approximately 57,500) and 5xx (about 23,200) errors suggests a lack of focus or proper context in the testing approach. The “spray-and-pray” methodology, which relies on flooding the system with requests, often yields ineffective testing and results in missed vulnerabilities.
Negligible purely manual exploration: Only 220 requests were made through purely manual exploration, which constitutes only a fraction of the total requests. This means that human involvement in the process was minimal, further reinforcing the dominance of automated or semi-automated tools.

The results suggest that AI-powered solutions, such as Equixly, can outperform manual testing in terms of speed and efficiency on large perimeters.

The future of penetration testing likely involves hybrid models, where human testers collaborate with AI tools, focusing on complex logic flaws and advanced exploit chains that remain challenging for automated scanners to detect.

These findings have several key implications for modern DevSecOps:

Continuous, Automated Testing: Traditional penetration testing, often conducted on an annual or quarterly basis, is becoming outdated. With tools like Equixly, continuous scanning—on a daily or weekly basis—can help identify and triage vulnerabilities in real time, leading to more proactive security measures.
Pentesters + AI: While human expertise is essential for identifying and exploiting complex security vulnerabilities (such as creative exploit chains and deep logic issues), the era of purely manual scanning at scale is quickly becoming obsolete. AI can assist in handling repetitive, low-level testing, freeing human testers to focus on more advanced tasks.
Reduced Manual Overhead: Human testers often waste valuable time on repetitive tasks, such as dealing with credential-missing errors or performing brute-force attempts. By offloading these tasks to AI-driven tools, testers can prioritize more sophisticated techniques, improving both the efficiency and effectiveness of the testing process.

Given the limitations of traditional penetration testing services, particularly for large API environments, why rely on just spreading tools across 100+ endpoints when you can leverage Equixly for more efficient, scalable, and comprehensive security testing?

Get in touch for a technical breakdown of the methodology behind Equixly’s results.

Alessio Dalla Piazza

CTO & FOUNDER

Former Founder & CTO of CYS4, he embarked on active digital surveillance work in 2014, collaborating with global and local law enforcement to combat terrorism and organized crime. He designed and utilized advanced eavesdropping technologies, identifying Zero-days in products like Skype, VMware, Safari, Docker, and IBM WebSphere. In June 2016, he transitioned to a research role at an international firm, where he crafted tools for automated offensive security and vulnerability detection. He discovered multiple vulnerabilities that, if exploited, would grant complete control. His expertise served the banking, insurance, and industrial sectors through Red Team operations, Incident Management, and Advanced Training, enhancing client security.

09/14/25 Security

Equixly vs. Humans: Analyzing the Efficiency of AI vs. Human Penetration Testers

Introduction

2. Experiment Setup

2.1 Benchmark Environment

2.2 Data Collection

2.3 Classification Methodology

3. Results and Observations

3.1 Overall Request Volume

3.2 The 30 Microservice Challenges: Outcome

3.3 HTTP Method Diversity

Interpretation

3.4 Status Codes and Error Rates

Interpretation

3.5 Unique Endpoints and Query Parameters

3.6 User-Agent Analysis

4. “Human” vs. “Tool” Classification

5. Suspicious Payload Distribution in URL

6. Conclusions and Future Directions

Alessio Dalla Piazza

CTO & FOUNDER

Others from our blog

Get a demo