Feedback from internal security team assessment — Is this a fair evaluation of pentagi's current capabilities? #217

amitkiit1994 · 2026-03-19T15:13:26Z

amitkiit1994
Mar 19, 2026

Our internal security team recently evaluated pentagi as an automated penetration testing solution and provided the following assessment. I'd like to share it with the community to understand:

Whether these observations align with the project's current intended scope
If there are configuration improvements, prompt strategies, or upcoming features that address these gaps
How other users are working around these limitations in practice
Assessment Summary

Standardized Coverage with Limited Depth

Operates on pre-programmed logic targeting known vulnerability patterns
Does not comprehensively cover OWASP Top 10 categories end-to-end
Critical areas such as injection, authentication, access control, and business logic are not deeply or systematically evaluated
2. Strength in Scale and Speed

Capable of delivering initial test results within 1–5 hours
Enables rapid first-pass coverage across applications
Well-suited for continuous and large-scale automated assessments
3. Higher False Positive Rate

Observed false positive rate in the range of 10–30%
Findings often require manual triage to validate exploitability
Several issues flagged are non-exploitable due to lack of contextual understanding
4. Limited Context & Business Logic Understanding

Struggles with custom business logic flaws, context-dependent vulnerabilities, and workflow abuse scenarios
Lacks ability to simulate real-world attacker decision-making in complex environments
5. Lack of Exploit Validation

Findings are not validated through end-to-end exploit scenarios
Results in theoretical risks being reported without confirming real attack feasibility
6. Severity Misalignment

Some findings are assigned higher severity without sufficient validation
Creates noise and impacts effective prioritization
Our Take
We see pentagi as a highly effective rapid, scalable first-pass security testing tool. The speed and automation are genuinely impressive. However, for complex real-world engagements — particularly business logic vulnerabilities, context-driven attack paths, and exploit validation — it currently falls short of replacing manual penetration testing.

Questions for the community / maintainers
Is pentagi intended to be a standalone pentest solution, or more of an augmentation layer for human pentesters?
Are there plans to improve OWASP Top 10 coverage depth, exploit validation chains, or contextual intelligence?
Has anyone tuned the system prompts or agent configuration to reduce false positives or improve business logic testing?
What's the recommended workflow — pentagi for initial automated sweep, then manual follow-up on findings?
Any input on whether this is a fair assessment at pentagi's current maturity level would be appreciated.

sickwell · 2026-03-19T22:11:24Z

sickwell
Mar 19, 2026
Collaborator

Hi @amitkiit1994, thank you for the thoughtful and fair review.

We truly appreciate the time you spent evaluating PentAGI and the fact that you structured your observations around both its strengths and its current limitations.

In general, many AI systems in security today face the same challenges you described: depth of analysis, contextual reasoning, exploit validation, and business logic understanding still depend heavily on the underlying LLM, prompt design, agent configuration, and target complexity.

Over the past year, we have been steadily developing PentAGI, shipping meaningful improvements, and working to make the product more mature and practical. At the same time, to be completely transparent, we also separately evaluate which capabilities should appear in the public version and which ones make more sense to keep in the enterprise segment.

At the same time, your assessment is broadly fair for the public version of PentAGI at its current maturity level — especially when we are talking about complex real-world applications and a minimal initial prompt, rather than narrowly scoped benchmarks. It is also important to keep in mind that, at the current stage, until scan types and target types are explicitly defined, the result depends significantly on task formulation, the amount of input data, and the selected operating mode. In other words, for the public version of PentAGI, the quality of the result is directly related to how much expertise and context the user brings into the process. The public version is there so users can see how multi-agent workflows behave in a real pentesting process. Today, it would be unrealistic to expect a single public AI-based tool in one uniform implementation, without strong expert guidance, to fully cover areas such as bug bounty work, red teaming, OWASP-oriented assessment, and other complex scenarios at the level of an experienced specialist.

One important point we would like to emphasize separately: PentAGI is not a one-off prototype or an experiment built around a single model. It is a product developed by a dedicated team, with a long-term roadmap behind it.

That roadmap is multi-directional by design. One key direction is increasing autonomy and achieving more stable results across different models, rather than relying solely on the strengths of one particular LLM at a given point in time. Another direction is creating more specialized configurations for different types of security assessment, where workflow, tooling, and evaluation criteria may differ significantly. Even now, PentAGI demonstrates a high level of full attack-chain coverage even on relatively weak local models that can literally be run on a few high-end GPUs, and for us this is an important signal that the architectural direction is the right one.

Now let me respond to your questions point by point.

1. Is pentagi intended to be a standalone pentest solution, or more of an augmentation layer for human pentesters?
Our direction is toward greater autonomy; however, we realistically view PentAGI as a powerful augmentation layer for human pentesters, red teamers, and AppSec engineers. The product currently provides two core modes: a fully autonomous mode, in which the system formulates a task and a set of subtasks with the ability to stop and adjust them, and an Assistant mode, in which PentAGI performs more compact automated action chains based on the user’s request. Depending on the scenario, you can choose either a more autonomous approach or a human-in-the-loop model.

2. Are there plans to improve OWASP Top 10 coverage depth, exploit validation chains, or contextual intelligence?
Yes, work in this direction is already underway. At the same time, OWASP Top 10 is only one of the well-known and useful approaches to web application testing, and from the beginning we have viewed PentAGI as broader than just a tool for web assessment. We already cover different types of targets, and in the long run both testing formats and target types will continue to expand.

Speaking practically, the quality of the result here depends heavily on how the task is formulated. For example, if you describe in detail an approach for testing a specific OWASP Top 10 category, such as SQL Injection, and pass that into PentAGI, the result will be noticeably better than with a general prompt like “test this site against the OWASP Top 10.”

If we briefly outline some of the priorities, they include:

deeper and more reliable exploit-chain validation;
improved contextual memory and reasoning in multi-step scenarios;
increased depth and consistency of coverage for common vulnerability classes.

So the takeaway here is simple: for the public version of PentAGI, coverage quality depends directly on how clearly and specifically the task is defined. If the testing objective is explicitly scoped to OWASP Top 10 and the agent receives related instructions, coverage becomes significantly more stable and targeted.

3. Has anyone tuned system prompts or agent configuration to reduce false positives or improve business logic testing?
Yes, absolutely. This kind of tuning really helps adapt agents more precisely to specific tasks and can also provide economic benefits. For example, different agents can be assigned to different models depending on their role: in some cases deeper reasoning matters more, while in others more reliable terminal and tool usage is more important.

For in-depth analysis of system behavior, we also recommend using the built-in Langfuse stack to trace task execution and identify areas for improvement.

Detailed prompting and proper configuration really do help reduce noise. At the same time, business logic testing remains one of the hardest areas for all such systems. In our view, what matters most here is the right combination of agents, strong context, and consistent traversal of user flows inside the application, with analysis of possible deviations, logic violations, and compromise points.

4. What's the recommended workflow — pentagi for initial automated sweep, then manual follow-up on findings?
We recommend choosing the workflow based on the specific task. PentAGI’s core is flexible enough to be used either on its own or together with other offensive or defensive tools.

For example, if we are talking about red team operations, we would be more likely to recommend using PentAGI in Assistant mode for targeted analysis of complex perimeter services. Fully autonomous mode is not optimal in such scenarios, because the system may use more noticeable tools and techniques.

If the task is application testing, then one of the most practical and effective workflows today looks like this: PentAGI is used as an initial autonomous layer for automated exploration and hypothesis generation, followed by manual validation and human-guided follow-up. And the same principle applies here again: the better the initial prompt is formulated and the more deeply the intended testing approach is described, the more accurate the result will be.

5. Any input on whether this is a fair assessment at pentagi's current maturity level would be appreciated.
Without knowing the exact scope and nature of the tests you performed, it is difficult for us to make an overly categorical judgment. On our side, we have conducted thousands of internal runs and we see that actual results really do depend heavily on configuration, model choice, and how narrowly or broadly the task is defined.

Therefore, we would describe the current state of the product as practically useful and actively evolving, but not as a finished replacement for expert-led manual testing in every possible scenario.

Once again, thank you for the review — this kind of feedback is genuinely very valuable to us and to the future development of the product.

If you would like to continue the discussion, we would be glad to see you in our Telegram/Discord community. We especially value real-world feedback, tuning experience, and workflow suggestions from practitioners.

1 reply

amitkiit1994 Mar 23, 2026
Author

Thank you so much for the reply

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feedback from internal security team assessment — Is this a fair evaluation of pentagi's current capabilities? #217

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Feedback from internal security team assessment — Is this a fair evaluation of pentagi's current capabilities? #217

Uh oh!

amitkiit1994 Mar 19, 2026

Replies: 1 comment · 1 reply

Uh oh!

sickwell Mar 19, 2026 Collaborator

Uh oh!

amitkiit1994 Mar 23, 2026 Author

amitkiit1994
Mar 19, 2026

Replies: 1 comment 1 reply

sickwell
Mar 19, 2026
Collaborator

amitkiit1994 Mar 23, 2026
Author