Week 5 - “PENTESTGPT: Evaluating and Harnessing Large Language Models for Automated Penetration Testing”

Wouldn’t it be great to have a system that automatically performs a pentest of any application you provide it with? This week’s paper is a step in that direction:

PENTESTGPT: Evaluating and Harnessing Large Language Models for Automated Penetration Testing

By Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, Stefan Rass

TL;DR

Vanilla LLMs struggle with end-to-end pentesting. PentestGPT breaks penetration testing into three subcategories of tasks and provides additional contextual support to overcome LLM limitations.

The Problem

Penetration testing is a manual, time-consuming, and expensive process. Companies spend significant resources on skilled professionals to identify security vulnerabilities. Given the capabilities of state-of-the-art (SOTA) large language models (LLMs), how effective are they at automating penetration testing? What are their limitations, and how can they be addressed?

The Solution

Two key components:

Study of LLM Capabilities
Design and Implementation of PentestGPT

Study
The authors curate a set of penetration testing targets from VulnHub and HackTheBox, covering various difficulty levels. Each target is broken into multiple subtasks, leading to a total of 182 subtasks across 13 targets, covering 18 CWEs. To ensure accuracy, three professional penetration testers verified the entire process. Evaluation is conducted using a human-in-the-loop approach where testers executed LLM instructions without modifications. The models tested include GPT-3.5, GPT-4, and LaMDA (Bard). The study finds that LLMs excel at using pentesting tools and interpreting source code but struggle with insufficient long-term memory, a tendency to focus on recent tasks over a broader strategic approach (akin to depth-first search rather than breadth-first search), and incorrect tool use.

PentestGPT: The Architecture
PentestGPT consists of three key modules:

Parsing Module: Extracts relevant information from verbose tool outputs, removing noise.
Reasoning Module: The core decision-making component, responsible for maintaining a structured, high-level plan of the pentest. It builds and updates a Pentesting Task Tree (PTT), which tracks completed, ongoing, and future tasks. The PTT prevents the model from overly focusing on recent tasks and provides a structured view of the penetration test status. At each iteration, the reasoning module updates the PTT based on newly acquired information, ensuring a dynamic approach to pentesting. The framework verifies the correctness of updates before the LLM determines the most beneficial next action based on estimated success probabilities and impact.
Generation Module: Uses Chain-of-Thought (CoT) reasoning to determine tools, execution steps, and commands for the next pentesting action. It generates step-by-step instructions that are then presented to the human-in-the-loop for execution, ensuring precise and effective pentesting.

Bonus Feature: Active Feedback allows human testers to provide corrections and steer the process interactively.

Results

Performance Comparison:
- PentestGPT with GPT-3.5 can solve only easy challenges end-to-end.
- PentestGPT with GPT-4 extends capability to some medium-level targets end-to-end.
- Clear performance gap between vanilla GPT vs. PentestGPT.
- When looking at individual subtasks, both PentestGPT and vanilla LLMs show progress across all difficulty levels, even in hard challenges.
Strategic Comparison:
- PentestGPT follows a more human-like strategy than raw LLMs.
- However, it still over-prioritizes brute-force attacks, unlike expert testers.
Ablation Study:
- All three modules are essential to PentestGPT’s success.
- The Reasoning Module contributes the most to overall performance.

My Thoughts

Given the fast progress in AI, this is not a new paper anymore. Since its release, smaller local models have reached or even surpassed GPT-3.5 performance. With open-source reasoning models, it’s now possible to have a pentesting assistant that can handle up to medium-level challenges—a huge leap forward!

While the study mentions that the human-in-the-loop testers were instructed not to use their expert knowledge, it would have been valuable to include performance breakdowns for different testers. This could help assess how repeatable and reliable the results are across varying levels of expertise.

Another interesting question: what happens if a non-expert is the human-in-the-loop? Could PentestGPT effectively guide an absolute beginner through a penetration test? This would be an exciting direction for future research.

Some tools used by PentestGPT required a GUI, which limits automation potential. Future iterations could explore how independent PentestGPT can become when given direct access to command-line tools. From what I can see, the team is actively developing in this direction, even with the plan to integrate GUI-based tools as well. Excited to see how this progresses!

Finally, I really like the Pentesting Task Tree (PTT) concept. By storing the high-level plan externally, PentestGPT mitigates LLM memory constraints while leveraging structured programming principles—a great example of combining AI with traditional CS methodologies.

Overall, PentestGPT is a fascinating step towards automated penetration testing. While still requiring human oversight, it shows that LLMs can meaningfully contribute to security testing. As open-source reasoning models improve, autonomous pentesting assistants may soon be within reach.