How AI Can Navigate Code and Catch Complex Vulnerabilities

HackerOne has been on a mission to understand why “shift-left” security isn’t working and to build a methodology-based solution that gets it right. Research in developing this methodology has involved ongoing exploration of the extent of what responsibly applied AI technology is capable of.

One of the questions HackerOne Engineering set out to answer early on was:

“Can an AI agent perform a security review, comprehensive or partial, on routine day-to-day changes to a codebase?”

The answer is yes.

This article covers how we were able to prove this in the initial version of an AI agent capable of navigating a codebase to understand context in ways conventional static analysis security testing (SAST) engines can’t. And how the AI agent was able to come up with an exploit path itself without human intervention. This is just one of many research initiatives that led to the creation of our development security solution, HackerOne Code.

How vulnerabilities in code are introduced

There are two primary ways in which a code change can have security vulnerabilities:

1) The code change itself contains a vulnerability. Or,

2) The code change impacts existing code to introduce a vulnerability (i.e., creates an opportunity to bypass security safeguard logic).

This includes code changes that involve third party dependencies and the chain of dependencies they install.

For a security review to be adequate, the context of an entire codebase needs to be taken into account. This is important in either case, but especially #2. An AI agent needs to determine reachability through possible execution paths then assess likelihood of reachability, potential business impact, and alternative implementations. To do this right, an AI system needs to be able to learn a repository (and learn it quickly). This was a challenge because some of the proven solutions for knowledge-based problems in AI applications don’t work as well for code. For example, retrieval augmented generation (RAG), commonly used for AI productivity in sales and support automations, works by querying large language models (LLMs) in tandem with corpuses of written prose (i.e., policy, operational processes) to enhance factual accuracy. These corpuses are updated infrequently compared to a codebase where several changes are merged every day. So for code security, a framework like RAG doesn’t work as a standalone solution. Its source of knowledge would quickly become out of date (and its output inaccurate).

AI that learns code like a developer

In order to provide adequate context for an AI agent and minimize false positives, first we needed a strong foundational code navigation capability. It would need to mimic the actions and thought process of a real human engineer reviewing code for security flaws, asking questions like:

“I see that this function definition has changed. Where is this function used?”

“I see that this change adds another endpoint. Does it ensure the user is authenticated and authorized before executing the business logic?“

To do this requires a mechanism capable of indexing a code repository. It needs to be able to track symbol definitions (i.e., classes, functions, modules, constants, variables), symbol references (i.e., where a defined symbol is called), and understand how they work together.

The abstract syntax tree (AST) normalization challenge

The first attempt at solving this was to use Tree-sitter, a tool that generates programming language agnostic abstract syntax trees (ASTs), to extract symbol definitions (i.e., function definitions) and symbol references (i.e., where a defined function is called). The benefits of this approach are that it is completely deterministic and fast.

This proved to work, but only in isolation with pre-determined samples. To be useful, the system would need to be able to process any kind of code it encountered. So grammar for every programming language would need to be supported — including syntactical differences between language versions and various programming model paradigms.

Tree-sitter offers the ability to combine grammars for files that contain multiple languages (i.e., ERB, PHP, JSX), but normalizing syntax trees for global compatibility is impractical. The scope of codebase compositions and the variety of ways developers write code is virtually boundless.

Testing with named entity recognition (NER) tasks

Seeking to fill capability gaps encountered by using ASTs alone, we turned to large language models (LLMs). This included experimentation leveraging LLMs to behave like a named entity recognition (NER) model. Our thesis was that by extracting symbol definitions and references with best-effort, an AI agent would be able to understand control-flow graph execution paths with a high degree of accuracy. This was inspired in part by a common, real-life use case — how an experienced engineer would review code changes for a repository they’re new to.

This solution is inherently non-deterministic; in theory this wouldn’t be as effective as using an AST built and finely tuned for the job. But it did prove effective as an indexing method to help an AI agent navigate code.

To understand how it works, let’s start with the following code snippet:

HackerOne Code Navigation AI Code Snippet

The NER-like model extracts require, require_relative, params, slice, and say_hello — identifying them as called functions without defining any functions.

Now let’s provide it with a file that contains a combination of definitions and references:

HackerOne Code Navigation AI Function Definition

It extracts require_relative and get_name as functions that are called and say_hello as a function definition.

This approach allows us to create a call graph for an AI agent to use. The first step is to repeat this process for each file and get all the symbol definitions. The second step is to create the edges between the nodes to understand the relationship between them.

Putting it to the test

In order to prove this could be used for an agent to navigate a code base, we created a simple test case using a raw git dff as the input:

By itself and in isolation, this code doesn’t hint at a security vulnerability. Instead of a constant string (“Jon Snow”), it would return the value of a parameter given to the function.

To assess this change in the context of the repository it’d be introduced in, the first step an AI agent would have to do is look up where the get_name function is used.

This would return:

Then the AI agent would need to investigate where the say_hello function is used. This requires recursively looking up symbol references, essentially traversing the graph, until it finds whether the function is called with user input.

So the ideal next step, what a human engineer would do, is look up references to the say_hello function:

At this point, the AI agent should have enough information to determine that there is an HTTP endpoint (“/”) that takes a subset of GET parameters (name) which passes it to say_hello where it’s concatenated in a string with the result of the get_name function. In other words, the proposed change would introduce a cross-site scripting vulnerability.

We built an AI agent with code navigation capabilities to run the test. If given just the git diff code, it would need to recursively traverse the symbol graph, understand how the code is being used in the repository, and catch the vulnerability.

Results

It performed exactly how we hoped it would.

The output of the AI agent capable of codebase navigation: an exploit path proposed code would introduce.

This works well in development workflows as commit checks or pipeline scans when a pull request is opened.

For example:

The code itself contains no obvious vulnerability. But by creating a call graph an AI agent can traverse and gain functional context it was able to catch a cross-site scripting (XSS) vulnerability.

Implications and takeaways

These early results were exciting and created a strong foundation for further research and refinement. Building on the NER approach by combining it with lexical analysis, we can query for nodes like “entry points” for exploit paths.

While we see incredible potential for application security testing in the era of AI, there’s no single automation solution for preventing vulnerabilities from being introduced in code. Powerful detection systems need to be balanced by strong validation mechanisms like human-in-the-loop (HiTL) validation and feedback channels. If not, it’s possible an AI agent can misinterpret and overwhelm developers with low-signal feedback. This puts a huge cognitive burden on developers and is part of how “shift left” security initiatives often fail. This is why incorporating expert validation is a critical component of our development security solution, HackerOne Code.

For more on AI capabilities for navigating complex codebases to find security flaws, check out Broken Security Promises: How Human-AI Collaboration Rebuilds Developer Trust.