Tool	Accuracy of Findings	Detects Non-Pattern-Based Issues?	Coverage of SAST Findings	Speed of Scanning	Usability & Dev Experience
DryRun Security	Very high – caught multiple critical issues missed by others	Yes – context-based analysis, logic flaws & SSRF	Broad coverage of standard vulns, logic flaws, and extendable	Near real-time PR feedback	Clear PR comments, expandable policies with no scripting or coding (NLCP)
Snyk Code	High on well-known patterns (SQLi, XSS), but misses other categories	Limited – AI-based, focuses on recognized vulnerabilities	Good coverage of standard vulns; may miss SSRF or advanced auth logic issues	Fast, often near PR speed	Decent GitHub integration, but rules are a black box
GitHub Advanced Security (CodeQL)	Very high precision for known queries, low false positives	Partial – strong dataflow for known issues, needs custom queries	Good for SQLi and XSS but logic flaws require advanced CodeQL experience.	Moderate to slow (GitHub Action based)	Requires CodeQL expertise for custom logic
Semgrep	Medium, but there is a good community for adding rules	Primarily pattern-based with limited dataflow	Decent coverage with the right rules, can still miss advanced logic or SSRF	Fast scans	Has custom rules, but dev teams must maintain them
SonarQube	Low – misses serious issues in our testing	Limited – mostly pattern-based, code quality oriented	Basic coverage for standard vulns, many hotspots require manual review	Moderate, usually in CI	Dashboard-based approach, can pass “quality gate” despite real vulns

Tool	Accuracy of Findings	Detects Non-Pattern-Based Issues?	Coverage of C# Vulnerabilities	Scan Speed	Developer Experience
DryRun Security	Very high – caught all critical flaws missed by others	Yes – context-based analysis finds logic errors, auth flaws, etc.	Broad coverage of OWASP Top 10 vulns plus business logic issues	Near real-time (PR comment within seconds)	Clear single PR comment with detailed insights; no config or custom scripts needed
Snyk Code	High on known patterns (SQLi, XSS), but misses logic/flow bugs	Limited – focuses on recognizable vulnerability patterns	Good for standard vulns; may miss SSRF or auth logic issues	Fast (integrates into PR checks)	Decent GitHub integration, but rules are a black box (no easy customization)
GitHub Advanced Security (CodeQL)	Low - missed everything except SQL Injection	Mostly pattern-based	Low – only discovered SQL Injection	Slowest of all but finished in 1 minute	Concise annotation with a suggested fix and optional auto-remedation
Semgrep	Medium – finds common issues with community rules, some misses	Primarily pattern-based, limited data flow analysis	Decent coverage with the right rules; misses advanced logic flaws	Very fast (runs as lightweight CI)	Custom rules possible, but require maintenance and security expertise
SonarQube	Low – missed serious issues in our testing	Mostly pattern-based (code quality focus)	Basic coverage for known vulns; many issues flagged as “hotspots” require manual review	Moderate (runs in CI/CD pipeline)	Results in dashboard; risk of false sense of security if quality gate passes despite vulnerabilities

Dimension	Why It Matters
Surface	Entry points & data sources highlight tainted flows early.
Language	Code idioms reveal hidden sinks and framework quirks.
Intent	What is the purpose of the code being changed/added?
Design	Robustness and resilience of changing code.
Environment	Libraries, build flags, and infra metadata flag, infrastructure (IaC) all give clues around the risks in changing code.

KPI	Pattern-Based SAST	DryRun CSA
Mean Time to Regex	3–8 hrs per noisy finding set	Not required
Mean Time to Context	N/A	< 1 min
False-Positive Rate	50–85 %	< 5 %
Logic-Flaw Detection	< 5 %	90%+

	Severity
Location	utils/authorization.py :L118	utils/authorization.py :L49 & L82 & L164
Issue	JWT Algorithm Confusion Attack: jwt.decode() selects the algorithm from unverified JWT headers.	Insecure OIDC Endpoint Communication: ‍urllib.request.urlopen called without explicit TLS/CA handling.
Impact	Complete auth bypass (switch RS256→HS256, forge tokens with public key as HMAC secret).	Susceptible to MITM if default SSL behavior is weakened or cert store compromised.
Remediation	Replace the dynamic algorithm selection with a fixed, expected algorithm list. Change line 118 from algorithms=[unverified_header.get('alg', 'RS256')] to algorithms=['RS256'] to only accept RS256 tokens. Add algorithm validation before token verification to ensure the header algorithm matches expected values.	Create a secure SSL context using ssl.create_default_context() with proper certificate verification. Configure explicit timeout values for all HTTP requests to prevent hanging connections. Add explicit SSL/TLS configuration by creating an HTTPSHandler with the secure SSL context. Implement proper error handling specifically for SSL certificate validation failures.
Key Insight	This vulnerability arises from trusting an unverified portion of the JWT to determine the verification method itself	This vulnerability stems from a lack of explicit secure communication practices, leaving the application reliant on potentially weak default behaviors.

AI in AppSec

•

August 29, 2024

How We Harnessed LLMs for Security and Why Testing is Our Secret Weapon

Background

When we set out to build DryRun Security, we had no idea that we were going to utilize LLMs in any capacity let alone put them at the heart of everything that we do. That work, the job of keeping software secure, is really the crux of why testing matters so much to us. The results of our product need to be accurate, consistent, and precise.

We need to be able to demonstrate that using an LLM-backed approach is more robust than the legacy Static Application Security Testing (SAST) tools.

The SAST tools you see today have been performing static analysis the same way for at least 26 years and while they are generally accepted as noisy and inaccurate by both the security and developer communities, they set out to do their analysis in a deterministic way. In other words, legacy SAST tools are tremendously limited in what they can do, however they can do their analysis in a very controlled and repeatable way.

In today’s world, and although we do believe this will definitely change, deterministic behavior affords the existing SAST solutions more trust by the security community than LLMs which use probabilistic methods to make its assertions and at times can appear to “make things up”.

We will be the first to admit that without proper testing and loads of work, LLMs often produce inconsistencies and inaccuracies in the answers they form. This is where proper testing really shines.

Testing our LLMs ability to produce consistent and accurate answers ensures that we do not have to trust the system’s answers but rather, we have the capability to verify them.

DryRun Security Components

Before we get into the meat of how we perform testing it would help to understand the components that make up our product and what we are specifically talking about testing in this post.

When GitHub pull requests are opened or updated, DryRun Security simultaneously runs a suite of analyzers that each perform some sort of evaluation of either pull request code OR other data points such as authorship, intent, or behavior. We call this Contextual Security Analysis or “CSA” and you can read more about it here.

In this article, we specifically focus on the testing behind our code analyzers. These code analyzers work by asking a set of code-specific questions (we call it “Code Inquiry”) about the code that is changing and then the responses will help our tool evaluate if the code changes introduces a security vulnerability.

The Importance of Testing

However, we can do NONE of this without proper benchmarking and testing. A product that purports to leapfrog not only the competition but an entire class of tools needs to be AT LEAST as useful as those tools and then show additional value. It needs to be able to demonstrate this through comparisons and repeatable testing.

LLMs themselves are difficult to make repeatable by the very nature of the technology. LLMs are literally designed to produce something slightly different in their responses primarily to avoid sounding robotic and if you are curious to learn more we recommend reading “What is ChatGPT Doing and Why Does It Work” by Stephen Wolfram. This is the reason that LLMs appear to quickly fall down when connected to your code base and asked to repeat the same evaluation multiple times. The first request might be great, the second ok, and by the third and fourth the LLM has already gone off the rails and responds with completely inaccurate results.

We are convinced LLMs producing different answers and factually incorrect information is a large part of the reason why there is still some hesitance amongst security technologists to embrace a new technology. We get that and of course it makes sense.

It took incredible amounts of experimentation and research to make LLMs behave in a consistent and accurate manner. It took so much work that we know most people securing software are unlikely to ever have the same time to invest in figuring out how to tame this technology so that it can be used for software security.

Without that significant investment of time and effort, the value just isn’t obvious. Luckily, we had the resources and desire and so we found not only could we make it work but it could do things that legacy SAST tools cannot.

However, once we had consistent and repeatable results, we needed to ensure those results did not change once we started making improvements and modifications to these analyzers. When talking about building around LLMs at a production scale, one needs to account for the variances in results that come from changing even a couple of words in a system prompt or trivially small changes to few show prompt examples or RAG-backed context, variables, and metadata. Any slight change, especially at the scale and level of complexity we are talking about, can cause massive variances and we need to be able to uncover those issues before shipping to customers.

Hopefully, dear reader, we’ve convinced you how important testing is to us and how important it should be to any organization building around LLMs.

Testing Overview

As previously mentioned and for the purposes of this article, we will focus on the testing of the analyzers that perform code analysis. Each time one of these analyzers believes it has discovered a valid vulnerability in code it will anonymize the data and send it to an ephemeral data store in what we call a “code hunk” format.

When an analyzer needs to use these code hunks to perform testing, we pull them down locally into a git repository, sort them into their respective buckets, and point the analyzer to this local git repository via a configuration file.

When we run our tests within the analyzer, it will use the code hunks to determine if it is performing correctly. If even one test fails, we know we’ve got work to do. We run these tests not once but multiple times to ensure consistency.

When we say “sort into buckets” we mean that in any given analyzer we may have several questions (if not more) that we ask. Each question is expected to produce a boolean answer. For example, imagine a command injection analyzer. It needs to know, at a very basic level, a few things:

Is user supplied input present?
Is that user input being placed into a system call?
Is that system invocation vulnerable or being used in a vulnerable way?

This leads us to asking several questions about the code and backfilling the LLM with various bits of information it needs to know in order to answer these questions. Lastly, each answer needs to return with either a true or false condition (boolean answers).

In the above scenario, because we are asking 3 distinct questions our test cases would contain 3 separate folders all listed under a specific vulnerability type. So imagine the following folder structures:

‍

- command_injection

- language

- framework/library

- detect_user_input

- inbox

- true

- false

- analyze_system_call

- inbox
- true

- false

- analyze_vuln

- inbox

- true

- false

‍

‍

When we run our synchronize script, the anonymized code hunks are placed into the inbox folder under their respective vulnerability type and “action” (the type of action taken, eg: detect_user_input).

‍

Once the code hunks have been placed into the inbox folder, we humans go about reviewing where they belong and sorting them into the correct true/false folder. We then run our tests from the analyzer’s code base so in this case we would be running this from the command injection analyzer’s code base. Each analyzer is built on our analyzer framework so a developer only needs to point to the location of the test cases repo on their local machine and then define the vulnerability category type (directly corresponds to the vulnerability category folder name).

‍

When we’re confident in the work we did to sort things correctly in the local test cases git repository, we’ll submit a pull request to add relevant code hunks to our analyzer tests repo.

‍

Because all of our code specific analyzers are written in Python, we opted to use the PyTest framework. We primarily utilize unit-tests to ensure the format of our knowledge-base, which we use to provide the LLM with all of the relevant information it needs, is in the correct format and in working order. We utilize integration tests to ensure the LLM is responding correctly with each code hunk it evaluates.

Other relevant details to provide are:

We have a tool to recreate pull requests from OSS repos so that we can test with live data in our staging environment
We have the ability to run in production on silent mode so that when we make changes we can observe the effect without impacting customers
We have multiple facets of observability in place to ensure that our tool is behaving as intended

‍

In conclusion, while large language models (LLMs) are potent tools, their default configurations often fall short in providing the accuracy and consistency required for reliable, critical analysis. Developers leveraging LLMs must not only invest in rigorous initial adjustments to ensure optimal performance but also establish comprehensive testing protocols to maintain this standard over time.

‍

Special thanks to Joshua on our engineering team for his work on designing the first version of our testing framework.

‍

Try DryRun Security Yourself

Ready to get more out of your secure code review? Request a demo or try DryRun Security today free and see how our suite of analyzers can help you secure your code with confidence.

Explore more at DryRun Security and download our free Contextual Security Analysis Guide to learn more about our innovative approach to application security.

Ken Johnson

Co-founder & CTO

No items found.

Vulnerability Class	Snyk (partial)	GitHub (CodeQL) (partial)	Semgrep	SonarQube	DryRun Security
SQL Injection			*
Cross-Site Scripting (XSS)
SSRF
Auth Flaw / IDOR
User Enumeration
Hardcoded Token

Vulnerability Class	Snyk Code	GitHub Advanced Security (CodeQL)	Semgrep	SonarQube	DryRun Security
SQL Injection (SQLi)
Cross-Site Scripting (XSS)
Server-Side Request Forgery (SSRF)
Auth Logic/IDOR
User Enumeration
Hardcoded Credentials

Vulnerability	DryRun Security	Semgrep	GitHub CodeQL	SonarQube	Snyk Code
1. Remote Code Execution via Unsafe Deserialization
2. Code Injection via eval() Usage
3. SQL Injection in a Raw Database Query
4. Weak Encryption (AES ECB Mode)
5. Broken Access Control / Logic Flaw in Authentication
Total Found	5/5	3/5	1/5	1/5	0/5

Vulnerability	DryRun Security	Snyk	CodeQL	SonarQube	Semgrep
Server-Side Request Forgery (SSRF)				(Hotspot)
Cross-Site Scripting (XSS)
SQL Injection (SQLi)
IDOR / Broken Access Control
Broken Authentication Logic
Invalid Token Validation Logic
Broken Email Verification Logic

	Severity
	Critical	High
Location	utils/authorization.py :L118	utils/authorization.py :L49 & L82 & L164
Issue	JWT Algorithm Confusion Attack: jwt.decode() selects the algorithm from unverified JWT headers.	Insecure OIDC Endpoint Communication: ‍urllib.request.urlopen called without explicit TLS/CA handling.
Impact	Complete auth bypass (switch RS256→HS256, forge tokens with public key as HMAC secret).	Susceptible to MITM if default SSL behavior is weakened or cert store compromised.
Remediation	Replace the dynamic algorithm selection with a fixed, expected algorithm list. Change line 118 from algorithms=[unverified_header.get('alg', 'RS256')] to algorithms=['RS256'] to only accept RS256 tokens. Add algorithm validation before token verification to ensure the header algorithm matches expected values.	Create a secure SSL context using ssl.create_default_context() with proper certificate verification. Configure explicit timeout values for all HTTP requests to prevent hanging connections. Add explicit SSL/TLS configuration by creating an HTTPSHandler with the secure SSL context. Implement proper error handling specifically for SSL certificate validation failures.
Key Insight	This vulnerability arises from trusting an unverified portion of the JWT to determine the verification method itself	This vulnerability stems from a lack of explicit secure communication practices, leaving the application reliant on potentially weak default behaviors.

How We Harnessed LLMs for Security and Why Testing is Our Secret Weapon

Background

DryRun Security Components

The Importance of Testing

Testing Overview

Try DryRun Security Yourself

Related Blogs

AI wrote the bug. AI missed the bug. DryRun Security found it.

The Rise of AI‑Native SAST