Tool	Accuracy of Findings	Detects Non-Pattern-Based Issues?	Coverage of SAST Findings	Speed of Scanning	Usability & Dev Experience
DryRun Security	Very high – caught multiple critical issues missed by others	Yes – context-based analysis, logic flaws & SSRF	Broad coverage of standard vulns, logic flaws, and extendable	Near real-time PR feedback	Clear PR comments, expandable policies with no scripting or coding (NLCP)
Snyk Code	High on well-known patterns (SQLi, XSS), but misses other categories	Limited – AI-based, focuses on recognized vulnerabilities	Good coverage of standard vulns; may miss SSRF or advanced auth logic issues	Fast, often near PR speed	Decent GitHub integration, but rules are a black box
GitHub Advanced Security (CodeQL)	Very high precision for known queries, low false positives	Partial – strong dataflow for known issues, needs custom queries	Good for SQLi and XSS but logic flaws require advanced CodeQL experience.	Moderate to slow (GitHub Action based)	Requires CodeQL expertise for custom logic
Semgrep	Medium, but there is a good community for adding rules	Primarily pattern-based with limited dataflow	Decent coverage with the right rules, can still miss advanced logic or SSRF	Fast scans	Has custom rules, but dev teams must maintain them
SonarQube	Low – misses serious issues in our testing	Limited – mostly pattern-based, code quality oriented	Basic coverage for standard vulns, many hotspots require manual review	Moderate, usually in CI	Dashboard-based approach, can pass “quality gate” despite real vulns

Tool	Accuracy of Findings	Detects Non-Pattern-Based Issues?	Coverage of C# Vulnerabilities	Scan Speed	Developer Experience
DryRun Security	Very high – caught all critical flaws missed by others	Yes – context-based analysis finds logic errors, auth flaws, etc.	Broad coverage of OWASP Top 10 vulns plus business logic issues	Near real-time (PR comment within seconds)	Clear single PR comment with detailed insights; no config or custom scripts needed
Snyk Code	High on known patterns (SQLi, XSS), but misses logic/flow bugs	Limited – focuses on recognizable vulnerability patterns	Good for standard vulns; may miss SSRF or auth logic issues	Fast (integrates into PR checks)	Decent GitHub integration, but rules are a black box (no easy customization)
GitHub Advanced Security (CodeQL)	Low - missed everything except SQL Injection	Mostly pattern-based	Low – only discovered SQL Injection	Slowest of all but finished in 1 minute	Concise annotation with a suggested fix and optional auto-remedation
Semgrep	Medium – finds common issues with community rules, some misses	Primarily pattern-based, limited data flow analysis	Decent coverage with the right rules; misses advanced logic flaws	Very fast (runs as lightweight CI)	Custom rules possible, but require maintenance and security expertise
SonarQube	Low – missed serious issues in our testing	Mostly pattern-based (code quality focus)	Basic coverage for known vulns; many issues flagged as “hotspots” require manual review	Moderate (runs in CI/CD pipeline)	Results in dashboard; risk of false sense of security if quality gate passes despite vulnerabilities

Dimension	Why It Matters
Surface	Entry points & data sources highlight tainted flows early.
Language	Code idioms reveal hidden sinks and framework quirks.
Intent	What is the purpose of the code being changed/added?
Design	Robustness and resilience of changing code.
Environment	Libraries, build flags, and infra metadata flag, infrastructure (IaC) all give clues around the risks in changing code.

KPI	Pattern-Based SAST	DryRun CSA
Mean Time to Regex	3–8 hrs per noisy finding set	Not required
Mean Time to Context	N/A	< 1 min
False-Positive Rate	50–85 %	< 5 %
Logic-Flaw Detection	< 5 %	90%+

	Severity
Location	utils/authorization.py :L118	utils/authorization.py :L49 & L82 & L164
Issue	JWT Algorithm Confusion Attack: jwt.decode() selects the algorithm from unverified JWT headers.	Insecure OIDC Endpoint Communication: ‍urllib.request.urlopen called without explicit TLS/CA handling.
Impact	Complete auth bypass (switch RS256→HS256, forge tokens with public key as HMAC secret).	Susceptible to MITM if default SSL behavior is weakened or cert store compromised.
Remediation	Replace the dynamic algorithm selection with a fixed, expected algorithm list. Change line 118 from algorithms=[unverified_header.get('alg', 'RS256')] to algorithms=['RS256'] to only accept RS256 tokens. Add algorithm validation before token verification to ensure the header algorithm matches expected values.	Create a secure SSL context using ssl.create_default_context() with proper certificate verification. Configure explicit timeout values for all HTTP requests to prevent hanging connections. Add explicit SSL/TLS configuration by creating an HTTPSHandler with the secure SSL context. Implement proper error handling specifically for SSL certificate validation failures.
Key Insight	This vulnerability arises from trusting an unverified portion of the JWT to determine the verification method itself	This vulnerability stems from a lack of explicit secure communication practices, leaving the application reliant on potentially weak default behaviors.

AI in AppSec

•

October 23, 2024

One Year of Using LLMs for Application Security: What We Learned

Hi, I’m Ken Johnson, Co-founder and CTO at DryRun Security. If you are unfamiliar with DryRun Security, our product finds the needle in the haystack of code changes, so AppSec teams spot unknown risks before they start.

I wanted to take a moment to reflect on our journey with using LLMs for application security. There have been plenty of skeptics who doubted whether AI and LLMs could reliably detect vulnerabilities in source code—and unsurprisingly, many of those voices came from traditional SAST vendors. While our experience hasn’t been without its challenges, as is the case with any emerging technology, navigating the bumps along the way has been part of the process.

Over the past year, we at DryRun Security have immersed ourselves in leveraging large language models (LLMs) for application security. Our exploration of how best to use LLMs to assess risk in software development has been a challenging, yet enlightening, journey. In this post, I’ll share key insights from our experience, including the obstacles we encountered and the lessons we learned.

Why LLMs for Application Security?

Traditionally, identifying vulnerabilities in software has relied heavily on code scanning techniques. These methods typically involve parsing source code into an Abstract Syntax Tree (AST), building call graphs to map function interactions, and searching for specific patterns or naming conventions.

Therein lies the crux of our issues.

At the end of the day, we are still looking for fairly exact patterns. More nuanced flaws that require some level of intelligence don’t exist in yesterday’s approach.

Some of the common complaints with legacy tools include:

High noise-to-signal ratio
Difficulty adapting to new technology stacks
Inability to detect complex or nuanced issues
Lack of context for changes in code
Designed primarily for security experts, not developers
Slow performance

Our hypothesis was that LLMs, known for their proficiency in text summarization and apparent understanding of code intent and behavior, could help overcome these limitations.

With LLMs, we believed we could go beyond exact pattern matching and detect a wider range of nuanced security issues in code.

We also believed that we could build a product that spoke to developers like humans do, that we could help them learn while keeping their code safe.

How Did It Go?

Initially, our results were incredibly underwhelming. We encountered several challenges, including:

Inconsistent quality across LLMs: Not all LLMs are equally effective at code analysis. Many performed poorly, delivering unreliable results.
Inconsistent outputs: Even LLMs that excel at analyzing code often struggled to produce consistent results, which is critical for security assessments.
Privacy and security concerns: Balancing the need for secure analysis with both privacy concerns and with the cost of running LLMs required a delicate tradeoff. We explored various solutions. We found hosting OSS models prohibitively expensive, OpenAI to be a non-starter for many of the organizations we talked to, and we needed strong privacy & security guarantees in any solution we purchased.
Training LLMs isn’t a silver bullet: While training models on specific datasets seemed like a potential solution, it came with drawbacks, including being locked into a particular LLM and the complexity of maintaining the training process as well as limiting our ability to quickly add support for new technology stacks.

A crucial realization was that success with LLMs hinges on how they are used.

In an ideal world, we’d send code changes to a well-trained LLM and receive accurate, consistent insights about potential issues—complete with suggestions on how to fix them.

Unfortunately, the reality is more complicated. LLMs can’t be fed code and expected to identify vulnerabilities. Instead, they require a shift in approach and careful integration with existing tools and processes.

Key Lessons We Learned

While we’re still learning, here are some of the most important lessons we’ve discovered about using LLMs for application security:

Choose the right LLM for the task. Different LLMs excel at different things. Sometimes you need a model specialized in embeddings; other times, you need one that can perform written tasks well or understand code deeply. Matching the right LLM to the specific job is critical.
Ask the right questions. Treat LLMs like human code reviewers—broad or vague questions will yield unreliable answers. For example, asking “Does this code have SQL injection?” might result in an uncertain or incomplete response. Instead, the questions need to be specific, concrete, and backed by context. Don’t expect LLMs to solve complex issues on their own—break down your queries and use tuning techniques to guide the LLM effectively.
LLMs don’t have all the answers, but they can learn. While training an LLM can help provide the necessary answers, it comes with the risk of vendor lock-in and a labor-intensive process. A more flexible approach is Retrieval-Augmented Generation (RAG), where you can quickly build a knowledge base without being tied to a specific LLM. This method also allows for more dynamic and scalable solutions.
Robust testing is essential. Anytime you modify code, update your knowledge base, or switch to a different LLM, you need thorough testing in place. Without strong tests, you risk compromising the security insights you’ve worked so hard to generate. (Check out our detailed article on how we test LLMs here).
LLMs excel at summarizing behavior. One area where LLMs truly shine is their ability to summarize the behavior of code. With the right setup, they can provide a clear, high-level understanding of what code is doing, which can be incredibly useful for spotting behavioral anomalies.
Combining deterministic and probabilistic methods works best. While LLMs excel in certain areas, we found that accuracy and speed improved significantly when we combined deterministic and probabilistic methods. For example, using deterministic techniques to identify whether a specific library is present in a codebase provided useful context for the LLM. This context, when fed into the LLM for probabilistic analysis, helped the model perform more effectively. By blending both methods, we were able to leverage the strengths of each and reduce uncertainty.
Agent-based execution enhances LLM performance. One of our biggest breakthroughs was realizing the effectiveness of using LLMs with an agent-based execution model. Instead of relying solely on single-shot question-and-answer interactions, we enabled the LLM to follow a series of steps—essentially mimicking a chain of thought. By giving the LLM access to external tools and documentation, as well as a structured process for gathering the information it needed, we saw a dramatic improvement in outcomes. This approach allowed the LLM to function more like a human code reviewer, providing deeper insights and more accurate analysis.

Problematic LLMs We Encountered

While many LLMs show promise, we also encountered several that presented too many issues for us to adopt them in our workflow. Below is a list of some of the problematic LLMs we tested, along with the key shortcomings:

CodeLlama: While CodeLlama showed promise in certain areas, it struggled with consistent code analysis, particularly in distinguishing between different input types such as few-shot prompts, context, and the code itself. Additionally, using CodeLlama with cloud services like SageMaker proved challenging, requiring significant effort in prompt formatting and modifying libraries to ensure embeddings were processed correctly.
LLaMA: At the time of this writing, LLaMA 3.2 has been released, but we have only tested versions 2 and 3. These earlier versions demonstrated notable limitations in understanding and processing more complex code, making them less effective for in-depth analysis. However, LLaMA excelled in other areas, particularly in writing and summarizing text, which can be useful for generating high-level overviews.
Mixtral: With Mixtral we encountered problems with specific programming languages and results were highly inconsistent.
Mistral: Mistral generated overly broad or irrelevant responses during security assessments.

This journey with LLMs has been filled with challenges, but the potential benefits for application security are immense. We’re excited to continue pushing the boundaries of what LLMs can do and to share our progress with you along the way.

Summary

Our journey with LLMs for application security has been both challenging and rewarding. While traditional static analysis tools have limitations, LLMs offer a new approach to code analysis that allows for greater nuance and insight into security risks.

Through trial and error, we learned that success with LLMs requires the right combination of deterministic and probabilistic methods, agent-based execution, and robust testing frameworks. Although not all LLMs are well-suited for this task, those that excel can summarize code behavior and improve security outcomes when used correctly.

In case you were wondering, here’s an article on how we addressed privacy and security issues for our customers: How We Keep Your Code Safe at DryRun Security

As we continue to push the boundaries of what’s possible with AI in security, we’re excited to share more insights on how we scale and refine these methods in our next post. Stay tuned and follow us on LinkedIn to receive more updates.

Thanks for reading this far! If you’re interested in seeing how we leverage LLMs to find risk before it gets merged, then I’d recommend checking out our 3-min demo video or setting up a 1:1 personalized demo with our team.

Ken Johnson

Co-founder & CTO

No items found.

Vulnerability Class	Snyk (partial)	GitHub (CodeQL) (partial)	Semgrep	SonarQube	DryRun Security
SQL Injection			*
Cross-Site Scripting (XSS)
SSRF
Auth Flaw / IDOR
User Enumeration
Hardcoded Token

Vulnerability Class	Snyk Code	GitHub Advanced Security (CodeQL)	Semgrep	SonarQube	DryRun Security
SQL Injection (SQLi)
Cross-Site Scripting (XSS)
Server-Side Request Forgery (SSRF)
Auth Logic/IDOR
User Enumeration
Hardcoded Credentials

Vulnerability	DryRun Security	Semgrep	GitHub CodeQL	SonarQube	Snyk Code
1. Remote Code Execution via Unsafe Deserialization
2. Code Injection via eval() Usage
3. SQL Injection in a Raw Database Query
4. Weak Encryption (AES ECB Mode)
5. Broken Access Control / Logic Flaw in Authentication
Total Found	5/5	3/5	1/5	1/5	0/5

Vulnerability	DryRun Security	Snyk	CodeQL	SonarQube	Semgrep
Server-Side Request Forgery (SSRF)				(Hotspot)
Cross-Site Scripting (XSS)
SQL Injection (SQLi)
IDOR / Broken Access Control
Broken Authentication Logic
Invalid Token Validation Logic
Broken Email Verification Logic

	Severity
	Critical	High
Location	utils/authorization.py :L118	utils/authorization.py :L49 & L82 & L164
Issue	JWT Algorithm Confusion Attack: jwt.decode() selects the algorithm from unverified JWT headers.	Insecure OIDC Endpoint Communication: ‍urllib.request.urlopen called without explicit TLS/CA handling.
Impact	Complete auth bypass (switch RS256→HS256, forge tokens with public key as HMAC secret).	Susceptible to MITM if default SSL behavior is weakened or cert store compromised.
Remediation	Replace the dynamic algorithm selection with a fixed, expected algorithm list. Change line 118 from algorithms=[unverified_header.get('alg', 'RS256')] to algorithms=['RS256'] to only accept RS256 tokens. Add algorithm validation before token verification to ensure the header algorithm matches expected values.	Create a secure SSL context using ssl.create_default_context() with proper certificate verification. Configure explicit timeout values for all HTTP requests to prevent hanging connections. Add explicit SSL/TLS configuration by creating an HTTPSHandler with the secure SSL context. Implement proper error handling specifically for SSL certificate validation failures.
Key Insight	This vulnerability arises from trusting an unverified portion of the JWT to determine the verification method itself	This vulnerability stems from a lack of explicit secure communication practices, leaving the application reliant on potentially weak default behaviors.

One Year of Using LLMs for Application Security: What We Learned

Why LLMs for Application Security?

How Did It Go?

Key Lessons We Learned

Problematic LLMs We Encountered

Summary

Related Blogs

AI wrote the bug. AI missed the bug. DryRun Security found it.

The Rise of AI‑Native SAST