Tool	Accuracy of Findings	Detects Non-Pattern-Based Issues?	Coverage of SAST Findings	Speed of Scanning	Usability & Dev Experience
DryRun Security	Very high – caught multiple critical issues missed by others	Yes – context-based analysis, logic flaws & SSRF	Broad coverage of standard vulns, logic flaws, and extendable	Near real-time PR feedback	Clear PR comments, expandable policies with no scripting or coding (NLCP)
Snyk Code	High on well-known patterns (SQLi, XSS), but misses other categories	Limited – AI-based, focuses on recognized vulnerabilities	Good coverage of standard vulns; may miss SSRF or advanced auth logic issues	Fast, often near PR speed	Decent GitHub integration, but rules are a black box
GitHub Advanced Security (CodeQL)	Very high precision for known queries, low false positives	Partial – strong dataflow for known issues, needs custom queries	Good for SQLi and XSS but logic flaws require advanced CodeQL experience.	Moderate to slow (GitHub Action based)	Requires CodeQL expertise for custom logic
Semgrep	Medium, but there is a good community for adding rules	Primarily pattern-based with limited dataflow	Decent coverage with the right rules, can still miss advanced logic or SSRF	Fast scans	Has custom rules, but dev teams must maintain them
SonarQube	Low – misses serious issues in our testing	Limited – mostly pattern-based, code quality oriented	Basic coverage for standard vulns, many hotspots require manual review	Moderate, usually in CI	Dashboard-based approach, can pass “quality gate” despite real vulns

Tool

Accuracy of Findings

Detects Non-Pattern-Based Issues?

Coverage of SAST Findings

Speed of Scanning

Usability & Dev Experience

DryRun Security

Very high – caught multiple critical issues missed by others

Yes – context-based analysis, logic flaws & SSRF

Broad coverage of standard vulns, logic flaws, and extendable

Near real-time PR feedback

Clear PR comments, expandable policies with no scripting or coding (NLCP)

Snyk Code

High on well-known patterns (SQLi, XSS), but misses other categories

Limited – AI-based, focuses on recognized vulnerabilities

Good coverage of standard vulns; may miss SSRF or advanced auth logic issues

Fast, often near PR speed

Decent GitHub integration, but rules are a black box

GitHub Advanced Security (CodeQL)

Very high precision for known queries, low false positives

Partial – strong dataflow for known issues, needs custom queries

Good for SQLi and XSS but logic flaws require advanced CodeQL experience.

Moderate to slow (GitHub Action based)

Requires CodeQL expertise for custom logic

Semgrep

Medium, but there is a good community for adding rules

Primarily pattern-based with limited dataflow

Decent coverage with the right rules, can still miss advanced logic or SSRF

Fast scans

Has custom rules, but dev teams must maintain them

SonarQube

Low – misses serious issues in our testing

Limited – mostly pattern-based, code quality oriented

Basic coverage for standard vulns, many hotspots require manual review

Moderate, usually in CI

Dashboard-based approach, can pass “quality gate” despite real vulns

Vulnerability Class

Snyk (partial)

GitHub (CodeQL) (partial)

Semgrep

SonarQube

DryRun Security

SQL Injection

Cross-Site Scripting (XSS)

SSRF

Auth Flaw / IDOR

User Enumeration

Hardcoded Token

Tool	Accuracy of Findings	Detects Non-Pattern-Based Issues?	Coverage of C# Vulnerabilities	Scan Speed	Developer Experience
DryRun Security	Very high – caught all critical flaws missed by others	Yes – context-based analysis finds logic errors, auth flaws, etc.	Broad coverage of OWASP Top 10 vulns plus business logic issues	Near real-time (PR comment within seconds)	Clear single PR comment with detailed insights; no config or custom scripts needed
Snyk Code	High on known patterns (SQLi, XSS), but misses logic/flow bugs	Limited – focuses on recognizable vulnerability patterns	Good for standard vulns; may miss SSRF or auth logic issues	Fast (integrates into PR checks)	Decent GitHub integration, but rules are a black box (no easy customization)
GitHub Advanced Security (CodeQL)	Low - missed everything except SQL Injection	Mostly pattern-based	Low – only discovered SQL Injection	Slowest of all but finished in 1 minute	Concise annotation with a suggested fix and optional auto-remedation
Semgrep	Medium – finds common issues with community rules, some misses	Primarily pattern-based, limited data flow analysis	Decent coverage with the right rules; misses advanced logic flaws	Very fast (runs as lightweight CI)	Custom rules possible, but require maintenance and security expertise
SonarQube	Low – missed serious issues in our testing	Mostly pattern-based (code quality focus)	Basic coverage for known vulns; many issues flagged as “hotspots” require manual review	Moderate (runs in CI/CD pipeline)	Results in dashboard; risk of false sense of security if quality gate passes despite vulnerabilities

Tool

Accuracy of Findings

Detects Non-Pattern-Based Issues?

Coverage of C# Vulnerabilities

Scan Speed

Developer Experience

DryRun Security

Very high – caught all critical flaws missed by others

Yes – context-based analysis finds logic errors, auth flaws, etc.

Broad coverage of OWASP Top 10 vulns plus business logic issues

Near real-time (PR comment within seconds)

Clear single PR comment with detailed insights; no config or custom scripts needed

Snyk Code

High on known patterns (SQLi, XSS), but misses logic/flow bugs

Limited – focuses on recognizable vulnerability patterns

Good for standard vulns; may miss SSRF or auth logic issues

Fast (integrates into PR checks)

Decent GitHub integration, but rules are a black box (no easy customization)

GitHub Advanced Security (CodeQL)

Low - missed everything except SQL Injection

Mostly pattern-based

Low – only discovered SQL Injection

Slowest of all but finished in 1 minute

Concise annotation with a suggested fix and optional auto-remedation

Semgrep

Medium – finds common issues with community rules, some misses

Primarily pattern-based, limited data flow analysis

Decent coverage with the right rules; misses advanced logic flaws

Very fast (runs as lightweight CI)

Custom rules possible, but require maintenance and security expertise

SonarQube

Low – missed serious issues in our testing

Mostly pattern-based (code quality focus)

Basic coverage for known vulns; many issues flagged as “hotspots” require manual review

Moderate (runs in CI/CD pipeline)

Results in dashboard; risk of false sense of security if quality gate passes despite vulnerabilities

Vulnerability Class

Snyk Code

GitHub Advanced Security (CodeQL)

Semgrep

SonarQube

DryRun Security

SQL Injection (SQLi)

Cross-Site Scripting (XSS)

Server-Side Request Forgery (SSRF)

Auth Logic/IDOR

User Enumeration

Hardcoded Credentials

Vulnerability

DryRun Security

Semgrep

GitHub CodeQL

SonarQube

Snyk Code

1. Remote Code Execution via Unsafe Deserialization

2. Code Injection via eval() Usage

3. SQL Injection in a Raw Database Query

4. Weak Encryption (AES ECB Mode)

5. Broken Access Control / Logic Flaw in Authentication

Total Found

5/5

3/5

1/5

0/5

Vulnerability

DryRun Security

Snyk

CodeQL

SonarQube

Semgrep

Server-Side Request Forgery (SSRF)

(Hotspot)

Cross-Site Scripting (XSS)

SQL Injection (SQLi)

IDOR / Broken Access Control

Broken Authentication Logic

Invalid Token Validation Logic

Broken Email Verification Logic

Dimension	Why It Matters
Surface	Entry points & data sources highlight tainted flows early.
Language	Code idioms reveal hidden sinks and framework quirks.
Intent	What is the purpose of the code being changed/added?
Design	Robustness and resilience of changing code.
Environment	Libraries, build flags, and infra metadata flag, infrastructure (IaC) all give clues around the risks in changing code.

Dimension

Why It Matters

Surface

Entry points & data sources highlight tainted flows early.

Language

Code idioms reveal hidden sinks and framework quirks.

Intent

What is the purpose of the code being changed/added?

Design

Robustness and resilience of changing code.

Environment

Libraries, build flags, and infra metadata flag, infrastructure (IaC) all give clues around the risks in changing code.

KPI	Pattern-Based SAST	DryRun CSA
Mean Time to Regex	3–8 hrs per noisy finding set	Not required
Mean Time to Context	N/A	< 1 min
False-Positive Rate	50–85 %	< 5 %
Logic-Flaw Detection	< 5 %	90%+

KPI

Pattern-Based SAST

DryRun CSA

Mean Time to Regex

3–8 hrs per noisy finding set

Not required

Mean Time to Context

N/A

< 1 min

False-Positive Rate

50–85 %

< 5 %

Logic-Flaw Detection

< 5 %

90%+

	Severity
Location	utils/authorization.py :L118	utils/authorization.py :L49 & L82 & L164
Issue	JWT Algorithm Confusion Attack: jwt.decode() selects the algorithm from unverified JWT headers.	Insecure OIDC Endpoint Communication: ‍urllib.request.urlopen called without explicit TLS/CA handling.
Impact	Complete auth bypass (switch RS256→HS256, forge tokens with public key as HMAC secret).	Susceptible to MITM if default SSL behavior is weakened or cert store compromised.
Remediation	Replace the dynamic algorithm selection with a fixed, expected algorithm list. Change line 118 from algorithms=[unverified_header.get('alg', 'RS256')] to algorithms=['RS256'] to only accept RS256 tokens. Add algorithm validation before token verification to ensure the header algorithm matches expected values.	Create a secure SSL context using ssl.create_default_context() with proper certificate verification. Configure explicit timeout values for all HTTP requests to prevent hanging connections. Add explicit SSL/TLS configuration by creating an HTTPSHandler with the secure SSL context. Implement proper error handling specifically for SSL certificate validation failures.
Key Insight	This vulnerability arises from trusting an unverified portion of the JWT to determine the verification method itself	This vulnerability stems from a lack of explicit secure communication practices, leaving the application reliant on potentially weak default behaviors.

Severity

Critical

High

Location

utils/authorization.py :L118

utils/authorization.py :L49 & L82 & L164

Issue

JWT Algorithm Confusion Attack:
jwt.decode() selects the algorithm from unverified JWT headers.

Insecure OIDC Endpoint Communication:
‍urllib.request.urlopen called without explicit TLS/CA handling.

Impact

Complete auth bypass (switch RS256→HS256, forge tokens with public key as HMAC secret).

Susceptible to MITM if default SSL behavior is weakened or cert store compromised.

Remediation

Replace the dynamic algorithm selection with a fixed, expected algorithm list. Change line 118 from algorithms=[unverified_header.get('alg', 'RS256')] to algorithms=['RS256'] to only accept RS256 tokens. Add algorithm validation before token verification to ensure the header algorithm matches expected values.

Create a secure SSL context using ssl.create_default_context() with proper certificate verification. Configure explicit timeout values for all HTTP requests to prevent hanging connections. Add explicit SSL/TLS configuration by creating an HTTPSHandler with the secure SSL context. Implement proper error handling specifically for SSL certificate validation failures.

Key Insight

This vulnerability arises from trusting an unverified portion of the JWT to determine the verification method itself

This vulnerability stems from a lack of explicit secure communication practices, leaving the application reliant on potentially weak default behaviors.

AI in AppSec

•

October 23, 2025

How We Turned Natural Language Into a Scalable Agentic AppSec Engine

When we first started experimenting with what would eventually become our Custom Policy Agent, the idea sounded deceptively simple:

What if you could ask a question about a pull request in plain English, something like “Does this change modify authentication logic?” and get an accurate answer?

At the time, we called them Behavioral Questions. They were defined in YAML, took a question and a bit of code context, and ran against pull requests. It was an early prototype of what we now call Natural Language Code Policies and it was both exciting and wildly unstable. We even secured a spot as Black Hat Startup Finalists 2024 for this invention!

The Early Days: Simple Idea, Hard Problems

Those first iterations worked… sometimes. But we quickly ran into the kind of challenges that make or break a system like this.

Repeatability: The same question asked twice could produce slightly different answers.
Accuracy: Results drifted depending on the complexity of the code or question being asked.
Review Planning: Analysis runs were messy, with early AI-driven coordination and little human oversight.
Speed: Queueing and execution bottlenecks were not terrible but needed to be much faster.
Limited Context: The system only saw code inside the PR. That meant it missed dependencies, helper functions, or related logic outside the diff.

It was enough to prove the concept, but not enough to scale. We needed a foundation that could deliver consistent, high-fidelity results across any repo, language, or environment.

‍

Evolution: From Behavioral Questions to Natural Language Code Policies

The next phase was a big leap. We rebuilt Behavioral Questions as Natural Language Code Policies (NLCPs) and moved the whole experience into our dashboard.

That shift mattered because it gave users real control instead of a YAML file they had to maintain. Policies were now much more capable and intuitive, including:

Rich background context: Why this check matters, what files and folders to exclude, good and bad examples, documentation, and anything else that helps inform the analysis.
Custom remediation guidance: The ability for policy builders to place their organization’s specific mitigation strategy as guidance when the issue is flagged in a code change.
Policy Dashboard: A visual interface for building, testing, and managing NLCPs all in one place.

We also rebuilt the evaluation layer from the ground up to ensure accuracy and repeatability. Under the hood, we focused heavily on orchestration: better queueing, improved notifications, and streamlined concurrency so that policies ran faster and more predictably.

Confidence Through Testing: The Policy Builder

Once NLCPs lived in the dashboard, we wanted users to trust them before deployment.
So we built the Policy Builder, a place to point policies at specific repos and pull requests to see how they behaved in real conditions.

It let users verify that:

The policy is working as intended
The results are consistent
Execution is fast enough for production use

This was one of those simple-but-critical steps that turned experimentation into adoption across our user base.

The Agent Era: Expanding Context and Intelligence

Even with all that progress, one limitation kept surfacing. Sometimes, the PR itself wasn’t enough. Developers and security reviewers often need broader code context to understand what a change means in the larger system.

That’s when we introduced our first agent, a component capable of fetching relevant code from the wider repository for ephemeral analysis.

It gave each policy deeper insight into how a change fit into the surrounding codebase.

From there, we kept expanding:

A Just-in-time research agent that could look up information about frameworks, languages, or vulnerabilities from approved documentation and allow-listed websites.
An SCA agent that checks for Common Vulnerabilities and Exposures (CVEs) tied to dependencies.
A license agent that validates open-source license compliance.

Each agent added more depth, accuracy, and autonomy to our analyses. Together, they turned NLCPs into something much smarter than a static rules engine.

‍

We Thought Natural Language Was Enough. We Were Wrong.

Our next lesson was a human one.

We assumed that using natural language meant people could easily write their own policies. It turns out… not really.

Everyone writes questions differently. Some are verbose, some vague, and some mix intent with background story. LLMs can handle a lot, but ambiguity still creates drift.

So we built an AI-powered Policy Assistant to help.

It walks users through creating policies by asking precise, context-specific questions and clarifying what they want to detect, what context matters, and what feedback should be shown to developers.

By the end of the conversation, the user has a ready-to-run, testable policy that’s been engineered for accuracy and clarity.

‍

Collaboration and Creativity: The Policy Library

As our customer base grew, we started seeing patterns and different teams solving those same kinds of problems in similar ways.

That insight led to the Policy Library, where users can use or customize policies depending on their goals and coding environment.

It’s a set of shared policies where security and engineering teams can pick from this list of pre-built policies and immediately begin to experiment, iterate, and learn from each other in a community-driven layer built right into the product.

‍

Custom Policy Agent: Where We Are Now

At this point, the term “NLCP” no longer fits all that we’re doing for policies. It’s no longer just natural-language prompts.

It’s autonomous, a Custom Policy Agent that calls sub-agents on demand. Adaptive, autonomous systems driven by your policy and capable of reaching beyond the code in the PR to reason about behavior and intent across any technology stack.

Our latest iteration introduced something we’re especially proud of:
Larger agents create execution plans for smaller, focused sub-agents, which means reviews are faster and more precise.

Each analysis involves reasoning, strategy, and decision-making and it is all orchestrated in seconds. The smaller agents handle targeted inspection and coordination, while the larger ones perform deep inspection and synthesis. The result is faster, more accurate, and more consistent analysis than ever before.

‍

Looking Back, and Ahead

Competitors are now developing their own early, incomplete versions of this system, which we see as a significant compliment and a strong validation of our approach to the problem.

It just makes sense.

Why spend time writing brittle rules for each stack, that can only match surface-level patterns, when you can speak to a system in plain language, detect complex logic and authorization issues, and apply it anywhere?

That’s what our Custom Policy Agent delivers.

It started as a YAML file and a question.

Now, it is an intelligent, adaptive, agentic layer that helps AppSec teams reason about code the way developers do, in context.

Learn more: Six technologies in the DryRun Custom Policy Agent that make policies easier to adopt and deliver high accuracy.

‍

Ken Johnson

Co-founder & CTO

No items found.

How We Turned Natural Language Into a Scalable Agentic AppSec Engine

The Early Days: Simple Idea, Hard Problems

Evolution: From Behavioral Questions to Natural Language Code Policies

Confidence Through Testing: The Policy Builder

The Agent Era: Expanding Context and Intelligence

We Thought Natural Language Was Enough. We Were Wrong.

Collaboration and Creativity: The Policy Library

Custom Policy Agent: Where We Are Now

Looking Back, and Ahead

Related Blogs

The Half Life (and Decay) of Static Rules in a Modern Codebase

7 Mistakes Teams Make When Building AI Applications