By clicking “Accept All Cookies”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.
ToolAccuracy of FindingsDetects Non-Pattern-Based Issues?Coverage of SAST FindingsSpeed of ScanningUsability & Dev Experience
DryRun SecurityVery high – caught multiple critical issues missed by othersYes – context-based analysis, logic flaws & SSRFBroad coverage of standard vulns, logic flaws, and extendableNear real-time PR feedback
Snyk CodeHigh on well-known patterns (SQLi, XSS), but misses other categoriesLimited – AI-based, focuses on recognized vulnerabilitiesGood coverage of standard vulns; may miss SSRF or advanced auth logic issuesFast, often near PR speedDecent GitHub integration, but rules are a black box
GitHub Advanced Security (CodeQL)Very high precision for known queries, low false positivesPartial – strong dataflow for known issues, needs custom queriesGood for SQLi and XSS but logic flaws require advanced CodeQL experience.Moderate to slow (GitHub Action based)Requires CodeQL expertise for custom logic
SemgrepMedium, but there is a good community for adding rulesPrimarily pattern-based with limited dataflowDecent coverage with the right rules, can still miss advanced logic or SSRFFast scansHas custom rules, but dev teams must maintain them
SonarQubeLow – misses serious issues in our testingLimited – mostly pattern-based, code quality orientedBasic coverage for standard vulns, many hotspots require manual reviewModerate, usually in CIDashboard-based approach, can pass “quality gate” despite real vulns
Vulnerability ClassSnyk (partial)GitHub (CodeQL) (partial)SemgrepSonarQubeDryRun Security
SQL Injection
*
Cross-Site Scripting (XSS)
SSRF
Auth Flaw / IDOR
User Enumeration
Hardcoded Token
ToolAccuracy of FindingsDetects Non-Pattern-Based Issues?Coverage of C# VulnerabilitiesScan SpeedDeveloper Experience
DryRun Security
Very high – caught all critical flaws missed by others
Yes – context-based analysis finds logic errors, auth flaws, etc.
Broad coverage of OWASP Top 10 vulns plus business logic issuesNear real-time (PR comment within seconds)Clear single PR comment with detailed insights; no config or custom scripts needed
Snyk CodeHigh on known patterns (SQLi, XSS), but misses logic/flow bugsLimited – focuses on recognizable vulnerability patterns
Good for standard vulns; may miss SSRF or auth logic issues 
Fast (integrates into PR checks)Decent GitHub integration, but rules are a black box (no easy customization)
GitHub Advanced Security (CodeQL)Low - missed everything except SQL InjectionMostly pattern-basedLow – only discovered SQL InjectionSlowest of all but finished in 1 minuteConcise annotation with a suggested fix and optional auto-remedation
SemgrepMedium – finds common issues with community rules, some missesPrimarily pattern-based, limited data flow analysis
Decent coverage with the right rules; misses advanced logic flaws 
Very fast (runs as lightweight CI)Custom rules possible, but require maintenance and security expertise
SonarQube
Low – missed serious issues in our testing
Mostly pattern-based (code quality focus)Basic coverage for known vulns; many issues flagged as “hotspots” require manual review Moderate (runs in CI/CD pipeline)Results in dashboard; risk of false sense of security if quality gate passes despite vulnerabilities
Vulnerability ClassSnyk CodeGitHub Advanced Security (CodeQL)SemgrepSonarQubeDryRun Security
SQL Injection (SQLi)
Cross-Site Scripting (XSS)
Server-Side Request Forgery (SSRF)
Auth Logic/IDOR
User Enumeration
Hardcoded Credentials
VulnerabilityDryRun SecuritySemgrepGitHub CodeQLSonarQubeSnyk Code
1. Remote Code Execution via Unsafe Deserialization
2. Code Injection via eval() Usage
3. SQL Injection in a Raw Database Query
4. Weak Encryption (AES ECB Mode)
5. Broken Access Control / Logic Flaw in Authentication
Total Found5/53/51/51/50/5
VulnerabilityDryRun SecuritySnykCodeQLSonarQubeSemgrep
Server-Side Request Forgery (SSRF)
(Hotspot)
Cross-Site Scripting (XSS)
SQL Injection (SQLi)
IDOR / Broken Access Control
Invalid Token Validation Logic
Broken Email Verification Logic
DimensionWhy It Matters
Surface
Entry points & data sources highlight tainted flows early.
Language
Code idioms reveal hidden sinks and framework quirks.
Intent
What is the purpose of the code being changed/added?
Design
Robustness and resilience of changing code.
Environment
Libraries, build flags, and infra metadata flag, infrastructure (IaC) all give clues around the risks in changing code.
KPIPattern-Based SASTDryRun CSA
Mean Time to Regex
3–8 hrs per noisy finding set
Not required
Mean Time to Context
N/A
< 1 min
False-Positive Rate
50–85 %< 5 %
Logic-Flaw Detection
< 5 %
90%+
Severity
CriticalHigh
Location
utils/authorization.py :L118
utils/authorization.py :L49 & L82 & L164
Issue
JWT Algorithm Confusion Attack:
jwt.decode() selects the algorithm from unverified JWT headers.
Insecure OIDC Endpoint Communication:
urllib.request.urlopen called without explicit TLS/CA handling.
Impact
Complete auth bypass (switch RS256→HS256, forge tokens with public key as HMAC secret).
Susceptible to MITM if default SSL behavior is weakened or cert store compromised.
Remediation
Replace the dynamic algorithm selection with a fixed, expected algorithm list. Change line 118 from algorithms=[unverified_header.get('alg', 'RS256')] to algorithms=['RS256'] to only accept RS256 tokens. Add algorithm validation before token verification to ensure the header algorithm matches expected values.
Create a secure SSL context using ssl.create_default_context() with proper certificate verification. Configure explicit timeout values for all HTTP requests to prevent hanging connections. Add explicit SSL/TLS configuration by creating an HTTPSHandler with the secure SSL context. Implement proper error handling specifically for SSL certificate validation failures.
Key Insight
This vulnerability arises from trusting an unverified portion of the JWT to determine the verification method itself
This vulnerability stems from a lack of explicit secure communication practices, leaving the application reliant on potentially weak default behaviors.
Security

Why We Trained Our Own Security Finding Classifier

Why We Trained Our Own Security Finding Classifier

(And Stopped Trusting Frontier Models to Do It)

Our code review agent produces thousands of security findings. Each one needs a labelsqli, xss, authz_missing, non_security_nitpick drawn from a taxonomy of 54 vulnerability classes. Getting that label right isn’t academic. It’s the lynchpin of everything downstream: filtering noisy results, applying customer-specific suppression rules, and feeding our agent’s learning system so it gets smarter about a customer’s codebase over time. A misclassified finding doesn’t just look wrong in a report. It poisons the feedback loop.

We tried three fundamentally different strategies before landing on the one that worked. This post walks through each, what we learned, and why we ultimately fine-tuned a 350-million-parameter open-source model that runs in 21 milliseconds per classification.

The Classification Problem

To understand why this matters, it helps to know how our analysis pipeline is structured. DryRun runs multiple specialized analyzers. Some are what we call “high-signal analyzers,” — focused on well-understood vulnerability classes like SQL injection or XSS. These analyzers already know what they’re looking for. When they flag a finding, the classification is implicit: the analyzer that found it tells you what it is. The patterns are well-defined, and the language and framework signals our agents collect make the classification straightforward.

Then there’s the General Security Analyzer, which reviews pull requests in real time, and our Deep Scan agent, which performs full-repository security analysis. Both are research agents that find vulnerabilities outside the scope of classical, well-categorized issues: missing authorization checks, business logic flaws, insecure defaults, supply chain risks, prompt injection, IDOR, and the long tail of security issues that don’t fit neatly into a pattern-matching box. Both also run high-signal checks for known vulnerability classes, but their real value is in everything else.

This is where classification gets hard and where it matters most. A finding arrives as a title, description, and source filename. Something like:

The classifier’s job: map this to one of 54 labels. Some are obvious (sqli). Many are not. Is a missing authorization check authz_missing or idor? Is a weak default configurationinsecure_defaults or anon_security_nitpick? These ambiguities aren’t edge cases. They’re the norm for the General Security Analyzer’s output. Security finding classification sits in a space where categories overlap by nature, and the boundaries between them are genuinely contested even among human reviewers.

Why Classification Accuracy Matters So Much

Classification isn’t a reporting nicety. It sits on the hot path for multiple critical behaviors in our system.

First, it lets us apply category-specific controls to reduce noise. If you know a finding belongs to a particular vulnerability family, you can route it through the right suppression rules, ranking logic, and post-processing behavior. That’s much harder when every finding is just unstructured free text. When a customer tells us “stop flagging informational disclosure findings in our test fixtures,” that rule is only as good as the label. Ifinformation_disclosure findings are being tagged as insecure_defaults, the rule doesn’t fire and the customer sees noise.

Second, classification is one of the inputs to our feedback and learning systems. Over time, we want the agent to adapt to customer repositories, recurring architecture patterns, and organization-specific expectations. That only works if the system can consistently recognize what kind of issue it’s dealing with in the first place. We’ll write more about the learning system in a future post, but the short version is: garbage labels in, garbage behavior out.

Third, classification creates the foundation for meaningful aggregation. Secure code review outputs become much more useful when you can reason in terms of patterns — recurring authorization failures, repeated information disclosure issues, rising supply chain concerns. Without stable labels, none of that analysis is possible.

In other words, if classification is unstable, everything downstream gets noisier.

Attempt 1: Just Ask a Frontier Model

The most obvious approach: send each finding to a large language model along with the full label taxonomy and let it classify. We tested this with frontier-class models including Codex 5.1 Max. The prompt included all 54 label definitions, decision criteria for ambiguous cases, and one example per label.

It worked — sort of. Accuracy ranged from 78% to 85% depending on the run, which is the core of the problem.

Run the same batch of findings twice, and you’d get different labels for a meaningful percentage of them.

The model would waffle on boundary cases: calling something authz_missing on Monday and idor on Tuesday. Temperature 0 helped but didn’t eliminate the variance, especially on the ambiguous cases that matter most.

This non-determinism is a dealbreaker for a system that needs to apply rules and learn from its own output. If the same finding can get two different labels depending on when you ask, your entire downstream pipeline is built on sand.

There were also practical concerns. Every classification required an API call — added latency, added cost, and an external dependency in a critical path. At the volumes we process, those API costs add up fast. And any connectivity hiccup means classifications stall.

Attempt 2: Vector Similarity With RAG

Our next idea: build a reference library. We classified a large set of findings, embedded them into a vector store, and at classification time retrieved the most similar examples. The intuition was sound—if you’ve seen a finding like this before and labeled it the same way, the new one is probably in the same category too.

We tried two variants. Pure vector similarity used nearest-neighbor voting: embed the finding, pull the top-K neighbors from the index, and if enough neighbors agree on a label, use it. We experimented with both local FAISS indices (using OpenAI embeddings) and Pinecone (using AWS Bedrock Titan embeddings) to see if the embedding model or index implementation mattered.

The pure vector approach was fast and less varied, but it struggled with anything novel or ambiguous. It was essentially a lookup table with fuzzy matching, and when the closest examples in the store were themselves borderline cases, the voting just amplified the confusion. It returned other far too often on findings that a human would confidently label.

The RAG variant improved things. Instead of voting, we retrieved similar examples and sent them as context to a frontier model, letting the LLM make the final call with grounded references. This handled novel findings better—the model could reason about the examples rather than just count votes. Accuracy was more consistent than the frontier model approach, landing around 80%, but it still wasn’t where we needed it.

But it still inherited the core problems: API dependency, per-request cost, and inconsistency. The LLM was better-grounded now, but it was still an LLM making judgment calls that could vary between runs. And we’d added the cost and complexity of maintaining a vector store on top of the API calls.

Attempt 3: Fine-Tuning a Small Model

Here’s where things got interesting. We decided to try fine-tuning a small, open-source model specifically for our classification task. The hypothesis: a model that’s seen thousands of labeled examples of exactly this task should outperform a general-purpose model reasoning from descriptions alone, and it should do so without the run-to-run variation we kept hitting.

Choosing the Base Model

Picking a base model wasn’t a single decision — it was a tournament. We started by evaluating several models in the 1B to 9B parameter range on our classification task without any fine-tuning, just prompting with the full label taxonomy. We wanted to see which architectures had the strongest baseline intuition for code and security-related classification.

The Granite family stood out. IBM’s Granite models consistently outperformed comparably-sized alternatives on code-related and security-adjacent tasks in the untrained evaluations — they seemed to have a better innate sense of what security findings are describing, likely owing to IBM’s focus on enterprise and code-heavy training data.

From there, we fine-tuned three candidates head-to-head: IBM’s Granite 4.0 at 350M parameters, Granite at 1B, and Qwen3 at 4B. The expectation was that larger models would win — more parameters, more capacity, better accuracy. That’s not what happened. The 350M Granite model matched or beat both larger models after fine-tuning on our balanced training set.

Our hypothesis for why: the classification task is well-constrained. The input is structured (title, description, filename), the output is a single token from a fixed set of 54 labels, and the decision boundary is defined by the training data. A 350M-parameter model with a hybrid Transformer+Mamba Mixture-of-Experts architecture has more than enough capacity for this task, and the smaller model may actually benefit from having fewer parameters to overfit with.

The practical implications sealed the decision:

  • Small enough to serve on a single GPU or an Apple Silicon laptop
  • Apache 2.0 licensed — no enterprise licensing headaches
  • The hybrid architecture handles long context efficiently, which matters because our system prompt includes all 54 label definitions
  • At 350M parameters, it’s roughly 1,000x smaller than the frontier models we’d been calling via API

Building the Training Dataset

The dataset pipeline had three stages.

Stage one: labeling real findings as seeds. We pulled unique security findings from our production database and classified them using a frontier model/human review with the full taxonomy. These labeled real findings weren’t the training data themselves — they were the seed corpus. We needed a large, accurately labeled reference set so we could understand the distribution of vulnerability types in the wild and have high-quality examples to drive synthetic generation.

Stage two: building a balanced synthetic training set. Our real-world distribution has a massive long tail — information_disclosure findings vastly outnumber csv_injection. Training directly on this distribution would produce a model that’s great at common categories and useless at rare ones. So we used the labeled real findings as seeds to generate synthetic training data. For each label, we sampled real examples as context and asked a frontier model to produce realistic variations — findings that read like something our analyzers would actually produce, but with enough variation to teach the model the category rather than memorize specific phrasings. Overrepresented labels were capped; underrepresented labels were synthesized up to a minimum threshold. Every synthetic sample was tagged so we could track its provenance.

Stage three: stratified evaluation split. Critically, our test set is drawn from real findings, not synthetic ones. We held out a stratified portion of the labeled real findings for evaluation, preserving label proportions. Labels with very few real samples were collapsed into other. This means the model is trained on synthetic data but evaluated on production findings — which is the only honest way to measure whether the synthetic training actually generalizes.

Training

We fine-tuned with QLoRA (quantized Low-Rank Adaptation) using the Unsloth library. Rather than updating all 365 million parameters, LoRA trains small adapter matrices — about 13 million parameters, roughly 3.6% of the total model. This keeps training fast and memory-efficient.

The key architectural decision wastrain_on_responses_only. Each training sample is a chat-format conversation: system prompt with all 54 labels, user message with the finding, assistant response with just the label. Without response-only training, ~95% of the gradient signal gets wasted on learning to reproduce the static system prompt. Masking the loss on system and user tokens focuses all learning capacity on the actual classification decision.

Training ran for 3 epochs on an H100, with a learning rate of 2e-4 and effective batch size of 32. Total wall-clock time was measured in minutes, not hours. Final training loss: 0.1164.

After training, we merged the LoRA adapters back into the base weights and exported at 16-bit precision. The result is a self-contained model — no adapter loading, no base model dependency at inference time.

Results

We benchmarked all seven approaches (frontier model prompting, two vector similarity variants, two RAG variants, the fine-tuned model, and a hybrid deterministic+fine-tuned approach) against the same held-out test set.

The fine-tuned Granite model hit 86.6% accuracy across all 54 categories at 21 milliseconds per sample with zero errors across the full test set.

To put those numbers in context:

Frontier LLM Vector Rag Fine-tuned Granite
Accuracy 78–85% (varied) -80% 86.6%
Speed Seconds (API) Seconds (AP + retrieval) 21ms
Cost per 1K $0.15–0.60 $0.10–0.40 –$0.01
Consistency Varies between runs Varies between runs No variation observed
Requires internet Yes Yes No

The accuracy beats the frontier model’s best runs and is stable across every invocation. The speed, cost, and consistency advantages are not close.

Where It Excels

Categories with clear, distinctive patterns scored above 0.95 F1: path traversal, remote code execution, resource exhaustion, excessive privileges, open CORS, XXE. These findings have characteristic language that the model locks onto reliably.

The access control family—idor, authz_missing, excessive_privileges—all scored above 0.88 F1, which is notable because these are genuinely hard to distinguish even for human reviewers.

Where It Struggles

The hardest categories fall into three clusters.

The security-vs-nitpick boundary. non_security_nitpick, insecure_defaults, andbusiness_logic_flaw form a confusion cluster. The model correctly identifies ~73% of non-security nitpicks, but the boundary between “bad practice with no real security impact” and “misconfiguration that could be exploited” is legitimately fuzzy. This is the same boundary human security engineers argue about.

The dependency family. supply_chain_risk, vulnerable_dependency, andlanguage_version overlap significantly. A finding about an outdated package could reasonably be any of these. The model prefers supply_chain_risk (the broader label) when uncertain, which is arguably the safer default.

The catch-all. other scored 0.49 F1 — expected for a category that’s defined as “none of the above.” It’s inherently noisy and mostly functions as a classifier confidence signal: when the model says other, it’s expressing uncertainty.

Misclassification Patterns Tell a Story

The confusion matrix is one of the most revealing artifacts of this entire project — and it looked dramatically different across approaches.

With untrained models and frontier LLM prompting, the confusion matrix was a mess. Errors were scattered broadly across categories with no discernible pattern. The models would confuse rce with information_disclosure, or label a clear sqli as business_logic_flaw — mistakes that don’t reflect any coherent reasoning about the security domain. The confusion wasn’t just frequent, it was unpredictable. You couldn’t look at the error patterns and extract useful signal about where the taxonomy was ambiguous versus where the model was simply lost.

After fine-tuning, the confusion matrix tightened dramatically. The fine-tuned model’s mistakes are almost always reasonable mistakes — the same mistakes a human reviewer might make. It confusesauthz_missing with idor (both are access control issues). It confuses insecure_defaults withnon_security_nitpick (both are about configuration quality). It confuses supply_chain_risk withvulnerable_dependency (both are about dependency problems).

Three distinct confusion clusters emerged:

  • The security-vs-nitpick boundary: non_security_nitpick insecure_defaultsbusiness_logic_flaw. This is the hardest boundary in the entire taxonomy — when does a bad practice cross from “you should fix this” to “this is exploitable”?
  • The access control family: authz_missingidorexcessive_privileges auth_bypass. All four are fundamentally about “who can access what,” just at different layers of the stack.
  • The dependency family: supply_chain_risk vulnerable_dependency language_version. These overlap because a vulnerable dependency is a supply chain risk — the distinction is about specificity.

In other words, the model’s error boundary aligns with the genuine ambiguity in the taxonomy itself. This is exactly the kind of confusion matrix you want to see: not random noise, but structured disagreement that mirrors the hard calls human reviewers face.

We’re actively using these patterns to improve. The confusion matrix isn’t a static report card — it’s an engineering input. We’re refining label definitions where the taxonomy creates artificial ambiguity, generating targeted synthetic data for the highest-confusion pairs, and exploring hierarchical classification that groups related labels (the access control family, the dependency family) so the model can get the family right even when the specific label is arguable. Each round of retraining tightens the matrix further. The remaining accuracy gains are likely to come as much from refining the taxonomy — collapsing or clarifying confusing label pairs — as from model improvements.

What We Learned

Consistency matters more than peak accuracy. A model that’s 88% accurate but gives different answers on different runs is less useful than one that’s 86% accurate and stable. Across our benchmarking, the fine-tuned model showed no variation between runs on the same inputs — not because small models are inherently deterministic (they’re not), but because the task is constrained enough and the model confident enough that the outputs didn’t shift. That kind of stability is what downstream systems — rules, learning, reporting — depend on.

Small models can punch above their weight on narrow tasks. 350 million parameters is tiny by modern standards, but when the task is well-defined (54-class text classification from structured input), you don’t need a model that can also write poetry and explain quantum mechanics. Task-specific fine-tuning is a multiplier.

Synthetic data works if you’re careful. Training on LLM-generated data sounds like circular reasoning, but the key is the evaluation boundary. Real labeled findings seed the generation, synthetic data does the training, and real held-out findings do the evaluation. The model never sees the test set during training, and the test set is never synthetic. The synthetic data just fills gaps in the long tail of your label distribution so the model doesn’t ignore rare categories.

The real competition is between approaches, not models. We didn’t spend weeks agonizing over which base model to use. The bigger decision was the strategy: prompting vs. retrieval vs. fine-tuning. Once we committed to fine-tuning, the specific model choice was driven by practical constraints (size, license, architecture, base performance) rather than benchmark chasing.

Invest in your benchmark harness early. We built a reusable framework that let us test any new approach against the same test set with consistent metrics. When you’re comparing seven approaches, having reproducible, apples-to-apples evaluation isn’t a nice-to-have — it’s the only way to make a defensible decision.

What’s Next

The classifier is now a production component, but the real payoff is what it enables. Accurate, consistent labels are the foundation for our agent learning system — the mechanism by which our code review agent improves over time on a customer's specific codebase and security posture. We'll write about that system in an upcoming post.

On the classifier itself, we're already iterating on the confusion clusters described above — tightening label definitions, generating targeted training data for high-confusion pairs, and exploring hierarchical classification. We're also working on confidence calibration using model logits so downstream systems can treat low-confidence classifications differently, and building a continuous retraining pipeline as new finding types emerge from the evolving security landscape.

But classification is just the starting point for what small, domain-specific models can do in security tooling. We're now exploring whether the same principle — fine-tune a tiny model on a narrow, well-defined task — can extend beyond labeling findings to actually finding vulnerabilities themselves.

Early results are promising. We're training extremely small models to validate specific vulnerability classes given dataflow and code context. The first target is remote code execution: given a candidate dataflow path and the surrounding code, can a purpose-built model determine whether the RCE is real? Our initial experiments suggest it can — and that these models outperform untrained frontier models on the same task, which mirrors what we saw with the classifier.

The training approach is fundamentally different from classification. Instead of labeled examples, we're generating training data from agent traces — the reasoning chains our security agents produce as they analyze code. This captures not just the final verdict but the analytical process: how a skilled agent follows data from source to sink, what it checks along the way, and how it distinguishes a true positive from a false alarm. Distilling that reasoning into a small model is a different kind of fine-tuning problem, and we're still early, but the direction feels right.

If this pans out, the implications are significant. Instead of relying on expensive frontier models with inconsistent behaviours for every security judgment in the pipeline, we'd have a constellation of cheap, fast, specialized models — one that classifies, one that validates RCEs, one that confirms authorization flaws — each trained on the specific reasoning patterns that matter for its task. The classifier was our proof of concept that this approach works. Now we're scaling it.