Snyk
Snyk VulnBench JS 1.0 Reveals Inconsistencies in LLM Security Findings
Ask AI about this cluster
Analyzing cluster data...
Referenced clusters:
Something went wrong. Please try again.
Cluster AI
Ask questions about this threat cluster with AI-powered analysis.
Get Researcher $29.99/moArticle Content
Snyk conducted 300 vulnerability scans to evaluate the repeatability of LLM security reviews on identical code and prompts. The results showed that while reference-matched findings were stable, extra-model reports varied significantly. Out of 161 unique unmatched findings, 80 appeared only once across five identical scans, while 134 of 158 matched findings were consistent across all repetitions. The benchmark highlighted that LLMs can identify high-signal exploit shapes but also produce inconsistent results. The highest-recall LLM configuration detected only 81% of Snyk Code reference vulnerabilities, with nearly 50% of LLM-only reports appearing in just one of five scans. This raises questions about the reliability of LLMs in security assessments compared to traditional deterministic SAST tools. The benchmark was designed to measure model behavior under controlled conditions using JavaScript and Express applications.
Key Points: • LLM security findings show significant variability, with 50% appearing only once in scans. • Reference-matched findings were stable, indicating a need for combining LLMs with traditional SAST tools. • The highest-recall LLM configuration found only 81% of Snyk Code reference vulnerabilities.