Secrets detection has long been trapped in a game between coverage and precision. While “typed” secrets are easily caught via classic regex, “untyped” and variable-based secrets remain invisible to pattern matching. The industry-standard fallback to entropy attempts to fill this gap, but drowns engineers in false positives without guaranteeing coverage.
DeepSecrets has always tackled this problem differently, and version 2.0 brings the hard data to prove it.
Hi, Nikolai here.
Today, I’m thrilled to announce the breakthrough release of the DeepSecrets tool.
It’s the result of the latest 6 months of research focused on search coverage, true positive rate, and performance.
What is it? Another token-wasting CLI proxy to an AI API?
Nope! n In our LLM-hype era, DeepSecrets still runs entirely on your machine — giving you great results offline, securely, and for free.
DeepSecrets extends the classic regex-based strategy for secrets scanning and relies heavily on semantic code analysis. The approach dramatically increases search coverage while maintaining reliable results: secret candidates are always semantically correct.
Every candidate passes an evaluation stage where many checks are performed on variable naming and value types. The initial under-the-hood story is covered in previous articles.
Show me the numbers.
DeepSecrets has always needed a way to track its quality in real-world scenarios.
Several months ago, I started testing every change to the tool against the SecretBench — a special benchmark for secrets-scanning tools. The benchmark consists of 2 elements:
- an archive of ~50K files containing secrets;
- a CSV file where every secret is described, enriched with metadata, and marked as valid or invalid
So, DeepSecrets v2.0 finds 93% of valid secrets while ignoring 92% of noisy secrets.
==Of course, I shared my findings with the SecretBench authors and kindly asked them to review them.==
This is how the result changes the competition covered in the article by SecretBench’s creators — DeepSecrets is a new top in the benchmark’s scope.
| Rank | Tool | Type | Recall n (Valid Secrets Found) | Precision n (Fewer False Alarms) | F1-Score n (Overall Balance) |
|—-|—-|—-|—-|—-|—-|
| ==1== | ==DeepSecrets 2.0== | ==Open-source== | ==93%== | ==69%== | ==79%== |
| 2 | GitLeaks | Open-source | 88% | 46% | 60% |
| 3 | GitHub Secret Scanner | Built-in Platform | 6% | 75% | 48% |
| 4 | Commercial X | Proprietary Enterprise | 45% | 25% | 32% |
| 5 | TruffleHog | Open-source | 52% | 6% | 11% |
| 6 | SpectralOps | Proprietary Enterprise | 68% | 1% | 2% |
And… no article is truly complete without this textbook visual.
But the most interesting discovery lies beyond the benchmark.
Revealing the “Dark Matter” beyond the SecretBench
The authors of SecretBench manually audited the lines of code flagged by their 761 regex patterns. If a secret didn’t trigger a regex match on BigQuery, it never made it to the review table and remained invisible.
DeepSecrets isn’t bound by text patterns and is designed to provide maximum search coverage while maintaining semantically correct, clean results.
So, outside the SecretBench scope, DeepSecrets uncovered 66,564 additional secret locations (only 9,425 distinct secrets). This is the “Dark Matter” of application security: valid, actionable secrets hiding in plain sight that even manual human auditing couldn’t catch.
We can and should challenge this point regarding those enormous numbers. What is the true positive rate of those extra findings? Of course, there is no way to achieve a 0 false-positive rate: even a classic password=password might be a valid result depending on the context.
DeepSecrets is smart enough to mark them as low-confidence ones. Every candidate is scored using a system that evaluates naming layouts, value entropy, and “naturalness” (using an n-gram and a bloom filter — technical details below). If the confidence is negative, a candidate is discarded. The reported score ranges from 0 to 10, and values of 6 and higher indicate valid findings in most cases. This is how 92% of scoped false positives were correctly filtered.
As of the “Dark Matter,” 4,568 (48%) of those hidden findings returned a confidence score of 6 or higher.
Confidence Drilldown (Distinct Values)

Technical Improvements
Performance and Stability
The update made the tool ~30% faster and more reliable for large files (up to 200 MB) with rich semantics. Previously, it was hard to tell if the tool was still alive during resource-intensive scans. The UI now shows the progress and estimates for each file, as well as the overall progress.

Semantic Scanning Coverage
The LexerTokenizer is now able to detect code nesting and correctly parse situations like “inline yaml inside yaml inside markdown”.

CheapVarDetector
The semantic lexing and parsing process has limitations (e.g., code that is commented out or contained within string variables).
To cover such cases, the release introduces a CheapVarDetector. It’s just a set of tight regular expressions that detect potential variable declarations in “unlexable” code. Yes, it generates noise variables, but that aspect is covered in the next section.
Variable Evaluation
It is now score-based and configurable, allowing you to tailor it to your specific needs.
The idea is straightforward: to assess the level of “dangerousness” of a suspicious code construction based on its semantic parameters. In the case of a variable, its name, value, and file location.

Speaking of the value, the classic entropy is still our friend, BUT not the best one. It turned out that checking the “naturalness” of the secret’s value is a great additional signal.
To calculate that “naturalness”, the tool uses an n-gram and a bloom filter trained on a 300k-word natural-language dictionary.
The final score directly affects the confidence level we discovered earlier today. n So, this is how we can survive the storm of dummy candidates.
More Languages Support
The update covers variable extraction edge cases in Shell, JavaScript, Markdown, PHP, and C#, and provides deeper support for R(d), Ruby, and Nix.
Update to Regexes
Classic regexes for known-format secrets also received a revamp, becoming more stringent and using multi-stage checks. Several catastrophic backtracking issues were also resolved.
The default ruleset can now better search for AWS, Stripe, MailChimp, and our favorite —–BEGIN constructions.
Switching to SARIF reports
The default reporting format now is SARIF. It is an industry standard and provides seamless integration with orchestration and ASPM systems.
The legacy JSON report format is now deprecated and will be removed in the next release.
For now, you can still choose it via --outformat json, but you will get a deprecation warning.
SARIF Reports & Dynamic Confidence
Every finding from DeepSecrets gets a confidence score. However, different security platforms parse SARIF metrics in different ways. To ensure compatibility across modern ASPM dashboards, DeepSecrets does the following:
- Virtual Subrules (
rules[]): Dynamically generates rules likeS105-LOWorS105-CRITICAL. This forces GitHub Security and DefectDojo to map semantic precision variance properly without breaking native parsers. This potentially can become a problem if you have used DS before and have a set of correctly deduplicated findings. I am truly sorry for this temporary inconvenience, but this change is vital. - Deterministic Result Level: The tool always explicitly sets
level: errorin theresults[]model. This serves as a universal fallback for CI/CD pipelines and older SAST parsers, ensuring that exposed secrets reliably break builds or block Pull Requests regardless of individual rule interpretations. - Contextual Messages: Injects the raw numeric confidence score natively into
result.message.textso security analysts see it immediately on their dashboards.
Can’t wait for your feedback!
Don’t take my word for it, check it out yourself.
Let’s make our code cleaner together with DeepSecrets.
Github: https://github.com/ntoskernel/deepsecrets/
Release: https://github.com/ntoskernel/deepsecrets/releases/tag/v2.0.0