Skip to content

jonberenguer/sensitive-data-scanner

Repository files navigation

Important

The source code for this repository was generated by claude.ai [Sonnet4.6] **This is strictly to review codes generated by AI providers, which may follow with modifications in the future.

Warning

This project is provided as-is without warranty of any kind. Review the code before use in any environment you care about. Use at your own risk.

Sensitive Data Scanner

Goal

Cross-platform, self-contained binary that scans files for sensitive data and produces three outputs: a full-detail report, a redacted management report, and a structured JSON dataset for security team triage.

Scope

  • In scope: Detection of API keys, tokens, passwords, secrets, SSNs, credit card numbers, private keys, and connection strings via a user-configurable pattern library (patterns.json)
  • In scope: Three outputs — full report (raw values, restricted), redacted report, and structured JSON dataset
  • In scope: Native binaries for Linux and Windows; containerised execution via Docker
  • Out of scope: Automatic remediation or secret rotation — detection only
  • Out of scope: Real-time file system monitoring; this is a point-in-time scan

Repository Layout

.
├── src/                    # Go source code
│   ├── go.mod
│   ├── main.go             # CLI parsing, orchestration
│   ├── scanner.go          # File walking, pattern matching, entropy detection
│   ├── worker.go           # Parallel worker pool
│   ├── reporter.go         # Text report + JSON dataset writers
│   ├── reporter_html.go    # Self-contained HTML report writer
│   └── patterns.go         # Pattern struct, JSON loader, regex compiler
├── node/                   # Original Node.js implementation (archived)
│   ├── scanner.js
│   ├── patterns.js
│   ├── scan.sh
│   ├── scan.ps1
│   └── Dockerfile
├── fixtures/               # Test data (fake credentials)
├── patterns.json           # User-configurable pattern library
├── Dockerfile              # Multi-stage Go build + runtime image
├── build.sh                # Linux/macOS build script (uses Docker)
├── build.ps1               # Windows build script (uses Docker)
├── scan.sh                 # Linux/macOS run wrapper
└── scan.ps1                # Windows run wrapper

build/ is created by the build scripts and is git-ignored.


Building

Docker is the only build requirement — no Go installation needed on the host.

Linux / macOS

chmod +x build.sh scan.sh
./build.sh                # builds all targets
./build.sh linux          # Linux amd64 only
./build.sh linux-arm64    # Linux arm64 only
./build.sh windows        # Windows amd64 only

Windows (PowerShell)

.\build.ps1                          # builds all targets
.\build.ps1 -Target linux            # Linux amd64 only
.\build.ps1 -Target linux-arm64      # Linux arm64 only
.\build.ps1 -Target windows          # Windows amd64 only

Binaries are written to build/:

File Platform
build/scanner-linux-amd64 Linux x86-64
build/scanner-linux-arm64 Linux ARM64 (Raspberry Pi, AWS Graviton, Apple Silicon via Rosetta)
build/scanner-windows-amd64.exe Windows x86-64

Running

Linux / macOS (wrapper)

chmod +x scan.sh
./scan.sh /path/to/scan
./scan.sh /path/to/scan --ext .js,.env --exclude vendor,tmp

Windows (PowerShell wrapper)

.\scan.ps1 C:\path\to\scan
.\scan.ps1 C:\path\to\scan -Ext ".js,.env" -Exclude "vendor,tmp" -Out "C:\reports"

Windows output files are automatically suffixed with -win to distinguish them from Linux/macOS runs.

Direct (any platform)

./build/scanner-linux-amd64 /path/to/scan --patterns ./patterns.json

Docker (containerised)

# Demo run against bundled fixtures:
docker build -t sensitive-data-scanner .
docker run --rm sensitive-data-scanner

# Scan a directory on the host:
docker run --rm \
  -v /host/path/to/scan:/target:ro \
  -v /host/output:/out \
  sensitive-data-scanner /target --patterns /app/patterns.json --out /out

Options

Flag Description
--ext .js,.env,... Only scan files with these extensions (comma-separated; dot optional)
--exclude dir1,dir2 Additional directories to exclude
--suffix <str> Append a suffix to all output filenames
--out <path> Custom output directory (default: ./scan-output-<timestamp>)
--patterns <path> Path to the patterns JSON file (default: patterns.json in working dir)
--summary Print finding counts by type to stdout; skip writing output files
--entropy Enable high-entropy string detection (catches secrets with no known prefix)
--entropy-threshold <float> Entropy threshold in bits/char (default 4.5). Also enables --entropy.
--entropy-min-len <int> Minimum token length for entropy check (default 20). Also enables --entropy.
--threads <int> Parallel scan workers (default 1). Output order is always deterministic.
-h, --help Show usage

Output Files

Each scan creates a timestamped directory scan-output-<timestamp>/ containing:

File Description Access
full-report.txt All findings with raw secret values chmod 600 — RESTRICTED
redacted-report.txt Findings with partially redacted values (safe for management) Unrestricted
redacted-report.html Same as above in self-contained HTML (print-friendly) Unrestricted
findings.json Structured dataset for triage (no raw secrets) Unrestricted
skipped.log Binary or unreadable files (if any) Unrestricted

Redaction Format

  • SSN: ***-**-6789 (last 4 digits visible)
  • Credit Card: ****-****-****-1111 (last 4 digits visible)
  • Private Key: [PRIVATE KEY DETECTED — see full report]
  • Everything else: ABCD****WXYZ (first 4 + last 4 chars)

Pattern Library (patterns.json)

Patterns are defined in patterns.json at the project root. Each entry follows this schema:

{
  "id": "aws-access-key-id",
  "name": "AWS Access Key ID",
  "description": "...",
  "pattern": "\\bAKIA[0-9A-Z]{16}\\b",
  "caseInsensitive": false,
  "captureGroup": 0
}
Field Type Description
id string Unique identifier
name string Human-readable label used in reports
description string Documents what the pattern targets
pattern string RE2-compatible regex (JSON-escaped)
caseInsensitive bool Prepends (?i) when true
captureGroup int 0 = use full match; 1 = use first capture group
validator string (optional) Post-match validator: "ssn" (rejects invalid area/group/serial ranges) or "luhn" (Luhn check-digit for credit cards)

Add, remove, or modify entries in patterns.json to customise detection without recompiling.

Note: Go uses RE2 syntax. Lookahead/lookbehind assertions are not supported. Use the validator field for post-match filtering instead (as done for SSN and credit cards).


Detected Secret Types

Pattern Example
AWS Access Key ID AKIA...
AWS Secret Access Key aws_secret_access_key = ...
GCP API Key AIza...
GCP Service Account Key client_email: [email protected]
GitHub PAT (classic + fine-grained) ghp_..., github_pat_...
Slack Bot/User/App Token xoxb-...
Stripe Secret/Publishable Key sk_live_..., pk_test_...
SendGrid API Key SG....
Twilio Auth Token TWILIO_AUTH_TOKEN = ...
Bearer Token Authorization: Bearer ...
JSON Web Token (JWT) eyJ....eyJ....
Generic API Key api_key = ...
Generic Secret / Token secret = ..., access_token = ...
Generic Password Field password = ...
Private Key (PEM header) -----BEGIN RSA PRIVATE KEY-----
Database Connection String postgresql://user:pass@host/db
Azure Storage Connection String DefaultEndpointsProtocol=...AccountKey=...
Social Security Number (SSN) 123-45-6789
Credit Card Number Visa, Mastercard, Amex, Discover (Luhn-validated)
High-Entropy String Any token ≥ 20 chars scoring ≥ 4.5 bits/char (requires --entropy)

Default Excluded Directories

.git, node_modules, .cache, dist, build, vendor, __pycache__, .yarn, .next, .nuxt, target, .venv, venv, .tox, coverage, .nyc_output, .parcel-cache, .turbo, .svelte-kit, out, .output

Additional directories can be excluded with --exclude.


Security Notice

The full report contains raw, unredacted secret values.

  • Do NOT commit it to version control
  • Do NOT email or share it without encryption
  • chmod 600 is applied automatically on Linux/macOS

Node.js Implementation (Archived)

The original Node.js implementation lives in the node/ directory and requires Node.js 18+.

# Linux/macOS
node/scan.sh /path/to/scan

# Windows
node/scan.ps1 C:\path\to\scan

# Docker (built from node/ context)
docker build -t scanner-node -f node/Dockerfile node/
docker run --rm scanner-node

To Do

  • Expand Azure patterns: SAS tokens, Cosmos DB connection strings
  • Add unit tests for the pattern library (one fixture per pattern)

About

Cross-platform, self-contained binary that scans files for sensitive data and produces three outputs: a full-detail report, a redacted management report, and a structured JSON dataset for security team triage.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors