Skip to content

Latest commit

Β 

History

History
228 lines (181 loc) Β· 8.86 KB

File metadata and controls

228 lines (181 loc) Β· 8.86 KB

Architecture

Design Philosophy

Every module in this project has one job. encoders.py transforms data. detector.py scores formats. peeler.py orchestrates recursive decoding. formatter.py renders output. cli.py wires user input to functions. No module reaches into another module's concern.

This isn't over-engineering for a small project. It's how you keep a small project from becoming an unmanageable one. When you need to add a new encoding format, you touch encoders.py and detector.py. That's it. The CLI, formatter, and peeler don't change.

Module Dependency Graph

cli.py
β”œβ”€β”€ constants.py     (EncodingFormat, ExitCode, PEEL_MAX_DEPTH)
β”œβ”€β”€ encoders.py      (encode, decode, encode_url, decode_url)
β”œβ”€β”€ detector.py      (detect_encoding)
β”œβ”€β”€ peeler.py        (peel)
β”œβ”€β”€ formatter.py     (print_encoded, print_decoded, print_detection, print_peel_result, print_chain_result)
└── utils.py         (resolve_input_bytes, resolve_input_text)

peeler.py
β”œβ”€β”€ constants.py     (CONFIDENCE_THRESHOLD, PEEL_MAX_DEPTH, EncodingFormat)
β”œβ”€β”€ detector.py      (detect_best)
└── utils.py         (safe_bytes_preview, truncate)

detector.py
β”œβ”€β”€ constants.py     (charsets, thresholds, EncodingFormat)
β”œβ”€β”€ encoders.py      (try_decode)
└── utils.py         (is_printable_text)

formatter.py
β”œβ”€β”€ constants.py     (EncodingFormat, PREVIEW_LENGTH)
β”œβ”€β”€ detector.py      (DetectionResult β€” type only)
β”œβ”€β”€ peeler.py        (PeelResult β€” type only)
└── utils.py         (safe_bytes_preview)

encoders.py
└── constants.py     (EncodingFormat)

utils.py
└── (no internal deps)

constants.py
└── (no internal deps)

The dependency arrows always point downward. constants.py and utils.py sit at the bottom with zero internal dependencies. cli.py sits at the top, importing from everything. Nothing in the middle reaches upward. This is a directed acyclic graph (DAG), and if you ever create a circular import, Python will tell you immediately.

Data Flow

Encode Command

User Input (str or file or stdin)
    β”‚
    β–Ό
resolve_input_bytes()          ← utils.py:12
    β”‚  Converts any input source to raw bytes
    β–Ό
encode(raw, fmt)               ← encoders.py:88
    β”‚  Dispatches via ENCODER_REGISTRY to format-specific function
    β–Ό
encode_base64(data) (or other) ← encoders.py:22
    β”‚  Returns encoded string
    β–Ό
print_encoded(result, fmt)     ← formatter.py:31
    β”‚  Rich panel if terminal, raw stdout if piped
    β–Ό
Output

Decode Command

User Input (str or file or stdin)
    β”‚
    β–Ό
resolve_input_text()           ← utils.py:29
    β”‚  Converts any input source to stripped text
    β–Ό
decode(text, fmt)              ← encoders.py:93
    β”‚  Dispatches via ENCODER_REGISTRY to format-specific function
    β–Ό
decode_base64(data) (or other) ← encoders.py:26
    β”‚  Returns decoded bytes
    β–Ό
print_decoded(result)          ← formatter.py:44
    β”‚  Safe preview (UTF-8 if possible, hex fallback)
    β–Ό
Output

Detect Command

User Input (str)
    β”‚
    β–Ό
detect_encoding(text)          ← detector.py:206
    β”‚
    β”œβ”€β”€β–Ί _score_base64(text)   ← detector.py:31
    β”œβ”€β”€β–Ί _score_base64url(text)← detector.py:70
    β”œβ”€β”€β–Ί _score_base32(text)   ← detector.py:97
    β”œβ”€β”€β–Ί _score_hex(text)      ← detector.py:126
    └──► _score_url(text)      ← detector.py:174
    β”‚
    β”‚  Each scorer returns 0.0–1.0
    β”‚  Results filtered by CONFIDENCE_THRESHOLD (0.6)
    β”‚  Sorted by confidence descending
    β–Ό
print_detection(results)       ← formatter.py:58
    β”‚  Rich table: format, confidence %, decoded preview
    β–Ό
Output

Peel Command (the star feature)

User Input (str)
    β”‚
    β–Ό
peel(text, max_depth=20)       ← peeler.py:33
    β”‚
    β”œβ”€β”€β–Ί LOOP (up to max_depth iterations):
    β”‚    β”‚
    β”‚    β”œβ”€β”€ detect_best(current_text)  ← detector.py:226
    β”‚    β”‚   Returns highest-confidence detection
    β”‚    β”‚
    β”‚    β”œβ”€β”€ Break if: no detection, below threshold, decode fails
    β”‚    β”‚
    β”‚    β”œβ”€β”€ Record PeelLayer (depth, format, confidence, previews)
    β”‚    β”‚
    β”‚    └── decoded_bytes β†’ current_text for next iteration
    β”‚        (break if bytes aren't valid UTF-8)
    β”‚
    β–Ό
PeelResult(layers, final_output, success)
    β”‚
    β–Ό
print_peel_result(result)      ← formatter.py:94
    β”‚  Layer-by-layer display + final output panel
    β–Ό
Output

Chain Command

User Input (str) + --steps "base64,hex,url"
    β”‚
    β–Ό
resolve_input_bytes()          ← utils.py:12
    β”‚
    β–Ό
_parse_chain_steps("base64,hex,url")  ← cli.py:264
    β”‚  Validates each format name against EncodingFormat enum
    β”‚  Returns [BASE64, HEX, URL]
    β–Ό
LOOP over formats:
    β”‚
    β”œβ”€β”€ encode(current_bytes, fmt)     ← encoders.py:88
    β”œβ”€β”€ Record (fmt, encoded_string)
    └── encoded_string β†’ bytes for next iteration
    β”‚
    β–Ό
print_chain_result(steps, final)       ← formatter.py:130
    β”‚  Step-by-step display + final panel
    β–Ό
Output

Key Patterns

Registry Pattern (encoders.py:73–85)

Instead of a chain of if fmt == "base64": ... elif fmt == "base64url": ..., every encoder and decoder pair is registered in a dictionary:

ENCODER_REGISTRY: dict[EncodingFormat, tuple[EncoderFn, DecoderFn]]

Adding a new format means adding one entry to the registry and writing the two functions. The dispatch functions encode() and decode() never change. This is the open-closed principle: open for extension, closed for modification.

Frozen Dataclasses (detector.py:24, peeler.py:17, peeler.py:26)

All result types use @dataclass(frozen=True, slots=True). Frozen means the fields can't be mutated after creation. Slots means no __dict__ per instance, which uses less memory and is slightly faster. For data that flows through a pipeline and should never be changed, frozen dataclasses are the right tool.

Pipeline-Friendly Output (formatter.py:22–28)

The tool detects whether stdout is a terminal or a pipe. When piped (echo "data" | b64tool decode | other_tool), it writes raw text to stdout with no Rich formatting. When interactive, it shows panels, tables, and colors. This happens via is_piped() checking sys.stdout.isatty().

Rich output goes to stderr (Console(stderr=True) at formatter.py:19), so diagnostic messages never contaminate piped data. This is a standard Unix convention that many CLI tools get wrong.

Scorer Architecture (detector.py:195–203)

Detection uses the same registry pattern as encoding. Each format has a scorer function with the signature Callable[[str], float]. The _SCORERS dictionary maps EncodingFormat to its scorer. This means adding detection for a new format requires writing one scorer function and adding one dict entry.

Every scorer follows the same structure:

  1. Quick rejection (charset check, length check)
  2. Accumulate a confidence score based on structural signals
  3. Attempt actual decoding
  4. Bonus if decoded output is printable text
  5. Return clamped to [0.0, 1.0]

Type Aliases with PEP 695 (encoders.py:18–19)

type EncoderFn = Callable[[bytes], str]
type DecoderFn = Callable[[str], bytes]

Python 3.12+ type statements (PEP 695) replace TypeAlias from typing. They're lazily evaluated and more readable. These aliases document the contract: encoders take bytes and return strings, decoders take strings and return bytes.

Error Handling Strategy

Errors are handled at two levels:

Module level: Functions like try_decode() (encoders.py:98) catch encoding-specific exceptions and return None. The detector and peeler use this to gracefully handle decode failures without crashing.

CLI level: Each command (cli.py) wraps its body in a try/except. typer.BadParameter is re-raised (Typer formats these nicely). All other exceptions get a [red]Error:[/red] message and exit code 1. This prevents stack traces from leaking to end users.

The intermediate modules (detector, peeler) never catch exceptions themselves. They call try_decode() and check for None. This keeps error handling at the boundaries, not scattered through business logic.

Why Not a Class?

None of the core modules use classes for behavior (only for data: EncodingFormat, DetectionResult, PeelLayer, PeelResult). The encoder functions are pure functions. The scorers are pure functions. The peeler is a function. There's no shared mutable state to encapsulate, so there's no reason for a class.

An Encoder class with encode() and decode() methods would add indirection without adding value. The registry dict achieves the same polymorphism with less ceremony. This is idiomatic Python: use classes for data, functions for behavior, unless you have state to manage.