CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Essential Build and Development Commands

Core Development Commands

# Default command - shows help
make

# Run all tests
make test

# Run tests with race detector
make test-race

# Run a single test
go test ./llama3 -run TestTokenizerEncode -v

# Run benchmarks
make bench
go test ./llama3 -bench=BenchmarkEncode -run=^$

# Build the CLI binary
make build

# Install the tokenizer CLI
make install

# Generate documentation (updates README files with gomarkdoc)
make generate

# Format code
make fmt-all

# Run linter
make lint

# Generate coverage report
make coverage

Release Commands

# Create and push a version tag
make tag VERSION=v1.0.0
git push origin v1.0.0

# Test release process locally
make release-snapshot

# Build for all platforms
make build-all

Development Tools

# Start development with hot-reload
make dev

# Run devbox shell for consistent environment
make devbox

# Install all development tools
make install-tools

# Install pre-commit hooks
make install-pre-commit

High-Level Architecture

Package Structure

The tokenizer project is organized into modular packages with clear separation of concerns:

cmd/tokenizer/ - CLI implementation
- main.go - Entry point with build variables (version, commit, buildDate, goVersion)
- root.go - Root command and version command
- Subcommands are delegated to individual tokenizer implementations (e.g., llama3)
llama3/ - Core Llama 3 tokenizer implementation
- tokenizer.go - Main tokenizer struct implementing Encoder/Decoder interfaces
- scanner.go - Streaming tokenization API following bufio.Scanner pattern
- vocab.go - Vocabulary management with embedded data
- options.go - Configuration options for encoding/decoding
- constants.go - Token IDs and vocabulary size constants
- errors.go - Custom error types
llama3/internal/ - Internal implementation details
- bpe/ - Byte Pair Encoding algorithm with caching
- pretokenizer/ - State machine for pre-tokenization (regex-free)
- vocabulary/ - Vocabulary data loading and management
- encoding/ - Byte-to-unicode encoding utilities
- tokens/ - Special token handling
llama3/cmd/llama3/ - Llama3-specific CLI commands
- encode.go - Text encoding command (with memory-efficient streaming for stdin)
- decode.go - Token decoding command
- info.go - Tokenizer information command

Key Architectural Decisions

Interface-Based Design: The tokenizer uses small, focused interfaces (Encoder, Decoder, Scanner, PreTokenizer, BPE, Cache) to allow for testing and future extensibility.
Embedded Data: Vocabulary and merge data are embedded at compile time using go:embed, eliminating runtime file dependencies while allowing custom data loading through interfaces.
State Machine Pre-tokenization: Instead of using regex (which has Go/JS incompatibilities), a custom state machine implements the pre-tokenization pattern matching exactly.
Performance Optimizations:
- Object pooling with sync.Pool for state machines
- LRU caching for BPE results
- Pre-allocated slices and careful memory management
- Zero-allocation methods for performance-critical paths
Streaming Support: The Scanner interface provides memory-efficient tokenization for large texts with proper UTF-8 boundary handling.

Build System Integration

The Makefile uses LDFLAGS to embed build information:

LDFLAGS := -ldflags "-X main.version=$(VERSION) -X main.commit=$(COMMIT) -X main.buildDate=$(BUILD_DATE) -X main.goVersion=$(GO_VERSION)"

These variables are defined in cmd/tokenizer/main.go and displayed by the version command.

Testing Strategy

The project uses multiple testing approaches:

Unit tests for individual components
Integration tests for full tokenization pipeline
Comparison tests against JavaScript reference implementation
Benchmark tests for performance monitoring
Test vectors (100+ cases) for edge case validation

Documentation Generation

The project uses gomarkdoc with embed tags to maintain documentation:

generate.go files in each package configure gomarkdoc
README files contain  tags
Running make generate updates documentation without overwriting manual content

Important Notes

When modifying the tokenizer, always run comparison tests to ensure JavaScript compatibility
The state machine in internal/pretokenizer/state_machine.go is critical for correctness - changes require careful testing
Build variables must be in the main package (not subpackages) for LDFLAGS to work
The project follows conventional commit messages for automated changelog generation
Pre-commit hooks will run automatically - ensure they pass before committing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md

Essential Build and Development Commands

Core Development Commands

Release Commands

Development Tools

High-Level Architecture

Package Structure

Key Architectural Decisions

Build System Integration

Testing Strategy

Documentation Generation

Important Notes

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Essential Build and Development Commands

Core Development Commands

Release Commands

Development Tools

High-Level Architecture

Package Structure

Key Architectural Decisions

Build System Integration

Testing Strategy

Documentation Generation

Important Notes