Bit-Axon

Minimal Bits, Maximal Impulse

Bit-Axon is a 3.2B-parameter hybrid small language model engine built from the ground up for Apple Silicon. It combines Mamba-style state space models, shared-expert mixture-of-experts, and aggressive 4-bit quantization into a single architecture that runs inference on a MacBook Air M4 with 16 GB unified memory. Built with Python and Apple's MLX framework, not PyTorch.

Key Features

Linear complexity: Mamba-style Axon-SSM layers with O(1) memory per token and no KV cache, handling long contexts without the quadratic cost of full attention
Sparse activation: 8-expert shared-expert MoE with top-2 routing, activating only ~1.4B parameters per token while 60% of the model stays idle
Aggressive quantization: NF4 inference, weight-decomposed DoRA fine-tuning, and planned TurboQuant KV cache compression to fit 64K context into under 3 GB

Architecture Overview

Bit-Axon uses a 24-layer sandwich structure where each third of the network serves a distinct role:

Layer  1-8:  ████████████████████ Pure Axon-SSM (Linear, no KV cache)    → Context absorption
Layer  9-16: ████████████████████ SWA + MoE (Attention + Sparse)          → Deep reasoning
Layer 17-24: ████████████████████ SSM + MoE (Linear + Sparse)             → Output synthesis

The first eight layers are pure SSM, absorbing raw context with constant memory. The middle eight add sliding window attention (4K window) alongside MoE for focused reasoning. The final eight drop attention entirely, relying on SSM plus sparse experts for fast output synthesis.

Model Configuration

Parameter	Value	Notes
`vocab_size`	32,000	Tokenizer vocabulary
`hidden_dim`	2,560	Model width (d_model)
`num_layers`	24	Total transformer/SSM layers
`num_heads`	32	SWA attention heads
`head_dim`	80	2,560 / 32
`d_source_model`	2,048	Qwen2.5-3B bridge dimension
`ssm_d_state`	16	SSM state vector size
`ssm_d_conv`	4	SSM 1D convolution kernel
`ssm_expand`	3	SSM expansion ratio
`swa_window_size`	4,096	Sliding window attention span
`moe_num_experts`	8	MoE expert count
`moe_top_k`	2	Active experts per token
`moe_intermediate_dim`	4,096	Expert FFN dimension
`moe_shared_expert`	true	Shared expert always active
`max_seq_len`	65,536	Maximum context length
`weight_tying`	true	Embedding and output head shared
`rms_norm_eps`	1e-6	RMSNorm epsilon

Memory Budget

All figures assume a MacBook Air M4 with 16 GB unified memory and roughly 8 GB available for the model.

Configuration	Weight Memory	Inference Memory
FP16 weights	~6,400 MB	N/A (does not fit)
Q4 weights, 4K context	~1,760 MB	~2,500 MB
Q4 weights, 64K context	~1,760 MB	~2,900 MB
QLoRA training (4-bit)	~1,760 MB	~3,200–3,700 MB

Installation

pip install bit-axon

For development, which pulls in pytest and pytest-xdist:

pip install -e ".[dev]"

Requires Python 3.10+ and an Apple Silicon Mac with MLX installed. The mlx and numpy dependencies are declared in the package metadata and install automatically.

Quick Start

CLI

pip install bit-axon
bit-axon download skyoo2003/bit-axon
bit-axon run "Hello, world!"
bit-axon run --chat  # Interactive chat mode

Python API

import mlx.core as mx
from bit_axon import BitAxonConfig, BitAxonModel

config = BitAxonConfig()
model = BitAxonModel(config)

input_ids = mx.array([[1, 42, 100, 200, 500]])
logits, caches = model(input_ids)

print(f"Output shape: {logits.shape}")  # (1, 5, 32000)

The returned caches list contains KV cache objects for SWA layers (9 through 16) and None for SSM-only layers, since SSM layers maintain internal state without external caching.

CLI Commands

Command	Description
`bit-axon run "prompt"`	Run LLM inference
`bit-axon train data.json`	Fine-tune with SFT (thermal-aware QLoRA)
`bit-axon quantize ./model`	Quantize model weights
`bit-axon merge --base-model ./model --adapter ./adapter`	Merge LoRA/DoRA adapters
`bit-axon benchmark`	Benchmark model performance
`bit-axon download [repo]`	Download model from HuggingFace Hub

Use bit-axon <command> --help for full options.

macOS App

Bit-Axon includes a native SwiftUI app for real-time chat on Apple Silicon.

cd BitAxonApp
swift build
open BitAxonApp.xcodeproj  # or open in Xcode

Features:

Real-time token streaming
Token speed and GPU memory monitoring
Drag-and-drop fine-tuning

Project Structure

bit-axon/
├── pyproject.toml              # Build config, dependencies, test config
├── src/bit_axon/
│   ├── __init__.py             # Package version
│   ├── config.py               # BitAxonConfig dataclass
│   ├── model.py                # BitAxonModel (24-layer sandwich)
│   ├── layers/
│   │   ├── axon_ssm.py         # Mamba-style State Space Model
│   │   ├── block.py            # 3 block variants (SSM, SWA+MoE, SSM+MoE)
│   │   ├── moe.py              # Shared-Expert Mixture of Experts
│   │   ├── rms_norm.py         # RMSNorm
│   │   └── swa.py              # Sliding Window Attention
│   ├── quantization/
│   │   ├── nf4.py              # 4-bit NormalFloat quantization
│   │   ├── ternary.py          # 1.58-bit BitNet (planned)
│   │   └── turboquant.py       # TurboQuant KV cache compression (planned)
│   ├── training/
│   │   ├── lora.py             # LoRA adapter
│   │   └── dora.py             # DoRA (weight-decomposed LoRA) adapter
│   └── utils/
│       └── cache.py            # KV cache utilities
├── tests/                      # Mirrors src/bit_axon structure
└── docs/plans/                 # Development plans (EN + KO)

Roadmap

Core Primitives (Weeks 1–4): Axon-SSM, shared-expert MoE, and DoRA adapter implementations
Architecture Synthesis (Weeks 5–8): Weight porting from Qwen2.5-3B, NF4 quantization, initial benchmarks
Training (Weeks 9–14): QLoRA supervised fine-tuning with thermal-aware scheduling for sustained training on a fanless MacBook
Alignment (Weeks 15–18): ORPO preference optimization and adapter merging
Release (Weeks 19–24): CLI inference tool, SwiftUI chat application, and open-source publication

Links

GitHub: skyoo2003/bit-axon
HuggingFace: skyoo2003/bit-axon
PyPI: bit-axon

Contributing

See CONTRIBUTING.md for guidelines on submitting issues, opening pull requests, and development workflow.

License

Bit-Axon is released under the Apache License 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 131 Commits
.github		.github
BitAxonApp		BitAxonApp
docs		docs
model		model
src/bit_axon		src/bit_axon
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
GOVERNANCE.md		GOVERNANCE.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bit-Axon

Key Features

Architecture Overview

Model Configuration

Memory Budget

Installation

Quick Start

CLI

Python API

CLI Commands

macOS App

Project Structure

Roadmap

Links

Contributing

License

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Bit-Axon

Key Features

Architecture Overview

Model Configuration

Memory Budget

Installation

Quick Start

CLI

Python API

CLI Commands

macOS App

Project Structure

Roadmap

Links

Contributing

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages