Skip to content

jwlutz/ml_fundamentals

Repository files navigation

ML Fundamentals

Machine learning algorithms implemented from scratch using NumPy, verified against scikit-learn and PyTorch on canonical datasets.

Overview

This repository contains clean, educational implementations of 18 machine learning algorithms built from first principles. Each implementation uses only NumPy for numerical operations, with comprehensive test suites and interactive Streamlit visualizations. All models are benchmarked against scikit-learn to verify correctness.


Benchmark Results

All implementations are verified against scikit-learn on canonical datasets using 5-fold cross-validation.

Implementation Status

Status Overview

Accuracy Heatmap

Accuracy Heatmap

From-Scratch vs Scikit-Learn

Sklearn Comparison

Performance Gap Analysis

Gap Analysis

Available Datasets

Datasets Overview

Classification Performance by Dataset

Model Iris Wine Breast Cancer Digits Titanic Avg
KNN 96.0% 94.9% 96.0% 97.7% - 96.2%
Decision Tree 95.3% 89.3% 92.6% 85.4% - 90.7%
Naive Bayes 96.0% 98.3% 93.7% 80.4% - 92.1%
Random Forest 95.3% 98.3% 95.4% 97.2% 82.8% 93.8%
Logistic Regression - - - - 82.1% 82.1%
XGBoost - - - - 80.7% 80.7%
MLP - - - 96.9% - 96.9%

Clustering & Other Tasks

Model Dataset Metric Score
K-Means Blobs ARI 1.000
DBSCAN Moons ARI 1.000
Hierarchical Blobs ARI 1.000
PCA Digits Explained Variance 100% match
Ridge California Housing MSE 0.556
Lasso California Housing MSE 0.548

Summary

Model Avg Accuracy Gap vs sklearn Notes
KNN 96.2% 0.0% Identical to sklearn
Decision Tree 90.7% 0.7% Within variance
Naive Bayes 92.1% -1.1% Slightly better on some datasets
Random Forest 93.8% 0.4% Matches sklearn closely
Logistic Regression 82.1% 0.0% Binary classification
XGBoost 80.7% -9.5% Outperforms sklearn GradientBoosting
K-Means 100% 0.0% ARI = 1.0
DBSCAN 100% 0.0% ARI = 1.0
Hierarchical 100% 0.0% ARI = 1.0
PCA 100% 0.0% Same explained variance
t-SNE N/A N/A Shape matches, stochastic
MLP 96.9% 0.9% Minor tuning difference
Ridge N/A 0.0% MSE matches exactly
Lasso N/A 0.0% MSE matches exactly

Algorithms Implemented

Supervised Learning - Regression

Algorithm Description Key Features
Linear Regression Ordinary least squares with gradient descent Closed-form & iterative solutions
Ridge Regression L2 regularization Prevents overfitting, handles multicollinearity
Lasso Regression L1 regularization Feature selection via sparsity

Supervised Learning - Classification

Algorithm Description Key Features
Logistic Regression Binary/multiclass classification Sigmoid activation, gradient descent
K-Nearest Neighbors Instance-based learning Distance metrics, weighted voting
Decision Tree Recursive partitioning Gini impurity, entropy, max depth control
Random Forest Ensemble of decision trees Bootstrap aggregating, feature subsampling
Naive Bayes Probabilistic classifier Gaussian likelihood, Bayes theorem
Support Vector Machine Maximum margin classifier Linear & RBF kernels, SMO algorithm
XGBoost Gradient boosting Second-order gradients, regularization

Neural Networks

Algorithm Description Key Features
Perceptron Single-layer classifier Online learning, linear decision boundary
Multi-Layer Perceptron Deep feedforward network Backpropagation, ReLU/sigmoid, Adam optimizer

Clustering

Algorithm Description Key Features
K-Means Centroid-based clustering Lloyd's algorithm, k-means++ initialization
DBSCAN Density-based clustering No predefined k, handles arbitrary shapes
Hierarchical Clustering Agglomerative clustering Single/complete/average linkage, dendrogram

Dimensionality Reduction

Algorithm Description Key Features
PCA Principal Component Analysis Eigendecomposition, variance explained
t-SNE t-distributed Stochastic Neighbor Embedding Non-linear, preserves local structure

Interactive Visualizations

Explore each algorithm with interactive Streamlit dashboards:

# Launch the visualization dashboard
ml-viz

Available Visualizations

  • Linear Regression: Fit lines, view residuals, compare regularization
  • KNN: Visualize decision boundaries, adjust k
  • K-Means: Watch clustering iterations, compare initializations
  • PCA: Explore variance explained, visualize projections
  • Decision Tree: View splits, adjust hyperparameters
  • Random Forest: Compare individual trees vs ensemble
  • Logistic Regression: Binary classification boundaries
  • Naive Bayes: Class distributions and predictions
  • SVM: Support vectors, kernel comparisons
  • Perceptron: Linear classifier training
  • MLP: Network architecture, training curves
  • DBSCAN: Density-based cluster discovery
  • Hierarchical Clustering: Dendrograms, linkage methods
  • t-SNE: High-dimensional data visualization
  • XGBoost: Boosting iterations, feature importance

RAG Pipeline (Bonus)

A complete Retrieval-Augmented Generation implementation:

from rag import RAGPipeline

pipeline = RAGPipeline()
pipeline.add_documents(["doc1.pdf", "doc2.pdf"])
response = pipeline.query("What is the main topic?")

Installation

# Clone the repository
git clone https://github.com/jwlutz/ml_fundamentals.git
cd ml_fundamentals

# Install core dependencies
pip install -e .

# For RAG functionality
pip install -e ".[rag]"

# For PyTorch benchmarks
pip install -e ".[benchmarks]"

Usage

As a Library

from ml_models import (
    LinearRegression, LogisticRegression,
    KNN, DecisionTree, RandomForest,
    KMeans, DBSCAN, PCA, MLP
)

# Example: Train a decision tree
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)

tree = DecisionTree(max_depth=5, criterion='gini')
tree.fit(X, y)
predictions = tree.predict(X)

# Example: K-Means clustering
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=3)

kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
labels = kmeans.predict(X)

# Example: Neural network
mlp = MLP(
    layer_sizes=[784, 128, 64, 10],
    activation='relu',
    learning_rate=0.001
)
mlp.fit(X_train, y_train, epochs=100, batch_size=32)

Running Benchmarks

# Quick benchmark (~30 seconds, no downloads)
python benchmarks/quick_benchmark.py

# Full sklearn comparison (~2 minutes)
python benchmarks/sklearn_comparison.py

# Comprehensive with all datasets (~5+ minutes)
python benchmarks/comprehensive_benchmark.py

# Compare MLP against PyTorch
python benchmarks/pytorch_mnist_comparison.py

# Generate visualization images
python benchmarks/generate_visualizations.py

Running Tests

# Run all tests
pytest tests/

# Run specific model tests
pytest tests/test_decision_tree.py -v

# Run with coverage
pytest tests/ --cov=ml_models

Project Structure

ml_fundamentals/
├── ml_models/                 # Core implementations
│   ├── linear_regression.py
│   ├── ridge_regression.py
│   ├── lasso_regression.py
│   ├── logistic_regression.py
│   ├── knn.py
│   ├── decision_tree.py
│   ├── random_forest.py
│   ├── naive_bayes.py
│   ├── svm.py
│   ├── xgboost.py
│   ├── perceptron.py
│   ├── mlp.py
│   ├── kmeans.py
│   ├── dbscan.py
│   ├── hierarchical_clustering.py
│   ├── pca.py
│   └── tsne.py
├── visualizers/               # Streamlit dashboards
│   ├── ml_app.py             # Main entry point
│   └── pages/                # Individual model visualizations
├── rag/                       # RAG pipeline
│   ├── pipeline.py
│   ├── chunking.py
│   ├── embeddings.py
│   └── retrieval.py
├── benchmarks/                # Performance comparisons
│   ├── sklearn_comparison.py
│   ├── pytorch_mnist_comparison.py
│   ├── visualize.py
│   └── datasets.py
├── tests/                     # Unit tests
├── docs/images/               # Benchmark visualizations
└── pyproject.toml

Datasets Used for Validation

Dataset Samples Features Task Difficulty
Iris 150 4 Classification (3 classes) Easy
Wine 178 13 Classification (3 classes) Medium
Banknote 1,372 4 Binary Classification Easy
Breast Cancer 569 30 Binary Classification Medium
Titanic 891 7 Binary Classification Medium
Digits 1,797 64 Classification (10 classes) Medium
Spam 4,601 57 Binary Classification Medium
Adult Income 48,842 14 Binary Classification Hard
California Housing 20,640 8 Regression Medium
Covertype 581,012 54 Classification (7 classes) Hard
Synthetic Hard 2,000 20 Classification (4 classes) Hard
Blobs 300 2 Clustering Easy
Moons 500 2 Clustering Easy

Design Principles

  1. Educational First: Code prioritizes readability over optimization
  2. NumPy Only: Core algorithms use only NumPy (no ML frameworks)
  3. Tested Against Production: All models verified against scikit-learn
  4. Interactive Learning: Every algorithm has a visual dashboard

License

MIT License

About

Machine learning algorithms implemented from scratch using NumPy, verified against scikit-learn and PyTorch

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors