You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Machine learning algorithms implemented from scratch using NumPy, verified against scikit-learn and PyTorch on canonical datasets.
Overview
This repository contains clean, educational implementations of 18 machine learning algorithms built from first principles. Each implementation uses only NumPy for numerical operations, with comprehensive test suites and interactive Streamlit visualizations. All models are benchmarked against scikit-learn to verify correctness.
Benchmark Results
All implementations are verified against scikit-learn on canonical datasets using 5-fold cross-validation.
Implementation Status
Accuracy Heatmap
From-Scratch vs Scikit-Learn
Performance Gap Analysis
Available Datasets
Classification Performance by Dataset
Model
Iris
Wine
Breast Cancer
Digits
Titanic
Avg
KNN
96.0%
94.9%
96.0%
97.7%
-
96.2%
Decision Tree
95.3%
89.3%
92.6%
85.4%
-
90.7%
Naive Bayes
96.0%
98.3%
93.7%
80.4%
-
92.1%
Random Forest
95.3%
98.3%
95.4%
97.2%
82.8%
93.8%
Logistic Regression
-
-
-
-
82.1%
82.1%
XGBoost
-
-
-
-
80.7%
80.7%
MLP
-
-
-
96.9%
-
96.9%
Clustering & Other Tasks
Model
Dataset
Metric
Score
K-Means
Blobs
ARI
1.000
DBSCAN
Moons
ARI
1.000
Hierarchical
Blobs
ARI
1.000
PCA
Digits
Explained Variance
100% match
Ridge
California Housing
MSE
0.556
Lasso
California Housing
MSE
0.548
Summary
Model
Avg Accuracy
Gap vs sklearn
Notes
KNN
96.2%
0.0%
Identical to sklearn
Decision Tree
90.7%
0.7%
Within variance
Naive Bayes
92.1%
-1.1%
Slightly better on some datasets
Random Forest
93.8%
0.4%
Matches sklearn closely
Logistic Regression
82.1%
0.0%
Binary classification
XGBoost
80.7%
-9.5%
Outperforms sklearn GradientBoosting
K-Means
100%
0.0%
ARI = 1.0
DBSCAN
100%
0.0%
ARI = 1.0
Hierarchical
100%
0.0%
ARI = 1.0
PCA
100%
0.0%
Same explained variance
t-SNE
N/A
N/A
Shape matches, stochastic
MLP
96.9%
0.9%
Minor tuning difference
Ridge
N/A
0.0%
MSE matches exactly
Lasso
N/A
0.0%
MSE matches exactly
Algorithms Implemented
Supervised Learning - Regression
Algorithm
Description
Key Features
Linear Regression
Ordinary least squares with gradient descent
Closed-form & iterative solutions
Ridge Regression
L2 regularization
Prevents overfitting, handles multicollinearity
Lasso Regression
L1 regularization
Feature selection via sparsity
Supervised Learning - Classification
Algorithm
Description
Key Features
Logistic Regression
Binary/multiclass classification
Sigmoid activation, gradient descent
K-Nearest Neighbors
Instance-based learning
Distance metrics, weighted voting
Decision Tree
Recursive partitioning
Gini impurity, entropy, max depth control
Random Forest
Ensemble of decision trees
Bootstrap aggregating, feature subsampling
Naive Bayes
Probabilistic classifier
Gaussian likelihood, Bayes theorem
Support Vector Machine
Maximum margin classifier
Linear & RBF kernels, SMO algorithm
XGBoost
Gradient boosting
Second-order gradients, regularization
Neural Networks
Algorithm
Description
Key Features
Perceptron
Single-layer classifier
Online learning, linear decision boundary
Multi-Layer Perceptron
Deep feedforward network
Backpropagation, ReLU/sigmoid, Adam optimizer
Clustering
Algorithm
Description
Key Features
K-Means
Centroid-based clustering
Lloyd's algorithm, k-means++ initialization
DBSCAN
Density-based clustering
No predefined k, handles arbitrary shapes
Hierarchical Clustering
Agglomerative clustering
Single/complete/average linkage, dendrogram
Dimensionality Reduction
Algorithm
Description
Key Features
PCA
Principal Component Analysis
Eigendecomposition, variance explained
t-SNE
t-distributed Stochastic Neighbor Embedding
Non-linear, preserves local structure
Interactive Visualizations
Explore each algorithm with interactive Streamlit dashboards:
# Launch the visualization dashboard
ml-viz
Available Visualizations
Linear Regression: Fit lines, view residuals, compare regularization