Skip to content

Commit 2cd2d28

Browse files
author
Fitsum
committed
Add beautiful documentation with Docsify
1 parent 3863ced commit 2cd2d28

4 files changed

Lines changed: 471 additions & 0 deletions

File tree

docs/.nojekyll

Whitespace-only changes.

docs/README.md

Lines changed: 276 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,276 @@
1+
# 🚀 Reproducible Machine Learning Pipeline with Real-Time Serving
2+
3+
## 🎯 Project Overview
4+
5+
**Solving the ML Reproducibility Crisis**
6+
7+
Built to explore and solve real ML engineering challenges. This project demonstrates how modern MLOps tools can transform chaotic ML workflows into professional, reproducible systems with production-ready model serving.
8+
9+
**Problem Solved:** *"It works on my machine"**"It works everywhere, reliably"*
10+
11+
**Tools**: DVC, MLflow, DagHub, Docker, Flask, Scikit-learn
12+
**Dataset**: Pima Indians Diabetes Database
13+
14+
## 🏗️ Architecture Overview
15+
16+
```
17+
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
18+
│ Data Pipeline │ │ Model Training │ │ Model Serving │
19+
│ │ │ │ │ │
20+
│ DVC Versioning │───▶│ MLflow Tracking │───▶│ Flask API + │
21+
│ Automated Prep │ │ Hyperparameter │ │ Live Dashboard │
22+
│ │ │ Optimization │ │ │
23+
└─────────────────┘ └─────────────────┘ └─────────────────┘
24+
│ │ │
25+
▼ ▼ ▼
26+
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
27+
│ DagHub │ │ Experiment │ │ Docker │
28+
│ Integration │ │ Comparison │ │ Containerization│
29+
└─────────────────┘ └─────────────────┘ └─────────────────┘
30+
```
31+
32+
**Complete MLOps workflow from raw data to production-ready API endpoint.**
33+
34+
![Architecture](../img/architecture.png)
35+
36+
## 📁 Project Structure
37+
38+
```
39+
ML_Pipeline/
40+
├── 📁 .dvc/ # DVC configuration and cache
41+
├── 📊 Data/ # Dataset storage
42+
├── 📷 img/ # Images and screenshots
43+
├── 🧠 models/ # Trained models and artifacts
44+
├── 🌐 serve/ # Model serving components
45+
├── 📜 src/ # Source code
46+
├── 🧪 tests/ # Testing suite
47+
├── 🐍 venv/ # Virtual environment
48+
├── 🐳 .dockerignore # Docker ignore rules
49+
├── 🔒 .env # Environment variables (MLflow credentials)
50+
├── 📋 .gitignore # Git ignore rules
51+
├── 🐳 Dockerfile # Main pipeline container
52+
├── 🔒 dvc.lock # DVC lock file
53+
├── 📄 dvc.yaml # Pipeline definition
54+
├── 📄 params.yaml # Hyperparameters & settings
55+
├── 📄 README.md # This documentation
56+
└── 📄 requirements.txt # Python dependencies
57+
```
58+
59+
![Project Structure](../img/Project%20structure.png)
60+
61+
## 🚀 Quick Start
62+
63+
### **Prerequisites**
64+
- Docker installed
65+
- Git repository access
66+
- Python 3.10+ (for local development)
67+
68+
### **1. Clone and Setup**
69+
```bash
70+
git clone https://github.com/fitsblb/ML_Pipeline.git
71+
cd ML_Pipeline
72+
73+
# Create environment file
74+
echo "MLFLOW_TRACKING_URI=https://dagshub.com/fitsblb/ML_Pipeline.mlflow" > .env
75+
echo "MLFLOW_TRACKING_USERNAME=your_username" >> .env
76+
echo "MLFLOW_TRACKING_PASSWORD=your_token" >> .env
77+
```
78+
79+
### **2. Run Complete Pipeline**
80+
```bash
81+
# Build and run the ML pipeline
82+
docker build -t ml-pipeline .
83+
docker run --rm -v %cd%:/app --env-file .env ml-pipeline
84+
```
85+
86+
### **3. Deploy Model Server**
87+
```bash
88+
# Build serving container
89+
docker build -t model-server -f serve/Dockerfile .
90+
91+
# Launch API server with live dashboard
92+
docker run -p 8000:8000 model-server
93+
```
94+
95+
### **4. Access Live Dashboard**
96+
Open browser: **http://localhost:8000**
97+
98+
![Live Dashboard](../img/host_output.png)
99+
100+
## 🧪 Testing the API
101+
102+
### **Health Check**
103+
```bash
104+
curl http://localhost:8000
105+
# Response: "Model is ready to predict!"
106+
```
107+
108+
### **Single Prediction**
109+
```bash
110+
curl -X POST http://localhost:8000/predict
111+
-H "Content-Type: application/json"
112+
-d '{
113+
"Pregnancies": 6,
114+
"Glucose": 148,
115+
"BloodPressure": 72,
116+
"SkinThickness": 35,
117+
"Insulin": 0,
118+
"BMI": 33.6,
119+
"DiabetesPedigreeFunction": 0.627,
120+
"Age": 50
121+
}'
122+
```
123+
124+
**Response:**
125+
```json
126+
{"prediction": [1]} // 1 = Diabetes Risk, 0 = No Diabetes
127+
```
128+
129+
### **Multiple Test Cases**
130+
```bash
131+
# Run comprehensive testing
132+
python tests/test_multiple.py
133+
```
134+
135+
![Terminal Output](../img/terminal_output.png)
136+
137+
## 🛠️ Technical Stack
138+
139+
| Tool | Purpose | Why This Tool |
140+
|------|---------|---------------|
141+
| **DVC** | Data & Model Versioning | Git for data - track large files efficiently |
142+
| **MLflow** | Experiment Tracking | Compare runs, track metrics, log models |
143+
| **DagHub** | Remote Collaboration | GitHub for ML - share data and experiments |
144+
| **Scikit-learn** | ML Framework | Random Forest with hyperparameter tuning |
145+
| **Python** | Implementation | Pandas, YAML configuration, pickle serialization |
146+
147+
## ✨ Key Features
148+
149+
### 🔁 **Complete Reproducibility**
150+
- One command reproduces entire pipeline
151+
- Deterministic results with fixed random seeds
152+
- Version-controlled data, code, and models
153+
154+
### 📊 **Smart Dependency Tracking**
155+
- Automatic recomputation when dependencies change
156+
- Skip unchanged stages for efficiency
157+
- Clear visualization of pipeline dependencies
158+
159+
### 🎛️ **Configurable Parameters**
160+
- Centralized configuration in `params.yaml`
161+
- Easy experimentation with different settings
162+
- Parameter versioning and tracking
163+
164+
### 🚀 **Experiment Management**
165+
- Track all runs with MLflow
166+
- Compare model performance across experiments
167+
- Remote experiment sharing via DagHub
168+
169+
### 👥 **Team Collaboration**
170+
- Shared data storage with DVC
171+
- Reproducible environments
172+
- Clear pipeline documentation
173+
174+
## 🔄 Pipeline Stages
175+
176+
### 1️⃣ **Preprocessing Stage**
177+
```bash
178+
python src/preprocess.py
179+
```
180+
- Loads raw diabetes dataset
181+
- Handles missing values and outliers
182+
- Feature scaling and engineering
183+
- Outputs cleaned dataset
184+
185+
### 2️⃣ **Training Stage**
186+
```bash
187+
python src/train.py
188+
```
189+
- Hyperparameter tuning with GridSearchCV
190+
- Random Forest model training
191+
- MLflow experiment logging
192+
- Model serialization
193+
194+
### 3️⃣ **Evaluation Stage**
195+
```bash
196+
python src/evaluate.py
197+
```
198+
- Model performance evaluation
199+
- Metrics calculation and logging
200+
- Results visualization
201+
202+
## 📊 Results
203+
204+
The pipeline achieves:
205+
- **Accuracy**: ~85% on diabetes prediction
206+
- **Reproducibility**: 100% identical results across runs
207+
- **Efficiency**: Only recomputes changed stages
208+
- **Scalability**: Easy to add new features or models
209+
210+
## 🔧 Experimentation
211+
212+
Modify parameters in `params.yaml` and rerun:
213+
214+
```yaml
215+
train:
216+
random_state: 42
217+
hyperparameter_grid:
218+
n_estimators: [100, 200, 300] # Try more trees
219+
max_depth: [5, 10, 15, null] # Experiment with depth
220+
min_samples_split: [2, 5, 10] # Adjust splitting
221+
```
222+
223+
```bash
224+
# Pipeline automatically detects parameter changes
225+
dvc repro # Only reruns affected stages
226+
```
227+
228+
## 🎓 Key Learnings
229+
230+
### **Technical Insights:**
231+
- DVC transforms chaotic data workflows into organized pipelines
232+
- MLflow provides crucial experiment tracking for ML iterations
233+
- Proper dependency management eliminates "works on my machine" issues
234+
- Parameter externalization enables systematic experimentation
235+
236+
### **MLOps Best Practices:**
237+
- Version control everything: data, code, models, and configurations
238+
- Automate pipeline execution and dependency tracking
239+
- Centralize experiment tracking for team collaboration
240+
- Design for reproducibility from day one
241+
242+
### **Engineering Principles:**
243+
- Separation of concerns: preprocessing, training, evaluation
244+
- Configuration-driven development with `params.yaml`
245+
- Fail-fast validation with proper error handling
246+
- Documentation-as-code approach
247+
248+
## 🛡️ Production Considerations
249+
250+
This pipeline demonstrates production-ready practices:
251+
- **Environment isolation** with `requirements.txt`
252+
- **Credential management** with environment variables
253+
- **Error handling** and logging throughout
254+
- **Modular design** for easy maintenance and testing
255+
- **CI/CD readiness** with single-command execution
256+
257+
## 🔮 Future Enhancements
258+
259+
- [ ] Add automated model validation
260+
- [ ] Implement A/B testing framework
261+
- [ ] Add data drift detection
262+
- [ ] Create web API for model serving
263+
- [ ] Add automated model retraining triggers
264+
265+
---
266+
267+
## 🙏 Acknowledgments
268+
269+
Built to explore and solve real ML engineering challenges. This project demonstrates how modern MLOps tools can transform chaotic ML workflows into professional, reproducible systems.
270+
271+
**Tools**: DVC, MLflow, DagHub, Scikit-learn
272+
**Dataset**: Pima Indians Diabetes Database
273+
274+
---
275+
276+
*"In ML, reproducibility isn't a nice-to-have—it's the foundation of trustworthy AI systems."*

docs/_sidebar.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
<!-- _sidebar.md -->
2+
3+
* [🏠 Home](/)
4+
* [🏗️ Architecture](/#-architecture-overview)
5+
* [📁 Project Structure](/#-project-structure)
6+
* [🚀 Quick Start](/#-quick-start)
7+
* [🧪 Testing the API](/#-testing-the-api)
8+
* [🛠️ Technical Stack](/#-technical-stack)
9+
* [✨ Key Features](/#-key-features)
10+
* [🔄 Pipeline Stages](/#-pipeline-stages)
11+
* [📊 Results](/#-results)
12+
* [🔧 Experimentation](/#-experimentation)
13+
* [🎓 Key Learnings](/#-key-learnings)
14+
* [🛡️ Production Considerations](/#-production-considerations)
15+
* [🔮 Future Enhancements](/#-future-enhancements)

0 commit comments

Comments
 (0)