|
| 1 | +# 🚀 Reproducible Machine Learning Pipeline with Real-Time Serving |
| 2 | + |
| 3 | +## 🎯 Project Overview |
| 4 | + |
| 5 | +**Solving the ML Reproducibility Crisis** |
| 6 | + |
| 7 | +Built to explore and solve real ML engineering challenges. This project demonstrates how modern MLOps tools can transform chaotic ML workflows into professional, reproducible systems with production-ready model serving. |
| 8 | + |
| 9 | +**Problem Solved:** *"It works on my machine"* → *"It works everywhere, reliably"* |
| 10 | + |
| 11 | +**Tools**: DVC, MLflow, DagHub, Docker, Flask, Scikit-learn |
| 12 | +**Dataset**: Pima Indians Diabetes Database |
| 13 | + |
| 14 | +## 🏗️ Architecture Overview |
| 15 | + |
| 16 | +``` |
| 17 | +┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ |
| 18 | +│ Data Pipeline │ │ Model Training │ │ Model Serving │ |
| 19 | +│ │ │ │ │ │ |
| 20 | +│ DVC Versioning │───▶│ MLflow Tracking │───▶│ Flask API + │ |
| 21 | +│ Automated Prep │ │ Hyperparameter │ │ Live Dashboard │ |
| 22 | +│ │ │ Optimization │ │ │ |
| 23 | +└─────────────────┘ └─────────────────┘ └─────────────────┘ |
| 24 | + │ │ │ |
| 25 | + ▼ ▼ ▼ |
| 26 | +┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ |
| 27 | +│ DagHub │ │ Experiment │ │ Docker │ |
| 28 | +│ Integration │ │ Comparison │ │ Containerization│ |
| 29 | +└─────────────────┘ └─────────────────┘ └─────────────────┘ |
| 30 | +``` |
| 31 | + |
| 32 | +**Complete MLOps workflow from raw data to production-ready API endpoint.** |
| 33 | + |
| 34 | + |
| 35 | + |
| 36 | +## 📁 Project Structure |
| 37 | + |
| 38 | +``` |
| 39 | +ML_Pipeline/ |
| 40 | +├── 📁 .dvc/ # DVC configuration and cache |
| 41 | +├── 📊 Data/ # Dataset storage |
| 42 | +├── 📷 img/ # Images and screenshots |
| 43 | +├── 🧠 models/ # Trained models and artifacts |
| 44 | +├── 🌐 serve/ # Model serving components |
| 45 | +├── 📜 src/ # Source code |
| 46 | +├── 🧪 tests/ # Testing suite |
| 47 | +├── 🐍 venv/ # Virtual environment |
| 48 | +├── 🐳 .dockerignore # Docker ignore rules |
| 49 | +├── 🔒 .env # Environment variables (MLflow credentials) |
| 50 | +├── 📋 .gitignore # Git ignore rules |
| 51 | +├── 🐳 Dockerfile # Main pipeline container |
| 52 | +├── 🔒 dvc.lock # DVC lock file |
| 53 | +├── 📄 dvc.yaml # Pipeline definition |
| 54 | +├── 📄 params.yaml # Hyperparameters & settings |
| 55 | +├── 📄 README.md # This documentation |
| 56 | +└── 📄 requirements.txt # Python dependencies |
| 57 | +``` |
| 58 | + |
| 59 | + |
| 60 | + |
| 61 | +## 🚀 Quick Start |
| 62 | + |
| 63 | +### **Prerequisites** |
| 64 | +- Docker installed |
| 65 | +- Git repository access |
| 66 | +- Python 3.10+ (for local development) |
| 67 | + |
| 68 | +### **1. Clone and Setup** |
| 69 | +```bash |
| 70 | +git clone https://github.com/fitsblb/ML_Pipeline.git |
| 71 | +cd ML_Pipeline |
| 72 | + |
| 73 | +# Create environment file |
| 74 | +echo "MLFLOW_TRACKING_URI=https://dagshub.com/fitsblb/ML_Pipeline.mlflow" > .env |
| 75 | +echo "MLFLOW_TRACKING_USERNAME=your_username" >> .env |
| 76 | +echo "MLFLOW_TRACKING_PASSWORD=your_token" >> .env |
| 77 | +``` |
| 78 | + |
| 79 | +### **2. Run Complete Pipeline** |
| 80 | +```bash |
| 81 | +# Build and run the ML pipeline |
| 82 | +docker build -t ml-pipeline . |
| 83 | +docker run --rm -v %cd%:/app --env-file .env ml-pipeline |
| 84 | +``` |
| 85 | + |
| 86 | +### **3. Deploy Model Server** |
| 87 | +```bash |
| 88 | +# Build serving container |
| 89 | +docker build -t model-server -f serve/Dockerfile . |
| 90 | + |
| 91 | +# Launch API server with live dashboard |
| 92 | +docker run -p 8000:8000 model-server |
| 93 | +``` |
| 94 | + |
| 95 | +### **4. Access Live Dashboard** |
| 96 | +Open browser: **http://localhost:8000** |
| 97 | + |
| 98 | + |
| 99 | + |
| 100 | +## 🧪 Testing the API |
| 101 | + |
| 102 | +### **Health Check** |
| 103 | +```bash |
| 104 | +curl http://localhost:8000 |
| 105 | +# Response: "Model is ready to predict!" |
| 106 | +``` |
| 107 | + |
| 108 | +### **Single Prediction** |
| 109 | +```bash |
| 110 | +curl -X POST http://localhost:8000/predict |
| 111 | + -H "Content-Type: application/json" |
| 112 | + -d '{ |
| 113 | + "Pregnancies": 6, |
| 114 | + "Glucose": 148, |
| 115 | + "BloodPressure": 72, |
| 116 | + "SkinThickness": 35, |
| 117 | + "Insulin": 0, |
| 118 | + "BMI": 33.6, |
| 119 | + "DiabetesPedigreeFunction": 0.627, |
| 120 | + "Age": 50 |
| 121 | + }' |
| 122 | +``` |
| 123 | + |
| 124 | +**Response:** |
| 125 | +```json |
| 126 | +{"prediction": [1]} // 1 = Diabetes Risk, 0 = No Diabetes |
| 127 | +``` |
| 128 | + |
| 129 | +### **Multiple Test Cases** |
| 130 | +```bash |
| 131 | +# Run comprehensive testing |
| 132 | +python tests/test_multiple.py |
| 133 | +``` |
| 134 | + |
| 135 | + |
| 136 | + |
| 137 | +## 🛠️ Technical Stack |
| 138 | + |
| 139 | +| Tool | Purpose | Why This Tool | |
| 140 | +|------|---------|---------------| |
| 141 | +| **DVC** | Data & Model Versioning | Git for data - track large files efficiently | |
| 142 | +| **MLflow** | Experiment Tracking | Compare runs, track metrics, log models | |
| 143 | +| **DagHub** | Remote Collaboration | GitHub for ML - share data and experiments | |
| 144 | +| **Scikit-learn** | ML Framework | Random Forest with hyperparameter tuning | |
| 145 | +| **Python** | Implementation | Pandas, YAML configuration, pickle serialization | |
| 146 | + |
| 147 | +## ✨ Key Features |
| 148 | + |
| 149 | +### 🔁 **Complete Reproducibility** |
| 150 | +- One command reproduces entire pipeline |
| 151 | +- Deterministic results with fixed random seeds |
| 152 | +- Version-controlled data, code, and models |
| 153 | + |
| 154 | +### 📊 **Smart Dependency Tracking** |
| 155 | +- Automatic recomputation when dependencies change |
| 156 | +- Skip unchanged stages for efficiency |
| 157 | +- Clear visualization of pipeline dependencies |
| 158 | + |
| 159 | +### 🎛️ **Configurable Parameters** |
| 160 | +- Centralized configuration in `params.yaml` |
| 161 | +- Easy experimentation with different settings |
| 162 | +- Parameter versioning and tracking |
| 163 | + |
| 164 | +### 🚀 **Experiment Management** |
| 165 | +- Track all runs with MLflow |
| 166 | +- Compare model performance across experiments |
| 167 | +- Remote experiment sharing via DagHub |
| 168 | + |
| 169 | +### 👥 **Team Collaboration** |
| 170 | +- Shared data storage with DVC |
| 171 | +- Reproducible environments |
| 172 | +- Clear pipeline documentation |
| 173 | + |
| 174 | +## 🔄 Pipeline Stages |
| 175 | + |
| 176 | +### 1️⃣ **Preprocessing Stage** |
| 177 | +```bash |
| 178 | +python src/preprocess.py |
| 179 | +``` |
| 180 | +- Loads raw diabetes dataset |
| 181 | +- Handles missing values and outliers |
| 182 | +- Feature scaling and engineering |
| 183 | +- Outputs cleaned dataset |
| 184 | + |
| 185 | +### 2️⃣ **Training Stage** |
| 186 | +```bash |
| 187 | +python src/train.py |
| 188 | +``` |
| 189 | +- Hyperparameter tuning with GridSearchCV |
| 190 | +- Random Forest model training |
| 191 | +- MLflow experiment logging |
| 192 | +- Model serialization |
| 193 | + |
| 194 | +### 3️⃣ **Evaluation Stage** |
| 195 | +```bash |
| 196 | +python src/evaluate.py |
| 197 | +``` |
| 198 | +- Model performance evaluation |
| 199 | +- Metrics calculation and logging |
| 200 | +- Results visualization |
| 201 | + |
| 202 | +## 📊 Results |
| 203 | + |
| 204 | +The pipeline achieves: |
| 205 | +- **Accuracy**: ~85% on diabetes prediction |
| 206 | +- **Reproducibility**: 100% identical results across runs |
| 207 | +- **Efficiency**: Only recomputes changed stages |
| 208 | +- **Scalability**: Easy to add new features or models |
| 209 | + |
| 210 | +## 🔧 Experimentation |
| 211 | + |
| 212 | +Modify parameters in `params.yaml` and rerun: |
| 213 | + |
| 214 | +```yaml |
| 215 | +train: |
| 216 | + random_state: 42 |
| 217 | + hyperparameter_grid: |
| 218 | + n_estimators: [100, 200, 300] # Try more trees |
| 219 | + max_depth: [5, 10, 15, null] # Experiment with depth |
| 220 | + min_samples_split: [2, 5, 10] # Adjust splitting |
| 221 | +``` |
| 222 | +
|
| 223 | +```bash |
| 224 | +# Pipeline automatically detects parameter changes |
| 225 | +dvc repro # Only reruns affected stages |
| 226 | +``` |
| 227 | + |
| 228 | +## 🎓 Key Learnings |
| 229 | + |
| 230 | +### **Technical Insights:** |
| 231 | +- DVC transforms chaotic data workflows into organized pipelines |
| 232 | +- MLflow provides crucial experiment tracking for ML iterations |
| 233 | +- Proper dependency management eliminates "works on my machine" issues |
| 234 | +- Parameter externalization enables systematic experimentation |
| 235 | + |
| 236 | +### **MLOps Best Practices:** |
| 237 | +- Version control everything: data, code, models, and configurations |
| 238 | +- Automate pipeline execution and dependency tracking |
| 239 | +- Centralize experiment tracking for team collaboration |
| 240 | +- Design for reproducibility from day one |
| 241 | + |
| 242 | +### **Engineering Principles:** |
| 243 | +- Separation of concerns: preprocessing, training, evaluation |
| 244 | +- Configuration-driven development with `params.yaml` |
| 245 | +- Fail-fast validation with proper error handling |
| 246 | +- Documentation-as-code approach |
| 247 | + |
| 248 | +## 🛡️ Production Considerations |
| 249 | + |
| 250 | +This pipeline demonstrates production-ready practices: |
| 251 | +- **Environment isolation** with `requirements.txt` |
| 252 | +- **Credential management** with environment variables |
| 253 | +- **Error handling** and logging throughout |
| 254 | +- **Modular design** for easy maintenance and testing |
| 255 | +- **CI/CD readiness** with single-command execution |
| 256 | + |
| 257 | +## 🔮 Future Enhancements |
| 258 | + |
| 259 | +- [ ] Add automated model validation |
| 260 | +- [ ] Implement A/B testing framework |
| 261 | +- [ ] Add data drift detection |
| 262 | +- [ ] Create web API for model serving |
| 263 | +- [ ] Add automated model retraining triggers |
| 264 | + |
| 265 | +--- |
| 266 | + |
| 267 | +## 🙏 Acknowledgments |
| 268 | + |
| 269 | +Built to explore and solve real ML engineering challenges. This project demonstrates how modern MLOps tools can transform chaotic ML workflows into professional, reproducible systems. |
| 270 | + |
| 271 | +**Tools**: DVC, MLflow, DagHub, Scikit-learn |
| 272 | +**Dataset**: Pima Indians Diabetes Database |
| 273 | + |
| 274 | +--- |
| 275 | + |
| 276 | +*"In ML, reproducibility isn't a nice-to-have—it's the foundation of trustworthy AI systems."* |
0 commit comments