GoPredict - Machine Learning Pipeline for Trip Duration Prediction

A comprehensive machine learning pipeline for predicting trip durations using various regression models, feature engineering, and hyperparameter optimization.

Medium post: https://medium.com/@hphadtare02/how-machine-learning-predicts-trip-duration-just-like-uber-zomato-91f7db6e9ce9

Demo

gopredict-1762027007945.mp4

Project Structure

GoPredict/
├── main.py                          # Main runner script
├── start_api.py                     # API server startup script
├── test_api.py                      # API testing script
├── config.py                        # Project configuration
├── requirements.txt                  # Python dependencies
├── README.md                        # This file
├── CONTRIBUTING.md                  # Development and integration guide
├── CODE_OF_CONDUCT.md               # Code of conduct and security
│
├── api/                            # FastAPI backend
│   └── main.py                     # FastAPI application
│
├── frontend/                       # React frontend
│   └── src/
│       └── lib/
│           └── api.ts              # API client library
│
├── data/                            # Data directory
│   ├── raw/                         # Raw data files
│   │   ├── train.csv               # Training data
│   │   └── test.csv                # Test data
│   ├── processed/                   # Processed data files
│   │   ├── feature_engineered_train.csv
│   │   ├── feature_engineered_test.csv
│   │   └── gmapsdata/              # Google Maps data
│   └── external/                    # External data sources
│       └── precipitation.csv       # Weather data
│
├── src/                            # Source code
│   ├── model/                      # Model-related modules
│   │   ├── models.py              # All ML models and pipeline
│   │   ├── evaluation.py          # Model evaluation functions
│   │   └── save_models.py         # Model persistence
│   ├── features/                   # Feature engineering modules
│   │   ├── distance.py            # Distance calculations
│   │   ├── geolocation.py         # Geographic features
│   │   ├── gmaps.py               # Google Maps integration
│   │   ├── precipitation.py       # Weather features
│   │   ├── time.py                # Time-based features
│   │   └── weather_api.py         # Weather API integration
│   ├── feature_pipe.py            # Feature engineering pipeline
│   ├── data_preprocessing.py      # Data preprocessing
│   └── complete_pipeline.py       # Complete ML pipeline
│
├── notebooks/                      # Jupyter notebooks
│   ├── 01_EDA.ipynb               # Exploratory Data Analysis
│   ├── 02_Feature_Engineering.ipynb # Feature engineering
│   ├── 03_Model_Training.ipynb    # Model training
│   ├── figures/                   # Generated plots
│   └── gmaps/                     # Interactive maps
│
├── saved_models/                   # Trained models (auto-created)
├── output/                         # Predictions and submissions (auto-created)
└── logs/                          # Log files (auto-created)

Quick Start

1. Installation

# Clone the repository
git clone <your-repo-url>
cd GoPredict

# Install dependencies
pip install -r requirements.txt

# Create necessary directories
mkdir -p logs output saved_models

2. API Server

Start the FastAPI server to connect your frontend with ML models:

# Start the API server
python start_api.py

# Test the API
python test_api.py

# View API documentation
# Visit http://localhost:8000/docs

3. Frontend Development

# Install frontend dependencies
cd frontend
npm install

# Start development server
npm run dev

API Documentation

The GoPredict API provides REST endpoints for machine learning-based trip duration prediction using FastAPI.

Quick API Start

# Start the API server
python start_api.py

# Or with custom options
python start_api.py --host 0.0.0.0 --port 8000 --reload

API Access Points

Interactive Documentation: http://localhost:8000/docs
Alternative Documentation: http://localhost:8000/redoc
Health Check: http://localhost:8000/health

Core API Endpoints

Weather API

GET /weather - Get weather data for a specific location and time

Parameters:

latitude (float): Latitude coordinate
longitude (float): Longitude coordinate
timestamp (str): ISO format timestamp (e.g., "2016-01-01T17:00:00")

Example:

curl "http://localhost:8000/weather?latitude=40.767937&longitude=-73.982155&timestamp=2016-01-01T17:00:00"

Response:

{
  "success": true,
  "data": {
    "temp": 5.0,
    "humidity": 53.0,
    "pressure": 1013.25
  },
  "location": { "latitude": 40.767937, "longitude": -73.982155 },
  "timestamp": "2016-01-01T17:00:00"
}

Distance Calculation API

POST /distance - Calculate Manhattan and/or Euclidean distances

Parameters:

start_lat (float): Starting latitude
start_lng (float): Starting longitude
end_lat (float): Ending latitude
end_lng (float): Ending longitude
method (str): "manhattan", "euclidean", or "both" (default: "both")

Example:

curl -X POST "http://localhost:8000/distance" \
  -H "Content-Type: application/json" \
  -d '{
    "start_lat": 40.767937,
    "start_lng": -73.982155,
    "end_lat": 40.748817,
    "end_lng": -73.985428,
    "method": "both"
  }'

Time Features API

POST /time-features - Extract time-based features from datetime

Parameters:

datetime_str (str): ISO format datetime string

Example:

curl -X POST "http://localhost:8000/time-features" \
  -H "Content-Type: application/json" \
  -d '{"datetime_str": "2016-01-01T17:00:00"}'

Prediction API

POST /predict - Predict trip duration using ML models

Parameters (JSON Body):

{
  "from": {
    "lat": 40.767937,
    "lon": -73.982155
  },
  "to": {
    "lat": 40.748817,
    "lon": -73.985428
  },
  "startTime": "2016-01-01T17:00:00",
  "city": "new_york",
  "model_name": "XGBoost"
}

Response:

{
  "minutes": 5.2,
  "confidence": 0.75,
  "model_version": "XGBoost",
  "distance_km": 2.1,
  "city": "new_york"
}

Model Management API

GET /models - List available trained models GET /models/{model_name} - Get specific model information POST /models/train - Train models in background

Example:

# List models
curl "http://localhost:8000/models"

# Train models
curl -X POST "http://localhost:8000/models/train" \
  -H "Content-Type: application/json" \
  -d '{"models_to_run": ["XGBoost", "Random Forest"]}'

Health & Status API

GET /health - Health check endpoint GET /status - Detailed API status

Frontend Integration

The frontend uses the API client in frontend/src/lib/api.ts:

import { predictTravelTime } from "@/lib/api";

// Example usage
const prediction = await predictTravelTime({
  from: { lat: 40.767937, lon: -73.982155 },
  to: { lat: 40.748817, lon: -73.985428 },
  startTime: "2016-01-01T17:00:00",
  city: "new_york",
});

ML Pipeline Usage

Simple Pipeline (Default)

python main.py

Runs the complete end-to-end pipeline:

Data preprocessing - Loads and cleans raw data
Feature engineering - Adds distance, time, cluster, and weather features
Model training - Trains all specified models
Model evaluation - Compares model performance
Prediction generation - Creates submission files

Custom Models

python main.py --models XGB,RF

Train only specific models.

With Hyperparameter Tuning

python main.py --tune-xgb

Enable XGBoost hyperparameter tuning.

📈 Output Files

Predictions

output/[model_name]/test_prediction_YYYYMMDD_HHMMSS.csv
Ready-to-submit prediction files with timestamps

Models

saved_models/[model_name]_YYYYMMDD_HHMMSS.pkl
Trained models with metadata

Logs

logs/main.log - Complete pipeline execution log
Detailed progress tracking and metrics

Visualizations

output/prediction_comparison_YYYYMMDD_HHMMSS.png
Model comparison plots
Feature importance plots

🔧 Configuration

Edit config.py to customize:

Model parameters
Data paths
Output directories
Hyperparameter tuning ranges
Logging settings

Usage Examples

Basic Usage

from src.model.models import run_complete_pipeline
import pandas as pd

# Load data
train_df = pd.read_csv('data/processed/feature_engineered_train.csv')
test_df = pd.read_csv('data/processed/feature_engineered_test.csv')

# Run complete pipeline
results = run_complete_pipeline(
    train_df=train_df,
    test_df=test_df,
    models_to_run=['LINREG', 'RIDGE', 'XGB'],
    tune_xgb=True,
    create_submission=True
)

Individual Components

from src.model.models import run_regression_models, predict_duration, to_submission

# Train models
models = run_regression_models(train_df, ['XGB', 'RF'])

# Make predictions
predictions = predict_duration(models['XGBoost'], test_df)

# Create submission
submission = to_submission(predictions, test_df)
submission.to_csv('my_submission.csv', index=False)

Testing

API Testing

# Run comprehensive API tests
python test_api.py

Frontend Testing

cd frontend
npm run test
npm run test:coverage

Available Models

LINREG - Linear Regression
RIDGE - Ridge Regression
LASSO - Lasso Regression
SVR - Support Vector Regression
XGB - XGBoost
RF - Random Forest
NN - Neural Network

ontributing

See CONTRIBUTING.md for development guidelines and frontend integration details.

Code of Conduct

See CODE_OF_CONDUCT.md for our community guidelines and security policies.

License

This project is licensed under the MIT License - see the LICENSE file for details.

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

GoPredict - Machine Learning Pipeline for Trip Duration Prediction

Demo

Project Structure

Quick Start

1. Installation

2. API Server

3. Frontend Development

API Documentation

Quick API Start

API Access Points

Core API Endpoints

Weather API

Distance Calculation API

Time Features API

Prediction API

Model Management API

Health & Status API

Frontend Integration

ML Pipeline Usage

Simple Pipeline (Default)

Custom Models

With Hyperparameter Tuning

📈 Output Files

Predictions

Models

Logs

Visualizations

🔧 Configuration

Usage Examples

Basic Usage

Individual Components

Testing

API Testing

Frontend Testing

Available Models

ontributing

Code of Conduct

License