Skip to content

IT21314742/Manufacturing_Analytics

Repository files navigation

Manufacturing & Financial Data Analytics Hub

πŸ“Š Project Overview

end-to-end data pipeline platform that integrates manufacturing operational data with financial metrics. Built with Python, PostgreSQL, and Apache Airflow, it demonstrates modern data engineering practices including ETL orchestration, star schema warehousing, and business intelligence visualization.

The project showcases the complete data lifecycle from raw data ingestion through transformation to interactive dashboards, providing a scalable foundation for manufacturing performance analysis.


🎯 Project Goals

  • Build a production-ready ETL pipeline for manufacturing data
  • Implement a star schema data warehouse in PostgreSQL for analytical queries
  • Demonstrate workflow orchestration with Apache Airflow
  • Create interactive BI dashboards with Tableau Public
  • Establish CI/CD practices using Docker and GitHub Actions
  • Produce actionable insights combining manufacturing and financial metrics

🧠 How the Project Works

The system operates through three integrated layers:

1️⃣ Data Ingestion & Staging

Raw manufacturing data (production logs, machine sensors) and financial records are:

  • Extracted from source files/APIs
  • Validated for completeness and accuracy
  • Staged in temporary tables for transformation

2️⃣ Transformation & Warehousing

The ETL process:

  • Cleans and normalizes raw data
  • Applies business logic and calculations
  • Loads into a star schema with fact and dimension tables
  • Maintains slowly changing dimensions for historical accuracy

3️⃣ Analytics & Visualization

The curated data enables:

  • Production efficiency tracking (OEE, downtime analysis)
  • Cost per unit calculations
  • Revenue and profitability trends
  • Interactive Tableau dashboards for decision support

πŸ—οΈ System Architecture Overview

At a high level, the system consists of:

  • A PostgreSQL database with star schema design
  • Python ETL scripts using Pandas and SQLAlchemy
  • Apache Airflow DAGs for orchestration and scheduling
  • Tableau Public for visualization and reporting
  • Docker containers for consistent development/deployment
  • GitHub Actions for automated testing and deployment

Full System Architecture Diagram

Architecture Diagram

This architecture diagram illustrates the end-to-end data pipeline for the Manufacturing Analytics platform. It shows how data flows from multiple sources (production databases, financial systems, sensors, and CSV files) through the ingestion layer, into the ETL pipeline orchestrated by Apache Airflow, and finally into the PostgreSQL data warehouse with its star schema design. The warehouse feeds into Tableau Public for business intelligence, while the entire infrastructure is containerized with Docker and automated via GitHub Actions CI/CD. The diagram highlights the separation of concerns between data sources, processing layers, storage, and visualization components


πŸ“ˆ Detailed Workflow Sequence Diagram

Sequence Diagram

Sequence diagram illustrates the runtime behavior of the ETL pipeline from trigger to completion. It shows the chronological interaction between system components during a typical pipeline execution. The flow begins with a user or automated trigger activating the Airflow DAG, which then orchestrates the Python ETL scripts. The Python modules extract raw data from PostgreSQL, perform transformations and KPI calculations, validate data quality, and finally load the processed data back into the warehouse's staging area, dimensions, and fact tables. Upon successful completion, Airflow sends a notification, and the user can then access updated visualizations in Tableau, which queries the fresh data from PostgreSQL.


πŸ—οΈ Star Schema Diagram

Star Scema Diagram

This entity-relationship diagram represents the star schema design of the PostgreSQL data warehouse. The schema follows dimensional modeling best practices with clearly separated dimension and fact tables. The dimension tables (product, time, machine, location, customer) contain descriptive attributes that provide context for analysis. These are connected to fact tables (production, financial, maintenance, inventory) that store quantitative metrics and measurements. This design optimizes query performance for analytical workloads, enables intuitive drill-down analysis, and supports complex business questions about manufacturing operations, financial performance, and operational efficiency. The foreign key relationships between dimensions and facts create a star-like pattern that gives this schema architecture its name.


βŒ›οΈ Runtime Sequence Explanation

The system follows this execution flow:

  1. Trigger - Airflow DAG starts based on schedule or manual trigger
  2. Extract Phase - Python scripts connect to data sources and pull raw data
  3. Staging - Raw data is loaded into staging tables in PostgreSQL
  4. Transform Phase - Data is cleaned, joined, and business logic is applied
  5. Load Phase - Transformed data populates the star schema (fact/dimension tables)
  6. Validation - Data quality checks ensure integrity and completeness
  7. Notification - Success/failure alerts are logged and sent
  8. Visualization - Tableau connects to the warehouse for dashboard updates

Workflow States

The ETL pipeline transitions through these states:

  • 🟑 Pending - DAG initialized, waiting for execution
  • πŸ”΅ Running - Tasks currently executing
  • 🟒 Success - All tasks completed successfully
  • πŸ”΄ Failed - Error encountered, retry mechanism activated
  • πŸ”„ Retrying - Automatic retry of failed tasks
  • ⏸️ Paused - Manual pause of DAG execution

πŸ› οΈ Technology Stack

Component Technology Choice
Database PostgreSQL 15+ (Star Schema Design)
ETL Python 3.9+, Pandas, SQLAlchemy
Orchestration Apache Airflow
BI & Reporting Tableau Public
Container Docker, docker-compose
CI/CD GitHub Actions
Version Control Git/GitHub
Monitoring Airflow Logs, etl_pipeline.log

πŸ“ Project Structure

β”‚
β”œβ”€β”€ πŸ“‚ src/ # Core ETL code
β”‚ β”œβ”€β”€ πŸ“‚ extract/ # Data extraction modules
β”‚ β”‚ β”œβ”€β”€ extract_production.py
β”‚ β”‚ β”œβ”€β”€ extract_financial.py
β”‚ β”‚ └── extract_machine_data.py
β”‚ β”‚
β”‚ β”œβ”€β”€ πŸ“‚ transform/ # Data transformation logic
β”‚ β”‚ β”œβ”€β”€ clean_data.py
β”‚ β”‚ β”œβ”€β”€ calculate_kpis.py
β”‚ β”‚ └── merge_datasets.py
β”‚ β”‚
β”‚ └── πŸ“‚ load/ # Database loading scripts
β”‚ β”œβ”€β”€ load_dimensions.py
β”‚ └── load_facts.py
β”‚
β”œβ”€β”€ πŸ“‚ airflow/
β”‚ └── πŸ“‚ dags/ # Airflow DAG definitions
β”‚ β”œβ”€β”€ manufacturing_etl.py # Main ETL pipeline DAG
β”‚ └── data_quality_dag.py # Data validation DAG
β”‚
β”œβ”€β”€ πŸ“‚ config/ # Configuration files
β”‚ β”œβ”€β”€ database.ini # DB connection settings
β”‚ └── logging.conf # Logging configuration
β”‚
β”œβ”€β”€ πŸ“‚ notebooks/ # Jupyter notebooks for exploration
β”‚ └── exploratory_analysis.ipynb
β”‚
β”œβ”€β”€ πŸ“‚ docs/ # Documentation
β”‚ └── data_dictionary.md # Schema documentation
β”‚
β”œβ”€β”€ πŸ“‚ tests/ # Unit and integration tests
β”‚ β”œβ”€β”€ test_extract.py
β”‚ β”œβ”€β”€ test_transform.py
β”‚ └── test_load.py
β”‚
β”œβ”€β”€ πŸ“‚ .vscode/ # VS Code configuration
β”‚ └── settings.json
β”‚
β”œβ”€β”€ πŸ“„ PostgreSQL_Schema.sql # Complete database schema
β”œβ”€β”€ πŸ“„ DB_Manipulation_Queries.sql # Sample analytical queries
β”œβ”€β”€ πŸ“„ docker-compose.yml # Container orchestration
β”œβ”€β”€ πŸ“„ .env.example # Environment variables template
β”œβ”€β”€ πŸ“„ requirements.txt # Python dependencies
β”œβ”€β”€ πŸ“„ environment.yml # Conda environment
β”œβ”€β”€ πŸ“„ start_postgres.py # DB initialization helper
β”œβ”€β”€ πŸ“„ etl_pipeline.log # Pipeline execution logs
β”œβ”€β”€ πŸ“„ .gitattributes # Git attributes
β”œβ”€β”€ πŸ“„ .gitignore # Git ignore rules
└── πŸ“„ README.md # You are here Mate!!

πŸ“¦ Installation

Prerequisites

  • Python 3.9+
  • PostgreSQL 15+
  • Docker (optional, for containerized setup)
  • Git

Step 1: Clone the Repository

git clone https://github.com/IT21314742/manufacturing_analytics.git
cd manufacturing_analytics

Step 2: Set Up Python Environment

Using pip:

pip install -r requirements.txt

Using Conda:

conda env create -f environment.yml
conda activate manufacturing-analytics

Step 3: Configure Database

  1. Create a PostgreSQL database:

    CREATE DATABASE manufacturing_db;
    
  2. Set up environment variables:

    cp .env.example .env
    # Edit .env with your database credentials
    
  3. Initialize the schema:

    # Using Python helper
    python start_postgres.py
    
    # Or manually with psql
    psql -d manufacturing_db -f PostgreSQL_Schema.sql
    

Step 4: Docker Setup

docker-compose up -d

This will start:

  • PostgreSQL container

  • Adminer for database management (port 8080)

  • Other services as configured

▢️ Usage

Running the ETL Pipeline

Option 1: Manual Execution

python src/extract/extract_production.py
python src/transform/calculate_kpis.py
python src/load/load_facts.py

Option 2: Using Airflow

# Start Airflow
airflow standalone

# Access Airflow UI at http://localhost:8080
# Trigger the 'manufacturing_etl' DAG

Running Analytical Queries

Execute predefined analytical queries:

psql -d manufacturing_db -f DB_Manipulation_Queries.sql

Sample queries included:

  • Monthly production efficiency trends

  • Cost analysis by product line

  • Revenue forecasting

  • Machine downtime patterns

Jupyter Notebook Exploration

jupyter notebook notebooks/exploratory_analysis.ipynb

πŸ“Š Example Output

Sample Dashboard Metrics

Metric Value Period Trend
Overall Equipment Effectiveness (OEE) 78.5% Q1 2026 πŸ“ˆ +5.2%
Production Volume 125,000 units March 2026 πŸ“Š On Target
Average Cost Per Unit $24.50 March 2026 πŸ“‰ -3.1%
Downtime Percentage 12.3% March 2026 🟑 Warning
Revenue $3.2M Q1 2026 πŸ“ˆ +8.7%

Sample Query Result

-- Top 5 products by profitability
SELECT 
    product_name,
    total_revenue,
    total_cost,
    (total_revenue - total_cost) as profit,
    ROUND((total_revenue - total_cost)/total_revenue * 100, 2) as profit_margin
FROM profitability_analysis
WHERE date_trunc('month', transaction_date) = '2026-03-01'
ORDER BY profit DESC
LIMIT 5;

Output:

product_name total_revenue total_cost profit profit_margin
Industrial Fan 450,000 310,000 140,000 31.11%
Motor Assembly 380,000 275,000 105,000 27.63%
Control Unit 295,000 210,000 85,000 28.81%
Bearing Set 210,000 155,000 55,000 26.19%
Wiring Harness 175,000 132,000 43,000 24.57%

🧱 Extending the System

The architecture is designed for extension. Possible improvements include:

Additional Features

  • Real-time streaming - Integrate Kafka for live sensor data
  • Machine Learning - Add predictive maintenance models
  • Additional data sources - Connect to ERP systems, IoT platforms
  • Advanced visualizations - Add more Tableau dashboards

Export Options

  • JSON export for API consumption
  • CSV exports for Excel users
  • Automated PDF report generation
  • Email notifications with summary attachments

Performance Optimizations

  • Incremental loading strategies
  • Partitioning large fact tables
  • Materialized views for frequent queries
  • Query optimization and indexing

πŸ§ͺ Testing

Run the test suite:

# Run all tests
pytest tests/

# Run specific test modules
pytest tests/test_transform.py -v

# Run with coverage report
pytest --cov=src tests/

🀝 Contributing

Contributions, issues, and feature requests are welcome! This project is intended for data engineers, analysts, and developers interested in:

  • Data pipeline architecture
  • ETL/ELT processes
  • Data warehousing
  • Business intelligence
  • Manufacturing analytics

πŸ™ Acknowledgments

  • PostgreSQL community
  • Apache Airflow team
  • Tableau Public
  • All contributors and testers

About

End-to-end data pipeline integrating manufacturing and financial data. Demonstrates ETL processes, a PostgreSQL data warehouse with a star schema, and business intelligence capabilities using PowerBI. Built with Python, Apache Airflow, and Docker.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages