end-to-end data pipeline platform that integrates manufacturing operational data with financial metrics. Built with Python, PostgreSQL, and Apache Airflow, it demonstrates modern data engineering practices including ETL orchestration, star schema warehousing, and business intelligence visualization.
The project showcases the complete data lifecycle from raw data ingestion through transformation to interactive dashboards, providing a scalable foundation for manufacturing performance analysis.
- Build a production-ready ETL pipeline for manufacturing data
- Implement a star schema data warehouse in PostgreSQL for analytical queries
- Demonstrate workflow orchestration with Apache Airflow
- Create interactive BI dashboards with Tableau Public
- Establish CI/CD practices using Docker and GitHub Actions
- Produce actionable insights combining manufacturing and financial metrics
The system operates through three integrated layers:
Raw manufacturing data (production logs, machine sensors) and financial records are:
- Extracted from source files/APIs
- Validated for completeness and accuracy
- Staged in temporary tables for transformation
The ETL process:
- Cleans and normalizes raw data
- Applies business logic and calculations
- Loads into a star schema with fact and dimension tables
- Maintains slowly changing dimensions for historical accuracy
The curated data enables:
- Production efficiency tracking (OEE, downtime analysis)
- Cost per unit calculations
- Revenue and profitability trends
- Interactive Tableau dashboards for decision support
At a high level, the system consists of:
- A PostgreSQL database with star schema design
- Python ETL scripts using Pandas and SQLAlchemy
- Apache Airflow DAGs for orchestration and scheduling
- Tableau Public for visualization and reporting
- Docker containers for consistent development/deployment
- GitHub Actions for automated testing and deployment
This architecture diagram illustrates the end-to-end data pipeline for the Manufacturing Analytics platform. It shows how data flows from multiple sources (production databases, financial systems, sensors, and CSV files) through the ingestion layer, into the ETL pipeline orchestrated by Apache Airflow, and finally into the PostgreSQL data warehouse with its star schema design. The warehouse feeds into Tableau Public for business intelligence, while the entire infrastructure is containerized with Docker and automated via GitHub Actions CI/CD. The diagram highlights the separation of concerns between data sources, processing layers, storage, and visualization components
Sequence diagram illustrates the runtime behavior of the ETL pipeline from trigger to completion. It shows the chronological interaction between system components during a typical pipeline execution. The flow begins with a user or automated trigger activating the Airflow DAG, which then orchestrates the Python ETL scripts. The Python modules extract raw data from PostgreSQL, perform transformations and KPI calculations, validate data quality, and finally load the processed data back into the warehouse's staging area, dimensions, and fact tables. Upon successful completion, Airflow sends a notification, and the user can then access updated visualizations in Tableau, which queries the fresh data from PostgreSQL.
This entity-relationship diagram represents the star schema design of the PostgreSQL data warehouse. The schema follows dimensional modeling best practices with clearly separated dimension and fact tables. The dimension tables (product, time, machine, location, customer) contain descriptive attributes that provide context for analysis. These are connected to fact tables (production, financial, maintenance, inventory) that store quantitative metrics and measurements. This design optimizes query performance for analytical workloads, enables intuitive drill-down analysis, and supports complex business questions about manufacturing operations, financial performance, and operational efficiency. The foreign key relationships between dimensions and facts create a star-like pattern that gives this schema architecture its name.
The system follows this execution flow:
- Trigger - Airflow DAG starts based on schedule or manual trigger
- Extract Phase - Python scripts connect to data sources and pull raw data
- Staging - Raw data is loaded into staging tables in PostgreSQL
- Transform Phase - Data is cleaned, joined, and business logic is applied
- Load Phase - Transformed data populates the star schema (fact/dimension tables)
- Validation - Data quality checks ensure integrity and completeness
- Notification - Success/failure alerts are logged and sent
- Visualization - Tableau connects to the warehouse for dashboard updates
The ETL pipeline transitions through these states:
- π‘ Pending - DAG initialized, waiting for execution
- π΅ Running - Tasks currently executing
- π’ Success - All tasks completed successfully
- π΄ Failed - Error encountered, retry mechanism activated
- π Retrying - Automatic retry of failed tasks
- βΈοΈ Paused - Manual pause of DAG execution
| Component | Technology Choice |
|---|---|
| Database | PostgreSQL 15+ (Star Schema Design) |
| ETL | Python 3.9+, Pandas, SQLAlchemy |
| Orchestration | Apache Airflow |
| BI & Reporting | Tableau Public |
| Container | Docker, docker-compose |
| CI/CD | GitHub Actions |
| Version Control | Git/GitHub |
| Monitoring | Airflow Logs, etl_pipeline.log |
β
βββ π src/ # Core ETL code
β βββ π extract/ # Data extraction modules
β β βββ extract_production.py
β β βββ extract_financial.py
β β βββ extract_machine_data.py
β β
β βββ π transform/ # Data transformation logic
β β βββ clean_data.py
β β βββ calculate_kpis.py
β β βββ merge_datasets.py
β β
β βββ π load/ # Database loading scripts
β βββ load_dimensions.py
β βββ load_facts.py
β
βββ π airflow/
β βββ π dags/ # Airflow DAG definitions
β βββ manufacturing_etl.py # Main ETL pipeline DAG
β βββ data_quality_dag.py # Data validation DAG
β
βββ π config/ # Configuration files
β βββ database.ini # DB connection settings
β βββ logging.conf # Logging configuration
β
βββ π notebooks/ # Jupyter notebooks for exploration
β βββ exploratory_analysis.ipynb
β
βββ π docs/ # Documentation
β βββ data_dictionary.md # Schema documentation
β
βββ π tests/ # Unit and integration tests
β βββ test_extract.py
β βββ test_transform.py
β βββ test_load.py
β
βββ π .vscode/ # VS Code configuration
β βββ settings.json
β
βββ π PostgreSQL_Schema.sql # Complete database schema
βββ π DB_Manipulation_Queries.sql # Sample analytical queries
βββ π docker-compose.yml # Container orchestration
βββ π .env.example # Environment variables template
βββ π requirements.txt # Python dependencies
βββ π environment.yml # Conda environment
βββ π start_postgres.py # DB initialization helper
βββ π etl_pipeline.log # Pipeline execution logs
βββ π .gitattributes # Git attributes
βββ π .gitignore # Git ignore rules
βββ π README.md # You are here Mate!!
- Python 3.9+
- PostgreSQL 15+
- Docker (optional, for containerized setup)
- Git
git clone https://github.com/IT21314742/manufacturing_analytics.git
cd manufacturing_analyticsUsing pip:
pip install -r requirements.txt
Using Conda:
conda env create -f environment.yml
conda activate manufacturing-analytics
-
Create a PostgreSQL database:
CREATE DATABASE manufacturing_db; -
Set up environment variables:
cp .env.example .env # Edit .env with your database credentials -
Initialize the schema:
# Using Python helper python start_postgres.py # Or manually with psql psql -d manufacturing_db -f PostgreSQL_Schema.sql
docker-compose up -d
This will start:
-
PostgreSQL container
-
Adminer for database management (port 8080)
-
Other services as configured
Option 1: Manual Execution
python src/extract/extract_production.py
python src/transform/calculate_kpis.py
python src/load/load_facts.py
Option 2: Using Airflow
# Start Airflow
airflow standalone
# Access Airflow UI at http://localhost:8080
# Trigger the 'manufacturing_etl' DAG
Execute predefined analytical queries:
psql -d manufacturing_db -f DB_Manipulation_Queries.sql
-
Monthly production efficiency trends
-
Cost analysis by product line
-
Revenue forecasting
-
Machine downtime patterns
jupyter notebook notebooks/exploratory_analysis.ipynb
| Metric | Value | Period | Trend |
|---|---|---|---|
| Overall Equipment Effectiveness (OEE) | 78.5% | Q1 2026 | π +5.2% |
| Production Volume | 125,000 units | March 2026 | π On Target |
| Average Cost Per Unit | $24.50 | March 2026 | π -3.1% |
| Downtime Percentage | 12.3% | March 2026 | π‘ Warning |
| Revenue | $3.2M | Q1 2026 | π +8.7% |
-- Top 5 products by profitability
SELECT
product_name,
total_revenue,
total_cost,
(total_revenue - total_cost) as profit,
ROUND((total_revenue - total_cost)/total_revenue * 100, 2) as profit_margin
FROM profitability_analysis
WHERE date_trunc('month', transaction_date) = '2026-03-01'
ORDER BY profit DESC
LIMIT 5;
| product_name | total_revenue | total_cost | profit | profit_margin |
|---|---|---|---|---|
| Industrial Fan | 450,000 | 310,000 | 140,000 | 31.11% |
| Motor Assembly | 380,000 | 275,000 | 105,000 | 27.63% |
| Control Unit | 295,000 | 210,000 | 85,000 | 28.81% |
| Bearing Set | 210,000 | 155,000 | 55,000 | 26.19% |
| Wiring Harness | 175,000 | 132,000 | 43,000 | 24.57% |
The architecture is designed for extension. Possible improvements include:
- Real-time streaming - Integrate Kafka for live sensor data
- Machine Learning - Add predictive maintenance models
- Additional data sources - Connect to ERP systems, IoT platforms
- Advanced visualizations - Add more Tableau dashboards
- JSON export for API consumption
- CSV exports for Excel users
- Automated PDF report generation
- Email notifications with summary attachments
- Incremental loading strategies
- Partitioning large fact tables
- Materialized views for frequent queries
- Query optimization and indexing
# Run all tests
pytest tests/
# Run specific test modules
pytest tests/test_transform.py -v
# Run with coverage report
pytest --cov=src tests/
Contributions, issues, and feature requests are welcome! This project is intended for data engineers, analysts, and developers interested in:
- Data pipeline architecture
- ETL/ELT processes
- Data warehousing
- Business intelligence
- Manufacturing analytics
- PostgreSQL community
- Apache Airflow team
- Tableau Public
- All contributors and testers


