This project is a high-performance ETL (Extract, Transform, Load) pipeline designed to handle real-time energy price data. By leveraging Dockerized Apache Airflow, it automates the transition from raw API data to analytical-ready storage.
| Phase | Description | Tools |
|---|---|---|
| Extract | Retrieval of hourly energy prices from EnergyZero API. | Requests, JSON |
| Transform | Data cleaning, VAT calculation (21%), and date/time engineering. | Pandas |
| Load | Compressed and schema-enforced storage in Parquet format. | PyArrow |
| Orchestrate | Fully automated scheduling and monitoring. | Airflow |
- Extract: A Python script fetches the last 7 days of electricity prices.
- Transform:
- Splits
ReadingDateinto separateDateandTimecolumns. - Calculates
Price_with_VAT(Base Price * 1.21). - Enforces correct data types for downstream analytics.
- Load: Saves the resulting dataframe as a
.parquetfile in thedata/processed/directory.
energyzero_etl/
├── 📁 dags/ # Airflow DAG definitions
├── 📁 scripts/ # Python ETL logic
├── 📁 data/
│ ├── 📁 raw/ # Raw JSON landing zone
│ └── 📁 processed/ # Optimized Parquet files
├── 🐳 Dockerfile # Custom Airflow image
├── 🚢 docker-compose.yml # Infrastructure as Code
└── 📄 requirements.txt # Dependency list
Prerequisite: Ensure Docker Desktop is installed and running.
git clone https://github.com/BeyzaNurSarikaya/energyzero-etl.git
cd energyzero-etl
docker-compose up --build -d
Access the Airflow Dashboard at http://localhost:8080:
- User:
admin - Pass:
admin
- Integrate a PostgreSQL database as the final "Load" destination.
- Add a Streamlit dashboard for real-time price visualization.
- Implement Slack/Email alerts for failed pipeline tasks.
Beyza Nur Sarıkaya
- LinkedIn: beyza-nur-sarikaya