Skip to content

juanf-0gravity/dataflow-etl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

23 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DataFlow - Data Pipeline Platform

A growing ETL platform for managing data workflows. Still early stage but getting there!

DataFlow is a comprehensive, scalable data pipeline platform designed for modern enterprises. It supports both real-time streaming and batch ETL workloads with advanced data quality monitoring, lineage tracking, and operational visibility.

πŸš€ Key Features

Multi-Source Data Ingestion

  • 20+ Built-in Connectors: PostgreSQL, MySQL, MongoDB, Salesforce, REST APIs, Kafka, S3, HDFS
  • Pluggable Architecture: Easy to add custom connectors
  • Schema Evolution: Automatic handling of schema changes
  • Incremental Loading: Optimized data synchronization

Real-Time & Batch Processing

  • Streaming ETL: Apache Kafka-based real-time processing
  • Batch Processing: Apache Spark integration for large-scale transformations
  • Hybrid Workloads: Seamless combination of streaming and batch
  • Auto-scaling: Kubernetes-native horizontal scaling

Data Quality & Governance

  • Built-in Validators: 50+ pre-built data quality rules
  • Custom Validators: Python-based extensible validation framework
  • Data Profiling: Automatic statistics and anomaly detection
  • Data Lineage: End-to-end tracking of data transformations
  • GDPR Compliance: Built-in PII detection and handling

Enterprise-Grade Operations

  • Web Dashboard: React-based management interface
  • RESTful APIs: Complete programmatic control
  • Monitoring & Alerting: Prometheus metrics with Grafana dashboards
  • Multi-tenancy: Isolated environments for different teams
  • Role-based Access: Fine-grained security controls

πŸ—οΈ Architecture

DataFlow follows a microservices architecture designed for cloud-native deployments:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Data Sources  β”‚    β”‚   Processing    β”‚    β”‚   Destinations  β”‚
β”‚                 β”‚    β”‚                 β”‚    β”‚                 β”‚
β”‚ β€’ Databases     │────│ β€’ Stream Proc   │────│ β€’ Data Lake     β”‚
β”‚ β€’ APIs          β”‚    β”‚ β€’ Batch Proc    β”‚    β”‚ β€’ Data Warehouseβ”‚  
β”‚ β€’ Files         β”‚    β”‚ β€’ Validators    β”‚    β”‚ β€’ Analytics     β”‚
β”‚ β€’ Streams       β”‚    β”‚ β€’ Transformers  β”‚    β”‚ β€’ ML Platforms  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                       β”‚                       β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   DataFlow      β”‚
                    β”‚   Control Plane β”‚
                    β”‚                 β”‚
                    β”‚ β€’ Orchestration β”‚
                    β”‚ β€’ Monitoring    β”‚
                    β”‚ β€’ Metadata      β”‚
                    β”‚ β€’ Security      β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ› οΈ Technology Stack

  • Runtime: Python 3.7+, Apache Spark 2.4+, Apache Kafka 2.2+
  • Storage: PostgreSQL, Redis, Apache Parquet
  • Container Platform: Docker, Kubernetes
  • Monitoring: Prometheus, Grafana, ELK Stack
  • Web Framework: Flask, React, TypeScript

πŸ“¦ Quick Start

Prerequisites

  • Docker 18.09+
  • Kubernetes 1.14+ (for production)
  • Python 3.7+ (for development)

Installation

  1. Deploy with Docker Compose (Development):
git clone https://github.com/your-org/dataflow.git
cd dataflow
docker-compose up -d
  1. Deploy on Kubernetes (Production):
helm repo add dataflow https://charts.dataflow.io
helm install dataflow dataflow/dataflow-platform
  1. Local Development Setup:
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python -m dataflow.cli init --dev

Creating Your First Pipeline

from dataflow import Pipeline, PostgreSQLSource, S3Sink, Validator

pipeline = Pipeline("user_events_etl")

# Source: PostgreSQL database
source = PostgreSQLSource(
    connection_string="postgresql://user:pass@localhost/prod",
    query="SELECT * FROM user_events WHERE created_at > '{last_run}'"
)

# Validation: Ensure data quality
validator = Validator()
validator.add_rule("user_id", "not_null")
validator.add_rule("event_type", "in", ["click", "view", "purchase"])
validator.add_rule("timestamp", "datetime_format", "%Y-%m-%d %H:%M:%S")

# Sink: Amazon S3 data lake
sink = S3Sink(
    bucket="company-data-lake",
    prefix="events/year={year}/month={month}/day={day}",
    format="parquet",
    compression="snappy"
)

# Build pipeline
pipeline.source(source)
pipeline.validate(validator)
pipeline.transform("events_transformer.py")
pipeline.sink(sink)

# Schedule for every hour
pipeline.schedule("0 * * * *")
pipeline.save()

πŸ“Š Use Cases

Real-Time Analytics Pipeline

from dataflow.streaming import KafkaSource, StreamProcessor

processor = StreamProcessor("real_time_analytics")
processor.source(KafkaSource("user_events"))
processor.window(size="5m", slide="1m")
processor.aggregate(["user_id"], ["count", "sum(revenue)"])
processor.sink(RedisCache("analytics_cache"))

Data Quality Monitoring

from dataflow.quality import DataProfiler, QualityMonitor

monitor = QualityMonitor("daily_quality_check")
monitor.profile_all_tables()
monitor.detect_anomalies(sensitivity=0.95)
monitor.alert_on_failure(slack_channel="#data-ops")

πŸ”§ Configuration

DataFlow uses YAML-based configuration:

# dataflow.yml
engine:
  parallelism: 4
  checkpoint_interval: "5m"
  
storage:
  metadata_db: "postgresql://localhost/dataflow_meta"
  cache: "redis://localhost:6379"

monitoring:
  prometheus:
    enabled: true
    port: 9090
  logging:
    level: INFO
    format: json

security:
  authentication: oauth2
  encryption: aes256

🚦 Production Deployment

Scaling Guidelines

  • Small deployment: 2-4 worker nodes, handles ~1GB/hour
  • Medium deployment: 8-16 worker nodes, handles ~50GB/hour
  • Large deployment: 32+ worker nodes, handles ~500GB/hour

High Availability Setup

# kubernetes/values.yml
replicaCount: 3
autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 20
  targetCPUUtilizationPercentage: 70

πŸ“ˆ Performance Benchmarks

Workload Type Throughput Latency Resource Usage
Batch ETL 2GB/min N/A 4 CPU, 8GB RAM
Streaming 100k msg/s <100ms 2 CPU, 4GB RAM
Data Quality 1M rows/s <50ms 1 CPU, 2GB RAM

🀝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Setup

make dev-setup
make test
make lint

Running Tests

# Unit tests
pytest tests/unit/

# Integration tests  
pytest tests/integration/

# End-to-end tests
pytest tests/e2e/

πŸ“„ License

DataFlow is licensed under the Apache License 2.0. See LICENSE for details.

πŸ†˜ Support


Made with ❀️ by the DataFlow Team

About

Python-based ETL pipeline tool for streaming and batch data processing

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors