A growing ETL platform for managing data workflows. Still early stage but getting there!
DataFlow is a comprehensive, scalable data pipeline platform designed for modern enterprises. It supports both real-time streaming and batch ETL workloads with advanced data quality monitoring, lineage tracking, and operational visibility.
- 20+ Built-in Connectors: PostgreSQL, MySQL, MongoDB, Salesforce, REST APIs, Kafka, S3, HDFS
- Pluggable Architecture: Easy to add custom connectors
- Schema Evolution: Automatic handling of schema changes
- Incremental Loading: Optimized data synchronization
- Streaming ETL: Apache Kafka-based real-time processing
- Batch Processing: Apache Spark integration for large-scale transformations
- Hybrid Workloads: Seamless combination of streaming and batch
- Auto-scaling: Kubernetes-native horizontal scaling
- Built-in Validators: 50+ pre-built data quality rules
- Custom Validators: Python-based extensible validation framework
- Data Profiling: Automatic statistics and anomaly detection
- Data Lineage: End-to-end tracking of data transformations
- GDPR Compliance: Built-in PII detection and handling
- Web Dashboard: React-based management interface
- RESTful APIs: Complete programmatic control
- Monitoring & Alerting: Prometheus metrics with Grafana dashboards
- Multi-tenancy: Isolated environments for different teams
- Role-based Access: Fine-grained security controls
DataFlow follows a microservices architecture designed for cloud-native deployments:
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Data Sources β β Processing β β Destinations β
β β β β β β
β β’ Databases ββββββ β’ Stream Proc ββββββ β’ Data Lake β
β β’ APIs β β β’ Batch Proc β β β’ Data Warehouseβ
β β’ Files β β β’ Validators β β β’ Analytics β
β β’ Streams β β β’ Transformers β β β’ ML Platforms β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β β
βββββββββββββββββββββββββΌββββββββββββββββββββββββ
β
βββββββββββββββββββ
β DataFlow β
β Control Plane β
β β
β β’ Orchestration β
β β’ Monitoring β
β β’ Metadata β
β β’ Security β
βββββββββββββββββββ
- Runtime: Python 3.7+, Apache Spark 2.4+, Apache Kafka 2.2+
- Storage: PostgreSQL, Redis, Apache Parquet
- Container Platform: Docker, Kubernetes
- Monitoring: Prometheus, Grafana, ELK Stack
- Web Framework: Flask, React, TypeScript
- Docker 18.09+
- Kubernetes 1.14+ (for production)
- Python 3.7+ (for development)
- Deploy with Docker Compose (Development):
git clone https://github.com/your-org/dataflow.git
cd dataflow
docker-compose up -d- Deploy on Kubernetes (Production):
helm repo add dataflow https://charts.dataflow.io
helm install dataflow dataflow/dataflow-platform- Local Development Setup:
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python -m dataflow.cli init --devfrom dataflow import Pipeline, PostgreSQLSource, S3Sink, Validator
pipeline = Pipeline("user_events_etl")
# Source: PostgreSQL database
source = PostgreSQLSource(
connection_string="postgresql://user:pass@localhost/prod",
query="SELECT * FROM user_events WHERE created_at > '{last_run}'"
)
# Validation: Ensure data quality
validator = Validator()
validator.add_rule("user_id", "not_null")
validator.add_rule("event_type", "in", ["click", "view", "purchase"])
validator.add_rule("timestamp", "datetime_format", "%Y-%m-%d %H:%M:%S")
# Sink: Amazon S3 data lake
sink = S3Sink(
bucket="company-data-lake",
prefix="events/year={year}/month={month}/day={day}",
format="parquet",
compression="snappy"
)
# Build pipeline
pipeline.source(source)
pipeline.validate(validator)
pipeline.transform("events_transformer.py")
pipeline.sink(sink)
# Schedule for every hour
pipeline.schedule("0 * * * *")
pipeline.save()from dataflow.streaming import KafkaSource, StreamProcessor
processor = StreamProcessor("real_time_analytics")
processor.source(KafkaSource("user_events"))
processor.window(size="5m", slide="1m")
processor.aggregate(["user_id"], ["count", "sum(revenue)"])
processor.sink(RedisCache("analytics_cache"))from dataflow.quality import DataProfiler, QualityMonitor
monitor = QualityMonitor("daily_quality_check")
monitor.profile_all_tables()
monitor.detect_anomalies(sensitivity=0.95)
monitor.alert_on_failure(slack_channel="#data-ops")DataFlow uses YAML-based configuration:
# dataflow.yml
engine:
parallelism: 4
checkpoint_interval: "5m"
storage:
metadata_db: "postgresql://localhost/dataflow_meta"
cache: "redis://localhost:6379"
monitoring:
prometheus:
enabled: true
port: 9090
logging:
level: INFO
format: json
security:
authentication: oauth2
encryption: aes256- Small deployment: 2-4 worker nodes, handles ~1GB/hour
- Medium deployment: 8-16 worker nodes, handles ~50GB/hour
- Large deployment: 32+ worker nodes, handles ~500GB/hour
# kubernetes/values.yml
replicaCount: 3
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 20
targetCPUUtilizationPercentage: 70| Workload Type | Throughput | Latency | Resource Usage |
|---|---|---|---|
| Batch ETL | 2GB/min | N/A | 4 CPU, 8GB RAM |
| Streaming | 100k msg/s | <100ms | 2 CPU, 4GB RAM |
| Data Quality | 1M rows/s | <50ms | 1 CPU, 2GB RAM |
We welcome contributions! Please see our Contributing Guide for details.
make dev-setup
make test
make lint# Unit tests
pytest tests/unit/
# Integration tests
pytest tests/integration/
# End-to-end tests
pytest tests/e2e/DataFlow is licensed under the Apache License 2.0. See LICENSE for details.
- π Documentation
- π¬ Community Slack
- π Issue Tracker
- π§ Enterprise Support
Made with β€οΈ by the DataFlow Team