DataFlow - Data Pipeline Platform

A growing ETL platform for managing data workflows. Still early stage but getting there!

DataFlow is a comprehensive, scalable data pipeline platform designed for modern enterprises. It supports both real-time streaming and batch ETL workloads with advanced data quality monitoring, lineage tracking, and operational visibility.

🚀 Key Features

Multi-Source Data Ingestion

20+ Built-in Connectors: PostgreSQL, MySQL, MongoDB, Salesforce, REST APIs, Kafka, S3, HDFS
Pluggable Architecture: Easy to add custom connectors
Schema Evolution: Automatic handling of schema changes
Incremental Loading: Optimized data synchronization

Real-Time & Batch Processing

Streaming ETL: Apache Kafka-based real-time processing
Batch Processing: Apache Spark integration for large-scale transformations
Hybrid Workloads: Seamless combination of streaming and batch
Auto-scaling: Kubernetes-native horizontal scaling

Data Quality & Governance

Built-in Validators: 50+ pre-built data quality rules
Custom Validators: Python-based extensible validation framework
Data Profiling: Automatic statistics and anomaly detection
Data Lineage: End-to-end tracking of data transformations
GDPR Compliance: Built-in PII detection and handling

Enterprise-Grade Operations

Web Dashboard: React-based management interface
RESTful APIs: Complete programmatic control
Monitoring & Alerting: Prometheus metrics with Grafana dashboards
Multi-tenancy: Isolated environments for different teams
Role-based Access: Fine-grained security controls

🏗️ Architecture

DataFlow follows a microservices architecture designed for cloud-native deployments:

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Data Sources  │    │   Processing    │    │   Destinations  │
│                 │    │                 │    │                 │
│ • Databases     │────│ • Stream Proc   │────│ • Data Lake     │
│ • APIs          │    │ • Batch Proc    │    │ • Data Warehouse│  
│ • Files         │    │ • Validators    │    │ • Analytics     │
│ • Streams       │    │ • Transformers  │    │ • ML Platforms  │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         └───────────────────────┼───────────────────────┘
                                 │
                    ┌─────────────────┐
                    │   DataFlow      │
                    │   Control Plane │
                    │                 │
                    │ • Orchestration │
                    │ • Monitoring    │
                    │ • Metadata      │
                    │ • Security      │
                    └─────────────────┘

🛠️ Technology Stack

Runtime: Python 3.7+, Apache Spark 2.4+, Apache Kafka 2.2+
Storage: PostgreSQL, Redis, Apache Parquet
Container Platform: Docker, Kubernetes
Monitoring: Prometheus, Grafana, ELK Stack
Web Framework: Flask, React, TypeScript

📦 Quick Start

Prerequisites

Docker 18.09+
Kubernetes 1.14+ (for production)
Python 3.7+ (for development)

Installation

Deploy with Docker Compose (Development):

git clone https://github.com/your-org/dataflow.git
cd dataflow
docker-compose up -d

Deploy on Kubernetes (Production):

helm repo add dataflow https://charts.dataflow.io
helm install dataflow dataflow/dataflow-platform

Local Development Setup:

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python -m dataflow.cli init --dev

Creating Your First Pipeline

from dataflow import Pipeline, PostgreSQLSource, S3Sink, Validator

pipeline = Pipeline("user_events_etl")

# Source: PostgreSQL database
source = PostgreSQLSource(
    connection_string="postgresql://user:pass@localhost/prod",
    query="SELECT * FROM user_events WHERE created_at > '{last_run}'"
)

# Validation: Ensure data quality
validator = Validator()
validator.add_rule("user_id", "not_null")
validator.add_rule("event_type", "in", ["click", "view", "purchase"])
validator.add_rule("timestamp", "datetime_format", "%Y-%m-%d %H:%M:%S")

# Sink: Amazon S3 data lake
sink = S3Sink(
    bucket="company-data-lake",
    prefix="events/year={year}/month={month}/day={day}",
    format="parquet",
    compression="snappy"
)

# Build pipeline
pipeline.source(source)
pipeline.validate(validator)
pipeline.transform("events_transformer.py")
pipeline.sink(sink)

# Schedule for every hour
pipeline.schedule("0 * * * *")
pipeline.save()

📊 Use Cases

Real-Time Analytics Pipeline

from dataflow.streaming import KafkaSource, StreamProcessor

processor = StreamProcessor("real_time_analytics")
processor.source(KafkaSource("user_events"))
processor.window(size="5m", slide="1m")
processor.aggregate(["user_id"], ["count", "sum(revenue)"])
processor.sink(RedisCache("analytics_cache"))

Data Quality Monitoring

from dataflow.quality import DataProfiler, QualityMonitor

monitor = QualityMonitor("daily_quality_check")
monitor.profile_all_tables()
monitor.detect_anomalies(sensitivity=0.95)
monitor.alert_on_failure(slack_channel="#data-ops")

🔧 Configuration

DataFlow uses YAML-based configuration:

# dataflow.yml
engine:
  parallelism: 4
  checkpoint_interval: "5m"
  
storage:
  metadata_db: "postgresql://localhost/dataflow_meta"
  cache: "redis://localhost:6379"

monitoring:
  prometheus:
    enabled: true
    port: 9090
  logging:
    level: INFO
    format: json

security:
  authentication: oauth2
  encryption: aes256

🚦 Production Deployment

Scaling Guidelines

Small deployment: 2-4 worker nodes, handles ~1GB/hour
Medium deployment: 8-16 worker nodes, handles ~50GB/hour
Large deployment: 32+ worker nodes, handles ~500GB/hour

High Availability Setup

# kubernetes/values.yml
replicaCount: 3
autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 20
  targetCPUUtilizationPercentage: 70

📈 Performance Benchmarks

Workload Type	Throughput	Latency	Resource Usage
Batch ETL	2GB/min	N/A	4 CPU, 8GB RAM
Streaming	100k msg/s	<100ms	2 CPU, 4GB RAM
Data Quality	1M rows/s	<50ms	1 CPU, 2GB RAM

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Setup

make dev-setup
make test
make lint

Running Tests

# Unit tests
pytest tests/unit/

# Integration tests  
pytest tests/integration/

# End-to-end tests
pytest tests/e2e/

📄 License

DataFlow is licensed under the Apache License 2.0. See LICENSE for details.

🆘 Support

Made with ❤️ by the DataFlow Team

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
dataflow		dataflow
.gitignore		.gitignore
.travis.yml		.travis.yml
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
config.py		config.py
connectors.py		connectors.py
db_test.py		db_test.py
docker-compose.yml		docker-compose.yml
example_pipeline.yaml		example_pipeline.yaml
monitoring.py		monitoring.py
mysql_connector.py		mysql_connector.py
pipeline.py		pipeline.py
pipeline_runner.py		pipeline_runner.py
requirements.txt		requirements.txt
scheduler.py		scheduler.py
setup.py		setup.py
simple_readme.md		simple_readme.md
test_pipeline.py		test_pipeline.py
validation.py		validation.py
web_dashboard.py		web_dashboard.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataFlow - Data Pipeline Platform

🚀 Key Features

Multi-Source Data Ingestion

Real-Time & Batch Processing

Data Quality & Governance

Enterprise-Grade Operations

🏗️ Architecture

🛠️ Technology Stack

📦 Quick Start

Prerequisites

Installation

Creating Your First Pipeline

📊 Use Cases

Real-Time Analytics Pipeline

Data Quality Monitoring

🔧 Configuration

🚦 Production Deployment

Scaling Guidelines

High Availability Setup

📈 Performance Benchmarks

🤝 Contributing

Development Setup

Running Tests

📄 License

🆘 Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DataFlow - Data Pipeline Platform

🚀 Key Features

Multi-Source Data Ingestion

Real-Time & Batch Processing

Data Quality & Governance

Enterprise-Grade Operations

🏗️ Architecture

🛠️ Technology Stack

📦 Quick Start

Prerequisites

Installation

Creating Your First Pipeline

📊 Use Cases

Real-Time Analytics Pipeline

Data Quality Monitoring

🔧 Configuration

🚦 Production Deployment

Scaling Guidelines

High Availability Setup

📈 Performance Benchmarks

🤝 Contributing

Development Setup

Running Tests

📄 License

🆘 Support

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages