Skip to content

Latest commit

 

History

History
292 lines (233 loc) · 6.41 KB

File metadata and controls

292 lines (233 loc) · 6.41 KB

Quick Start Guide

Get up and running with the Data Contract Execution Engine.

Prerequisites

  • Python 3.7+
  • pip
  • AWS Account (for Lambda deployment)

1. Installation

# Clone the repository
git clone https://github.com/<your-username>/data-contract-execution-engine.git
cd data-contract-execution-engine

# Install dependencies
pip install -r requirements.txt

2. Define Your Contract

Create or edit contracts/my_contract.yaml:

name: "Customer Data"
version: "1.0"

source_s3_path: "s3://my-bucket/input/customers.csv"
target_s3_path: "s3://my-bucket/output/customers_validated.csv"

schema:
  columns:
    customer_id:
      type: "integer"
      nullable: false
    name:
      type: "string"
      nullable: false
    email:
      type: "string"
    age:
      type: "integer"

sla:
  min_rows: 1
  max_rows: 1000000
  completeness_threshold: 0.95

3. Test Locally

Run the test suite to verify everything works:

# Run all tests
python -m pytest tests/ -v

# Run with coverage report
python -m pytest tests/ --cov=engine --cov=runtime

Expected output:

tests/test_contract_parser.py::test_load_contract PASSED
tests/test_validation_engine.py::test_validate_schema PASSED
tests/test_sla_enforcer.py::test_enforce_sla PASSED
tests/test_pipeline_generator.py::test_generate_pipeline PASSED
tests/test_lambda_handler.py::test_handler PASSED

======================== 18+ tests passed ========================

Or test manually with sample data:

python -c "
import pandas as pd
from engine.contract_parser import load_contract
from engine.pipeline_generator import PipelineGenerator

# Load contract and test
contract = load_contract('contracts/sample_contract.yaml')
df = pd.read_csv('examples/customers_expanded.csv')
pipeline = PipelineGenerator(contract)
results = pipeline.generate(df)
print('Validation Success!' if results.get('success') else 'Validation Failed')
print(results)
"

4. Test with Local Files

The engine supports both local file paths and S3 paths. This is useful for testing before deploying to Lambda.

Step 1: Create output directory

mkdir output

Step 2: Test with local paths

python -c "
from runtime.lambda_handler import handler

# Test with local files
event = {
    'contract_path': 'contracts/sample_contract.yaml',
    'source_path': 'examples/customers_expanded.csv',
    'target_path': 'output/customers_validated.csv'
}

result = handler(event, None)
print(result)
"

Step 3: Verify output

# Check the validated data
cat output/customers_validated.csv

You can mix local and S3 paths:

# Read from local, write to S3
{
    "contract_path": "contracts/sample_contract.yaml",
    "source_path": "examples/customers_expanded.csv",
    "target_path": "s3://my-bucket/output/validated.csv"
}

# Read from S3, write to local
{
    "contract_path": "s3://my-bucket/contracts/contract.yaml",
    "source_path": "s3://my-bucket/data/customers.csv",
    "target_path": "output/customers_validated.csv"
}

5. Deploy to Lambda

For detailed Lambda deployment instructions, see LAMBDA_DEPLOYMENT.md.

Quick deployment:

# Create deployment package
mkdir lambda_build
cp -r engine/ lambda_build/
cp -r runtime/ lambda_build/
cp -r contracts/ lambda_build/
cp requirements.txt lambda_build/

# Create ZIP
cd lambda_build
zip -r ../lambda_function.zip .
cd ..

# Deploy (requires AWS CLI configured)
aws lambda update-function-code \
  --function-name data-contract-executor \
  --zip-file fileb://lambda_function.zip

5. Test Lambda

Create a test event:

{
  "contract_path": "contracts/my_contract.yaml"
}

Or override paths (supports both S3 and local):

{
  "contract_path": "contracts/my_contract.yaml",
  "source_path": "examples/customers_expanded.csv",
  "target_path": "output/customers_validated.csv"
}

The Lambda handler will:

  1. Load the contract (S3 or local)
  2. Read data from source path (S3 or local)
  3. Validate against schema & SLA rules
  4. Write to target path (S3 or local)
  5. Return validation results

Response Example

Success:

{
  "statusCode": 200,
  "body": {
    "message": "Data contract execution completed successfully",
    "contract": "Customer Data",
    "input_rows": 10,
    "output_rows": 5,
    "pipeline_results": {
      "input_rows": 5,
      "success": true,
      "steps": [...]
    }
  }
}

Error:

{
  "statusCode": 500,
  "body": {
    "message": "Data contract execution failed: error details"
  }
}

Running Tests

# Run all tests
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=engine --cov=runtime

# Run specific test
pytest tests/test_validation_engine.py -v

Common Issues

Issue Solution
Missing columns Add columns to CSV or update contract schema
SLA violation Check min_rows, max_rows, completeness_threshold
S3 access denied Verify Lambda IAM role has S3 permissions
FileNotFoundError Ensure contract_path is correct

Project Structure

data-contract-execution-engine/
├── engine/                  # Core modules
│   ├── contract_parser.py
│   ├── validation_engine.py
│   ├── sla_enforcer.py
│   └── pipeline_generator.py
├── runtime/
│   └── lambda_handler.py
├── contracts/
│   └── sample_contract.yaml
├── examples/
│   ├── sample_data.csv
│   └── lambda_event.json
├── tests/                   # Test suite
├── requirements.txt
├── README.md
└── QUICKSTART.md (this file)

Next Steps

Security Best Practices

  1. Use IAM Roles - Never store AWS credentials in code
  2. S3 Encryption - Enable encryption for sensitive data
  3. Contract Validation - Validate contracts before deployment
  4. Logging - Review CloudWatch logs for security issues
  5. Least Privilege - Grant minimum required permissions

Performance Tips

For Lambda with data under 100MB:

  • Set Lambda memory to at least 3GB
  • Use CSV format
  • Keep transformations simple
  • Set timeout to 15 minutes

For larger datasets over 100MB, consider AWS Glue with Parquet format and auto-scaling workers.


Questions? Check the documentation or open an issue on GitHub.