feat: implement AWS Batch infrastructure and AutoML training pipeline#1
feat: implement AWS Batch infrastructure and AutoML training pipeline#1cristofima merged 29 commits intomainfrom
Conversation
Implemented AWS Batch with Fargate Spot compute for cost-effective AutoML training. This includes the creation of security groups, compute environments, job queues, and job definitions. Modified files (10): - infrastructure/terraform/batch.tf: Added Batch resources - infrastructure/terraform/data.tf: Added data sources for VPC and subnets - infrastructure/terraform/dynamodb.tf: Created DynamoDB tables for datasets and training jobs - infrastructure/terraform/ecr.tf: Added ECR repository for training container - infrastructure/terraform/iam.tf: Defined IAM roles and policies for Lambda and Batch - infrastructure/terraform/lambda.tf: Configured Lambda function and CloudWatch logs - infrastructure/terraform/main.tf: Set up Terraform backend and provider - infrastructure/terraform/outputs.tf: Defined outputs for infrastructure - infrastructure/terraform/prod.tfvars: Added production variables - infrastructure/terraform/s3.tf: Created S3 buckets for datasets, models, and reports Cost impact: Utilizing Fargate Spot can reduce training costs by up to 70% compared to on-demand pricing.
…d training status page Added a new file upload component for CSV files with validation, along with a training status page that displays job details and progress. This enhances user experience by providing real-time feedback on training jobs. Modified files (5): - frontend/components/FileUpload.tsx: New component for file uploads - frontend/app/training/[jobId]/page.tsx: New training status page - frontend/lib/api.ts: API client for upload and job details - frontend/lib/utils.ts: Utility functions for formatting and validation - frontend/styles/globals.css: Added styles for new components This change allows users to upload datasets and monitor the training process seamlessly.
This change introduces a complete training pipeline for AutoML using FLAML. It includes data preprocessing, exploratory data analysis (EDA), model training, and saving the trained model to S3. The pipeline is designed to be executed as an AWS Batch job, allowing for scalable training on large datasets. Modified files (10): - backend/training/train.py: Main training script with job handling - backend/training/preprocessor.py: Automatic data preprocessing logic - backend/training/eda.py: EDA report generation using Sweetviz - backend/training/model_trainer.py: FLAML model training integration - backend/api/services/s3_service.py: S3 operations for model and report storage - backend/api/utils/helpers.py: Configuration settings for AWS resources - backend/training/Dockerfile: Dockerfile for training container setup - backend/training/requirements.txt: Updated dependencies for training - backend/training/requirements.txt: Added necessary libraries for EDA and model training - backend/training/requirements.txt: Included FLAML for AutoML capabilities This implementation enhances the training capabilities of the system, allowing for automated model selection and hyperparameter tuning.
Introduced a Dockerfile to streamline the CI/CD process for building the Lambda deployment package. This change enhances the deployment efficiency and ensures that all necessary dependencies are included in the package. Modified files (1): - infrastructure/terraform/scripts/Dockerfile.lambda: New file for Lambda package creation
This change introduces multiple workflows for managing the AWS infrastructure using Terraform. The following workflows have been added: - CI Terraform: Validates Terraform syntax and formatting on pull requests and pushes. - Deploy Infrastructure: Automates the deployment of infrastructure changes with manual dispatch options. - Deploy Lambda API: Handles the deployment of the Lambda API with environment selection. - Deploy Training Container: Manages the building and pushing of the training container to ECR. - Destroy Environment: Provides a manual workflow to safely destroy environments with confirmation checks. These workflows enhance the CI/CD process, ensuring that infrastructure changes are validated and deployed efficiently. Modified files (6): - .github/workflows/ci-terraform.yml - .github/workflows/deploy-infrastructure.yml - .github/workflows/deploy-lambda-api.yml - .github/workflows/deploy-training-container.yml - .github/workflows/destroy-environment.yml
Introduced a PowerShell script to automate the setup of the Terraform S3 backend with DynamoDB state locking. This script creates the necessary AWS resources, including S3 bucket, DynamoDB table, and configures versioning, encryption, and access policies. Modified files (1): - tools/setup-backend.ps1: New script for backend setup This enhancement simplifies the initial configuration process for users, ensuring a consistent and efficient setup for managing Terraform state.
Added a comprehensive guide for generating conventional commit messages specific to the AWS AutoML Lite project. This includes commit structure, types, scopes, body guidelines, and common mistakes to avoid. Also configured VSCode to reference this guide for GitHub Copilot. Modified files (2): - .github/git-commit-messages-instructions.md: New commit message guidelines - .vscode/settings.json: Reference to commit message guidelines for Copilot
This commit introduces a comprehensive quick start guide for deploying the AWS AutoML Lite platform using Terraform. The guide includes prerequisites, step-by-step instructions for configuring the AWS CLI, deploying infrastructure, building and pushing the training container, and testing the deployment. Modified files (2): - docs/QUICKSTART.md: New file with deployment instructions - docs/README.md: Updated index to include quick start guide The quick start guide aims to streamline the onboarding process for new users and provide clear, actionable steps for setting up the AutoML Lite platform.
Modified files (1): - README.md: Updated prerequisites and corrected links to documentation Changes include specifying Terraform version and enhancing documentation links for better navigation and clarity in setup instructions.
Modified workflows to ignore Markdown and text files in Terraform and backend directories, ensuring cleaner CI runs without unnecessary file changes triggering validations. Modified files (4): - .github/workflows/ci-terraform.yml - .github/workflows/deploy-infrastructure.yml - .github/workflows/deploy-lambda-api.yml - .github/workflows/deploy-training-container.yml
TERRAFORM IMPROVEMENTS: - Add variable validation (environment, lambda_memory, lambda_timeout, aws_region) - Mark sensitive outputs (lambda_arn, batch_job_definition) - Apply terraform fmt to all files - Document folder structure best practices WORKFLOW FIXES: - Fix workspace creation in deploy-lambda-api.yml - Fix workspace creation in deploy-training-container.yml - Both now use: terraform workspace select \$ENV || terraform workspace new \$ENV DOCUMENTATION: - Add comprehensive TERRAFORM_BEST_PRACTICES.md - Document folder structure (matches 90% of AVM standard) - Include Microsoft Learn references - Add priority recommendations Validation: terraform validate Based on: Microsoft Learn Terraform Best Practices + AVM Standards
…er consistency CRITICAL FIXES VERIFIED: terraform_wrapper: false in ALL terraform setup steps (4/4 workflows) - deploy-infrastructure plan job (FOUND MISSING during verification, FIXED) - deploy-infrastructure deploy job (already correct) - deploy-lambda-api (already correct) - deploy-training-container (already correct) - ci-terraform (not needed - validation only) Artifact paths corrected (deploy-infrastructure) - Upload: tfplan-`$ENV (relative to working-directory) - Download: . (current directory) Infrastructure existence checks (2 workflows) - deploy-lambda-api: warns if infrastructure missing - deploy-training-container: fails if infrastructure missing - Both use: terraform state list | grep -q <resource> PRODUCTION ERRORS FIXED: 1. Artifact not found → Path mismatch corrected 2. Invalid format 'Warning: No outputs found' → Wrapper disabled everywhere 3. Empty workspace detection → State list check before output TERRAFORM BEST PRACTICES: - Sensitive outputs marked (lambda_arn, batch_job_definition) - Variable validation (environment, lambda_memory, timeout, region) - Terraform fmt applied to all .tf files - Folder structure documented (90% AVM compliance) VERIFICATION COMPLETED: - All 4 workflows have terraform_wrapper: false where needed - AWS_ROLE_ARN properly configured (6 references) - Gitignore excludes terraform files - All outputs use -raw flag (5 instances) - Infrastructure checks properly implemented EXECUTION ORDER (CRITICAL): 1. deploy-infrastructure (FIRST - creates all resources) 2. deploy-lambda-api (optional - Lambda code updates only) 3. deploy-training-container (optional - ECR image updates only) Fixes: Artifact paths, terraform wrapper consistency, empty workspace handling Validation: terraform validate, all verification checks, 100% complete
Updated the deployment workflows to enhance the checks for existing infrastructure by directly attempting to retrieve outputs. This change ensures more reliable detection of deployed resources and provides clearer feedback in the GitHub Actions summary. Modified files (2): - .github/workflows/deploy-lambda-api.yml: Improved API URL check - .github/workflows/deploy-training-container.yml: Enhanced ECR URL check
Improved checks for API and ECR URLs in deployment workflows to ensure valid outputs without warnings or errors. This change prevents potential issues during deployment by ensuring that the infrastructure is correctly set up before proceeding. Modified files (2): - .github/workflows/deploy-lambda-api.yml: Enhanced API URL validation - .github/workflows/deploy-training-container.yml: Enhanced ECR URL validation
Updated workflows to improve Lambda package building and Terraform workspace handling. Added checks for workspace existence and resource deployment status to prevent errors during execution. Modified files (3): - .github/workflows/deploy-infrastructure.yml: Added Lambda package build step - .github/workflows/deploy-lambda-api.yml: Improved workspace selection logic - .github/workflows/deploy-training-container.yml: Enhanced resource checks
- Add __init__.py to backend/api/ and all subdirectories - Fix .gitignore to not ignore backend/api/models/ - Add models/schemas.py that was previously ignored - Add dummy lambda zip creation to CI terraform workflow This fixes the Lambda import error: 'No module named api.models'
There was a problem hiding this comment.
Pull request overview
This PR implements a comprehensive AWS Batch infrastructure and AutoML training pipeline for the AWS AutoML Lite project. It establishes the complete serverless ML platform using Terraform, including API endpoints (Lambda + API Gateway), training infrastructure (AWS Batch + Fargate), data storage (S3 + DynamoDB), and a Next.js frontend. The implementation follows AWS best practices with OIDC authentication for CI/CD, comprehensive documentation, and environment-specific configurations.
Key Changes:
- Complete Terraform infrastructure (44+ AWS resources) with S3 backend and DynamoDB state locking
- Training pipeline using AWS Batch + Fargate Spot with FLAML AutoML
- FastAPI Lambda API with presigned S3 URLs and DynamoDB integration
- Next.js 14 frontend with TypeScript and TailwindCSS
- CI/CD workflows with GitHub Actions and OIDC authentication
- Comprehensive documentation and operational tools
Reviewed changes
Copilot reviewed 75 out of 79 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| infrastructure/terraform/*.tf | Complete IaC defining 44 AWS resources including Lambda, API Gateway, Batch, S3, DynamoDB, IAM roles |
| backend/api/ | FastAPI application with Mangum for Lambda deployment, S3/DynamoDB/Batch service integrations |
| backend/training/ | Training container with FLAML, preprocessing, EDA generation, and model persistence |
| frontend/app/ | Next.js 14 App Router pages for upload, configuration, training status, and results |
| tools/*.ps1 | PowerShell scripts for Terraform backend setup and resource verification |
| docs/ | Comprehensive documentation including quickstart, reference, and architecture decisions |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| variable "batch_vcpu" { | ||
| description = "Batch job vCPU" | ||
| type = string | ||
| default = "2" | ||
| } | ||
|
|
||
| variable "batch_memory" { | ||
| description = "Batch job memory in MB" | ||
| type = string | ||
| default = "4096" | ||
| } |
There was a problem hiding this comment.
batch_vcpu and batch_memory are defined as strings but represent numeric resource specifications. Consider changing to type = number for better type safety and validation, since AWS Batch expects numeric values. This would also allow for validation constraints (e.g., min/max values).
| if df[col].isnull().any(): | ||
| df[col].fillna(df[col].mode()[0] if not df[col].mode().empty else 'Unknown', inplace=True) |
There was a problem hiding this comment.
[nitpick] The nested ternary operator for categorical missing value imputation is hard to read. Consider extracting this logic into a clearer multi-line statement: mode_value = df[col].mode()[0] if not df[col].mode().empty else 'Unknown' followed by df[col].fillna(mode_value, inplace=True).
| dataset_id=request.dataset_id, | ||
| target_column=request.target_column, | ||
| job_id=job_id, | ||
| config=request.config.model_dump() |
There was a problem hiding this comment.
Potential AttributeError if request.config is None. The TrainRequest schema defines config as Optional[TrainingConfig] = TrainingConfig(), which provides a default, but defensive coding would add a null check or use request.config.model_dump() if request.config else {}.
| Test-Resource "API Health Endpoint" ` | ||
| "curl -s https://sirpi54231.execute-api.us-east-1.amazonaws.com/dev/health" ` | ||
| "healthy" |
There was a problem hiding this comment.
Hardcoded API Gateway ID (sirpi54231) and AWS account ID (835503570883) are exposed in multiple files. These should be replaced with environment variables or Terraform outputs to avoid exposing infrastructure details in version control.
| // API Client for AWS AutoML Lite | ||
| // Centralized API calls to backend | ||
|
|
||
| const API_URL = process.env.NEXT_PUBLIC_API_URL || 'http://localhost:8000'; |
There was a problem hiding this comment.
Fallback to localhost in production builds could mask configuration errors. Consider throwing an error in production environments when NEXT_PUBLIC_API_URL is not set, or at minimum log a warning to help detect misconfiguration.
| import boto3 | ||
| import pandas as pd | ||
| from datetime import datetime | ||
| from io import StringIO |
There was a problem hiding this comment.
Import of 'StringIO' is not used.
| @@ -0,0 +1,135 @@ | |||
| from fastapi import APIRouter, HTTPException, status, Query | |||
There was a problem hiding this comment.
Import of 'Query' is not used.
| from fastapi import APIRouter, HTTPException, status, Query | |
| from fastapi import APIRouter, HTTPException, status |
| @@ -0,0 +1,135 @@ | |||
| from fastapi import APIRouter, HTTPException, status, Query | |||
| from datetime import datetime | |||
| from typing import Optional | |||
There was a problem hiding this comment.
Import of 'Optional' is not used.
| from typing import Optional |
| @@ -0,0 +1,42 @@ | |||
| from fastapi import APIRouter, HTTPException, status | |||
| from datetime import datetime | |||
There was a problem hiding this comment.
Import of 'datetime' is not used.
| from datetime import datetime |
| import uuid | ||
| from ..models.schemas import UploadRequest, UploadResponse | ||
| from ..services.s3_service import s3_service | ||
| from ..services.dynamo_service import dynamodb_service |
There was a problem hiding this comment.
Import of 'dynamodb_service' is not used.
| from ..services.dynamo_service import dynamodb_service |
Introduced a new datasets router for managing dataset uploads and metadata. This includes endpoints for confirming uploads and retrieving dataset metadata from DynamoDB. The changes enhance the API's capability to handle datasets effectively, allowing for better integration with the training process. Modified files (6): - backend/api/routers/datasets.py: New router for dataset operations - backend/api/models/schemas.py: Added DatasetMetadata schema - backend/api/routers/models.py: Updated to include dataset_id in job response - backend/api/services/dynamo_service.py: Added methods for dataset metadata - backend/api/services/s3_service.py: New methods for S3 object management - backend/api/utils/helpers.py: Updated settings for new configurations
Modified files (3): - infrastructure/terraform/batch.tf: Convert vCPU and memory to number type - infrastructure/terraform/iam.tf: Add S3 ListBucket permission for batch job role - infrastructure/terraform/variables.tf: Change batch_vcpu and batch_memory types to number These changes ensure proper type handling for batch job configurations and enhance IAM permissions for accessing S3 resources.
Improved the heuristic for determining problem type based on target column data characteristics. Added checks for unique values in numeric columns to better classify them as regression or classification tasks. This change aims to increase the accuracy of model training by ensuring the correct problem type is identified. Modified files (2): - backend/api/routers/training.py: Updated training job logic - backend/training/model_trainer.py: Adjusted metric selection for multiclass classification
This PR implements a comprehensive AWS Batch infrastructure and AutoML training pipeline for the AWS AutoML Lite project. It establishes the complete serverless ML platform using Terraform, including API endpoints (Lambda + API Gateway), training infrastructure (AWS Batch + Fargate), data storage (S3 + DynamoDB), and a Next.js frontend. The implementation follows AWS best practices with OIDC authentication for CI/CD, comprehensive documentation, and environment-specific configurations.
Key Changes: