Skip to content

feat: implement AWS Batch infrastructure and AutoML training pipeline#1

Merged
cristofima merged 29 commits intomainfrom
dev
Nov 30, 2025
Merged

feat: implement AWS Batch infrastructure and AutoML training pipeline#1
cristofima merged 29 commits intomainfrom
dev

Conversation

@cristofima
Copy link
Copy Markdown
Owner

@cristofima cristofima commented Nov 28, 2025

This PR implements a comprehensive AWS Batch infrastructure and AutoML training pipeline for the AWS AutoML Lite project. It establishes the complete serverless ML platform using Terraform, including API endpoints (Lambda + API Gateway), training infrastructure (AWS Batch + Fargate), data storage (S3 + DynamoDB), and a Next.js frontend. The implementation follows AWS best practices with OIDC authentication for CI/CD, comprehensive documentation, and environment-specific configurations.

Key Changes:

  • Complete Terraform infrastructure (44+ AWS resources) with S3 backend and DynamoDB state locking
  • Training pipeline using AWS Batch + Fargate Spot with FLAML AutoML
  • FastAPI Lambda API with presigned S3 URLs and DynamoDB integration
  • Next.js 16 frontend with TypeScript and TailwindCSS
  • CI/CD workflows with GitHub Actions and OIDC authentication
  • Comprehensive documentation and operational tools

Implemented AWS Batch with Fargate Spot compute for cost-effective AutoML training. This includes the creation of security groups, compute environments, job queues, and job definitions.

Modified files (10):
- infrastructure/terraform/batch.tf: Added Batch resources
- infrastructure/terraform/data.tf: Added data sources for VPC and subnets
- infrastructure/terraform/dynamodb.tf: Created DynamoDB tables for datasets and training jobs
- infrastructure/terraform/ecr.tf: Added ECR repository for training container
- infrastructure/terraform/iam.tf: Defined IAM roles and policies for Lambda and Batch
- infrastructure/terraform/lambda.tf: Configured Lambda function and CloudWatch logs
- infrastructure/terraform/main.tf: Set up Terraform backend and provider
- infrastructure/terraform/outputs.tf: Defined outputs for infrastructure
- infrastructure/terraform/prod.tfvars: Added production variables
- infrastructure/terraform/s3.tf: Created S3 buckets for datasets, models, and reports

Cost impact: Utilizing Fargate Spot can reduce training costs by up to 70% compared to on-demand pricing.
…d training status page

Added a new file upload component for CSV files with validation, along with a training status page that displays job details and progress. This enhances user experience by providing real-time feedback on training jobs.

Modified files (5):
- frontend/components/FileUpload.tsx: New component for file uploads
- frontend/app/training/[jobId]/page.tsx: New training status page
- frontend/lib/api.ts: API client for upload and job details
- frontend/lib/utils.ts: Utility functions for formatting and validation
- frontend/styles/globals.css: Added styles for new components

This change allows users to upload datasets and monitor the training process seamlessly.
This change introduces a complete training pipeline for AutoML using FLAML. It includes data preprocessing, exploratory data analysis (EDA), model training, and saving the trained model to S3. The pipeline is designed to be executed as an AWS Batch job, allowing for scalable training on large datasets.

Modified files (10):
- backend/training/train.py: Main training script with job handling
- backend/training/preprocessor.py: Automatic data preprocessing logic
- backend/training/eda.py: EDA report generation using Sweetviz
- backend/training/model_trainer.py: FLAML model training integration
- backend/api/services/s3_service.py: S3 operations for model and report storage
- backend/api/utils/helpers.py: Configuration settings for AWS resources
- backend/training/Dockerfile: Dockerfile for training container setup
- backend/training/requirements.txt: Updated dependencies for training
- backend/training/requirements.txt: Added necessary libraries for EDA and model training
- backend/training/requirements.txt: Included FLAML for AutoML capabilities

This implementation enhances the training capabilities of the system, allowing for automated model selection and hyperparameter tuning.
Introduced a Dockerfile to streamline the CI/CD process for building the Lambda deployment package. This change enhances the deployment efficiency and ensures that all necessary dependencies are included in the package.

Modified files (1):
- infrastructure/terraform/scripts/Dockerfile.lambda: New file for Lambda package creation
This change introduces multiple workflows for managing the AWS infrastructure using Terraform. The following workflows have been added:

- CI Terraform: Validates Terraform syntax and formatting on  pull requests and pushes.
- Deploy Infrastructure: Automates the deployment of infrastructure changes with manual dispatch options.
- Deploy Lambda API: Handles the deployment of the Lambda API with environment selection.
- Deploy Training Container: Manages the building and pushing of the training container to ECR.
- Destroy Environment: Provides a manual workflow to safely destroy environments with confirmation checks.

These workflows enhance the CI/CD process, ensuring that infrastructure changes are validated and deployed efficiently.

Modified files (6):
- .github/workflows/ci-terraform.yml
- .github/workflows/deploy-infrastructure.yml
- .github/workflows/deploy-lambda-api.yml
- .github/workflows/deploy-training-container.yml
- .github/workflows/destroy-environment.yml
Introduced a PowerShell script to automate the setup of the Terraform S3 backend with DynamoDB state locking. This script creates the necessary AWS resources, including S3 bucket, DynamoDB table, and configures versioning, encryption, and access policies.

Modified files (1):
- tools/setup-backend.ps1: New script for backend setup

This enhancement simplifies the initial configuration process for users, ensuring a consistent and efficient setup for managing Terraform state.
Added a comprehensive guide for generating conventional commit messages specific to the AWS AutoML Lite project. This includes commit structure, types, scopes, body guidelines, and common mistakes to avoid. Also configured VSCode to reference this guide for GitHub Copilot.

Modified files (2):
- .github/git-commit-messages-instructions.md: New commit message guidelines
- .vscode/settings.json: Reference to commit message guidelines for Copilot
This commit introduces a comprehensive quick start guide for deploying the AWS AutoML Lite platform using Terraform. The guide includes prerequisites, step-by-step instructions for configuring the AWS CLI, deploying infrastructure, building and pushing the training container, and testing the deployment.

Modified files (2):
- docs/QUICKSTART.md: New file with deployment instructions
- docs/README.md: Updated index to include quick start guide

The quick start guide aims to streamline the onboarding process for new users and provide clear, actionable steps for setting up the AutoML Lite platform.
Modified files (1):
- README.md: Updated prerequisites and corrected links to documentation

Changes include specifying Terraform version and enhancing documentation links for better navigation and clarity in setup instructions.
Modified workflows to ignore Markdown and text files in Terraform and backend directories, ensuring cleaner CI runs without unnecessary file changes triggering validations.

Modified files (4):
- .github/workflows/ci-terraform.yml
- .github/workflows/deploy-infrastructure.yml
- .github/workflows/deploy-lambda-api.yml
- .github/workflows/deploy-training-container.yml
TERRAFORM IMPROVEMENTS:
- Add variable validation (environment, lambda_memory, lambda_timeout, aws_region)
- Mark sensitive outputs (lambda_arn, batch_job_definition)
- Apply terraform fmt to all files
- Document folder structure best practices

WORKFLOW FIXES:
- Fix workspace creation in deploy-lambda-api.yml
- Fix workspace creation in deploy-training-container.yml
- Both now use: terraform workspace select \$ENV || terraform workspace new \$ENV

DOCUMENTATION:
- Add comprehensive TERRAFORM_BEST_PRACTICES.md
- Document folder structure (matches 90% of AVM standard)
- Include Microsoft Learn references
- Add priority recommendations

Validation: terraform validate
Based on: Microsoft Learn Terraform Best Practices + AVM Standards
…er consistency

CRITICAL FIXES VERIFIED:
terraform_wrapper: false in ALL terraform setup steps (4/4 workflows)
   - deploy-infrastructure plan job (FOUND MISSING during verification, FIXED)
   - deploy-infrastructure deploy job (already correct)
   - deploy-lambda-api (already correct)
   - deploy-training-container (already correct)
   - ci-terraform (not needed - validation only)

Artifact paths corrected (deploy-infrastructure)
   - Upload: tfplan-`$ENV (relative to working-directory)
   - Download: . (current directory)

Infrastructure existence checks (2 workflows)
   - deploy-lambda-api: warns if infrastructure missing
   - deploy-training-container: fails if infrastructure missing
   - Both use: terraform state list | grep -q <resource>

PRODUCTION ERRORS FIXED:
1. Artifact not found → Path mismatch corrected
2. Invalid format 'Warning: No outputs found' → Wrapper disabled everywhere
3. Empty workspace detection → State list check before output

TERRAFORM BEST PRACTICES:
- Sensitive outputs marked (lambda_arn, batch_job_definition)
- Variable validation (environment, lambda_memory, timeout, region)
- Terraform fmt applied to all .tf files
- Folder structure documented (90% AVM compliance)

VERIFICATION COMPLETED:
- All 4 workflows have terraform_wrapper: false where needed
- AWS_ROLE_ARN properly configured (6 references)
- Gitignore excludes terraform files
- All outputs use -raw flag (5 instances)
- Infrastructure checks properly implemented

EXECUTION ORDER (CRITICAL):
1. deploy-infrastructure (FIRST - creates all resources)
2. deploy-lambda-api (optional - Lambda code updates only)
3. deploy-training-container (optional - ECR image updates only)

Fixes: Artifact paths, terraform wrapper consistency, empty workspace handling
Validation: terraform validate, all verification checks, 100% complete
Updated the deployment workflows to enhance the checks for existing infrastructure by directly attempting to retrieve outputs. This change ensures more reliable detection of deployed resources and provides clearer feedback in the GitHub Actions summary.

Modified files (2):
- .github/workflows/deploy-lambda-api.yml: Improved API URL check
- .github/workflows/deploy-training-container.yml: Enhanced ECR URL check
Improved checks for API and ECR URLs in deployment workflows to ensure valid outputs without warnings or errors. This change prevents potential issues during deployment by ensuring that the infrastructure is correctly set up before proceeding.

Modified files (2):
- .github/workflows/deploy-lambda-api.yml: Enhanced API URL validation
- .github/workflows/deploy-training-container.yml: Enhanced ECR URL validation
Updated workflows to improve Lambda package building and Terraform workspace handling. Added checks for workspace existence and resource deployment status to prevent errors during execution.

Modified files (3):
- .github/workflows/deploy-infrastructure.yml: Added Lambda package build step
- .github/workflows/deploy-lambda-api.yml: Improved workspace selection logic
- .github/workflows/deploy-training-container.yml: Enhanced resource checks
- Add __init__.py to backend/api/ and all subdirectories
- Fix .gitignore to not ignore backend/api/models/
- Add models/schemas.py that was previously ignored
- Add dummy lambda zip creation to CI terraform workflow

This fixes the Lambda import error: 'No module named api.models'
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a comprehensive AWS Batch infrastructure and AutoML training pipeline for the AWS AutoML Lite project. It establishes the complete serverless ML platform using Terraform, including API endpoints (Lambda + API Gateway), training infrastructure (AWS Batch + Fargate), data storage (S3 + DynamoDB), and a Next.js frontend. The implementation follows AWS best practices with OIDC authentication for CI/CD, comprehensive documentation, and environment-specific configurations.

Key Changes:

  • Complete Terraform infrastructure (44+ AWS resources) with S3 backend and DynamoDB state locking
  • Training pipeline using AWS Batch + Fargate Spot with FLAML AutoML
  • FastAPI Lambda API with presigned S3 URLs and DynamoDB integration
  • Next.js 14 frontend with TypeScript and TailwindCSS
  • CI/CD workflows with GitHub Actions and OIDC authentication
  • Comprehensive documentation and operational tools

Reviewed changes

Copilot reviewed 75 out of 79 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
infrastructure/terraform/*.tf Complete IaC defining 44 AWS resources including Lambda, API Gateway, Batch, S3, DynamoDB, IAM roles
backend/api/ FastAPI application with Mangum for Lambda deployment, S3/DynamoDB/Batch service integrations
backend/training/ Training container with FLAML, preprocessing, EDA generation, and model persistence
frontend/app/ Next.js 14 App Router pages for upload, configuration, training status, and results
tools/*.ps1 PowerShell scripts for Terraform backend setup and resource verification
docs/ Comprehensive documentation including quickstart, reference, and architecture decisions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +51 to +61
variable "batch_vcpu" {
description = "Batch job vCPU"
type = string
default = "2"
}

variable "batch_memory" {
description = "Batch job memory in MB"
type = string
default = "4096"
}
Copy link

Copilot AI Nov 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batch_vcpu and batch_memory are defined as strings but represent numeric resource specifications. Consider changing to type = number for better type safety and validation, since AWS Batch expects numeric values. This would also allow for validation constraints (e.g., min/max values).

Copilot uses AI. Check for mistakes.
Comment on lines +48 to +49
if df[col].isnull().any():
df[col].fillna(df[col].mode()[0] if not df[col].mode().empty else 'Unknown', inplace=True)
Copy link

Copilot AI Nov 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The nested ternary operator for categorical missing value imputation is hard to read. Consider extracting this logic into a clearer multi-line statement: mode_value = df[col].mode()[0] if not df[col].mode().empty else 'Unknown' followed by df[col].fillna(mode_value, inplace=True).

Copilot uses AI. Check for mistakes.
dataset_id=request.dataset_id,
target_column=request.target_column,
job_id=job_id,
config=request.config.model_dump()
Copy link

Copilot AI Nov 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential AttributeError if request.config is None. The TrainRequest schema defines config as Optional[TrainingConfig] = TrainingConfig(), which provides a default, but defensive coding would add a null check or use request.config.model_dump() if request.config else {}.

Copilot uses AI. Check for mistakes.
Comment on lines +43 to +45
Test-Resource "API Health Endpoint" `
"curl -s https://sirpi54231.execute-api.us-east-1.amazonaws.com/dev/health" `
"healthy"
Copy link

Copilot AI Nov 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded API Gateway ID (sirpi54231) and AWS account ID (835503570883) are exposed in multiple files. These should be replaced with environment variables or Terraform outputs to avoid exposing infrastructure details in version control.

Copilot uses AI. Check for mistakes.
// API Client for AWS AutoML Lite
// Centralized API calls to backend

const API_URL = process.env.NEXT_PUBLIC_API_URL || 'http://localhost:8000';
Copy link

Copilot AI Nov 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fallback to localhost in production builds could mask configuration errors. Consider throwing an error in production environments when NEXT_PUBLIC_API_URL is not set, or at minimum log a warning to help detect misconfiguration.

Copilot uses AI. Check for mistakes.
import boto3
import pandas as pd
from datetime import datetime
from io import StringIO
Copy link

Copilot AI Nov 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'StringIO' is not used.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,135 @@
from fastapi import APIRouter, HTTPException, status, Query
Copy link

Copilot AI Nov 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'Query' is not used.

Suggested change
from fastapi import APIRouter, HTTPException, status, Query
from fastapi import APIRouter, HTTPException, status

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,135 @@
from fastapi import APIRouter, HTTPException, status, Query
from datetime import datetime
from typing import Optional
Copy link

Copilot AI Nov 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'Optional' is not used.

Suggested change
from typing import Optional

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,42 @@
from fastapi import APIRouter, HTTPException, status
from datetime import datetime
Copy link

Copilot AI Nov 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'datetime' is not used.

Suggested change
from datetime import datetime

Copilot uses AI. Check for mistakes.
import uuid
from ..models.schemas import UploadRequest, UploadResponse
from ..services.s3_service import s3_service
from ..services.dynamo_service import dynamodb_service
Copy link

Copilot AI Nov 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'dynamodb_service' is not used.

Suggested change
from ..services.dynamo_service import dynamodb_service

Copilot uses AI. Check for mistakes.
Introduced a new datasets router for managing dataset uploads and metadata.
This includes endpoints for confirming uploads and retrieving dataset metadata from DynamoDB. The changes enhance the API's capability to handle datasets effectively, allowing for better integration with the training process.

Modified files (6):
- backend/api/routers/datasets.py: New router for dataset operations
- backend/api/models/schemas.py: Added DatasetMetadata schema
- backend/api/routers/models.py: Updated to include dataset_id in job response
- backend/api/services/dynamo_service.py: Added methods for dataset metadata
- backend/api/services/s3_service.py: New methods for S3 object management
- backend/api/utils/helpers.py: Updated settings for new configurations
Modified files (3):
- infrastructure/terraform/batch.tf: Convert vCPU and memory to number type
- infrastructure/terraform/iam.tf: Add S3 ListBucket permission for batch job role
- infrastructure/terraform/variables.tf: Change batch_vcpu and batch_memory types to number

These changes ensure proper type handling for batch job configurations and enhance IAM permissions for accessing S3 resources.
Improved the heuristic for determining problem type based on target column data characteristics. Added checks for unique values in numeric columns to better classify them as regression or classification tasks. This change
aims to increase the accuracy of model training by ensuring the correct problem type is identified.

Modified files (2):
- backend/api/routers/training.py: Updated training job logic
- backend/training/model_trainer.py: Adjusted metric selection for multiclass classification
@cristofima cristofima added the enhancement New feature or request label Nov 30, 2025
@cristofima cristofima merged commit b4b7b85 into main Nov 30, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants