feat: implement AWS Batch infrastructure and AutoML training pipeline by cristofima · Pull Request #1 · cristofima/AWS-AutoML-Lite

cristofima · 2025-11-28T23:59:53Z

This PR implements a comprehensive AWS Batch infrastructure and AutoML training pipeline for the AWS AutoML Lite project. It establishes the complete serverless ML platform using Terraform, including API endpoints (Lambda + API Gateway), training infrastructure (AWS Batch + Fargate), data storage (S3 + DynamoDB), and a Next.js frontend. The implementation follows AWS best practices with OIDC authentication for CI/CD, comprehensive documentation, and environment-specific configurations.

Key Changes:

Complete Terraform infrastructure (44+ AWS resources) with S3 backend and DynamoDB state locking
Training pipeline using AWS Batch + Fargate Spot with FLAML AutoML
FastAPI Lambda API with presigned S3 URLs and DynamoDB integration
Next.js 16 frontend with TypeScript and TailwindCSS
CI/CD workflows with GitHub Actions and OIDC authentication
Comprehensive documentation and operational tools

Implemented AWS Batch with Fargate Spot compute for cost-effective AutoML training. This includes the creation of security groups, compute environments, job queues, and job definitions. Modified files (10): - infrastructure/terraform/batch.tf: Added Batch resources - infrastructure/terraform/data.tf: Added data sources for VPC and subnets - infrastructure/terraform/dynamodb.tf: Created DynamoDB tables for datasets and training jobs - infrastructure/terraform/ecr.tf: Added ECR repository for training container - infrastructure/terraform/iam.tf: Defined IAM roles and policies for Lambda and Batch - infrastructure/terraform/lambda.tf: Configured Lambda function and CloudWatch logs - infrastructure/terraform/main.tf: Set up Terraform backend and provider - infrastructure/terraform/outputs.tf: Defined outputs for infrastructure - infrastructure/terraform/prod.tfvars: Added production variables - infrastructure/terraform/s3.tf: Created S3 buckets for datasets, models, and reports Cost impact: Utilizing Fargate Spot can reduce training costs by up to 70% compared to on-demand pricing.

…d training status page Added a new file upload component for CSV files with validation, along with a training status page that displays job details and progress. This enhances user experience by providing real-time feedback on training jobs. Modified files (5): - frontend/components/FileUpload.tsx: New component for file uploads - frontend/app/training/[jobId]/page.tsx: New training status page - frontend/lib/api.ts: API client for upload and job details - frontend/lib/utils.ts: Utility functions for formatting and validation - frontend/styles/globals.css: Added styles for new components This change allows users to upload datasets and monitor the training process seamlessly.

This change introduces a complete training pipeline for AutoML using FLAML. It includes data preprocessing, exploratory data analysis (EDA), model training, and saving the trained model to S3. The pipeline is designed to be executed as an AWS Batch job, allowing for scalable training on large datasets. Modified files (10): - backend/training/train.py: Main training script with job handling - backend/training/preprocessor.py: Automatic data preprocessing logic - backend/training/eda.py: EDA report generation using Sweetviz - backend/training/model_trainer.py: FLAML model training integration - backend/api/services/s3_service.py: S3 operations for model and report storage - backend/api/utils/helpers.py: Configuration settings for AWS resources - backend/training/Dockerfile: Dockerfile for training container setup - backend/training/requirements.txt: Updated dependencies for training - backend/training/requirements.txt: Added necessary libraries for EDA and model training - backend/training/requirements.txt: Included FLAML for AutoML capabilities This implementation enhances the training capabilities of the system, allowing for automated model selection and hyperparameter tuning.

Introduced a Dockerfile to streamline the CI/CD process for building the Lambda deployment package. This change enhances the deployment efficiency and ensures that all necessary dependencies are included in the package. Modified files (1): - infrastructure/terraform/scripts/Dockerfile.lambda: New file for Lambda package creation

This change introduces multiple workflows for managing the AWS infrastructure using Terraform. The following workflows have been added: - CI Terraform: Validates Terraform syntax and formatting on pull requests and pushes. - Deploy Infrastructure: Automates the deployment of infrastructure changes with manual dispatch options. - Deploy Lambda API: Handles the deployment of the Lambda API with environment selection. - Deploy Training Container: Manages the building and pushing of the training container to ECR. - Destroy Environment: Provides a manual workflow to safely destroy environments with confirmation checks. These workflows enhance the CI/CD process, ensuring that infrastructure changes are validated and deployed efficiently. Modified files (6): - .github/workflows/ci-terraform.yml - .github/workflows/deploy-infrastructure.yml - .github/workflows/deploy-lambda-api.yml - .github/workflows/deploy-training-container.yml - .github/workflows/destroy-environment.yml

Introduced a PowerShell script to automate the setup of the Terraform S3 backend with DynamoDB state locking. This script creates the necessary AWS resources, including S3 bucket, DynamoDB table, and configures versioning, encryption, and access policies. Modified files (1): - tools/setup-backend.ps1: New script for backend setup This enhancement simplifies the initial configuration process for users, ensuring a consistent and efficient setup for managing Terraform state.

Added a comprehensive guide for generating conventional commit messages specific to the AWS AutoML Lite project. This includes commit structure, types, scopes, body guidelines, and common mistakes to avoid. Also configured VSCode to reference this guide for GitHub Copilot. Modified files (2): - .github/git-commit-messages-instructions.md: New commit message guidelines - .vscode/settings.json: Reference to commit message guidelines for Copilot

This commit introduces a comprehensive quick start guide for deploying the AWS AutoML Lite platform using Terraform. The guide includes prerequisites, step-by-step instructions for configuring the AWS CLI, deploying infrastructure, building and pushing the training container, and testing the deployment. Modified files (2): - docs/QUICKSTART.md: New file with deployment instructions - docs/README.md: Updated index to include quick start guide The quick start guide aims to streamline the onboarding process for new users and provide clear, actionable steps for setting up the AutoML Lite platform.

Modified files (1): - README.md: Updated prerequisites and corrected links to documentation Changes include specifying Terraform version and enhancing documentation links for better navigation and clarity in setup instructions.

Modified workflows to ignore Markdown and text files in Terraform and backend directories, ensuring cleaner CI runs without unnecessary file changes triggering validations. Modified files (4): - .github/workflows/ci-terraform.yml - .github/workflows/deploy-infrastructure.yml - .github/workflows/deploy-lambda-api.yml - .github/workflows/deploy-training-container.yml

TERRAFORM IMPROVEMENTS: - Add variable validation (environment, lambda_memory, lambda_timeout, aws_region) - Mark sensitive outputs (lambda_arn, batch_job_definition) - Apply terraform fmt to all files - Document folder structure best practices WORKFLOW FIXES: - Fix workspace creation in deploy-lambda-api.yml - Fix workspace creation in deploy-training-container.yml - Both now use: terraform workspace select \$ENV || terraform workspace new \$ENV DOCUMENTATION: - Add comprehensive TERRAFORM_BEST_PRACTICES.md - Document folder structure (matches 90% of AVM standard) - Include Microsoft Learn references - Add priority recommendations Validation: terraform validate Based on: Microsoft Learn Terraform Best Practices + AVM Standards

…er consistency CRITICAL FIXES VERIFIED: terraform_wrapper: false in ALL terraform setup steps (4/4 workflows) - deploy-infrastructure plan job (FOUND MISSING during verification, FIXED) - deploy-infrastructure deploy job (already correct) - deploy-lambda-api (already correct) - deploy-training-container (already correct) - ci-terraform (not needed - validation only) Artifact paths corrected (deploy-infrastructure) - Upload: tfplan-`$ENV (relative to working-directory) - Download: . (current directory) Infrastructure existence checks (2 workflows) - deploy-lambda-api: warns if infrastructure missing - deploy-training-container: fails if infrastructure missing - Both use: terraform state list | grep -q <resource> PRODUCTION ERRORS FIXED: 1. Artifact not found → Path mismatch corrected 2. Invalid format 'Warning: No outputs found' → Wrapper disabled everywhere 3. Empty workspace detection → State list check before output TERRAFORM BEST PRACTICES: - Sensitive outputs marked (lambda_arn, batch_job_definition) - Variable validation (environment, lambda_memory, timeout, region) - Terraform fmt applied to all .tf files - Folder structure documented (90% AVM compliance) VERIFICATION COMPLETED: - All 4 workflows have terraform_wrapper: false where needed - AWS_ROLE_ARN properly configured (6 references) - Gitignore excludes terraform files - All outputs use -raw flag (5 instances) - Infrastructure checks properly implemented EXECUTION ORDER (CRITICAL): 1. deploy-infrastructure (FIRST - creates all resources) 2. deploy-lambda-api (optional - Lambda code updates only) 3. deploy-training-container (optional - ECR image updates only) Fixes: Artifact paths, terraform wrapper consistency, empty workspace handling Validation: terraform validate, all verification checks, 100% complete

Updated the deployment workflows to enhance the checks for existing infrastructure by directly attempting to retrieve outputs. This change ensures more reliable detection of deployed resources and provides clearer feedback in the GitHub Actions summary. Modified files (2): - .github/workflows/deploy-lambda-api.yml: Improved API URL check - .github/workflows/deploy-training-container.yml: Enhanced ECR URL check

Improved checks for API and ECR URLs in deployment workflows to ensure valid outputs without warnings or errors. This change prevents potential issues during deployment by ensuring that the infrastructure is correctly set up before proceeding. Modified files (2): - .github/workflows/deploy-lambda-api.yml: Enhanced API URL validation - .github/workflows/deploy-training-container.yml: Enhanced ECR URL validation

Updated workflows to improve Lambda package building and Terraform workspace handling. Added checks for workspace existence and resource deployment status to prevent errors during execution. Modified files (3): - .github/workflows/deploy-infrastructure.yml: Added Lambda package build step - .github/workflows/deploy-lambda-api.yml: Improved workspace selection logic - .github/workflows/deploy-training-container.yml: Enhanced resource checks

- Add __init__.py to backend/api/ and all subdirectories - Fix .gitignore to not ignore backend/api/models/ - Add models/schemas.py that was previously ignored - Add dummy lambda zip creation to CI terraform workflow This fixes the Lambda import error: 'No module named api.models'

Copilot

Pull request overview

This PR implements a comprehensive AWS Batch infrastructure and AutoML training pipeline for the AWS AutoML Lite project. It establishes the complete serverless ML platform using Terraform, including API endpoints (Lambda + API Gateway), training infrastructure (AWS Batch + Fargate), data storage (S3 + DynamoDB), and a Next.js frontend. The implementation follows AWS best practices with OIDC authentication for CI/CD, comprehensive documentation, and environment-specific configurations.

Key Changes:

Complete Terraform infrastructure (44+ AWS resources) with S3 backend and DynamoDB state locking
Training pipeline using AWS Batch + Fargate Spot with FLAML AutoML
FastAPI Lambda API with presigned S3 URLs and DynamoDB integration
Next.js 14 frontend with TypeScript and TailwindCSS
CI/CD workflows with GitHub Actions and OIDC authentication
Comprehensive documentation and operational tools

Reviewed changes

Copilot reviewed 75 out of 79 changed files in this pull request and generated 15 comments.

Show a summary per file

File	Description
infrastructure/terraform/*.tf	Complete IaC defining 44 AWS resources including Lambda, API Gateway, Batch, S3, DynamoDB, IAM roles
backend/api/	FastAPI application with Mangum for Lambda deployment, S3/DynamoDB/Batch service integrations
backend/training/	Training container with FLAML, preprocessing, EDA generation, and model persistence
frontend/app/	Next.js 14 App Router pages for upload, configuration, training status, and results
tools/*.ps1	PowerShell scripts for Terraform backend setup and resource verification
docs/	Comprehensive documentation including quickstart, reference, and architecture decisions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-29T00:02:34Z

infrastructure/terraform/variables.tf

+variable "batch_vcpu" {
+  description = "Batch job vCPU"
+  type        = string
+  default     = "2"
+}
+
+variable "batch_memory" {
+  description = "Batch job memory in MB"
+  type        = string
+  default     = "4096"
+}


batch_vcpu and batch_memory are defined as strings but represent numeric resource specifications. Consider changing to type = number for better type safety and validation, since AWS Batch expects numeric values. This would also allow for validation constraints (e.g., min/max values).

Copilot · 2025-11-29T00:02:34Z

backend/training/preprocessor.py

+            if df[col].isnull().any():
+                df[col].fillna(df[col].mode()[0] if not df[col].mode().empty else 'Unknown', inplace=True)


[nitpick] The nested ternary operator for categorical missing value imputation is hard to read. Consider extracting this logic into a clearer multi-line statement: mode_value = df[col].mode()[0] if not df[col].mode().empty else 'Unknown' followed by df[col].fillna(mode_value, inplace=True).

Copilot · 2025-11-29T00:02:34Z

backend/api/routers/training.py

+            dataset_id=request.dataset_id,
+            target_column=request.target_column,
+            job_id=job_id,
+            config=request.config.model_dump()


Potential AttributeError if request.config is None. The TrainRequest schema defines config as Optional[TrainingConfig] = TrainingConfig(), which provides a default, but defensive coding would add a null check or use request.config.model_dump() if request.config else {}.

Copilot · 2025-11-29T00:02:35Z

tools/verify-resources.ps1

+Test-Resource "API Health Endpoint" `
+    "curl -s https://sirpi54231.execute-api.us-east-1.amazonaws.com/dev/health" `
+    "healthy"


Hardcoded API Gateway ID (sirpi54231) and AWS account ID (835503570883) are exposed in multiple files. These should be replaced with environment variables or Terraform outputs to avoid exposing infrastructure details in version control.

Copilot · 2025-11-29T00:02:35Z

frontend/lib/api.ts

+// API Client for AWS AutoML Lite
+// Centralized API calls to backend
+
+const API_URL = process.env.NEXT_PUBLIC_API_URL || 'http://localhost:8000';


Fallback to localhost in production builds could mask configuration errors. Consider throwing an error in production environments when NEXT_PUBLIC_API_URL is not set, or at minimum log a warning to help detect misconfiguration.

Copilot · 2025-11-29T00:02:37Z

backend/training/train.py

+import boto3
+import pandas as pd
+from datetime import datetime
+from io import StringIO


Import of 'StringIO' is not used.

Copilot · 2025-11-29T00:02:37Z

backend/api/routers/training.py

@@ -0,0 +1,135 @@
+from fastapi import APIRouter, HTTPException, status, Query


Import of 'Query' is not used.

Suggested change

from fastapi import APIRouter, HTTPException, status, Query

from fastapi import APIRouter, HTTPException, status

Copilot · 2025-11-29T00:02:37Z

backend/api/routers/training.py

@@ -0,0 +1,135 @@
+from fastapi import APIRouter, HTTPException, status, Query
+from datetime import datetime
+from typing import Optional


Import of 'Optional' is not used.

Suggested change

from typing import Optional

Copilot · 2025-11-29T00:02:37Z

backend/api/routers/upload.py

@@ -0,0 +1,42 @@
+from fastapi import APIRouter, HTTPException, status
+from datetime import datetime


Import of 'datetime' is not used.

Suggested change

from datetime import datetime

Copilot · 2025-11-29T00:02:38Z

backend/api/routers/upload.py

+import uuid
+from ..models.schemas import UploadRequest, UploadResponse
+from ..services.s3_service import s3_service
+from ..services.dynamo_service import dynamodb_service


Import of 'dynamodb_service' is not used.

Suggested change

from ..services.dynamo_service import dynamodb_service

Introduced a new datasets router for managing dataset uploads and metadata. This includes endpoints for confirming uploads and retrieving dataset metadata from DynamoDB. The changes enhance the API's capability to handle datasets effectively, allowing for better integration with the training process. Modified files (6): - backend/api/routers/datasets.py: New router for dataset operations - backend/api/models/schemas.py: Added DatasetMetadata schema - backend/api/routers/models.py: Updated to include dataset_id in job response - backend/api/services/dynamo_service.py: Added methods for dataset metadata - backend/api/services/s3_service.py: New methods for S3 object management - backend/api/utils/helpers.py: Updated settings for new configurations

Modified files (3): - infrastructure/terraform/batch.tf: Convert vCPU and memory to number type - infrastructure/terraform/iam.tf: Add S3 ListBucket permission for batch job role - infrastructure/terraform/variables.tf: Change batch_vcpu and batch_memory types to number These changes ensure proper type handling for batch job configurations and enhance IAM permissions for accessing S3 resources.

Improved the heuristic for determining problem type based on target column data characteristics. Added checks for unique values in numeric columns to better classify them as regression or classification tasks. This change aims to increase the accuracy of model training by ensuring the correct problem type is identified. Modified files (2): - backend/api/routers/training.py: Updated training job logic - backend/training/model_trainer.py: Adjusted metric selection for multiclass classification

cristofima added 26 commits November 28, 2025 08:42

docs: add MIT License file

88f5f77

docs: add CODEOWNERS file for review requirements

71e558c

docs: add architecture decisions for containers vs Lambda

0142c84

docs: update README with Terraform prerequisites and links

cf2880b

Modified files (1): - README.md: Updated prerequisites and corrected links to documentation Changes include specifying Terraform version and enhancing documentation links for better navigation and clarity in setup instructions.

chore: update .gitignore to exclude .vscode directory

8235dfb

docs: update commit message guidelines to remove issue references

3b61f72

ci: add Terraform setup step to deployment workflows

8d49b1e

build(terraform): add dev.tfvars for environment configuration

48f8d9f

ci: capture terraform output exit code correctly in deploy-lambda-api

2c3cff7

ci: update Terraform version to 1.9.8 for state compatibility

86fb3af

ci: add workflow file paths to triggers for self-activation

b062d46

cristofima requested a review from Copilot November 28, 2025 23:59

Copilot started reviewing on behalf of cristofima November 29, 2025 00:00 View session

Copilot finished reviewing on behalf of cristofima November 29, 2025 00:01

Copilot AI reviewed Nov 29, 2025

View reviewed changes

cristofima added 3 commits November 29, 2025 11:18

cristofima added the enhancement New feature or request label Nov 30, 2025

cristofima merged commit b4b7b85 into main Nov 30, 2025
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement AWS Batch infrastructure and AutoML training pipeline#1

feat: implement AWS Batch infrastructure and AutoML training pipeline#1
cristofima merged 29 commits intomainfrom
dev

cristofima commented Nov 28, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 29, 2025

Uh oh!

Copilot AI Nov 29, 2025

Uh oh!

Copilot AI Nov 29, 2025

Uh oh!

Copilot AI Nov 29, 2025

Uh oh!

Copilot AI Nov 29, 2025

Uh oh!

Copilot AI Nov 29, 2025

Uh oh!

Copilot AI Nov 29, 2025

Uh oh!

Copilot AI Nov 29, 2025

Uh oh!

Copilot AI Nov 29, 2025

Uh oh!

Copilot AI Nov 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if df[col].isnull().any():
		df[col].fillna(df[col].mode()[0] if not df[col].mode().empty else 'Unknown', inplace=True)

		@@ -0,0 +1,135 @@
		from fastapi import APIRouter, HTTPException, status, Query

		@@ -0,0 +1,42 @@
		from fastapi import APIRouter, HTTPException, status
		from datetime import datetime

Conversation

cristofima commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cristofima commented Nov 28, 2025 •

edited

Loading