Table of Contents
The Serverless Receipt Processor is an intelligent document processing pipeline that automates the tedious task of manual expense logging. By simply dropping a receipt image into an S3 bucket, the system leverages OCR and Machine Learning to extract key metadata—such as vendor name, date, and total amount—storing the results in a NoSQL database and notifying the user via email.
This project demonstrates a fully automated CI/CD infrastructure approach where every component (S3, Lambda, DynamoDB, SES, IAM, and CloudWatch) is provisioned dynamically using Terraform, ensuring zero-manual configuration in the AWS Console.
- Python 3.13: The latest stable Lambda runtime utilizing Boto3 for AWS SDK integrations.
- Terraform: Used for Infrastructure as Code (IaC) with dynamic resource linking and circular-dependency protection.
- AWS SES: Transactional email service for instant processing summaries.
- Amazon Textract (AnalyzeExpense): Specialized ML models to extract structured receipt data without manual templates.
- Amazon DynamoDB: Scalable NoSQL storage for structured receipt metadata.
- Personal Expense Tracking: Automatically log grocery and retail receipts into a digital ledger.
- Automated Bookkeeping: Small business owners can bulk-upload receipts to generate monthly expense reports.
- Tax Compliance: Maintain a searchable, permanent database of all business-related expenditures.
- Trigger: User uploads an image/PDF to the
incoming/prefix in S3. - Processing: S3 event notification triggers the Python 3.13 Lambda.
- Analysis: Lambda sends the document to Amazon Textract for specialized expense extraction.
- Storage: Extracted vendor, date, and total amount are saved into DynamoDB with a unique UUID.
- Notification: Amazon SES sends a summary email to the verified administrator address.
- Monitoring: CloudWatch Logs (managed by Terraform) track every execution step.
AWS-TERRAFORM-RECEIPT-PROCESSOR/ ├── .terraform/ # Terraform managed internal directory ├── .github/workflows/ │ └── cd.yml # Production GitHub Actions pipeline │ └── ci.yml # Integration GitHub Actions pipeline │ └── documentation.yml # Documentation GitHub Actions pipeline ├── modules/ # Modularized infrastructure components │ ├── database/ # DynamoDB resources │ │ ├── main.tf │ │ ├── outputs.tf │ │ └── variables.tf │ ├── lambda/ # Lambda logic and IAM roles │ │ ├── src/ │ │ │ ├── lambda_function.py # Core Python processing logic │ │ │ └── lambda_function.zip # Generated deployment package │ │ ├── main.tf │ │ ├── outputs.tf │ │ └── variables.tf │ └── storage/ # S3 bucket and lifecycle configurations │ ├── main.tf │ ├── outputs.tf │ └── variables.tf ├── scripts/ # CD bash scripts (health-check, etc.) │ ├── health-check.sh │ └── integration-test.sh ├── assets/ # Project documentation assets (diagrams, images) ├── .gitignore # Specified files and folders to ignore in Git ├── .terraform.lock.hcl # Provider dependency lock file ├── .pre-commit-config.yaml # Local git-hook orchestration ├── .tflint.hcl # TFLint AWS ruleset configuration ├── .checkov.yml # Checkov scan ignore list ├── .terraform-docs.yml # Config for terraform documentation during workflow ├── main.tf # Root module: orchestrates the modules ├── outputs.tf # Aggregated outputs from modules ├── variables.tf # Global variables (Region, Email) ├── providers.tf # Terraform & Provider requirements ├── terraform.tfstate # Current state of deployed infrastructure ├── terraform.tfstate.backup # Previous state version for recovery ├── .gitignore # Standard Git exclusion list └── README.md # Project documentationThis section is automatically updated with the latest infrastructure details.
Detailed Infrastructure Specifications
| Name | Version |
|---|---|
| terraform | >= 1.5.0 |
| aws | ~> 5.0 |
| random | ~> 3.0 |
| Name | Source | Version |
|---|---|---|
| database | ./modules/database | n/a |
| lambda | ./modules/lambda | n/a |
| storage | ./modules/storage | n/a |
No resources.
| Name | Description | Type | Default | Required |
|---|---|---|---|---|
| aws_region | The AWS region to deploy resources in | string |
"us-east-1" |
no |
| lambda_name | Name for the Lambda function | string |
"ReceiptProcessor" |
no |
| user_email | The verified email for SES sending and receiving | string |
n/a | yes |
| Name | Description |
|---|---|
| bucket_id | The ID of the S3 bucket created |
| dynamodb_table_name | The name of the DynamoDB table |
| lambda_function_name | The name of the Lambda function created |
| region | The AWS region being used |
- AWS CLI configured with Admin permissions.
- Terraform CLI (v1.5.0+) installed locally.
- Terraform Cloud account for remote state management.
- Set your AWS Region: Set to whatever
aws_regionyou want invariables.tf.
- Create a new Workspace with github version control workflow in Terraform Cloud.
- In the Variables tab, add the following Terraform Variables:
-
Add the following Environment Variables (AWS Credentials):
AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEY
-
Run the command ni Terraform CLI:
terraform login
- Create a token and follow the steps in browser to complete the Terraform Cloud Connection.
-
Add the
backendblock interraformcode block:backend "remote" { hostname = "app.terraform.io" organization = <your-organization-name> workspaces { name = <your-workspace-name> } } -
Run the command in Terraform CLI to migrate the state into Terraform Cloud:
terraform init -migrate-state
-
Clone the Repository:
git clone https://github.com/ShenLoong99/aws-terraform-receipt-processor-automation.git
-
Provision Infrastructure:
Terraform Cloud → Initialize & Apply: Push your code to GitHub. Terraform Cloud will automatically detect the change, run aplan, and wait for your approval. -
Observe workflow:
GitHub (GitOps) → Github actions: Observe the process/workflow of CI/CD in the actions tab in GitHub. -
Critical: Check your inbox for the "AWS Notification - Identity Verification" email and click the confirmation link

This project uses a fully automated GitOps pipeline to ensure code quality and deployment reliability. The Pre-commit framework implements a "Shift-Left" strategy, ensuring that code is formatted, documented, and secure before it ever leaves your machine.
-
Branch Protection Rulesets
To ensure high code quality and prevent unauthorized changes to the production environment, themainbranch is governed by a GitHub Branch Ruleset.- Pull Request Mandatory: No code can be pushed directly to
main. All changes must originate from a feature branch and be merged via a Pull Request. - Required Status Checks: The
Infrastructure CI(Terraform Plan & Static Analysis) must pass successfully before a merge is permitted. - Bypass Authority: The dedicated GitHub App is added to the Bypass List with "Always allow" permissions. This allows the bot to push documentation updates directly to
mainwithout being blocked by PR requirements.
- Pull Request Mandatory: No code can be pushed directly to
-
Pre-commit
- Tool: Executes
terraform fmt,terraform validate,TFLint,terraform_docsandcheckovto ensure the code is clean. - Trigger: Runs on every git commit.
- Outcome: If any check fails, the commit is blocked. You fix the error, re-add the file, and commit again.
- Tool: Executes
-
Continuous Integration (PR)
- Tool: Executes
terraform fmt -check,terraform validateandcheckov, then doplanand cost estimation and print it on PR. - Trigger: Runs on every Pull Request.
-
Outcome: This acts as the "Gatekeeper" before code is merged to
main.
- Tool: Executes
-
Continuous Delivery (Deployment)
- Tool: Terraform Cloud + GitHub Actions OIDC.
- Trigger: Merges to the
mainbranch. -
Outcome: The pipeline verifies the infrastructure state and runs a post-deployment health check with(
health-check.sh&smoke-test-website.sh).
-
Dynamically update readme documentation
- Tool:
terraform_docs+ GitHub Actions. - Trigger: Merges to the
mainbranch. - Outcome: The pipeline verifies the infrastructure state from Terraform Cloud, retrieve outputs from Terraform Cloud and update the readme documentation file dynamically.
- Tool:
- Repository Secret
TF_API_TOKEN: Required for GitHub to communicate with Terraform Cloud. - Trigger: A GitHub Actions OIDC role (
GitHubActionRole) allows the runner to verify AWS resources without long-lived keys. -
Automated Documentation via GitHub App: Instead of using a Personal Access Token (PAT) or the default
GITHUB_TOKEN, this project uses a custom GitHub App for automated tasks.
Secret Description Source BOT_APP_IDThe unique numerical ID assigned to your GitHub App. App Settings > General BOT_PRIVATE_KEYThe full content of the generated .pemprivate key file.App Settings > Private keys
-
Upload a receipt file (JPG/PNG) using the AWS CLI to trigger the system:
aws s3 cp <your-receipt-image> s3://<your-bucket-name>/incoming/
-
Verify Database: Check the DynamoDB console for a new entry.
-
Verify Email: You will receive an email summary of the extracted data (Check inbox or spam section).
-
Verify Logs:
terraform validateensures log groups are managed under/aws/lambda/ReceiptProcessor.
- Python 3.13 Migration: Upgraded from 3.9 for longevity and performance.
- Auto-Naming: Used
random_idfor globally unique S3 buckets. - PDF Support: Enhance Textract logic to handle multi-page PDF documents.
- Web Dashboard: Build a React frontend to visualize receipts from DynamoDB.
| Challenge | Solution |
|---|---|
| Circular Dependencies | Resolved "Cycle" errors by using locals for function names instead of direct resource references in Log Groups. |
| Empty Bucket Deletion | Implemented force_destroy = true to allow Terraform to clean up S3 buckets even if they contain receipt images. |
| Silent Failures | Added explicit print() statements to Python logic to ensure visibility in CloudWatch Logs during Textract calls. |
This project is designed with the six pillars of the AWS Well-Architected Framework in mind to ensure a secure, high-performing, resilient, and efficient infrastructure.
-
Operational Excellence
- Infrastructure as Code (IaC): The entire environment is modularized and managed via Terraform, enabling version control, repeatability, and automated provisioning through Terraform Cloud.
- Observability & Logging: Implemented structured CloudWatch logging with custom debug prints to monitor the health of Textract extractions and DynamoDB transactions.
- Deployment Automation: A robust GitHub Actions CI/CD pipeline ensures consistent deployments with automated post-deployment health checks and integration probes.
-
Security
- Principle of Least Privilege: IAM roles are strictly scoped to specific resource ARNs (e.g., restricting SES permissions to a single verified identity and S3 access to the specific bucket).
- Data Protection at Rest: The S3 bucket is configured with
AES256server-side encryption by default andpublic_access_blockto prevent unauthorized exposure. - Secure Infrastructure: SQS permissions are explicitly defined to allow the Lambda function to send failed events to the Dead Letter Queue without broad administrative access.
-
Reliability
- Fault Tolerance: Integrated an SQS Dead Letter Queue (DLQ) to capture and analyze failed processing events, preventing data loss during unexpected service interruptions.
- Point-in-Time Recovery (PITR): DynamoDB is configured with PITR enabled, protecting the extracted receipt data against accidental deletion or code bugs.
- Managed Service Resiliency: Utilizing AWS Lambda and Amazon Textract ensures the system scales and recovers automatically across multiple Availability Zones.
-
Performance Efficiency
- Serverless Scaling: The architecture scales horizontally and instantaneously from zero to peak demand, as AWS handles the compute scaling for Lambda and the throughput for Textract.
- Optimized Memory: Lambda is configured with 512MB of RAM to balance execution speed and cost, ensuring faster processing of high-resolution receipt images.
- Selection of Right Services: Used Amazon Textract’s specialized
AnalyzeExpenseAPI to offload complex OCR and document-to-data mapping, reducing the need for heavy custom ML code.
-
Cost Optimization
- Zero-Waste Cleanup: Implemented an S3 Lifecycle Rule to delete receipt images after 1 day (for demo purposes) and abort incomplete multipart uploads after 7 days to eliminate unnecessary storage costs.
- Pay-as-you-go Model: Utilized DynamoDB
PAY_PER_REQUESTand Lambda serverless compute to ensure the project costs are $0.00 when not in use. - Resource Tagging: Applied a centralized
common_tagslocal map (Project, Environment, Owner) across all resources to enable granular cost tracking in the AWS Billing Dashboard.
-
Sustainability
- Minimizing Idle Resources: By choosing a fully serverless stack, the project minimizes the environmental impact by only consuming energy during the milliseconds required to process a receipt.
- Managed Service Efficiency: Shifting hardware management to AWS allows the project to benefit from the high-occupancy and power-optimized data centers managed by the cloud provider.
Special thanks to Tech with Lucy for the architectural inspiration and excellent AWS tutorials that helped shape this pipeline.
- See her youtube channel here: Tech With Lucy
- Watch her video here: 5 Intermediate AWS Cloud Projects To Get You Hired (2025)

