📑 Serverless Receipt Processor

Automated AI-Powered Receipt Data Extraction & Archiving

Table of Contents

About The Project
Built With
Use Cases
Architecture
File Structure
Technical Reference
Getting Started
GitOps & CI/CD Workflow
Usage
Roadmap
Challenges
Well Architected Framework
Acknowledgements

About The Project

The Serverless Receipt Processor is an intelligent document processing pipeline that automates the tedious task of manual expense logging. By simply dropping a receipt image into an S3 bucket, the system leverages OCR and Machine Learning to extract key metadata—such as vendor name, date, and total amount—storing the results in a NoSQL database and notifying the user via email.

This project demonstrates a fully automated CI/CD infrastructure approach where every component (S3, Lambda, DynamoDB, SES, IAM, and CloudWatch) is provisioned dynamically using Terraform, ensuring zero-manual configuration in the AWS Console.

↑ Back to Top

Built With

Python 3.13: The latest stable Lambda runtime utilizing Boto3 for AWS SDK integrations.
Terraform: Used for Infrastructure as Code (IaC) with dynamic resource linking and circular-dependency protection.
AWS SES: Transactional email service for instant processing summaries.
Amazon Textract (AnalyzeExpense): Specialized ML models to extract structured receipt data without manual templates.
Amazon DynamoDB: Scalable NoSQL storage for structured receipt metadata.

↑ Back to Top

Use Cases

Personal Expense Tracking: Automatically log grocery and retail receipts into a digital ledger.
Automated Bookkeeping: Small business owners can bulk-upload receipts to generate monthly expense reports.
Tax Compliance: Maintain a searchable, permanent database of all business-related expenditures.

↑ Back to Top

Architecture

Trigger: User uploads an image/PDF to the incoming/ prefix in S3.
Processing: S3 event notification triggers the Python 3.13 Lambda.
Analysis: Lambda sends the document to Amazon Textract for specialized expense extraction.
Storage: Extracted vendor, date, and total amount are saved into DynamoDB with a unique UUID.
Notification: Amazon SES sends a summary email to the verified administrator address.
Monitoring: CloudWatch Logs (managed by Terraform) track every execution step.

↑ Back to Top

File Structure

AWS-TERRAFORM-RECEIPT-PROCESSOR/
├── .terraform/                  # Terraform managed internal directory
├── .github/workflows/
│   └── cd.yml                   # Production GitHub Actions pipeline
│   └── ci.yml                   # Integration GitHub Actions pipeline
│   └── documentation.yml        # Documentation GitHub Actions pipeline
├── modules/                     # Modularized infrastructure components
│   ├── database/                # DynamoDB resources
│   │   ├── main.tf
│   │   ├── outputs.tf
│   │   └── variables.tf
│   ├── lambda/                  # Lambda logic and IAM roles
│   │   ├── src/
│   │   │   ├── lambda_function.py   # Core Python processing logic
│   │   │   └── lambda_function.zip  # Generated deployment package
│   │   ├── main.tf
│   │   ├── outputs.tf
│   │   └── variables.tf
│   └── storage/                 # S3 bucket and lifecycle configurations
│       ├── main.tf
│       ├── outputs.tf
│       └── variables.tf
├── scripts/                     # CD bash scripts (health-check, etc.)
│   ├── health-check.sh
│   └── integration-test.sh
├── assets/                      # Project documentation assets (diagrams, images)
├── .gitignore                   # Specified files and folders to ignore in Git
├── .terraform.lock.hcl          # Provider dependency lock file
├── .pre-commit-config.yaml      # Local git-hook orchestration
├── .tflint.hcl                  # TFLint AWS ruleset configuration
├── .checkov.yml                 # Checkov scan ignore list
├── .terraform-docs.yml          # Config for terraform documentation during workflow
├── main.tf                      # Root module: orchestrates the modules
├── outputs.tf                   # Aggregated outputs from modules
├── variables.tf                 # Global variables (Region, Email)
├── providers.tf                 # Terraform & Provider requirements
├── terraform.tfstate            # Current state of deployed infrastructure
├── terraform.tfstate.backup     # Previous state version for recovery
├── .gitignore                   # Standard Git exclusion list
└── README.md                    # Project documentation

↑ Back to Top

Technical Reference

This section is automatically updated with the latest infrastructure details.

Detailed Infrastructure Specifications

Requirements

Name	Version
terraform	>= 1.5.0
aws	~> 5.0
random	~> 3.0

Modules

Name	Source	Version
database	./modules/database	n/a
lambda	./modules/lambda	n/a
storage	./modules/storage	n/a

Resources

No resources.

Inputs

Name	Description	Type	Default	Required
aws_region	The AWS region to deploy resources in	`string`	`"us-east-1"`	no
lambda_name	Name for the Lambda function	`string`	`"ReceiptProcessor"`	no
user_email	The verified email for SES sending and receiving	`string`	n/a	yes

Outputs

Name	Description
bucket_id	The ID of the S3 bucket created
dynamodb_table_name	The name of the DynamoDB table
lambda_function_name	The name of the Lambda function created
region	The AWS region being used

↑ Back to Top

Getting Started

Prerequisites

AWS CLI configured with Admin permissions.
Terraform CLI (v1.5.0+) installed locally.
Terraform Cloud account for remote state management.
Set your AWS Region: Set to whatever aws_region you want in variables.tf.

Terraform Cloud State Management

Create a new Workspace with github version control workflow in Terraform Cloud.
In the Variables tab, add the following Terraform Variables:
Add the following Environment Variables (AWS Credentials):
- AWS_ACCESS_KEY_ID
- AWS_SECRET_ACCESS_KEY
Run the command ni Terraform CLI:
```
terraform login
```
Create a token and follow the steps in browser to complete the Terraform Cloud Connection.

Add the backend block in terraform code block:

backend "remote" {
  hostname     = "app.terraform.io"
  organization = <your-organization-name>
  workspaces {
    name = <your-workspace-name>
  }
}

Run the command in Terraform CLI to migrate the state into Terraform Cloud:
```
terraform init -migrate-state
```

Installation & Deployment

Clone the Repository:

git clone https://github.com/ShenLoong99/aws-terraform-receipt-processor-automation.git

Provision Infrastructure:
Terraform Cloud → Initialize & Apply: Push your code to GitHub. Terraform Cloud will automatically detect the change, run a plan, and wait for your approval.
Observe workflow:
GitHub (GitOps) → Github actions: Observe the process/workflow of CI/CD in the actions tab in GitHub.
Critical: Check your inbox for the "AWS Notification - Identity Verification" email and click the confirmation link

↑ Back to Top

GitOps & CI/CD Workflow

This project uses a fully automated GitOps pipeline to ensure code quality and deployment reliability. The Pre-commit framework implements a "Shift-Left" strategy, ensuring that code is formatted, documented, and secure before it ever leaves your machine.

Workflow

Branch Protection Rulesets
To ensure high code quality and prevent unauthorized changes to the production environment, the main branch is governed by a GitHub Branch Ruleset.
- Pull Request Mandatory: No code can be pushed directly to main. All changes must originate from a feature branch and be merged via a Pull Request.
- Required Status Checks: The Infrastructure CI (Terraform Plan & Static Analysis) must pass successfully before a merge is permitted.
- Bypass Authority: The dedicated GitHub App is added to the Bypass List with "Always allow" permissions. This allows the bot to push documentation updates directly to main without being blocked by PR requirements.
Pre-commit
- Tool: Executes terraform fmt, terraform validate, TFLint, terraform_docs and checkov to ensure the code is clean.
- Trigger: Runs on every git commit.
- Outcome: If any check fails, the commit is blocked. You fix the error, re-add the file, and commit again.
Continuous Integration (PR)
- Tool: Executes terraform fmt -check, terraform validate and checkov, then do plan and cost estimation and print it on PR.
- Trigger: Runs on every Pull Request.
- Outcome: This acts as the "Gatekeeper" before code is merged to main.
Continuous Delivery (Deployment)
- Tool: Terraform Cloud + GitHub Actions OIDC.
- Trigger: Merges to the main branch.
- Outcome: The pipeline verifies the infrastructure state and runs a post-deployment health check with(health-check.sh & smoke-test-website.sh).
Dynamically update readme documentation
- Tool: terraform_docs + GitHub Actions.
- Trigger: Merges to the main branch.
- Outcome: The pipeline verifies the infrastructure state from Terraform Cloud, retrieve outputs from Terraform Cloud and update the readme documentation file dynamically.

Prerequisites for GitOps

Repository Secret TF_API_TOKEN: Required for GitHub to communicate with Terraform Cloud.
Trigger: A GitHub Actions OIDC role (GitHubActionRole) allows the runner to verify AWS resources without long-lived keys.

Automated Documentation via GitHub App: Instead of using a Personal Access Token (PAT) or the default GITHUB_TOKEN, this project uses a custom GitHub App for automated tasks.

Secret	Description	Source
`BOT_APP_ID`	The unique numerical ID assigned to your GitHub App.	App Settings > General
`BOT_PRIVATE_KEY`	The full content of the generated `.pem` private key file.	App Settings > Private keys

↑ Back to Top

Usage & Testing

Upload a receipt file (JPG/PNG) using the AWS CLI to trigger the system:
```
aws s3 cp <your-receipt-image> s3://<your-bucket-name>/incoming/
```
Verify Database: Check the DynamoDB console for a new entry.
Verify Email: You will receive an email summary of the extracted data (Check inbox or spam section).
Verify Logs: terraform validate ensures log groups are managed under /aws/lambda/ReceiptProcessor.

↑ Back to Top

Roadmap

Python 3.13 Migration: Upgraded from 3.9 for longevity and performance.
Auto-Naming: Used random_id for globally unique S3 buckets.
PDF Support: Enhance Textract logic to handle multi-page PDF documents.
Web Dashboard: Build a React frontend to visualize receipts from DynamoDB.

↑ Back to Top

Challenges

Challenge	Solution
Circular Dependencies	Resolved "Cycle" errors by using `locals` for function names instead of direct resource references in Log Groups.
Empty Bucket Deletion	Implemented `force_destroy = true` to allow Terraform to clean up S3 buckets even if they contain receipt images.
Silent Failures	Added explicit `print()` statements to Python logic to ensure visibility in CloudWatch Logs during Textract calls.

↑ Back to Top

AWS Well-Architected Framework Alignment

This project is designed with the six pillars of the AWS Well-Architected Framework in mind to ensure a secure, high-performing, resilient, and efficient infrastructure.

Operational Excellence
- Infrastructure as Code (IaC): The entire environment is modularized and managed via Terraform, enabling version control, repeatability, and automated provisioning through Terraform Cloud.
- Observability & Logging: Implemented structured CloudWatch logging with custom debug prints to monitor the health of Textract extractions and DynamoDB transactions.
- Deployment Automation: A robust GitHub Actions CI/CD pipeline ensures consistent deployments with automated post-deployment health checks and integration probes.
Security
- Principle of Least Privilege: IAM roles are strictly scoped to specific resource ARNs (e.g., restricting SES permissions to a single verified identity and S3 access to the specific bucket).
- Data Protection at Rest: The S3 bucket is configured with AES256 server-side encryption by default and public_access_block to prevent unauthorized exposure.
- Secure Infrastructure: SQS permissions are explicitly defined to allow the Lambda function to send failed events to the Dead Letter Queue without broad administrative access.
Reliability
- Fault Tolerance: Integrated an SQS Dead Letter Queue (DLQ) to capture and analyze failed processing events, preventing data loss during unexpected service interruptions.
- Point-in-Time Recovery (PITR): DynamoDB is configured with PITR enabled, protecting the extracted receipt data against accidental deletion or code bugs.
- Managed Service Resiliency: Utilizing AWS Lambda and Amazon Textract ensures the system scales and recovers automatically across multiple Availability Zones.
Performance Efficiency
- Serverless Scaling: The architecture scales horizontally and instantaneously from zero to peak demand, as AWS handles the compute scaling for Lambda and the throughput for Textract.
- Optimized Memory: Lambda is configured with 512MB of RAM to balance execution speed and cost, ensuring faster processing of high-resolution receipt images.
- Selection of Right Services: Used Amazon Textract’s specialized AnalyzeExpense API to offload complex OCR and document-to-data mapping, reducing the need for heavy custom ML code.
Cost Optimization
- Zero-Waste Cleanup: Implemented an S3 Lifecycle Rule to delete receipt images after 1 day (for demo purposes) and abort incomplete multipart uploads after 7 days to eliminate unnecessary storage costs.
- Pay-as-you-go Model: Utilized DynamoDB PAY_PER_REQUEST and Lambda serverless compute to ensure the project costs are $0.00 when not in use.
- Resource Tagging: Applied a centralized common_tags local map (Project, Environment, Owner) across all resources to enable granular cost tracking in the AWS Billing Dashboard.
Sustainability
- Minimizing Idle Resources: By choosing a fully serverless stack, the project minimizes the environmental impact by only consuming energy during the milliseconds required to process a receipt.
- Managed Service Efficiency: Shifting hardware management to AWS allows the project to benefit from the high-occupancy and power-optimized data centers managed by the cloud provider.

↑ Back to Top

Acknowledgements

Special thanks to Tech with Lucy for the architectural inspiration and excellent AWS tutorials that helped shape this pipeline.

See her youtube channel here: Tech With Lucy
Watch her video here: 5 Intermediate AWS Cloud Projects To Get You Hired (2025)

↑ Back to Top

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📑 Serverless Receipt Processor

About The Project

Built With

Use Cases

Architecture

File Structure

Technical Reference

Requirements

Modules

Resources

Inputs

Outputs

Getting Started

Prerequisites

Terraform Cloud State Management

Installation & Deployment

GitOps & CI/CD Workflow

Workflow

Prerequisites for GitOps

Usage & Testing

Roadmap

Challenges

AWS Well-Architected Framework Alignment

Acknowledgements

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.github/workflows		.github/workflows
assets		assets
modules		modules
scripts		scripts
.checkov.yml		.checkov.yml
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.terraform-docs.yml		.terraform-docs.yml
.terraform.lock.hcl		.terraform.lock.hcl
.tflint.hcl		.tflint.hcl
README.md		README.md
README.template.md		README.template.md
main.tf		main.tf
outputs.tf		outputs.tf
providers.tf		providers.tf
variables.tf		variables.tf

Folders and files

Latest commit

History

Repository files navigation

📑 Serverless Receipt Processor

About The Project

Built With

Use Cases

Architecture

File Structure

Technical Reference

Requirements

Modules

Resources

Inputs

Outputs

Getting Started

Prerequisites

Terraform Cloud State Management

Installation & Deployment

GitOps & CI/CD Workflow

Workflow

Prerequisites for GitOps

Usage & Testing

Roadmap

Challenges

AWS Well-Architected Framework Alignment

Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages