Skip to content

ShenLoong99/aws-terraform-receipt-processor-automation

Repository files navigation

Contributors Forks Stargazers Issues Unlicense License LinkedIn

📑 Serverless Receipt Processor

amazon-ses-logo

Automated AI-Powered Receipt Data Extraction & Archiving

AWS Terraform Python


GitHub Actions
Infrastructure CI Production Deployment Update Documentation


Last Commit Repo Size pre-commit Checkov Security

Explore the docs »

Table of Contents
  1. About The Project
  2. Built With
  3. Use Cases
  4. Architecture
  5. File Structure
  6. Technical Reference
  7. Getting Started
  8. GitOps & CI/CD Workflow
  9. Usage
  10. Roadmap
  11. Challenges
  12. Well Architected Framework
  13. Acknowledgements

About The Project

The Serverless Receipt Processor is an intelligent document processing pipeline that automates the tedious task of manual expense logging. By simply dropping a receipt image into an S3 bucket, the system leverages OCR and Machine Learning to extract key metadata—such as vendor name, date, and total amount—storing the results in a NoSQL database and notifying the user via email.

This project demonstrates a fully automated CI/CD infrastructure approach where every component (S3, Lambda, DynamoDB, SES, IAM, and CloudWatch) is provisioned dynamically using Terraform, ensuring zero-manual configuration in the AWS Console.

Built With

python lambda terraform textract textract dynamodb

  • Python 3.13: The latest stable Lambda runtime utilizing Boto3 for AWS SDK integrations.
  • Terraform: Used for Infrastructure as Code (IaC) with dynamic resource linking and circular-dependency protection.
  • AWS SES: Transactional email service for instant processing summaries.
  • Amazon Textract (AnalyzeExpense): Specialized ML models to extract structured receipt data without manual templates.
  • Amazon DynamoDB: Scalable NoSQL storage for structured receipt metadata.

Use Cases

  • Personal Expense Tracking: Automatically log grocery and retail receipts into a digital ledger.
  • Automated Bookkeeping: Small business owners can bulk-upload receipts to generate monthly expense reports.
  • Tax Compliance: Maintain a searchable, permanent database of all business-related expenditures.

Architecture

architecture-diagram

  1. Trigger: User uploads an image/PDF to the incoming/ prefix in S3.
  2. Processing: S3 event notification triggers the Python 3.13 Lambda.
  3. Analysis: Lambda sends the document to Amazon Textract for specialized expense extraction.
  4. Storage: Extracted vendor, date, and total amount are saved into DynamoDB with a unique UUID.
  5. Notification: Amazon SES sends a summary email to the verified administrator address.
  6. Monitoring: CloudWatch Logs (managed by Terraform) track every execution step.

File Structure

AWS-TERRAFORM-RECEIPT-PROCESSOR/
├── .terraform/                  # Terraform managed internal directory
├── .github/workflows/
│   └── cd.yml                   # Production GitHub Actions pipeline
│   └── ci.yml                   # Integration GitHub Actions pipeline
│   └── documentation.yml        # Documentation GitHub Actions pipeline
├── modules/                     # Modularized infrastructure components
│   ├── database/                # DynamoDB resources
│   │   ├── main.tf
│   │   ├── outputs.tf
│   │   └── variables.tf
│   ├── lambda/                  # Lambda logic and IAM roles
│   │   ├── src/
│   │   │   ├── lambda_function.py   # Core Python processing logic
│   │   │   └── lambda_function.zip  # Generated deployment package
│   │   ├── main.tf
│   │   ├── outputs.tf
│   │   └── variables.tf
│   └── storage/                 # S3 bucket and lifecycle configurations
│       ├── main.tf
│       ├── outputs.tf
│       └── variables.tf
├── scripts/                     # CD bash scripts (health-check, etc.)
│   ├── health-check.sh
│   └── integration-test.sh
├── assets/                      # Project documentation assets (diagrams, images)
├── .gitignore                   # Specified files and folders to ignore in Git
├── .terraform.lock.hcl          # Provider dependency lock file
├── .pre-commit-config.yaml      # Local git-hook orchestration
├── .tflint.hcl                  # TFLint AWS ruleset configuration
├── .checkov.yml                 # Checkov scan ignore list
├── .terraform-docs.yml          # Config for terraform documentation during workflow
├── main.tf                      # Root module: orchestrates the modules
├── outputs.tf                   # Aggregated outputs from modules
├── variables.tf                 # Global variables (Region, Email)
├── providers.tf                 # Terraform & Provider requirements
├── terraform.tfstate            # Current state of deployed infrastructure
├── terraform.tfstate.backup     # Previous state version for recovery
├── .gitignore                   # Standard Git exclusion list
└── README.md                    # Project documentation

Technical Reference

This section is automatically updated with the latest infrastructure details.
Detailed Infrastructure Specifications

Requirements

Name Version
terraform >= 1.5.0
aws ~> 5.0
random ~> 3.0

Modules

Name Source Version
database ./modules/database n/a
lambda ./modules/lambda n/a
storage ./modules/storage n/a

Resources

No resources.

Inputs

Name Description Type Default Required
aws_region The AWS region to deploy resources in string "us-east-1" no
lambda_name Name for the Lambda function string "ReceiptProcessor" no
user_email The verified email for SES sending and receiving string n/a yes

Outputs

Name Description
bucket_id The ID of the S3 bucket created
dynamodb_table_name The name of the DynamoDB table
lambda_function_name The name of the Lambda function created
region The AWS region being used

Getting Started

Prerequisites

  • AWS CLI configured with Admin permissions.
  • Terraform CLI (v1.5.0+) installed locally.
  • Terraform Cloud account for remote state management.
  • Set your AWS Region: Set to whatever aws_region you want in variables.tf.

Terraform Cloud State Management

  1. Create a new Workspace with github version control workflow in Terraform Cloud.
  2. In the Variables tab, add the following Terraform Variables:
  3. Add the following Environment Variables (AWS Credentials):
    • AWS_ACCESS_KEY_ID
    • AWS_SECRET_ACCESS_KEY
  4. Run the command ni Terraform CLI:
    terraform login
  5. Create a token and follow the steps in browser to complete the Terraform Cloud Connection.
  6. Add the backend block in terraform code block:
    backend "remote" {
      hostname     = "app.terraform.io"
      organization = <your-organization-name>
      workspaces {
        name = <your-workspace-name>
      }
    }
  7. Run the command in Terraform CLI to migrate the state into Terraform Cloud:
    terraform init -migrate-state

Installation & Deployment

  1. Clone the Repository:
    git clone https://github.com/ShenLoong99/aws-terraform-receipt-processor-automation.git
  2. Provision Infrastructure:
    Terraform CloudInitialize & Apply: Push your code to GitHub. Terraform Cloud will automatically detect the change, run a plan, and wait for your approval.
  3. Observe workflow:
    GitHub (GitOps)Github actions: Observe the process/workflow of CI/CD in the actions tab in GitHub.
  4. Critical: Check your inbox for the "AWS Notification - Identity Verification" email and click the confirmation link verify-identity-email
    ses-identity-verified

GitOps & CI/CD Workflow

This project uses a fully automated GitOps pipeline to ensure code quality and deployment reliability. The Pre-commit framework implements a "Shift-Left" strategy, ensuring that code is formatted, documented, and secure before it ever leaves your machine.

Workflow

  1. Branch Protection Rulesets
    To ensure high code quality and prevent unauthorized changes to the production environment, the main branch is governed by a GitHub Branch Ruleset.
    • Pull Request Mandatory: No code can be pushed directly to main. All changes must originate from a feature branch and be merged via a Pull Request.
    • Required Status Checks: The Infrastructure CI (Terraform Plan & Static Analysis) must pass successfully before a merge is permitted.
    • Bypass Authority: The dedicated GitHub App is added to the Bypass List with "Always allow" permissions. This allows the bot to push documentation updates directly to main without being blocked by PR requirements.
  2. Pre-commit
    • Tool: Executes terraform fmt, terraform validate, TFLint, terraform_docs and checkov to ensure the code is clean.
    • Trigger: Runs on every git commit.
    • Outcome: If any check fails, the commit is blocked. You fix the error, re-add the file, and commit again.
  3. Continuous Integration (PR)
    • Tool: Executes terraform fmt -check, terraform validate and checkov, then do plan and cost estimation and print it on PR.
    • Trigger: Runs on every Pull Request.
    • Outcome: This acts as the "Gatekeeper" before code is merged to main.
  4. Continuous Delivery (Deployment)
    • Tool: Terraform Cloud + GitHub Actions OIDC.
    • Trigger: Merges to the main branch.
    • Outcome: The pipeline verifies the infrastructure state and runs a post-deployment health check with(health-check.sh & smoke-test-website.sh).
  5. Dynamically update readme documentation
    • Tool: terraform_docs + GitHub Actions.
    • Trigger: Merges to the main branch.
    • Outcome: The pipeline verifies the infrastructure state from Terraform Cloud, retrieve outputs from Terraform Cloud and update the readme documentation file dynamically.

Prerequisites for GitOps

  • Repository Secret TF_API_TOKEN: Required for GitHub to communicate with Terraform Cloud.
  • Trigger: A GitHub Actions OIDC role (GitHubActionRole) allows the runner to verify AWS resources without long-lived keys.
  • Automated Documentation via GitHub App: Instead of using a Personal Access Token (PAT) or the default GITHUB_TOKEN, this project uses a custom GitHub App for automated tasks.
    Secret Description Source
    BOT_APP_ID The unique numerical ID assigned to your GitHub App. App Settings > General
    BOT_PRIVATE_KEY The full content of the generated .pem private key file. App Settings > Private keys

Usage & Testing

  • Upload a receipt file (JPG/PNG) using the AWS CLI to trigger the system:
    aws s3 cp <your-receipt-image> s3://<your-bucket-name>/incoming/
    upload-item-into-bucket
  • Verify Database: Check the DynamoDB console for a new entry.
    dynamodb-stored-items
  • Verify Email: You will receive an email summary of the extracted data (Check inbox or spam section).
    receipt-summary-email
  • Verify Logs: terraform validate ensures log groups are managed under /aws/lambda/ReceiptProcessor.
    receipt-summary-email

Roadmap

  • Python 3.13 Migration: Upgraded from 3.9 for longevity and performance.
  • Auto-Naming: Used random_id for globally unique S3 buckets.
  • PDF Support: Enhance Textract logic to handle multi-page PDF documents.
  • Web Dashboard: Build a React frontend to visualize receipts from DynamoDB.

Challenges

Challenge Solution
Circular Dependencies Resolved "Cycle" errors by using locals for function names instead of direct resource references in Log Groups.
Empty Bucket Deletion Implemented force_destroy = true to allow Terraform to clean up S3 buckets even if they contain receipt images.
Silent Failures Added explicit print() statements to Python logic to ensure visibility in CloudWatch Logs during Textract calls.

AWS Well-Architected Framework Alignment

This project is designed with the six pillars of the AWS Well-Architected Framework in mind to ensure a secure, high-performing, resilient, and efficient infrastructure.

  1. Operational Excellence
    • Infrastructure as Code (IaC): The entire environment is modularized and managed via Terraform, enabling version control, repeatability, and automated provisioning through Terraform Cloud.
    • Observability & Logging: Implemented structured CloudWatch logging with custom debug prints to monitor the health of Textract extractions and DynamoDB transactions.
    • Deployment Automation: A robust GitHub Actions CI/CD pipeline ensures consistent deployments with automated post-deployment health checks and integration probes.
  2. Security
    • Principle of Least Privilege: IAM roles are strictly scoped to specific resource ARNs (e.g., restricting SES permissions to a single verified identity and S3 access to the specific bucket).
    • Data Protection at Rest: The S3 bucket is configured with AES256 server-side encryption by default and public_access_block to prevent unauthorized exposure.
    • Secure Infrastructure: SQS permissions are explicitly defined to allow the Lambda function to send failed events to the Dead Letter Queue without broad administrative access.
  3. Reliability
    • Fault Tolerance: Integrated an SQS Dead Letter Queue (DLQ) to capture and analyze failed processing events, preventing data loss during unexpected service interruptions.
    • Point-in-Time Recovery (PITR): DynamoDB is configured with PITR enabled, protecting the extracted receipt data against accidental deletion or code bugs.
    • Managed Service Resiliency: Utilizing AWS Lambda and Amazon Textract ensures the system scales and recovers automatically across multiple Availability Zones.
  4. Performance Efficiency
    • Serverless Scaling: The architecture scales horizontally and instantaneously from zero to peak demand, as AWS handles the compute scaling for Lambda and the throughput for Textract.
    • Optimized Memory: Lambda is configured with 512MB of RAM to balance execution speed and cost, ensuring faster processing of high-resolution receipt images.
    • Selection of Right Services: Used Amazon Textract’s specialized AnalyzeExpense API to offload complex OCR and document-to-data mapping, reducing the need for heavy custom ML code.
  5. Cost Optimization
    • Zero-Waste Cleanup: Implemented an S3 Lifecycle Rule to delete receipt images after 1 day (for demo purposes) and abort incomplete multipart uploads after 7 days to eliminate unnecessary storage costs.
    • Pay-as-you-go Model: Utilized DynamoDB PAY_PER_REQUEST and Lambda serverless compute to ensure the project costs are $0.00 when not in use.
    • Resource Tagging: Applied a centralized common_tags local map (Project, Environment, Owner) across all resources to enable granular cost tracking in the AWS Billing Dashboard.
  6. Sustainability
    • Minimizing Idle Resources: By choosing a fully serverless stack, the project minimizes the environmental impact by only consuming energy during the milliseconds required to process a receipt.
    • Managed Service Efficiency: Shifting hardware management to AWS allows the project to benefit from the high-occupancy and power-optimized data centers managed by the cloud provider.

Acknowledgements

Special thanks to Tech with Lucy for the architectural inspiration and excellent AWS tutorials that helped shape this pipeline.

About

This project is an automate receipt processing with Textract, Lambda, and Dynamob. This is a system that can help you keep track of all your paper as well as digital receipts.

Topics

Resources

Stars

Watchers

Forks

Contributors