A Resilient Node.js Application with Automated Scaling and Real-Time Observability
Table of Contents
This project demonstrates a production-ready Two-Tier Architecture deployed on AWS using Terraform. It features a Node.js web application running on a fleet of EC2 instances managed by an Auto Scaling Group (ASG) for high availability and a Multi-AZ RDS MySQL database for data persistence.
The core focus of this implementation is Operational Excellence. By integrating the CloudWatch Agent via SSM Parameter Store, the infrastructure automatically captures system logs and application health metrics, providing deep visibility into the environment without manual configuration of individual servers.
-
- Terraform & Terraform Cloud: Orchestrates the entire lifecycle of the AWS resources. State is managed remotely in Terraform Cloud to enable team collaboration and safe state-locking.
- HCL (HashiCorp Configuration Language): Used to define modular resources (VPC, EC2, RDS) for high reusability.
-
- Node.js & Express: The core application engine handling RESTful CRUD operations.
- Auto Scaling Group (ASG): Ensures high availability by maintaining a minimum of 2 instances across multiple Availability Zones (AZs).
- Launch Templates: Standardizes the "Golden Image" configuration, including instance tags, IAM profiles, and UserData.
- Application Load Balancer (ALB): Acts as the single entry point, distributing traffic and performing health checks to ensure users never hit a failing server.
-
- Amazon RDS (MySQL): A managed database instance residing in a Private Subnet Group, isolated from the public internet.
-
Custom VPC Architecture: Ensures high availability by maintaining a minimum of 2 instances across multiple Availability Zones (AZs).
- Public Subnets: Hosting the ALB and (temporarily) the EC2 instances.
- Private Subnets: Dedicated to data persistence.
- Internet Gateway (IGW): Enables external connectivity for the web tier.
- Route Tables: Manages the traffic flow between the internet and internal subnets.
- CloudWatch Logs: Dedicated log group for flow analysis, providing a complete audit trail for security and compliance.
-
- AWS Systems Manager (SSM) Parameter Store: Centrally stores the CloudWatch Agent JSON configuration, allowing for "Configuration-as-Code" updates.
- CloudWatch Logs: Centralized log repository for application journals and
cloud-initoutput. - Amazon S3: Used for ALB Access Logging. Every request processed by the load balancer is logged as a gzipped file, capturing client IPs, request paths, and server response times for deep traffic analysis.
- IAM Roles & Instance Profiles: Implements Least Privilege Access, granting EC2 only the permissions needed to write to CloudWatch and read from SSM.
- VPC Flow Logs: Captures IP traffic metadata (source/destination IPs, ports, and protocols) for all network interfaces within the VPC.
- Fault-Tolerant Web Hosting: Proving that the application remains live even if an entire Availability Zone (AZ) fails.
- Automated Monitoring Setup: A reusable pattern for DevOps engineers to inject monitoring agents into a fleet of servers automatically.
- Observability Template: A reference for setting up unified logging across a dynamic fleet of servers.
- Infrastructure Hardening: Demonstration of IAM least-privilege roles and secure VPC networking.
The system is deployed into a custom VPC spanning two Availability Zones to ensure high availability. The Web Tier is managed by an Auto Scaling Group (ASG) and distributed by an Application Load Balancer (ALB), while the Data Tier is strictly isolated in private subnets with restricted ingress.
- Client Ingress & Routing: Traffic enters via the Internet Gateway (IGW) and is intercepted by the Application Load Balancer. The ALB acts as the single entry point, offloading SSL (if configured) and performing health checks to ensure traffic only reaches healthy EC2 nodes.
- Elastic Compute & Scaling: The ASG maintains a minimum of 2 instances. It utilizes a Target Group to seamlessly register/deregister instances during scaling events or failovers, ensuring zero-downtime deployments.
-
Multi-Layered Observability:
- Host Level: The CloudWatch Agent retrieves its
ssm:AmazonCloudWatch-linux-webappconfiguration to stream/var/log/cloud-init-output.logand application logs. - Network Level: VPC Flow Logs capture all IP traffic metadata to monitor for rejected connection attempts.
- Access Level: ALB Access Logs are archived in Amazon S3 for long-term auditability and traffic pattern analysis.
- Host Level: The CloudWatch Agent retrieves its
- Secure Data Persistence: The Node.js application communicates with the RDS MySQL instance located in the Private Subnets. Security Groups are configured using Security Group Referencing (allowing 3306 ONLY from the Web Security Group), ensuring the database is never exposed to the public internet.
-
The Logging & Auditing Ecosystem:
- S3 Bucket (ALB Access Logs): Every request hitting the Load Balancer is logged into a dedicated Amazon S3 bucket. This provides a durable audit trail of client IPs, request paths, and response latencies, crucial for compliance and traffic analysis.
-
CloudWatch (System & Network):
- Host Level: EC2 instances stream
/var/log/cloud-init-output.logand application logs to CloudWatch Logs. - Network Level: VPC Flow Logs capture all IP traffic metadata to monitor for rejected connection attempts.
- Host Level: EC2 instances stream
-
IAM Roles & Security Governance:
Logging functionality is enabled through an IAM Instance Profile attached to the EC2 instances. This role follows the Principle of Least Privilege, granting specific permissions to:
- Retrieve configurations from SSM Parameter Store.
- Write log streams to CloudWatch via the
CloudWatchAgentServerPolicy. - Allow the ALB service principal to write access logs to the S3 Bucket via a bucket policy.
AWS-TERRAFORM-2-TIER-WEBAPP/ โโโ ๐ .github/ # GitHub Actions or workflows โ โโโ workflows/ # CI/CD Pipeline Definitions โ โโโ cd.yml # Production Deployment โ โโโ ci.yml # Terraform PR Insights (Checkov, TFLint, Plan) โ โโโ documentation.yml # Automated Documentation Sync via terraform-docs โโโ ๐ .terraform/ # Terraform working directory โโโ ๐ assets/ # Project documentation and images โโโ ๐ modules/ # Reusable infrastructure modules โ โโโ ๐ alb/ # Application Load Balancer configuration โ โโโ ๐ ec2/ # Compute tier configuration โ โ โโโ ๐ scripts/ # User data and initialization scripts (e.g., user_data.tftpl) โ โ โโโ main.tf # EC2 Launch Template and ASG resources โ โ โโโ outputs.tf # EC2-specific output values โ โ โโโ providers.tf # Version constraints (No cloud block!) โ โ โโโ variables.tf # EC2-specific input variables โ โโโ ๐ rds/ # Managed database configuration โ โโโ ๐ security_groups/ # Networking security rules โ โโโ ๐ vpc/ # Virtual Private Cloud network setup โโโ ๐ scripts/ # Automation & Validation Scripts โ โโโ verify-deployment.sh # Post-deployment test script โ โโโ test-functions.sh # Functional API curl test script โโโ .gitignore # Files excluded from version control โโโ .terraform.lock.hcl # Provider dependency lock file โโโ .pre-commit-config.yaml # Local git-hook orchestration โโโ .tflint.hcl # TFLint AWS ruleset configuration โโโ .checkov.yml # Checkov scan ignore list โโโ .terraform-docs.yml # Config for terraform documentation during workflow โโโ main.tf # Root module orchestrating all tiers โโโ outputs.tf # Global output values (e.g., ALB DNS) โโโ project-key.pem # Private SSH key for EC2 access โโโ providers.tf # AWS provider and Terraform version config โโโ README.md # Project documentation (Auto-injected by terraform-docs) โโโ README.template.md # Documentation template โโโ terraform.tfstate # Current state of deployed infrastructure โโโ terraform.tfstate.backup # Previous state backup โโโ variables.tf # Global input variablesThis section is automatically updated with the latest infrastructure details.
Detailed Infrastructure Specifications
| Name | Version |
|---|---|
| terraform | >= 1.5.0 |
| aws | ~> 5.0 |
| local | ~> 2.0 |
| tls | ~> 4.0 |
| Name | Source | Version |
|---|---|---|
| alb | ./modules/alb | n/a |
| ec2 | ./modules/ec2 | n/a |
| rds | ./modules/rds | n/a |
| security_groups | ./modules/security_groups | n/a |
| storage | ./modules/storage | n/a |
| vpc | ./modules/vpc | n/a |
| Name | Type |
|---|---|
| aws_key_pair.generated_key | resource |
| local_file.private_key | resource |
| tls_private_key.main | resource |
| Name | Description | Type | Default | Required |
|---|---|---|---|---|
| admin_ip | CIDR block for admin IPs allowed to access resources | string |
n/a | yes |
| availability_zones | Availability zones to use | list(string) |
[ |
no |
| aws_region | AWS region to deploy resources in | string |
"us-east-1" |
no |
| db_password | RDS root password | string |
n/a | yes |
| private_subnets | Private subnet CIDR blocks | list(string) |
[ |
no |
| public_subnets | Public subnet CIDR blocks | list(string) |
[ |
no |
| vpc_cidr | VPC CIDR block | string |
"10.0.0.0/16" |
no |
| Name | Description |
|---|---|
| alb_dns_name | The DNS name of the load balancer |
| alb_sg_id | Security Group ID of the ALB |
| alb_target_group_arn | n/a |
| app_url | The clickable URL for the application |
| aws_region | The AWS region in use |
| ec2_instance_ids | IDs from the EC2 module |
| ec2_public_ips | Public IP addresses of the EC2 instances |
| ec2_sg_id | Security Group ID of the EC2 instances |
| private_subnet_ids | Private subnet IDs |
| public_subnet_ids | Public subnet IDs |
| rds_endpoint | The connection endpoint for the RDS instance |
| rds_sg_id | Security Group ID of the RDS instance |
| vpc_id | VPC ID |
- AWS Account with Bedrock Claude 3.5 model access enabled.
- Terraform CLI (v1.5.0+) installed locally.
- Terraform Cloud account for remote state management.
- Create a new Workspace with github version control workflow in Terraform Cloud.
- In the Variables tab, add the following Terraform Variables:
-
Add the following Environment Variables (AWS Credentials):
AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEY
-
Run the command ni Terraform CLI:
terraform login
- Create a token and follow the steps in browser to complete the Terraform Cloud Connection.
-
Add the
backendblock interraformcode block:backend "remote" { hostname = "app.terraform.io" organization = <your-organization-name> workspaces { name = <your-workspace-name> } } -
Run the command in Terraform CLI to migrate the state into Terraform Cloud:
terraform init -migrate-state
-
Clone the Repository:
git clone https://github.com/ShenLoong99/aws-terraform-2-tier-webapp.git
-
Provision Infrastructure:
Terraform Cloud โ Initialize & Apply: Push your code to GitHub. Terraform Cloud will automatically detect the change, run aplan, and wait for your approval. -
Observe workflow:
GitHub (GitOps) โ Github actions: Observe the process/workflow of CI/CD in the actions tab in GitHub.
This project uses a fully automated GitOps pipeline to ensure code quality and deployment reliability. The Pre-commit framework implements a "Shift-Left" strategy, ensuring that code is formatted, documented, and secure before it ever leaves your machine.
-
Branch Protection Rulesets
To ensure high code quality and prevent unauthorized changes to the production environment, themainbranch is governed by a GitHub Branch Ruleset.- Pull Request Mandatory: No code can be pushed directly to
main. All changes must originate from a feature branch and be merged via a Pull Request. - Required Status Checks: The
Infrastructure CI(Terraform Plan & Static Analysis) must pass successfully before a merge is permitted. - Bypass Authority: The dedicated GitHub App is added to the Bypass List with "Always allow" permissions. This allows the bot to push documentation updates directly to
mainwithout being blocked by PR requirements.
- Pull Request Mandatory: No code can be pushed directly to
-
Pre-commit
- Tool: Executes
terraform fmt,terraform validate,TFLint,terraform_docsandcheckovto ensure the code is clean. - Trigger: Runs on every git commit.
- Outcome: If any check fails, the commit is blocked. You fix the error, re-add the file, and commit again.
- Tool: Executes
-
Continuous Integration (PR)
- Tool: Executes
terraform fmt -check,terraform validateandcheckov, then doplanand cost estimation and print it on PR. - Trigger: Runs on every Pull Request.
-
Outcome: This acts as the "Gatekeeper" before code is merged to
main.
- Tool: Executes
-
Continuous Delivery (Deployment)
- Tool: Terraform Cloud + GitHub Actions OIDC.
- Trigger: Merges to the
mainbranch. -
Outcome: The pipeline verifies the infrastructure state and runs a post-deployment health check with(
health-check.sh&smoke-test-website.sh).
-
Dynamically update readme documentation
- Tool:
terraform_docs+ GitHub Actions. - Trigger: Merges to the
mainbranch. - Outcome: The pipeline verifies the infrastructure state from Terraform Cloud, retrieve outputs from Terraform Cloud and update the readme documentation file dynamically.
- Tool:
- Repository Secret
TF_API_TOKEN: Required for GitHub to communicate with Terraform Cloud. - Trigger: A GitHub Actions OIDC role (
GitHubActionRole) allows the runner to verify AWS resources without long-lived keys. -
Automated Documentation via GitHub App: Instead of using a Personal Access Token (PAT) or the default
GITHUB_TOKEN, this project uses a custom GitHub App for automated tasks.
Secret Description Source BOT_APP_IDThe unique numerical ID assigned to your GitHub App. App Settings > General BOT_PRIVATE_KEYThe full content of the generated .pemprivate key file.App Settings > Private keys
-
Once the Auto Scaling Group shows
InServiceinstances:-
Access the App: Copy the Public IP of any running instance (or the Load Balancer DNS if applicable) and paste it into your browser at port 3000.
-
Add an Item: Type a task (e.g., "Setup CloudWatch Alarms") in the input box and click Add. Verify the item appears in the list.
- Mark Complete: Click the Complete button. The item should move or change status, confirming a
PUTrequest to the RDS MySQL backend. - Delete an Item: Click Delete. Verify the item is removed from the UI and the database via a
DELETErequest.
-
Access the App: Copy the Public IP of any running instance (or the Load Balancer DNS if applicable) and paste it into your browser at port 3000.
-
The application is designed to demonstrate horizontal scaling and stateless execution. You can observe the Load Balancer's "Round Robin" or "Least Outstanding Requests" algorithm in action by following these step:
- Observe the Footer: The webpage UI displays the Metadata Instance ID of the specific EC2 server that processed the request.
-
Refresh the Browser: By performing refresh repeatedly, you will notice the Server ID toggle between the two instances in the Auto Scaling Group.
- The Significance: This confirms that the ALB is successfully distributing ingress traffic across multiple Availability Zones. It also demonstrates that the application state is correctly decoupled; despite switching between different backend servers, the To-Do List data remains consistent because both instances are connected to the same central RDS MySQL database.
-
Manually terminate one instance in the console. Watch the ASG Activity Tab; you will see the ASG detect the "Unhealthy" status and automatically spin up a replacement in a different subnet to maintain the desired capacity of 2.
-
- Generate Traffic: Click through your application for 5-10 minutes (add/delete some tasks).
-
Verify S3 Storage: Navigate to the Amazon S3 Console and open your logging bucket (e.g.,
webapp-alb-logs-...). -
Inspect Logs: Follow the folder path:
AWSLogs/ <ACCOUNT_ID> /elasticloadbalancing/ <REGION> /.
-
Download the file: Extract it and view it in notepad
-
- Generate a "Reject" Event: Attempt to SSH into your RDS database directly from your home computer. Since the RDS is in a private subnet, this connection will time out.
- Verify in CloudWatch: Navigate to CloudWatch > Log Groups >
/aws/vpc/flow-logs-debug. -
Search for Rejections: Look for logs where the
actionisREJECT. This confirms that your Security Group "Least Privilege" rules are actively blocking unauthorized external traffic.
If the application is not responding or you notice issues with the initial setup, SSH into an instance and use these commands to diagnose the root cause.
- Debugging & Troubleshooting
- Check the Network Tab: Open Developer Tools (F12) and ensure requests to
/healthare returning200 OK. - Incognito Mode: If "Mark Complete" or "Delete" buttons fail with a `removeChild` error in the console, try the app in an Incognito Window. This prevents browser extensions from interfering with the dynamic DOM updates.
- Hard Reload: Use
Ctrl + F5to ensure your browser isn't running a cached, older version of the JavaScript bundle.
- Check the Network Tab: Open Developer Tools (F12) and ensure requests to
-
There are two primary ways to access your application instance:
ssh -i project-key.pem ec2-user@<INSTANCE_PUBLIC_IP>
Access the virtual machine securely using your generated RSA private key.
โ ๏ธ IMPORTANT: Ensure you are operating in your Local Terminal linked to the Terraform Cloud Workspace before proceeding with these commands.If you do not have SSH access or want to connect directly via the browser:
- Navigate to the EC2 Console and select your instance.
- Click the Connect button at the top.
- Select the Session Manager tab and click Connect.
Note: This is the recommended secure method as it doesn't require opening port 22 to the public internet.
-
Cloud-init logs record the
UserDataexecution during the first boot. Use these if Node.js or the CloudWatch Agent fails to install. -
The application runs as a background service managed by
systemd. Use these to verify if the Node.js API is active. -
Verify that your EC2 instance can talk to the Private RDS instance across subnets. Note: You must install the MySQL client on the EC2 instance first to run these queries.
Step Command 1. Install Client sudo dnf install mariadb105 -y2. Test Connection mysql -h <RDS_ENDPOINT> -u admin -p<PASSWORD> webapp_2_tier_db -e "SELECT * FROM items;"Troubleshooting: If the connection times out after installing the client, ensure the RDS Security Group has an Inbound Rule allowing Port 3306 from the EC2 Security Group ID.
- Multi-AZ Deployment: Instances distributed across different subnets and managed by an Auto Scaling Group (ASG).
- Config-as-Code: SSM Parameter Store used for unified agent settings.
- Cost Controls: Automated log retention and Infrequent Access tiers.
- CI/CD Integration: Automate Terraform Apply via GitHub Actions.
- ALB Integration: Application Load Balancer implemented as a single entry point.
- Traffic Auditing: S3-based ALB access logging for long-term compliance.
- Network Visibility: VPC Flow Logs integrated with CloudWatch for deep packet metadata analysis.
- Dynamic UI Rendering: Fixed Terraform interpolation and Node.js DOM conflicts for a resilient frontend.
- Reliability & Failure Design: Implemented ASG with
health_check_type = "ELB"andinstance_refreshto ensure automated recovery and zero-downtime updates. - HTTPS: Implement SSL termination using AWS Certificate Manager (ACM).
- Dynamic Scaling Policy: Implement
aws_autoscaling_policyto scale out based on CPU utilization (currently configured as a fixed-size reliable fleet). - Bastion Host: Move the database and web servers to private subnets for enhanced security.
- Edge Security (WAF): Deploy AWS WAF to protect the Application Load Balancer from common web exploits like SQL Injection and Cross-Site Scripting (XSS).
| Challenge | Solution |
|---|---|
| Logging Visibility | Amazon Linux 2023 removed /var/log/messages. Switched agent tracking to cloud-init-output.log and application-specific paths. |
| The "3 Instances" Mystery | Observation of 3 instances running simultaneously during updates. Learned that the min_healthy_percentage (50%) forces ASG to launch a new instance before killing the old one. |
| ASG Grace Periods | Default cooldowns and grace periods (300s) caused delays. Adjusted health_check_grace_period for faster development cycles. |
| Terraform Cloud State Migration | Moving to a remote backend caused local_file resources to "flap" (recreate on every apply). I learned that local file resources are incompatible with ephemeral cloud runners and shifted to managing keys via sensitive outputs. |
| Log Class UI Errors | Encountered "Operation not supported" errors when viewing INFREQUENT_ACCESS logs. Learned that while IA saves costs, it has a UI propagation delay and limited "Live Tail" support compared to STANDARD logs. |
| IAM Propagation | Encountered AccessDenied during initial agent boot. Ensured the AmazonSSMManagedInstanceCore policy was attached to the Instance Profile. |
| Free Tier Constraints | To avoid unexpected costs, I limited the max_size of the ASG and deferred automatic scaling policies. This highlights the balance between high-availability architecture and budget management in a dev/test environment. |
| Load Balancer Integration | Configuring the aws_lb_target_group to correctly perform health checks on Port 3000. Learned that the ASG must be explicitly attached to the Target Group to receive traffic. |
| Public vs. Private Security | Successfully isolated the RDS in a private subnet. Navigating the challenge of allowing EC2-to-RDS communication while keeping the database shielded from the public internet. |
| Statelessness | To achieve this, the Node.js application was built to be stateless. No session data is stored on the local disk of the EC2 instance; all persistent data is offloaded to RDS. This allows the ALB to swap servers mid-session without the user losing their progress. |
| ALB Algorithms | By default, the ALB uses a round-robin approach at the request level. If you don't see the ID change immediately, it may be due to HTTP Keep-Alive (the browser reusing a connection) or Stickiness settings (which are disabled in this project to better demonstrate load distribution). |
| Terraform Variable Interpolation in UserData | One of the most complex challenges was passing JavaScript code through Terraform's templatefile() function. Terraform interprets ${...} as variables to be replaced. I implemented Double-Dollar Escaping ($${id}) for browser-side template literals and transitioned to standard String Concatenation for complex HTML generation to ensure Terraform and the Bash cat command didn't corrupt the JavaScript logic. |
| Stateful Database Initialization | Ensuring the database schema was ready before the application started required a robust Node.js initialization script that handles "Table Already Exists" errors and performs conditional migrations (like adding the completed column) without crashing the service. |
| Browser Extension DOM Interference | I identified that the browser extension MetaMask can cause NotFoundError when scripts rapidly update the DOM via innerHTML. Documented the requirement for testing in Incognito Mode to isolate application logic from third-party browser scripts. |
This project demonstrates a Cost-Optimized Development Environment, prioritizing security and observability over high-availability redundancy to minimize AWS expenditure while maintaining production-grade standards.
Special thanks to Tech with Lucy for the architectural inspiration and excellent AWS tutorials that helped shape this pipeline.
- See her youtube channel here: Tech With Lucy
- Watch her video here: 5 Intermediate AWS Cloud Projects To Get You Hired (2025)




