A high-performance Java application that demonstrates a ~6x performance boost in processing large log files by implementing a concurrent producer-consumer pipeline.
- The Problem
- The Solution
- Key Skills & Concepts Showcase
- Architecture
- Performance Benchmark
- Tech Stack
- Getting Started
- Contributing
- License
In data engineering, processing massive log files sequentially is a critical bottleneck. This common approach leaves modern multi-core CPUs severely underutilized, leading to wasted time, increased infrastructure costs, and delayed data insights. A single-threaded application simply cannot keep up with the scale of modern data.
This project tackles this challenge head-on by implementing a concurrent Producer-Consumer pipeline. This architecture decouples the I/O-bound task (reading the file) from the CPU-bound task (parsing data), allowing them to run in parallel. The result is a dramatic reduction in processing time and maximized hardware efficiency.
This project is a practical demonstration of a strong command of:
- Core Concepts:
- Concurrency & Multithreading
- Producer-Consumer Architectural Pattern
- Performance Benchmarking & Analysis
- Key Java Technologies:
java.util.concurrent(ExecutorService,BlockingQueue,ConcurrentHashMap)- Java 17 Records &
Optionalfor clean, robust data modeling - Stream API for data manipulation
- Development Practices:
- Professional Documentation & Project Presentation
- Standardized Git Feature Branch Workflow
The system's data flow is designed for maximum parallelism:
graph TD
A[Input: nasa_access.log] --> B(Reader Thread);
B -- Puts lines into --> C{BlockingQueue};
C -- Threads take lines from --> D((Processor Thread Pool));
D -- Aggregates counts into --> E((ConcurrentHashMap));
E --> F[Output: Final Report];
The following benchmark illustrates the expected performance improvement when processing a large log file (~500MB) on a standard multi-core machine.
| Mode | Threads | Execution Time | Speedup |
|---|---|---|---|
| Single-Threaded | 1 | ~25s (?) |
1x |
| Multi-Threaded | 8 | ~4s (?) |
~6.2x |
Note: The
(?)indicates that these are estimated target values. Actual benchmarks will be populated upon completion ofIssue #2andIssue #3.
| Technology | Justification |
|---|---|
| Java 17 | Chosen for its Long-Term Support (LTS) status and modern features like Records for creating immutable data models concisely. |
| Maven | Selected for its declarative dependency management and standardized build lifecycle, making the project portable and easy to build. |
| JUnit 5 | Used for its modern, modular architecture for writing clean and organized unit tests. |
- JDK 17 or higher (
java --version) - Apache Maven 3.8.0 or higher (
mvn --version)
-
Clone the Repository
# Clones the project to your local machine git clone [https://github.com/manhtruong03/concurrent-log-processor.git](https://github.com/manhtruong03/concurrent-log-processor.git) cd concurrent-log-processor
-
Build the Project with Maven
# This command compiles the code, runs tests, and packages it into a single executable JAR mvn clean package- Expected Outcome: You will see a
[INFO] BUILD SUCCESSmessage and a new JAR file located in the/targetdirectory.
- Expected Outcome: You will see a
-
Run the Application
# The output, including the execution time, will be printed to the console java -jar target/concurrent-log-processor-1.0-SNAPSHOT.jar
Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
Please see the CONTRIBUTING.md file for detailed guidelines on the development workflow.
This project is distributed under the MIT License. See the LICENSE file for more information.

