A scalable, fault-tolerant distributed job queue system using Redis to manage tasks across worker nodes with job tracking, retries, prioritization, and a dashboard for monitoring and health checks.
- Architecture Overview
- Features
- Getting Started
- Folder Structure
- API Endpoints
- Deployments
- Test The Application
The objective is to design and implement a distributed job queue system using Redis that can:
- Distribute computational tasks dynamically across multiple worker nodes.
- Track job statuses (
pending
,processing
,completed
,failed
) and handle failures through retries or alternative mechanisms. - Provide a user-friendly dashboard for real-time monitoring of job statuses and worker health.
- Support horizontal scaling of worker nodes and job prioritization.
- Ensuring fault-tolerance and graceful recovery from worker or network failures.
- Efficiently managing a distributed queue to handle job priorities and dependencies.
- Implementing a robust retry mechanism for failed jobs and a dead-letter queue for irrecoverable tasks.
- Storing and retrieving job results in a scalable manner.
- Handling dynamic workload variations and enabling worker auto-scaling based on queue length.
- Implementing job dependencies where certain jobs can only start after others are completed.
- Tracking real-time job progress for better monitoring and debugging.
1. Frontend:
A React.js application with an intuitive interface for monitoring and managing the system, providing:
- Worker health and active worker status.
- Queue length and benchmarking of jobs.
- Total jobs (processing, completed, failed).
- Detailed view of all jobs (pending, processing, canceled, failed, completed) with type, status, progress, and priorities, including dynamic pagination and filtering by parameters.
- Input modal for simulating jobs.
2. Backend:
3. Cloud Infrastructure:
-
Networking:
- AWS VPC for managing network configurations.
- AWS EC2 for hosting the application instances.
- AWS Security Groups for managing access control to EC2 instances.
- AWS NAT Gateway for enabling internet access from private subnets.
-
DevOps:
- Pulumi as IAC to manage AWS resources and automate deployments.
- Priority-based job scheduling
- Automatic worker scaling (1-10 workers)
- Job retry with exponential backoff
- Dead letter queue for failed jobs
- Real-time job progress tracking
- Worker health monitoring
- Comprehensive metrics collection
- Circuit breaker pattern implementation
- Job dependency management
Follow these steps to run the application locally
1. Clone the Repository
git clone https://github.com/BayajidAlam/r-queue
cd r-queue
2. Install Dependencies
cd client
yarn install
3. Set Up Environment Variables
Create a .env file in the /client directory and add this:
VITE_PUBLIC_API_URL=backend url
4. Run the server
yarn dev
1. Install Dependencies
cd server
yarn install
REDIS_HOST=localhost
PORT=5000
3. Navigate to docker compose folder and run all container:
cd docker-compose.yml
docker-compose up -d
4. Run following command to create cluster:
redis-cli --cluster create \
<node-1 IP>:6379 <node-2 IP>:6379 <node-3 IP>:6379 \
<node-4 IP>:6379 <node-5 IP>:6379 <node-6 IP>:6379 \
--cluster-replicas 1
you will see something like this:
6. Verify the Cluster
redis-cli -c cluster nodes
7. Now run the server and test your applicaion:
yarn dev
You will see something like this:
-
/client
: Frontend/public
: Static files and assets./src
: Application code..env
: Frontend environment variablespackage.json
-
/server
: Backend/src
: Backend source code.bulkJobSimulation.ts
: Script for creating bulk amount job
docker-compose
: For creating redis cluster in docker environment locally.env
: Backend environment variablespackage.json
-
/IaC
: Infrastructure/pulumi
:index.ts
: Pulumi IaC files for managing AWS resources includes networking, compute to create distributed redis cluster.
ansible
: Ansible files for create and configure frontend, backend, redis setup and redis-cluster.
The application have following API's
http://localhost:5000/api
API Endpoint:
http://localhost:5000/api/health
{
"status": "unhealthy",
"details": {
"redisConnected": true,
"activeWorkers": 0,
"queueLength": 0,
"processingJobs": 0,
"metrics": {
"avgProcessingTime": 0,
"errorRate": 0,
"throughput": 0
}
},
"timestamp": "2025-01-10T12:20:37.856Z",
"version": "1.0"
}
API Endpoint:
http://localhost:5000/api/jobs
For register a user your request body should be like following
{
"type": "email",
"data": {
"Hello": "Hello",
"world": "world"
},
"priority": 3,
"dependencies": [
"a3342ec2-fcae-4e8d-8df8-8f59a2c7d58c"
]
}
{
"acknowledged": true,
"insertedId": "675002aea8b348ab91f524d0"
}
Before deploying the application, ensure you have the following:
- An AWS account with EC2 setup permissions.
- Docker installed on your local machine for building containers.
- AWS CLI installed and configured with your credentials.
- Node.js (version 18 or above) and npm and yarn installed for both frontend and backend applications.
- Pulumi installed for managing AWS infrastructure as code.
- TypeScript installed on your computer
1. Clone the Repository
git clone https://github.com/BayajidAlam/r-queue
cd r-queue/IaC/pulumi
2. Configure AWS CLI
Provide Access Key and Secret Key
3. Create Key Pair
Create a new key pair for our instances using the following command:
aws ec2 create-key-pair --key-name MyKeyPair --query 'KeyMaterial' --output text > MyKeyPair.pem
3. Deploy the infrastructure
pulumi up
On your AWS VPC resources map will be like:
4. Run the Ansible Playbook First navigate to ansible directory in pulumi and give following command
ansible-playbook -e @vars.yml playbooks/redis-setup.yml
ansible-playbook -e @vars.yml playbooks/redis-cluster.yml
ansible-playbook -e @vars.yml playbooks/backend-setup.yml
ansible-playbook -e @vars.yml playbooks/frontend-setup.yml
Now access frontend using user :5173 and you will see like :
Create a job: Click on Add new job modal and give necessary input:
Job type: What type of job you want to simulate
Processing Time: How long the job will take to complete process
Priority: Priority of the job
Job Data (JSON): Data we are passing with the job
Dependencies (comma-separated job IDs): If the job is dependent to another job add ID here form Recent Activity dashboard
Simulate Failure: If you want to simulate a failure check this
Now click on add new job button
Summary:
- One Active worker
- Processing 1
- One item is showing in Recent Job
- After the job processing is done, completed = 1. Using this job ID you can create a new job with dependencies. And selecting Simulate you can create a job that will fail at the end.
Test with bulk input: First ssh to your backend ec2, navigate to /opt/r-queue/server/src and run the command
simulate 20 2 10
simulate totalJobs duration batchSize