Let's assume your business is about selling tickets for events (concerts, shows, etc ...). Your design architecture is pretty simple:
Users search for events and select a ticket to purchase
The event is sent to a ticketing service
The payment service processes the payment
Once the payment processed, the user and ticket details are updated in the database.
To finish, email service and sms/phone service send notification to the user with the purchase confirmation and the details of the event.
As illustrated in the following diagram
So far so good, until a major event happens, an artist is planning to perform in multiple cities, and the demand is growing exponentially.
There are about 1 million users trying to purchase tickets for the same event, and the system is not able to handle the load.
Email service and phone/sms service are crashing and fail to process the requests from the ticketing service
The ticketing service cannot process new requests
There is no payment processing, with a huge impact and a loss of revenue
Users are frustrated with their experiences (slow response, etc..), and some decide to cancel the purchasing process.
Purchased tickets event and user information cannot be updated in the database
Users don't receive notifications of their purchase, and send you complaints
As illustrated in the following diagram, the bottleneck creates cascading failures.
There are few ways to solve this problem, one of them is to use a message queue system.
A messaging queue system is a system that allows you to send and receive messages between services. It's a way to decouple the services communications by adding an extra layer to prevent cascade failures when one of the services goes down.
The service/sender instead of using a direct api call to another service, sends the message to a queue (asynchronously), and the service/receiver retrieves (polls) the message from the queue. There are no tight dependencies between both services, allowing more reliability and fault-tolerance in your system.
The main components of a messaging queue system are :
Producer : The application that sends the message to the message queue system.
Queue : stores the messages.
Consumer : The application that receives the message from the message queue system.
The main advantages of messaging queue :
The sender (application/producer) doesn't have to wait for a response from another application
The receiver (application/consumer) can process the messages at any time
Multiple consumers can process messages concurrently, messages aren't lost if the service goes down.
Here is an example of message/event (json) that can be sent between sender and receiver :
{
"event": "TICKET_PURCHASED",
"timestamp": "2025-10-06T14:32:00Z",
"data": {
"ticket_id": "TCKT-98423",
"event_id": "EVT-12345",
"event_name": "Jazz World Tour 2025 - Marseille",
"user": {
"user_id": "USR-7821",
"name": "John Doe",
"email": "john.doe@minimalistsa.com"
},
"seat": {
"section": "A",
"row": "5",
"seat_number": "42"
},
"price": {
"amount": 25.00,
"currency": "EUR"
},
"payment": {
"transaction_id": "PAY-982374",
"status": "SUCCESS",
"method": "Credit Card"
}
}
}
The new architecture should look like this :
The new architecture component "queue system" should handle more events and more loads.
The email and phone/sms service should process the events/messages at their own pace .
If one of the services (ticketing service, email service, sms/phone service) goes down, the messages are still kept in the queue for later processing.
Messaging queue service often offer core features :
Message Enqueueing / Dequeueing
Acknowledgements (ACK/NACK)
Retries & Backoff
Dead Letter Queue (DLQ)
Priority Queues
Message Ordering
This is the basic functionality of messaging queue system, the producer (sender) pushes an event to the queue, then the consumer (receiver) polls the event from the queue. There are different ways the consumer can retrieve the messages, either on best effort or using FIFO (first in first out) model.
ACK : the message has been processed successfully by the consumer
NACK : the message cannot be processed by the consumer and is returned to the queue
Failed messages can be processed again by the consumer, you can define retry limit or interval between retries.
Messages that failed too many times or cannot be processed are sent to the DLQ (Dead Letter Queue), to reduce the size of the messages to be processed in the queue.
They can be processed manually or more investigation can be done.
Priority can be assigned to messages, so that higher priority messages are processed before messages with lower priority.
During peak workload, you probably want to process in priority critical events (orders, payments), and process lower priority events later (product recommendations, etc ..).
This guarantees that message are processed in the order they arrive (FIFO : First In First Out)
Capacity planning is critical and important, but sometimes things can break with a proportion not fully anticipated, causing cascading failures.
One option to prevent and reduce the risks, is to implement a graceful degradation. The goal is to progressively reject the requests as the load increases or disable temporarily some minor services or features not essential or critical to the user experience. What the user sees is a working system (eventually with a slight degradation), not a full crash.
The messaging queue service as all services can also be under pressure, heavy load, it's important to monitor how the service behaves and which metrics can help you understand and alert if threshold are met.
Monitoring Objectives | Description | Core metrics to collect | Alerts and thresholds |
---|---|---|---|
Availability & Health |
Is the broker up and reachable? Are producers and consumers connected? |
broker_up active_connections queue_status (ready/unavailable) |
broker_up == 0 (broker down) active_connections == 0 for > 5 min |
Performance |
Is the queue processing messages in a timely manner? Are consumers catching up with the producers? |
message_publish_rate message_ack_rate consumer_lag (Kafka) message_latency_p95 / p99 |
consumer_lag > 1000 messages for > 5m p95_latency > 500ms |
Resources & Capacity |
Is the broker running out of memory or disk space? Are queues growing too large? |
memory_used / memory_available disk_usage queue_length |
memory > 85% for 5m disk_usage > 80% queue_length > 10k |
Errors & Faults |
Are there any failed message deliveries or rejected messages? Are dead-letter queues (DLQs) growing? |
message_error_rate dlq_message_count |
error_rate > 1% over 5m DLQ count > 100 |
Security & Access |
Monitor unauthorized access, privilege changes, or suspicious traffic. |
failed_auth_count unauthorized_connection_attempts |
failed_auth_count > 5/min suspicious IPs detected |
Replication & Cluster State |
Ensure replicas are in sync and partitions healthy. |
partition_under_replicated leader_election_rate |
under_replicated_partitions > 0 leader_election_rate > 1/min |
Capacity planning should help you determine how many messages can be handled during peak time. The following should help you plan for peak workloads
How many messages/s can be ingested ?
What's the size of each message ?
How long the message remain in the queue ?
How many consumers can read simultaneously ?
Here is a brief comparison of the available messaging queue
Amazon SQS | RabbitMQ | Apache Kafka |
---|---|---|
Managed distributed queue (AWS) | Traditional message broker | Distributed log streaming platform |
Up to 3,000–5,000 msg/sec per queue (standard), can scale horizontally | Typically 20–50k msg/sec per node; depends on hardware | Easily 1M+ msg/sec per cluster, horizontally scalable |
You can also perform stress tests on your messaging queue system with different loads with tools like :
Industry / Context | Use Case / Scenario | Description / Impact | Sources |
---|---|---|---|
E-Commerce | Order Processing | High-volume orders handled asynchronously, ensuring smooth checkout and backend processing. | More details |
Fintech / Payment Systems | Transaction Processing | Payment requests queued for asynchronous processing and retries, ensuring reliability during spikes. | More details |
Healthcare | Patient Appointments & Records | Patient scheduling and data handling queued asynchronously, preventing double-bookings or delays. | More details |
Email / Marketing Platforms | Large-scale Email Dispatch | Email sending queued to handle high volume and ensure timely delivery. | More details |
IoT / Sensor Networks | Device Data Ingestion | Sensor data queued for async processing, ensuring scalability and reliability. | More details |
Microservices | Inter-Service Communication | Messages queued to decouple services, enabling asynchronous and reliable communication. | More details |