Event-Driven Systems at Scale

In September 2025, Nudato processed over 50,000 real-time notifications during a single 3-day event. Our original pipeline wasn't ready. This is the story of how we rebuilt it — the technical decisions, the mistakes we made, the AWS services we leaned on, and the monitoring setup that let us sleep through the next conference season.

The problem

Nudato's original notification system was simple: when an event occurred (new registration, room change, organizer message), a worker processed the notification and sent it via WebSocket to the client.

The architecture was straightforward. An Express API received the action, wrote to PostgreSQL, and a set of background workers polled the database every 2 seconds looking for new notifications to send. When they found one, they pushed it through a WebSocket connection to the target device.

It worked fine with 500 attendees. With 5,000 at a major tech conference in Mexico City, we started seeing:

8-15 second latency on notifications (unacceptable for real-time room changes)
Message loss during activity spikes (~2% of messages never reached clients)
Saturated workers blocking other task processing (report generation, email sends)
Database CPU at 85% from polling queries running every 2 seconds across 4 workers

// The original pipeline (simplified)
Event → API → DB Write → Worker (poll DB every 2s) → WebSocket Push
                ↓
           Bottleneck here: workers poll every 2s,
           each poll = SELECT query with JOIN across 3 tables

The polling approach had a fundamental problem: under load, the workers couldn't process the backlog fast enough, which meant the next poll picked up even more records, which meant processing took even longer. A classic cascading degradation pattern.

We knew this wouldn't scale for conference season (October-November), where Nudato typically handles 15-20 concurrent events with 2,000-8,000 attendees each. We had roughly 6 weeks to fix it.

The solution: Event-driven with AWS

We redesigned the pipeline using a pure event-driven architecture. Instead of polling for changes, the system reacts to events as they happen. Each action emits a typed event, and downstream consumers process those events independently and asynchronously.

Designing the event schema

Before writing any infrastructure code, we spent a full day designing our event schema. This is the part most teams skip, and it's the part that matters most. A poorly designed event schema leads to tightly coupled consumers and brittle integrations.

We settled on a structured envelope format based on CloudEvents:

// Event envelope — every event in the system follows this shape
interface NudatoEvent<T extends string, D> {
  id: string;                    // UUID v4, unique per event
  source: string;                // e.g., "nudato.registration", "nudato.schedule"
  type: T;                       // e.g., "room-change", "new-attendee"
  time: string;                  // ISO 8601 timestamp
  specversion: '1.0';
  data: D;                       // Event-specific payload
  metadata: {
    eventId: string;             // Nudato event (conference) ID
    correlationId: string;       // For tracing across services
    causationId: string;         // ID of the event that caused this one
    actor: {
      type: 'user' | 'system' | 'organizer';
      id: string;
    };
  };
}

// Example: Room change event
type RoomChangeEvent = NudatoEvent<'room-change', {
  sessionId: string;
  sessionTitle: string;
  previousRoom: string;
  newRoom: string;
  reason: string;
  effectiveAt: string;
  affectedAttendeeCount: number;
}>;

Key design decisions:

Events carry enough context for consumers to act without querying back to the source. The room change event includes the session title and affected attendee count — consumers don't need to look these up.
Correlation and causation IDs let us trace a chain of events across the system. When debugging "why did this attendee get 3 notifications?", we can follow the causation chain.
Events are immutable facts. They describe something that happened, not a command to do something. This distinction matters for replay and auditing.

New architecture

┌──────────────────────────────────────────────────────────────────┐
│                    Event-Driven Pipeline                          │
│                                                                   │
│  Mobile App / Web                                                 │
│       ↓                                                           │
│  API Gateway (REST)                                               │
│       ↓                                                           │
│  Lambda (validate + emit)                                         │
│       ↓                                                           │
│  EventBridge (central event bus)                                  │
│       ↓                    ↓                    ↓                  │
│  ┌─────────┐      ┌──────────────┐     ┌──────────────┐          │
│  │ SQS     │      │ SQS          │     │ Kinesis      │          │
│  │ Urgent  │      │ Informational│     │ Firehose     │          │
│  │ Queue   │      │ Queue        │     │ (analytics)  │          │
│  └────┬────┘      └──────┬───────┘     └──────┬───────┘          │
│       ↓                  ↓                    ↓                   │
│  Lambda            Lambda              S3 (raw events)            │
│  (immediate        (batch every         → Athena queries          │
│   send)             30 seconds)                                   │
│       ↓                  ↓                                        │
│  API Gateway WebSocket                                            │
│  (persistent connections)                                         │
│       ↓                                                           │
│  DynamoDB (connection registry + message persistence)             │
└──────────────────────────────────────────────────────────────────┘

Key components:

Amazon EventBridge as the central event bus. Every action on the platform emits a typed event. EventBridge gives us content-based routing, built-in retry logic, and an archive for event replay.
Separate SQS Queues by priority. Urgent notifications (room change, emergency alerts) go to a queue with no batching — Lambda processes them immediately. Informational notifications (new attendee joined, feedback received) go to a queue with 30-second batching to reduce WebSocket noise.
Lambda functions that process each notification type independently. Each function is small, focused, and independently deployable. A bug in the "feedback notification" Lambda doesn't affect room change notifications.
API Gateway WebSocket for persistent client connections. Connection state is tracked in DynamoDB, which gives us a registry of "which user is connected on which device."
Kinesis Firehose as a parallel consumer that streams all events to S3 for analytics and auditing. Every event that flows through the system is permanently recorded.

The pattern that helped most: Fan-out with filters

EventBridge lets you route events to different targets based on their content. This gave us the flexibility to process each notification type optimally without building routing logic into our application code:

// EventBridge rule for urgent notifications
// These go directly to Lambda with no batching
{
  source: ["nudato.events"],
  "detail-type": ["room-change", "emergency-alert", "schedule-update"],
  detail: {
    priority: ["high"]
  }
}
// → Target: Lambda function "process-urgent-notification"
// → Retry: 3 attempts, then DLQ
// → Concurrency: up to 100 simultaneous executions

// Rule for informational notifications
// These get batched to reduce client-side noise
{
  source: ["nudato.events"],
  "detail-type": ["new-attendee", "feedback-received", "photo-uploaded"],
  detail: {
    priority: ["low"]
  }
}
// → Target: SQS queue with 30-second batching window
// → Lambda consumes in batches of up to 10 messages
// → Client receives a single "3 new updates" notification

// Rule for analytics (every event, no filtering)
{
  source: [{ "prefix": "nudato" }]
}
// → Target: Kinesis Firehose → S3
// → Every event is archived for analysis

WebSocket connection management

One of the trickier parts of the system was managing WebSocket connections at scale. We needed to track which users were connected, on which devices, and route notifications to the right connections.

// Lambda: Handle WebSocket connection
import { APIGatewayProxyWebsocketHandlerV2 } from 'aws-lambda';
import { DynamoDBClient, PutItemCommand, DeleteItemCommand } from '@aws-sdk/client-dynamodb';

const db = new DynamoDBClient({});

export const connectHandler: APIGatewayProxyWebsocketHandlerV2 = async (event) => {
  const connectionId = event.requestContext.connectionId;
  const userId = event.requestContext.authorizer?.userId;
  const eventId = event.queryStringParameters?.eventId;

  await db.send(new PutItemCommand({
    TableName: process.env.CONNECTIONS_TABLE!,
    Item: {
      PK: { S: `EVENT#${eventId}` },
      SK: { S: `USER#${userId}#CONN#${connectionId}` },
      connectionId: { S: connectionId },
      userId: { S: userId },
      connectedAt: { S: new Date().toISOString() },
      GSI1PK: { S: `USER#${userId}` },
      GSI1SK: { S: `CONN#${connectionId}` },
      ttl: { N: String(Math.floor(Date.now() / 1000) + 86400) }, // 24h TTL
    },
  }));

  return { statusCode: 200 };
};

export const disconnectHandler: APIGatewayProxyWebsocketHandlerV2 = async (event) => {
  const connectionId = event.requestContext.connectionId;

  // Query GSI to find the connection record (we don't know the PK at disconnect time)
  // Then delete it
  await deleteConnectionByConnectionId(connectionId);

  return { statusCode: 200 };
};

// Lambda: Send notification to all connected devices for a user
import { ApiGatewayManagementApi } from '@aws-sdk/client-apigatewaymanagementapi';
import { DynamoDBClient, QueryCommand } from '@aws-sdk/client-dynamodb';

const db = new DynamoDBClient({});

async function sendToUser(userId: string, eventId: string, payload: NotificationPayload) {
  const wsClient = new ApiGatewayManagementApi({
    endpoint: process.env.WEBSOCKET_ENDPOINT!,
  });

  // Find all active connections for this user in this event
  const result = await db.send(new QueryCommand({
    TableName: process.env.CONNECTIONS_TABLE!,
    IndexName: 'GSI1',
    KeyConditionExpression: 'GSI1PK = :userId',
    ExpressionAttributeValues: {
      ':userId': { S: `USER#${userId}` },
    },
  }));

  const connections = result.Items ?? [];
  const staleConnections: string[] = [];

  // Fan out to all devices
  await Promise.allSettled(
    connections.map(async (conn) => {
      const connectionId = conn.connectionId.S!;
      try {
        await wsClient.postToConnection({
          ConnectionId: connectionId,
          Data: Buffer.from(JSON.stringify(payload)),
        });
      } catch (error: any) {
        if (error.statusCode === 410) {
          // Connection is stale — mark for cleanup
          staleConnections.push(connectionId);
        }
      }
    })
  );

  // Clean up stale connections
  await Promise.all(
    staleConnections.map((connId) => deleteConnectionByConnectionId(connId))
  );
}

Dead-letter queue processing

In production, messages fail. Network blips, malformed payloads, Lambda timeouts. Without a dead-letter queue strategy, you lose those messages silently. Our DLQ setup ensures nothing is lost:

// Lambda: Process dead-letter queue messages
// Runs every 5 minutes, processes failed messages
export const dlqProcessor = async (event: SQSEvent) => {
  for (const record of event.Records) {
    const originalEvent = JSON.parse(record.body);
    const failureCount = Number(record.attributes.ApproximateReceiveCount);

    // Log the failure with full context
    console.error('DLQ message received', {
      messageId: record.messageId,
      failureCount,
      originalEvent: originalEvent,
      firstFailedAt: record.attributes.SentTimestamp,
    });

    if (failureCount <= 3) {
      // Retry: put it back on the main queue
      await sqs.send(new SendMessageCommand({
        QueueUrl: process.env.MAIN_QUEUE_URL!,
        MessageBody: record.body,
        MessageAttributes: {
          retryCount: { DataType: 'Number', StringValue: String(failureCount) },
        },
      }));
    } else {
      // Permanent failure: store for manual review
      await db.send(new PutItemCommand({
        TableName: process.env.FAILED_EVENTS_TABLE!,
        Item: {
          PK: { S: `FAILED#${new Date().toISOString().split('T')[0]}` },
          SK: { S: record.messageId },
          event: { S: record.body },
          failureCount: { N: String(failureCount) },
          lastError: { S: originalEvent.lastError || 'unknown' },
        },
      }));

      // Alert the team
      await sns.publish({
        TopicArn: process.env.ALERTS_TOPIC!,
        Subject: `[Nudato] Permanent notification failure`,
        Message: `Message ${record.messageId} failed ${failureCount} times. Event type: ${originalEvent.type}`,
      });
    }
  }
};

Infrastructure as Code

The entire pipeline is defined in AWS CDK. This was non-negotiable — we needed to be able to reproduce the infrastructure in a staging environment for load testing, and we needed a clear audit trail of infrastructure changes.

// Simplified CDK stack for the notification pipeline
import * as cdk from 'aws-cdk-lib';
import * as events from 'aws-cdk-lib/aws-events';
import * as targets from 'aws-cdk-lib/aws-events-targets';
import * as sqs from 'aws-cdk-lib/aws-sqs';
import * as lambda from 'aws-cdk-lib/aws-lambda-nodejs';

export class NotificationPipelineStack extends cdk.Stack {
  constructor(scope: cdk.App, id: string) {
    super(scope, id);

    // Central event bus
    const bus = new events.EventBus(this, 'NudatoBus', {
      eventBusName: 'nudato-events',
    });

    // Dead-letter queue
    const dlq = new sqs.Queue(this, 'NotificationDLQ', {
      queueName: 'nudato-notifications-dlq',
      retentionPeriod: cdk.Duration.days(14),
    });

    // Urgent notification queue
    const urgentQueue = new sqs.Queue(this, 'UrgentQueue', {
      queueName: 'nudato-urgent-notifications',
      visibilityTimeout: cdk.Duration.seconds(30),
      deadLetterQueue: { queue: dlq, maxReceiveCount: 3 },
    });

    // Informational queue with batching
    const infoQueue = new sqs.Queue(this, 'InfoQueue', {
      queueName: 'nudato-info-notifications',
      visibilityTimeout: cdk.Duration.seconds(60),
      deadLetterQueue: { queue: dlq, maxReceiveCount: 3 },
    });

    // Route urgent events
    new events.Rule(this, 'UrgentRule', {
      eventBus: bus,
      eventPattern: {
        source: ['nudato.events'],
        detailType: ['room-change', 'emergency-alert', 'schedule-update'],
      },
      targets: [new targets.SqsQueue(urgentQueue)],
    });

    // Route informational events
    new events.Rule(this, 'InfoRule', {
      eventBus: bus,
      eventPattern: {
        source: ['nudato.events'],
        detailType: ['new-attendee', 'feedback-received', 'photo-uploaded'],
      },
      targets: [new targets.SqsQueue(infoQueue)],
    });

    // Lambda processors
    const urgentProcessor = new lambda.NodejsFunction(this, 'UrgentProcessor', {
      entry: 'src/lambdas/process-urgent.ts',
      timeout: cdk.Duration.seconds(10),
      memorySize: 256,
      reservedConcurrentExecutions: 100,
    });

    const infoProcessor = new lambda.NodejsFunction(this, 'InfoProcessor', {
      entry: 'src/lambdas/process-info.ts',
      timeout: cdk.Duration.seconds(30),
      memorySize: 256,
      reservedConcurrentExecutions: 50,
    });
  }
}

Monitoring and observability

We set up monitoring before migrating a single line of application code. This was the most important decision of the project. Without observability in place, we would have been flying blind during the migration and unable to compare the old and new systems.

CloudWatch dashboards

We built three dashboards:

1. Pipeline Health (real-time):

Messages in each queue (current depth)
Lambda invocation count and error rate
WebSocket active connections
EventBridge event count by type

2. Latency Tracking:

End-to-end latency: event emitted to client receives notification (P50, P95, P99)
Per-stage latency: API to EventBridge, EventBridge to SQS, SQS to Lambda, Lambda to WebSocket
Latency by event type (urgent vs. informational)

3. Business Metrics:

Notifications sent per event (conference)
Delivery success rate
DLQ depth (should be 0 in steady state)
Cost per 1,000 notifications

Alarms

// Critical alarms that page the on-call engineer
const alarms = [
  {
    name: 'DLQ Depth > 0',
    metric: dlq.metricApproximateNumberOfMessagesVisible(),
    threshold: 1,
    evaluationPeriods: 1,
    action: 'PAGE — messages are failing',
  },
  {
    name: 'Urgent Latency P95 > 500ms',
    metric: urgentLatencyMetric.p95(),
    threshold: 500,
    evaluationPeriods: 3,
    action: 'PAGE — urgent notifications are delayed',
  },
  {
    name: 'WebSocket Errors > 5%',
    metric: wsErrorRate,
    threshold: 5,
    evaluationPeriods: 2,
    action: 'WARN — check for stale connections',
  },
  {
    name: 'Lambda Throttles > 0',
    metric: urgentProcessor.metricThrottles(),
    threshold: 1,
    evaluationPeriods: 1,
    action: 'WARN — increase reserved concurrency',
  },
];

Custom metrics with embedded metric format

We used CloudWatch Embedded Metric Format to emit custom metrics directly from Lambda functions without the overhead of separate PutMetricData API calls:

import { createMetricsLogger, Unit } from 'aws-embedded-metrics';

async function processNotification(event: NudatoEvent) {
  const metrics = createMetricsLogger();
  metrics.setNamespace('Nudato/Notifications');
  metrics.setDimensions({ EventType: event.type, Priority: event.data.priority });

  const startTime = Date.now();

  try {
    await sendToUser(event.data.targetUserId, event.data.eventId, event.data.payload);

    const latency = Date.now() - startTime;
    metrics.putMetric('DeliveryLatency', latency, Unit.Milliseconds);
    metrics.putMetric('DeliverySuccess', 1, Unit.Count);
  } catch (error) {
    metrics.putMetric('DeliveryFailure', 1, Unit.Count);
    throw error;
  } finally {
    await metrics.flush();
  }
}

The migration: Running old and new in parallel

We didn't do a big-bang migration. For 2 weeks, we ran both systems in parallel: the old polling system continued to deliver notifications, and the new event-driven system ran alongside it in "shadow mode" — processing events and logging what it would have sent, but not actually delivering to clients.

This let us compare:

Did the new system process every event the old system processed?
Was the new system's latency consistently better?
Were there any event types the new system dropped?

We found 3 bugs during shadow mode that would have caused production incidents:

A race condition where the WebSocket connection registry showed a user as connected on a device they had already closed
A serialization issue with emoji in event names (a surprisingly common edge case in multilingual events)
A DynamoDB throughput limit we hadn't accounted for during burst scenarios

After shadow mode, we did a gradual rollout: 10% of events on Monday, 25% on Tuesday, 50% on Wednesday, 100% on Thursday. Each step was monitored for 24 hours before proceeding.

Results

After the redesign:

| Metric | Before | After | Improvement | |--------|--------|-------|-------------| | Latency (P50) | 3-5s | 45ms | ~100x faster | | Latency (P95) | 8-15s | 120ms | ~100x faster | | Latency (P99) | 20-30s | 350ms | ~85x faster | | Lost messages | ~2% during spikes | 0% (with DLQ) | Eliminated | | Monthly cost | $340 (EC2 workers 24/7) | $45 (Lambda pay-per-use) | 87% reduction | | Max concurrent users | ~2,000 | ~50,000+ | 25x capacity | | Deploy risk | High (monolith) | Low (per-function) | Isolated deploys | | Time to add new event type | 2-3 days | 2-4 hours | ~10x faster |

The most impactful change was cost: by moving from EC2 workers running 24/7 to Lambda that only executes when there are events, we reduced costs by 87%. During off-peak hours (nights, weekdays without events), the system costs essentially nothing. During a 5,000-attendee conference, Lambda scales automatically to handle the load and we pay only for the compute we actually use.

The second most impactful change was development velocity. Adding a new notification type used to require modifying the worker code, updating the polling query, testing the interaction with existing notification types, and deploying the entire worker fleet. Now it's: define the event schema, write a Lambda function, create an EventBridge rule, deploy. Each new notification type is completely isolated.

Lessons learned

After running this system in production for 6 months and handling dozens of multi-thousand-attendee events, here's what we'd tell anyone building event-driven systems:

1. Not everything needs to be event-driven

Simple CRUD reads and writes are still request-response. We only moved to events what truly benefits from asynchronous processing. The event detail page? That's a regular API call. Room availability check? Regular API call. Room change notification to 3,000 attendees? Event-driven.

Our rule of thumb: if the action needs to happen synchronously for the user to continue their task, it's request-response. If it's a reaction to something that already happened and the user doesn't need to wait for it, it's an event.

2. Dead-letter queues are mandatory, not optional

In production, there will always be messages that fail. Malformed payloads, transient network errors, downstream service outages, Lambda cold start timeouts. Without DLQ, you lose those messages silently. With DLQ, you have a safety net and a clear signal that something needs attention.

We review our DLQ every morning as part of our operational routine. If the DLQ has any messages, it's the first thing we investigate. Most of the time it's a transient error and the automatic retry handled it. Occasionally it surfaces a real bug.

3. Observability first, always

Before migrating, we set up CloudWatch dashboards with latency, throughput, and error metrics by event type. Without this, we would have been flying blind. The dashboards paid for themselves on day one of the migration when we caught a latency spike that would have gone unnoticed.

Invest in structured logging. Every log line from every Lambda function includes the correlation ID, event type, and processing stage. When something goes wrong at 11 PM, the difference between "Lambda failed" and "Lambda process-urgent failed on room-change event for conference ABC, correlation-id XYZ, at the WebSocket delivery stage" is the difference between 30 minutes and 3 minutes of debugging.

4. Idempotency is not optional

In a distributed system, messages can be delivered more than once. SQS guarantees at-least-once delivery, not exactly-once. Every consumer must be idempotent — processing the same message twice should produce the same result as processing it once.

// Idempotent notification delivery
async function processNotification(event: NudatoEvent) {
  const idempotencyKey = `notif:${event.id}:${event.metadata.correlationId}`;

  // Check if we've already processed this event
  const existing = await db.send(new GetItemCommand({
    TableName: process.env.PROCESSED_EVENTS_TABLE!,
    Key: { PK: { S: idempotencyKey } },
  }));

  if (existing.Item) {
    console.log('Duplicate event, skipping', { eventId: event.id });
    return; // Already processed
  }

  // Process the notification
  await sendToUser(event);

  // Mark as processed (with TTL for automatic cleanup)
  await db.send(new PutItemCommand({
    TableName: process.env.PROCESSED_EVENTS_TABLE!,
    Item: {
      PK: { S: idempotencyKey },
      processedAt: { S: new Date().toISOString() },
      ttl: { N: String(Math.floor(Date.now() / 1000) + 86400) }, // 24h TTL
    },
  }));
}

5. Design for replay from day one

Events are immutable facts. If you archive them (and you should), you can replay them through the system to recover from failures, backfill new features, or debug production issues.

We've used event replay three times in production:

Once to backfill a new analytics dashboard that needed historical notification data
Once to recover notifications that were processed but not delivered due to a WebSocket API Gateway issue
Once to reproduce a bug that only occurred with a specific sequence of events

6. Start with fewer event types than you think you need

Our initial design had 23 event types. We launched with 8. The other 15 either weren't needed or were better handled as attributes on existing event types. More event types mean more routing rules, more Lambda functions, more things to monitor, and more things to go wrong.

Start coarse, split when you have data. It's much easier to split a "session-update" event into "room-change" and "time-change" events later than it is to merge them back.

The best distributed system is the one you don't need. But when you need it, do it right from the start — with event schemas, dead-letter queues, idempotency, observability, and a migration plan that lets you validate before committing.

What's next

We're currently extending the event-driven pipeline to handle:

Push notifications (Firebase Cloud Messaging + APNs) as an additional delivery channel alongside WebSocket
Event sourcing for the session scheduling module, where we need a complete audit trail of every change to the conference schedule
Cross-event analytics that aggregate patterns across multiple conferences to help organizers improve their events

The event-driven foundation makes each of these extensions a matter of adding new consumers to existing events — not redesigning the system.

Does your system need to scale? Let's talk about designing an architecture that can handle growth.