Real-time with Socket.io across instances: what I learned the hard way.

The Socket.io tutorial works on your laptop. One Node process, one client, messages flow. You ship it to production, scale to two backend instances behind a load balancer, and the messages stop reaching half the users. Welcome to the part the tutorial skipped.

The problem isn't Socket.io. The problem is that real-time over a fleet of stateless backends is a different shape than HTTP request/response, and the load balancer can't help you the way it usually does.

The setup

Imagine a collaborative platform, multiple users on the same page, edits and presence updates flowing in both directions. The frontend opens a Socket.io connection. The backend listens on a Socket.io server. Messages emit, clients receive. Until you have two backends.

Three things break the moment the second instance comes up.

Trap 1: the handshake doesn't survive the load balancer

Socket.io's transport falls back. It tries WebSocket first; if that fails, it tries HTTP long-polling. Long-polling sends a chain of HTTP requests where each request must reach the same backend instance, because the connection state lives in memory on that instance.

A naive load balancer round-robins requests. The first long-polling request hits backend A, which creates a session. The second request hits backend B, which has no record of that session and rejects it. Connection fails. The client retries. The retry hits backend A or B at random. Sometimes the connection establishes, sometimes it doesn't.

The fix is sticky sessions: the load balancer needs to route every request from a given client to the same backend instance for the lifetime of the session. AWS ALB calls this stickiness, configured per-target-group with a duration cookie. Most load balancers have an equivalent. Without it, long-polling never works behind your fleet, and any client whose network blocks WebSocket falls into a permanent retry loop.

The tutorial assumed your load balancer would do this. Production load balancers don't, by default.

Trap 2: a message emitted on instance A never reaches instance B's clients

Sticky sessions get the connection up. They don't solve the next problem.

User Alice is connected to instance A. User Bob is connected to instance B. Bob does something, instance B handles the action and emits a message intended for Alice. Instance B holds Bob's connection but not Alice's. The message goes nowhere. Alice keeps refreshing wondering why nothing updates.

This is the case the docs put one paragraph on and call "the adapter problem." Socket.io has a default in-memory adapter that knows about clients connected to this instance. To reach across instances, you need a shared adapter, typically Redis. The Redis adapter pub-subs every emit so all instances know about it; each instance forwards messages to its own connected clients.

Diagram: Bob emits to Instance B, which publishes to Redis Pub/Sub. Redis broadcasts to Instance A, which delivers to Alice. The cross-instance hop happens through Redis.

The Redis adapter is two NPM packages and a connection string. The load on Redis from this is small for most products, a few hundred messages per second is nothing. The cost of skipping it is that half your users miss half your messages, in a non-deterministic way that's brutal to debug because it works on staging where you only have one instance.

Trap 3: deploys disconnect everyone, and the reconnect storm hurts

Your fleet rolls. A new image deploys, instances drain, clients reconnect. With a hundred concurrent users, a deploy means a hundred simultaneous reconnect attempts. The new instances boot, get hit by all of them at once, and CPU spikes to 100% during the warm-up window. Connection establishment is heavier than steady-state traffic.

Two parts to mitigation:

The reconnect side. The Socket.io client supports reconnection backoff with jitter. The defaults are okay; explicitly set reconnectionDelay and reconnectionDelayMax so the storm spreads out. We use a base of 1 second, max of 30 seconds, enough to keep individual reconnects fast while preventing a synchronized retry across the fleet.

The deploy side. The instance refresh strategy controls how many old instances drain at once. If the ASG drains all of them simultaneously, the storm is unavoidable. Configure a rolling deploy with small batch size and a warmup period, half the fleet stays up while the new half spins up, reconnects come in waves not all at once, and CPU stays manageable.

The trap is that none of this matters in single-instance dev. Reconnect storms only show up under real load with a real fleet.

What the docs gloss over

A WebSocket connection is a long-lived TCP connection. Your load balancer's default idle timeout might kill it after a minute of silence. ALB defaults to 60 seconds; raise it to several minutes or send periodic Socket.io pings to keep traffic on the wire. Socket.io has a built-in heartbeat, verify it's tuned for your timeout window, not the other way around.

If your application uses Socket.io rooms, the room membership is per-instance with the in-memory adapter and global with the Redis adapter. Your code that joins a room works the same way; the difference is in who can emit to that room. With the Redis adapter, any instance can emit to any room. Without it, only the instance hosting the connection can.

Authorization happens at handshake time. The token your client sends gets validated when the connection establishes; after that, the connection lives. If you revoke a user's access, their socket connection persists until they reconnect. Decide whether that's acceptable or whether you need active session validation on emit, most products are fine with the gap, but security-sensitive products aren't.

The shape that works

Sticky sessions on the load balancer. Redis adapter on every instance. Tuned reconnect backoff on the client. Rolling deploys with controlled batch size. Idle timeout aligned with your heartbeat. Authorization model that matches your security needs.

That's the production checklist. The dev tutorial has none of it because dev runs one instance. Every team building real-time on a fleet rediscovers each one.

The trap underneath the traps: real-time is stateful, and your stateless backend deploy assumptions don't carry over. Treat connections as state that needs explicit migration paths the same way you'd treat database connections, and the rest follows.