Engineering · Elixir/OTP

TownSquare is a perfect match for BEAM

Why a realtime-presence widget is a textbook example for Elixir/OTP — and how it solves the hard time Node (and friends) are giving you.

Josef Richter

TownSquare is a lovely little thing: drop in a script tag and a strip of stick figures appears at the bottom of your page, one per visitor reading right now. You can walk around, see who’s on the same article, wave, chat. No accounts, no history. Caue Napier built it, open-sourced it, and runs a free public server. Thank you for all that!

What the app actually is

Strip it down:

Mostly-independent scenes — one per site — each a roster of who’s present.
Many long-lived WebSocket connections – visitors on the scene.
Per-visitor state (position, page, recent messages).
“Many tabs = one visitor” – deduplication.
Broadcast within a scene; clean up the instant someone leaves.

What caught my eye is that it’s almost a perfect use case for the BEAM — and, at the same time, a genuinely troublesome one for Node, along with Python, Ruby, and PHP that share its shape, and even for Go and Rust, which fix the cheap-concurrency half but still leave you without per-entity isolation or supervision.

That list up there is an actor-per-entity system with presence and pub/sub — the thing Erlang was built for: Ericsson needed millions of isolated little state machines surviving each other’s crashes. Coincidentally, they created a system that turned out to be perfect for today’s realtime multiuser (and multiagent) world wide web.

Reading the original JavaScript code — and especially its own docs/tech-debt.md — confirmed it. The author had already run a careful self-audit, and the list reads like a catalogue of exactly the troubles this shape causes on Node.

So I forked it and rewrote the realtime core in Elixir — same wire protocol, the exact same browser client, byte for byte — and ran TownSquare’s own protocol tests against it. The rest of this post is the why.

The one idea you have to swallow first

In Node your whole server is one process, one event loop, one heap. Every connection, scene, and visitor is an object in that shared space, taking turns on a single thread:

node

one process, one loop, one heap

+---------------------------------------+
|  event loop -> cb -> cb -> cb -> ...  |
|  scenes . identities . sockets        |
+---------------------------------------+

  ^ all shared, mutated in place
  one throw in any callback can drop everything

On the BEAM you get millions of tiny processes, each with its own heap and mailbox, scheduled preemptively, sharing nothing. They touch each other only by sending messages:

beam

many tiny processes, share nothing

+------------+  +------------+  +------------+             +------------+
|  socket A  |  |  socket B  |  |  socket C  |   --msg-->  |  scene     |
|  own heap  |  |  own heap  |  |  own heap  |             |  roster    |
|  own mbox  |  |  own mbox  |  |  own mbox  |             |  own heap  |
+------------+  +------------+  +------------+             +------------+

  a crash in one stays in that one

Two words in there are carrying the whole idea, so let’s slow down. A process here is nothing like an OS process — it’s handed an allocated slice of CPU time and automatically suspended when that slice is up, and the BEAM juggles millions of them at once.

And when Elixir people say “a process,” they almost always mean one running a receive loop: wait for a message, handle it, loop back, wait for the next. (A bare spawned process doesn’t loop at all — it runs its function once and exits. The loop is the pattern that turns a process into a long-lived thing you can talk to: a socket, a scene, a visitor.)

# runs its body once, then the process is gone
def noloop, do: IO.puts("I run once and die")
spawn(&noloop/0)

# loops forever, parked between messages
def loop do
  receive do
    msg -> IO.puts("got #{inspect(msg)}")
  end
  loop()
end
spawn(&loop/0)

# the first runs and dies; the second stays around,
# and spawn returns its pid so you can message it

Why the &loop/0? spawn needs a function value, and &name/0 is how you hand it a named function. Bare spawn(loop) would call loop and pass its return value instead — which wouldn’t spawn anything.

“Scheduled preemptively” is the other word, and it’s the real aha. That suspend-when-the-slice-is-up happens whether the process cooperates or not — the scheduler takes the core back mid-flight and hands it to the next one. No single process can hog a core, and none has to remember to yield. The scheduler just keeps everyone moving. That’s the fair part.

Easiest to feel by misbehaving on purpose. Each of these processes would be fatal on Node’s single loop — a runaway computation, a near-endless hang, a flat-out crash:

# counts forever, computing bigger and bigger factorials
def infinite_loop(n \\ 1) do
  _ = Enum.reduce(1..n, 1, &*/2)
  infinite_loop(n + 1)
end

# hangs "forever"
def hanging_loop, do: Process.sleep(:timer.hours(24 * 365))

# blows up on the spot
def explode_loop, do: raise("Kaboom!")

# spawn all three: the rest of the system never even notices
spawn(&infinite_loop/0)
spawn(&hanging_loop/0)
spawn(&explode_loop/0)

It’s also smart. The moment a process calls receive with an empty mailbox, the scheduler knows there’s no point spending cycles on it — so it suspends the process early, before its slice is even up, and wakes it the instant a message arrives. Same for a process that’s sitting in :timer.sleep or blocked on an I/O call like File.read: it’s parked, not spinning, and the core it would have held goes to a process with real work to do. Fair, and smart — just how you like your schedulers.

That last part matters more than it sounds for an app like this. Most visitors are just reading — connection open, doing nothing for minutes at a stretch. On the BEAM each one is a suspended process that costs essentially nothing until a message wakes it, so a thousand people quietly reading is a thousand parked processes — not a thousand things elbowing for one event loop’s attention.

And those slices aren’t all on one core. The BEAM runs a scheduler on every CPU core, and because the processes share nothing, they don’t just take turns — they genuinely run in parallel. Node uses one core per process; to touch a second core you start a second Node process and wire up the coordination by hand. On the BEAM you get all your cores for nothing — and the same share-nothing model that fans work across cores fans it across machines too: add a node to the cluster and its processes talk to the rest by sending the exact same messages. That’s the road from a thousand readers to millions — more cores, more boxes, no rewrite.

One OS process holds millions of tiny Elixir processes, each filling a pie as it earns its slice of CPU time. The BEAM runs a scheduler on every CPU core, so the schedulers keep grabbing ready processes and running them — every core busy, in true parallel. Diagram inspired by The Pragmatic Studio’s Elixir course — strongly recommended, if you want these concepts to really ‘click’.

If that picture lands, everything below is just consequences.

The principles, mapped to the tech-debt

Here’s the BEAM mental model in five pieces. Against each is the line (or two) from TownSquare’s own docs/tech-debt.md it makes disappear — the author’s words, not mine.

1 · A connection is a process → crashes are contained

T1 (top priority) — “One unguarded throw kills every connection — no try/catch around request dispatch, no uncaughtException guard, no SIGTERM drain.”

Node:  bad frame -> throw -> unhandled -> every socket on every site drops
BEAM:  bad frame -> that ONE process crashes -> its supervisor restarts it

Each connection is its own process, so an unguarded throw crashes that one socket and its supervisor restarts it. There’s nothing to wrap and no global net to bolt on — the bug class doesn’t exist.

2 · State is a process you message → the scene is the roster

In Node a scene is data: a scenes Map → an identities Map → a Set of sockets, all mutated in place. On the BEAM a scene is a GenServer — a process whose state map is the roster. You never reach in and read it; you send it a message and it updates itself:

socket --{:move, pid, x}--> scene process
   the scene then updates its own state.ids[id]
   and broadcasts the change to the sockets
   (one message at a time -> no locks, no races)

One process handles one message at a time, so there are no locks and no races — and the roster and the pub/sub are the same process, with no separate presence library to keep in sync. (No tech-debt line here: this is the backbone the rest stand on.)

3 · Death is a message → disconnect handles itself

H5 — “Replaced WebSockets leak on reconnect — old socket listeners not removed/closed.”

H7 — “Leave timers fire against deleted scenes — not cleared on site/scene deletion or shutdown.”

This is the one that deletes a whole category of bugs. When a socket joins, the scene calls Process.monitor(pid). When that process exits — tab closed, laptop slept, crash, network drop — the scene receives one message:

{:DOWN, _ref, :process, pid, _}      <- "that socket is gone"

That message is the leave event. There’s no ws.on('close') to remember to wire up — that’s H5 — and the reconnect-grace timer is created inside the scene with Process.send_after(self(), …), so it can’t outlive the scene it guards — that’s H7. Both gone by construction.

4 · “Let it crash” → supervisors, not safety nets

T5 — “Plugin registration runs at module top-level with no try/catch — one malformed plugin crashes boot.”

            Supervisor
           /     |      \
       scene   scene   Bandit --> one process per connection

   one process crashes -> only that one is restarted

Each scene starts as a supervised process; one that fails to start is isolated and logged, not fatal to the others or to boot. You stop writing defensive try/catch as a load-bearing safety net — isolation plus restart is the model.

5 · Preemptive scheduling → two more just vanish

T3 — “Synchronous full-registry saveSites() … on the WS/admin hot path blocks the event loop.”

H6 — “Unbounded in-memory growth — per-IP-per-scene activity map and scenes never bounded.”

There’s no shared event loop to block: a process doing slow disk I/O can’t stall message delivery to any other, so T3 isn’t a hazard. And an empty scene process terminates and the BEAM reclaims its heap, so the hand-rolled eviction behind H6 becomes “let the process exit.”

That’s six tech-debt items and six bug classes that don’t survive the move — none about cleverness, all about which runtime you started from.

Now the code is boring — and that’s the point

Because the runtime does the hard parts, the port is small. The socket is four callbacks:

lib/town_square_beam/socket.ex

def init(opts), do: {:ok, %{scene: Scene.ensure(opts[:scene_key]), id: nil}}

def handle_in({text, [opcode: :text]}, state),   # ~ ws "message"
  do: dispatch(Jason.decode!(text), state)

def handle_info({:ws_push, payload}, state),     # a scene broadcast → this tab
  do: {:push, {:text, payload}, state}

def terminate(_reason, _state), do: :ok          # ~ ws "close" (scene cleans up via monitor)

And the broadcast — the entire “pub/sub” — is a loop: JSON.stringify once, send to many.

lib/town_square_beam/scene.ex

defp broadcast(state, frame, except: pid) do
  payload = Jason.encode!(frame)
  for i <- Map.values(state.ids), {p, _} <- i.sockets, p != pid,
    do: send(p, {:ws_push, payload})
end

No Phoenix, no Channels, no phoenix_pubsub, no Phoenix.Presence — on one node they’re redundant. Same altitude as Node’s http + ws, so the only variable left between the two is the runtime.

Proof: their own tests, green

I pointed TownSquare’s own smoke-test.js — same ws client, same JSON frames — at the Elixir server:

$ node parity/core-smoke.mjs
Core parity test passed.

That covers identity dedup across tabs, the peer snapshot, browserId never leaking, join/leave, move/say/typing/gestures/reading, server-derived reading labels, rate-limiting, the 140-char cap, seat arbitration, and the multi-tab grace-window semantics. The same client can’t tell the two servers apart.

What I didn’t port — and why that’s the argument

The Elixir core is ~740 lines; server.js is ~3,300. That’s not a fair 4.5× — server.js also has the site registry, admin API, moderation, the world map, proof-of-work, a Plausible proxy, plugins. I ported none of it.

That omission is the point: all of it is ordinary HTTP CRUD where the runtime doesn’t matter — roughly the same size in any language. The BEAM’s advantage is concentrated entirely in the realtime core, which is exactly the part where the tech-debt list evaporates. The boring 80% stays boring everywhere; the hard 20% is the 20% the BEAM was designed for.

When I’d reach for it

Greenfield with the words presence, realtime, multiplayer, or chat in the brief: the BEAM, without a second thought. An existing, working, not-growing product: leave it — rewriting working software is usually a mistake.

One honest caveat on the webring Caue wants next: today “walk to a neighbour” is just a navigation to their site, which is correct and needs no special runtime — and on his single hosted server every square is already one process anyway. The BEAM only pulls ahead if this becomes a federation of independently-run squares sharing live presence, where an all-BEAM cluster gets cross-node messaging for free (:pg, in the standard library). Real, but a narrower claim than it first sounds.

Don’t reach for Node by reflex. Reach for the runtime built for exactly this, and spend your cleverness on the part your users actually see.

After-notes: from core to deployable

The port above proves the point, but a demo server needs a little more: lock the socket to one site, survive a bot opening ten thousand connections, and ship as a single artifact. Three small additions — and each is smaller, or differently shaped, on the BEAM, for the same runtime reason the rest of the post is about.

Origin lock-down → reject before the process exists

On Node the origin and abuse checks live tangled inside the one big message handler — the connection already exists by the time you look at it. Here the gate is just a branch before the upgrade, so a rejected request never becomes a process at all:

lib/town_square_beam/router.ex

cond do
  not origin_allowed?(origin) ->
    send_resp(conn, 403, "origin not allowed")

  RateLimit.take(client_ip(conn)) == :rate_limited ->
    send_resp(conn, 429, "too many connections")

  true ->
    WebSockAdapter.upgrade(conn, Socket, ...)   # only now is a process spawned
end

Same lesson as the crash story: the cheapest place to stop a bad connection is before it becomes a process, and the runtime makes “before it exists” an obvious seam.

Rate limiting → ETS is the answer to shared state under real parallelism

The Node limiter is a Map of timestamps, kept safe only because the single event loop serialises every access. The BEAM has a scheduler on every core (the whole point of that animation up top), so a plain shared map would be a genuine data race. ETS is the runtime’s answer: an in-memory table with an atomic, lock-free counter, so the hot path stays concurrent across all those cores while a supervised owner process sweeps it:

lib/town_square_beam/rate_limit.ex

# atomic across every scheduler — no lock, no GenServer call on the hot path
count = :ets.update_counter(table, {ip, bucket}, {2, 1}, {{ip, bucket}, 0, expires})
if count <= limit, do: :ok, else: :rate_limited

It’s the one spot where the parallelism the post celebrates would actually bite you — and OTP hands you the tool that turns it into a non-issue.

Releasing it → “no framework” is a small artifact

The release is the BEAM plus four hex packages (Bandit, Plug, WebSock, Jason) — no Phoenix, no Node, no separate presence or pub/sub service, because those are OTP standard library (Registry, DynamicSupervisor, send/2). Two things fall out: one self-contained image configured entirely by the environment at boot (config/runtime.exs), and a supervision tree that is the deployment description — the rate limiter and the scene supervisor are siblings in one readable list, not glue:

lib/town_square_beam/application.ex

children = [
  {Registry, keys: :unique, name: SceneRegistry},
  {DynamicSupervisor, name: SceneSupervisor, strategy: :one_for_one},
  TownSquareBeam.RateLimit,
  {Bandit, plug: Router, port: port}
]

One honest gotcha. The router resolves the widget directory with Path.expand("../../public", __DIR__) baked into a module attribute — which is a build-time path. It works in dev, but the release image has to put public/ back at that same path for it to resolve. A small reminder that __DIR__ is compiled in, not discovered at run time.

None of this changes the thesis — it’s plumbing, not the core — but it’s a fair illustration that the runtime keeps paying out past the realtime core: less to wire, fewer ways to get it wrong.

TownSquare is a perfect match for BEAM

What the app actually is

The one idea you have to swallow first

The principles, mapped to the tech-debt

1 · A connection is a process → crashes are contained

2 · State is a process you message → the scene is the roster

3 · Death is a message → disconnect handles itself

4 · “Let it crash” → supervisors, not safety nets

5 · Preemptive scheduling → two more just vanish

Now the code is boring — and that’s the point

Proof: their own tests, green

What I didn’t port — and why that’s the argument

When I’d reach for it

After-notes: from core to deployable

Origin lock-down → reject before the process exists

Rate limiting → ETS is the answer to shared state under real parallelism

Releasing it → “no framework” is a small artifact

Further reading