Engineering · Elixir/OTP
TownSquare is a perfect match for BEAM
Why a realtime-presence widget is a textbook example for Elixir/OTP — and how it solves the hard time Node (and friends) are giving you.
TownSquare is a lovely little thing: drop in a script tag and a strip of stick figures appears at the bottom of your page, one per visitor reading right now. You can walk around, see who’s on the same article, wave, chat. No accounts, no history. Caue Napier built it, open-sourced it, and runs a free public server. Thank you for all that!
What the app actually is
Strip it down:
- Mostly-independent scenes — one per site — each a roster of who’s present.
- Many long-lived WebSocket connections – visitors on the scene.
- Per-visitor state (position, page, recent messages).
- “Many tabs = one visitor” – deduplication.
- Broadcast within a scene; clean up the instant someone leaves.
What caught my eye is that it’s almost a perfect use case for the BEAM — and, at the same time, a genuinely troublesome one for Node, along with Python, Ruby, and PHP that share its shape, and even for Go and Rust, which fix the cheap-concurrency half but still leave you without per-entity isolation or supervision.
That list up there is an actor-per-entity system with presence and pub/sub — the thing Erlang was built for: Ericsson needed millions of isolated little state machines surviving each other’s crashes. Coincidentally, they created a system that turned out to be perfect for today’s realtime multiuser (and multiagent) world wide web.
Reading the original JavaScript code — and especially its own docs/tech-debt.md
— confirmed it. The author had already run a careful self-audit, and the list reads like a
catalogue of exactly the troubles this shape causes on Node.
So I forked it and rewrote the realtime core in Elixir — same wire protocol, the exact same browser client, byte for byte — and ran TownSquare’s own protocol tests against it. The rest of this post is the why.
The one idea you have to swallow first
In Node your whole server is one process, one event loop, one heap. Every connection, scene, and visitor is an object in that shared space, taking turns on a single thread:
one process, one loop, one heap
+---------------------------------------+
| event loop -> cb -> cb -> cb -> ... |
| scenes . identities . sockets |
+---------------------------------------+
^ all shared, mutated in place
one throw in any callback can drop everything
On the BEAM you get millions of tiny processes, each with its own heap and mailbox, scheduled preemptively, sharing nothing. They touch each other only by sending messages:
many tiny processes, share nothing
+------------+ +------------+ +------------+ +------------+
| socket A | | socket B | | socket C | --msg--> | scene |
| own heap | | own heap | | own heap | | roster |
| own mbox | | own mbox | | own mbox | | own heap |
+------------+ +------------+ +------------+ +------------+
a crash in one stays in that one
Two words in there are carrying the whole idea, so let’s slow down. A process here is nothing like an OS process — it’s handed an allocated slice of CPU time and automatically suspended when that slice is up, and the BEAM juggles millions of them at once.
And when Elixir people say “a process,” they almost always mean one running a receive loop: wait for a message, handle it, loop back, wait for the next. (A bare spawned process doesn’t loop at all — it runs its function once and exits. The loop is the pattern that turns a process into a long-lived thing you can talk to: a socket, a scene, a visitor.)
# runs its body once, then the process is gone
def noloop, do: IO.puts("I run once and die")
spawn(&noloop/0)
# loops forever, parked between messages
def loop do
receive do
msg -> IO.puts("got #{inspect(msg)}")
end
loop()
end
spawn(&loop/0)
# the first runs and dies; the second stays around,
# and spawn returns its pid so you can message it
Why the &loop/0? spawn needs a function
value, and &name/0 is how you hand it a named function. Bare
spawn(loop) would call loop and pass its return value instead
— which wouldn’t spawn anything.
“Scheduled preemptively” is the other word, and it’s the real aha. That suspend-when-the-slice-is-up happens whether the process cooperates or not — the scheduler takes the core back mid-flight and hands it to the next one. No single process can hog a core, and none has to remember to yield. The scheduler just keeps everyone moving. That’s the fair part.
Easiest to feel by misbehaving on purpose. Each of these processes would be fatal on Node’s single loop — a runaway computation, a near-endless hang, a flat-out crash:
# counts forever, computing bigger and bigger factorials
def infinite_loop(n \\ 1) do
_ = Enum.reduce(1..n, 1, &*/2)
infinite_loop(n + 1)
end
# hangs "forever"
def hanging_loop, do: Process.sleep(:timer.hours(24 * 365))
# blows up on the spot
def explode_loop, do: raise("Kaboom!")
# spawn all three: the rest of the system never even notices
spawn(&infinite_loop/0)
spawn(&hanging_loop/0)
spawn(&explode_loop/0)
It’s also smart. The moment a process calls receive with an empty mailbox,
the scheduler knows there’s no point spending cycles on it — so it suspends the process
early, before its slice is even up, and wakes it the instant a message arrives. Same for a process
that’s sitting in :timer.sleep or blocked on an I/O call like
File.read: it’s parked, not spinning, and the core it would have held goes to a
process with real work to do. Fair, and smart — just how you like your schedulers.
That last part matters more than it sounds for an app like this. Most visitors are just reading — connection open, doing nothing for minutes at a stretch. On the BEAM each one is a suspended process that costs essentially nothing until a message wakes it, so a thousand people quietly reading is a thousand parked processes — not a thousand things elbowing for one event loop’s attention.
And those slices aren’t all on one core. The BEAM runs a scheduler on every CPU core, and because the processes share nothing, they don’t just take turns — they genuinely run in parallel. Node uses one core per process; to touch a second core you start a second Node process and wire up the coordination by hand. On the BEAM you get all your cores for nothing — and the same share-nothing model that fans work across cores fans it across machines too: add a node to the cluster and its processes talk to the rest by sending the exact same messages. That’s the road from a thousand readers to millions — more cores, more boxes, no rewrite.
If that picture lands, everything below is just consequences.
The principles, mapped to the tech-debt
Here’s the BEAM mental model in five pieces. Against each is the line (or two) from
TownSquare’s own docs/tech-debt.md it makes disappear — the author’s
words, not mine.
1 · A connection is a process → crashes are contained
T1 (top priority) — “One unguarded throw kills every connection — no try/catch around request dispatch, no
uncaughtExceptionguard, no SIGTERM drain.”
Node: bad frame -> throw -> unhandled -> every socket on every site drops
BEAM: bad frame -> that ONE process crashes -> its supervisor restarts it
Each connection is its own process, so an unguarded throw crashes that one socket and its supervisor restarts it. There’s nothing to wrap and no global net to bolt on — the bug class doesn’t exist.
2 · State is a process you message → the scene is the roster
In Node a scene is data: a scenes Map → an identities Map
→ a Set of sockets, all mutated in place. On the BEAM a scene is a
GenServer — a process whose state map is the roster. You never reach
in and read it; you send it a message and it updates itself:
socket --{:move, pid, x}--> scene process
the scene then updates its own state.ids[id]
and broadcasts the change to the sockets
(one message at a time -> no locks, no races)
One process handles one message at a time, so there are no locks and no races — and the roster and the pub/sub are the same process, with no separate presence library to keep in sync. (No tech-debt line here: this is the backbone the rest stand on.)
3 · Death is a message → disconnect handles itself
H5 — “Replaced WebSockets leak on reconnect — old socket listeners not removed/closed.”
H7 — “Leave timers fire against deleted scenes — not cleared on site/scene deletion or shutdown.”
This is the one that deletes a whole category of bugs. When a socket joins, the scene calls
Process.monitor(pid). When that process exits — tab closed, laptop slept, crash,
network drop — the scene receives one message:
{:DOWN, _ref, :process, pid, _} <- "that socket is gone"
That message is the leave event. There’s no ws.on('close') to remember to
wire up — that’s H5 — and the reconnect-grace timer is created
inside the scene with Process.send_after(self(), …), so it can’t
outlive the scene it guards — that’s H7. Both gone by construction.
4 · “Let it crash” → supervisors, not safety nets
T5 — “Plugin registration runs at module top-level with no try/catch — one malformed plugin crashes boot.”
Supervisor
/ | \
scene scene Bandit --> one process per connection
one process crashes -> only that one is restarted
Each scene starts as a supervised process; one that fails to start is isolated and logged, not fatal
to the others or to boot. You stop writing defensive try/catch as a load-bearing safety net
— isolation plus restart is the model.
5 · Preemptive scheduling → two more just vanish
T3 — “Synchronous full-registry
saveSites()… on the WS/admin hot path blocks the event loop.”H6 — “Unbounded in-memory growth — per-IP-per-scene activity map and scenes never bounded.”
There’s no shared event loop to block: a process doing slow disk I/O can’t stall message delivery to any other, so T3 isn’t a hazard. And an empty scene process terminates and the BEAM reclaims its heap, so the hand-rolled eviction behind H6 becomes “let the process exit.”
That’s six tech-debt items and six bug classes that don’t survive the move — none about cleverness, all about which runtime you started from.
Now the code is boring — and that’s the point
Because the runtime does the hard parts, the port is small. The socket is four callbacks:
def init(opts), do: {:ok, %{scene: Scene.ensure(opts[:scene_key]), id: nil}}
def handle_in({text, [opcode: :text]}, state), # ~ ws "message"
do: dispatch(Jason.decode!(text), state)
def handle_info({:ws_push, payload}, state), # a scene broadcast → this tab
do: {:push, {:text, payload}, state}
def terminate(_reason, _state), do: :ok # ~ ws "close" (scene cleans up via monitor)
And the broadcast — the entire “pub/sub” — is a loop:
JSON.stringify once, send to many.
defp broadcast(state, frame, except: pid) do
payload = Jason.encode!(frame)
for i <- Map.values(state.ids), {p, _} <- i.sockets, p != pid,
do: send(p, {:ws_push, payload})
end
No Phoenix, no Channels, no phoenix_pubsub, no Phoenix.Presence — on
one node they’re redundant. Same altitude as Node’s http + ws, so
the only variable left between the two is the runtime.
Proof: their own tests, green
I pointed TownSquare’s own smoke-test.js — same ws
client, same JSON frames — at the Elixir server:
Core parity test passed.
That covers identity dedup across tabs, the peer snapshot, browserId never leaking,
join/leave, move/say/typing/gestures/reading, server-derived reading labels, rate-limiting, the
140-char cap, seat arbitration, and the multi-tab grace-window semantics. The same client can’t
tell the two servers apart.
What I didn’t port — and why that’s the argument
The Elixir core is ~740 lines; server.js is ~3,300. That’s not a
fair 4.5× — server.js also has the site registry, admin API, moderation, the
world map, proof-of-work, a Plausible proxy, plugins. I ported none of it.
That omission is the point: all of it is ordinary HTTP CRUD where the runtime doesn’t matter — roughly the same size in any language. The BEAM’s advantage is concentrated entirely in the realtime core, which is exactly the part where the tech-debt list evaporates. The boring 80% stays boring everywhere; the hard 20% is the 20% the BEAM was designed for.
When I’d reach for it
Greenfield with the words presence, realtime, multiplayer, or chat in the brief: the BEAM, without a second thought. An existing, working, not-growing product: leave it — rewriting working software is usually a mistake.
One honest caveat on the webring Caue wants next: today “walk to a neighbour” is
just a navigation to their site, which is correct and needs no special runtime — and on his
single hosted server every square is already one process anyway. The BEAM only pulls ahead if this
becomes a federation of independently-run squares sharing live presence, where an all-BEAM
cluster gets cross-node messaging for free (:pg, in the standard library). Real, but a
narrower claim than it first sounds.
Don’t reach for Node by reflex. Reach for the runtime built for exactly this, and spend your cleverness on the part your users actually see.
After-notes: from core to deployable
The port above proves the point, but a demo server needs a little more: lock the socket to one site, survive a bot opening ten thousand connections, and ship as a single artifact. Three small additions — and each is smaller, or differently shaped, on the BEAM, for the same runtime reason the rest of the post is about.
Origin lock-down → reject before the process exists
On Node the origin and abuse checks live tangled inside the one big message handler — the connection already exists by the time you look at it. Here the gate is just a branch before the upgrade, so a rejected request never becomes a process at all:
cond do
not origin_allowed?(origin) ->
send_resp(conn, 403, "origin not allowed")
RateLimit.take(client_ip(conn)) == :rate_limited ->
send_resp(conn, 429, "too many connections")
true ->
WebSockAdapter.upgrade(conn, Socket, ...) # only now is a process spawned
end
Same lesson as the crash story: the cheapest place to stop a bad connection is before it becomes a process, and the runtime makes “before it exists” an obvious seam.
Rate limiting → ETS is the answer to shared state under real parallelism
The Node limiter is a Map of timestamps, kept safe only because the single event loop
serialises every access. The BEAM has a scheduler on every core (the whole point of that
animation up top), so a plain shared map would be a genuine data race. ETS is the
runtime’s answer: an in-memory table with an atomic, lock-free counter, so the hot path stays
concurrent across all those cores while a supervised owner process sweeps it:
# atomic across every scheduler — no lock, no GenServer call on the hot path
count = :ets.update_counter(table, {ip, bucket}, {2, 1}, {{ip, bucket}, 0, expires})
if count <= limit, do: :ok, else: :rate_limited
It’s the one spot where the parallelism the post celebrates would actually bite you — and OTP hands you the tool that turns it into a non-issue.
Releasing it → “no framework” is a small artifact
The release is the BEAM plus four hex packages (Bandit, Plug, WebSock, Jason) — no Phoenix, no
Node, no separate presence or pub/sub service, because those are OTP standard library
(Registry, DynamicSupervisor, send/2). Two things fall out:
one self-contained image configured entirely by the environment at boot
(config/runtime.exs), and a supervision tree that is the deployment
description — the rate limiter and the scene supervisor are siblings in one readable list, not
glue:
children = [
{Registry, keys: :unique, name: SceneRegistry},
{DynamicSupervisor, name: SceneSupervisor, strategy: :one_for_one},
TownSquareBeam.RateLimit,
{Bandit, plug: Router, port: port}
]
One honest gotcha. The router resolves the widget directory with
Path.expand("../../public", __DIR__) baked into a module attribute — which is a
build-time path. It works in dev, but the release image has to put public/ back
at that same path for it to resolve. A small reminder that __DIR__ is compiled in, not
discovered at run time.
None of this changes the thesis — it’s plumbing, not the core — but it’s a fair illustration that the runtime keeps paying out past the realtime core: less to wire, fewer ways to get it wrong.
Further reading
Where to go deeper — the companion piece, the original project, and the OTP primitives this port leans on.
- TownSquare and its changelog — Caue Napier’s original, built in the open.
- Caue’s release write-up — the project’s own story and roadmap (the webring).
-
The fork on GitHub
— the Elixir rewrite at the repo root, the Node original preserved on the
js-originalbranch, and the parity test. - “The Soul of Erlang and Elixir” by Saša Jurić — the best 40 minutes on why the BEAM is different.
- OTP design principles and the Elixir Supervisor docs — supervision trees, “let it crash”, restart strategies.
-
:pgprocess groups — the standard-library piece behind cross-node presence (the webring claim). - Bandit and WebSock — the HTTP/WebSocket server the port runs on.
The fork, the Elixir server, and the parity test are at the
beam-backend branch of github.com/josefrichter/TownSquare
(main is still the original Node.js server).
Huge thanks to
Caue Napier for building
TownSquare in the open — go
add it to your site.