Internal · SpaceMusic Engineering

The Engine, the Exe, and Three Reaches

SpaceMusic has one user interface. A small program on the same computer builds it and feeds it the video, so it feels instant and needs no internet. The same interface also reaches the local network and the wider web — but the local computer always comes first.

The constraint that picks the architecture

SpaceMusic needs one user interface that does three things at once. On the computer running the engine, it has to feel instant. It has to keep working when there is no network at all. And, from the exact same code, it also has to appear on a tablet across the room and in a browser across the world. Any one of these on its own is easy. The hard part is that SpaceMusic needs all three, and they pull in different directions.

The engine draws new frames every few milliseconds. If the UI shows them 150 milliseconds late, that feels broken when you are sitting right at the machine. But over the internet, the same 150 milliseconds is fine — there, the only other option is no picture at all. In other words, how much delay is acceptable depends on where the viewer is. "Good enough" means one thing up close and something else far away. A design that ignores this ends up in one of two bad places: it makes the local case slow just to keep the remote case simple, or it makes the remote case impossible just to keep the local case fast. We want neither.

This document describes the way out: one UI codebase, three different ways to move the video (we call them transports), and one simple rule that picks the right transport on its own — so the UI never has to choose.

How streaming UIs usually get built

Most web UIs today are built the same way. The UI is hosted in the cloud and loaded from a content network. It talks to a backend server over HTTPS, and any live video passes through a media server in the middle. This is a clean, simple model: there is one place the UI comes from, one login system, and one place to deploy updates. For a product whose users are always somewhere far away, it is the right choice.

It is also the wrong starting point for us, and it is worth being honest about why — because it was the first thing I tried. Sending everything through a server makes the design nice and uniform. But it does that by making the common case pay for the rare one: the person sitting right at the machine has to take the slow path, just so a remote viewer can be supported. For SpaceMusic, that trade is the wrong way around.

Cloud-first. One hosted UI, and all traffic goes through the server. Uniform and simple — but every viewer pays for the trip to a far-away datacenter, and nothing works without internet. most SaaS dashboards, cloud creative tools
Local-only. A native app tied directly to the engine. It feels instant, but it is a second codebase, and it cannot reach a phone or a remote browser without building a whole new transport. classic desktop control surfaces
Local-first, server-optional (where we land). One web UI, mostly served and fed by a program on the local machine, with the server added only for the case that truly needs the internet.

One sentence about the product rules out the cloud-first default: this is the only UI we have, it must feel instant on the same machine, and it must run with no internet. Once that is fixed and not up for debate, the server cannot sit on the path the local case depends on. The rest of the design follows from there.

What we have already proven

We are not starting from nothing. The most important reach is also the one with the least risk left in it. In an earlier test (plan 037), we sent the engine's picture into a browser on the same computer over a localhost WebSocket and decoded it with WebCodecs. The end-to-end delay was about 9.5 milliseconds — fast enough that it looks the same as a normal native window. So the same-machine transport is already built and measured.

Three more pieces are already in place. First, the engine is headless: it shares its images over Spout and its parameters over a WebSocket, and does no streaming work itself — the encoding happens in a separate program. Second, we just switched the encoder to Main 4:2:0 H.264. This is exactly the format the internet path needs, so that reach is now unblocked at the codec level. Third, the streamer now does demand-driven downscale: it encodes each image at the size and frame rate the viewer actually asks for. That last change is what kept the picture stable on a weaker device at 30 frames per second. So the parts exist. What is missing is the design that connects them.

Transport follows the serving origin

The key idea is small enough to say in one line: do not pick one transport for everything — let the UI talk back to whatever served it.

If the page was loaded from localhost, then the thing that served it is the local program sitting right next to the engine. So the UI streams straight from it over a WebSocket — instant, offline, with no server involved. If the page was loaded from a local-network address, it is the same story, just one network hop further away. If the page was loaded from the public server, the engine cannot be reached directly, so the UI receives the video through LiveKit instead. Where the page came from already tells you which reach it is. So if we tie the transport to that, the UI makes the right choice on its own — with no settings to configure, and no second codebase.

Figure 1 · System topology — one engine, one exe, three reaches Open full size · print A3 landscape ↗

The engine on the left never even knows the server exists. The local program is the only thing that connects the local world to the internet, and it only ever makes outgoing connections. That means there is no incoming port to open in the studio's firewall. Two of the three arrows leaving the program stay on the local network and never touch the dashed box. Only the internet reach goes down into the server — and only when someone actually opens the page from outside.

Three reaches, three transports

With that origin rule in place, each reach can use the transport that fits it best, instead of one shared transport that is merely the easiest. The differences are real, not cosmetic. The video path, the parameter path, the delay, and even the login model all change from one column to the next.

Figure 2 · The three reaches, side by side Open full size · print A3 landscape ↗

The two local reaches

Same-machine and LAN use the same transport: H.264 video over a WebSocket, decoded by WebCodecs. They differ only in the network hop and one browser technicality. localhost counts as a secure context, so WebCodecs runs over plain HTTP with no certificate needed. The same page on a local-network IP address does not count as secure. Chrome enforces this rule and blocks it; Safari does not, and lets it through. That one browser quirk is why the LAN reach is "Safari for now, a dedicated app later" rather than "works everywhere today." It is also why neither local reach needs the certificate setup that the phrase "we need HTTPS" might suggest.

The remote reach, and the auth split

WAN is the only column that touches the server, and the only one with real authentication. The local program publishes each active image into a LiveKit room using WHIP. A remote browser loads the UI from the server behind single sign-on, then joins that room with a short-lived token to receive the video. There are three credentials, and each has one job. The person signs in once with SSO. The browser is given a scoped token (a JWT) that only lets it receive, nothing else. The headless program uses a long-lived API key, which it trades for a token that lets it publish. The local reaches need none of this — at most a simple pairing code, so a random device on the network cannot connect.

One sharp detail here, which we learned the hard way: the browser must get its token from its own origin. It must not fetch the token from the API on a different origin. The reason is that the SSO layer answers a cross-origin pre-check with a redirect and no CORS header. This fails quietly when the user has no active session, but works when they already have a warm login cookie. That is the worst kind of bug to track down, because it looks fine in testing and breaks in the real world.

The hard edges

Three things at the edges are worth naming, so they are deliberate choices and not surprises later.

The LAN certificate. To get Chrome — not just Safari — working on the local network, the clean way is to put a trusted certificate on the local program. But a private local IP address cannot be given a public certificate. Doing it anyway would mean special DNS setup and a hand-made certificate, which is the one genuinely awkward piece of infrastructure in the whole design. We avoid it rather than solve it: a dedicated native app for the tablet does not follow the browser's secure-context rules at all, so the app makes the certificate problem disappear. Until that app exists, Safari covers the LAN.

The demand signal has to travel over the wire. The encoder should only do the work a viewer actually needs: which channels, at what size, at what frame rate. On the local machine, that request is a direct WebSocket message. Over the internet, it travels through the same relay as the parameters. The mechanism is the same in both cases — it is the per-tile downscale we already built — only the wire is longer. This is what stops the internet upload from carrying full-resolution video that nobody is even looking at.

WAN is under-built on purpose. Two viewers, one room, on a server that could handle thousands. That is not a mistake — it is the right size for something that is, for now, just a demonstration. It becomes important the day we run the engine itself in the cloud. The design is ready for that day, without being over-built for it today.

Why local-first is the whole game

"The server adds reach, but the product never depends on it. Unplug the network, and the UI works exactly as well as it did a second ago."

We turn the usual model around and serve the UI from a local program — but not for the sake of speed alone. The real reason is that this UI is the product. There is no other one. A product whose only interface goes black when the Wi-Fi drops, or stutters when a datacenter in another country is busy, is a fragile product. Local-first makes the common case the fast case, and the offline case the normal case. The internet then becomes a bonus that extends our reach, instead of a single thing the whole product hangs on and fails without.

All of this rests on one piece of discipline: a single seam, built in from the very first commit. This seam is a transport abstraction — a thin layer that hides where the video comes from. With it, the UI components do not know or care whether their pixels arrived from a localhost socket nine milliseconds ago, or from a LiveKit room a hundred and fifty milliseconds ago. Get this seam right early, and the three reaches are just three versions behind one shared interface. Get it wrong, and we end up writing the UI twice. Everything else in this document follows from that one decision.

Glossary

Terms and acronyms used in this document, in plain language.

Spout: A Windows mechanism for sharing a GPU texture between processes on the same machine with no copy. How the engine hands frames to the local exe.
WebCodecs: A browser feature that lets web code use the device's hardware video decoder directly, skipping the usual <video> buffering. This is what makes decoding fast enough for low delay inside a web page.
secure context: A browser safety rule. Some features, WebCodecs among them, only run on pages loaded over HTTPS — or from localhost, which the browser treats as safe.
NVENC: NVIDIA's hardware H.264/H.265 encoder. The local exe uses it to turn textures into a video stream without taxing the engine.
Main 4:2:0: An H.264 profile + chroma format that is broadly decodable (browsers and WebRTC alike) and half the chroma data of 4:4:4 — the format we settled the encoder on.
WHIP: WebRTC-HTTP Ingestion Protocol. A simple HTTP handshake for pushing a WebRTC stream into a media server — how the exe publishes video to the server.
LiveKit: The open-source WebRTC media server (an SFU) deployed on our server; it ingests the exe's WHIP stream and fans it out to remote viewers.
SFU: Selective Forwarding Unit. A media server that takes one copy of a stream and forwards it to many viewers, so the sender only has to upload once.
Centrifugo: The real-time message relay on our server; carries parameters (and the demand signal) between the exe and a remote UI.
Authentik: Our single-sign-on provider. Gates the server-hosted UI; issues the session the token-mint step checks.
devpush: The self-hosted platform-as-a-service on our server that builds and hosts web apps from a git push — where the WAN UI is deployed.
JWT: JSON Web Token. A short-lived signed credential; here, scoped to "subscribe to room X" for a browser or "publish to room X" for the exe.
glass-to-glass: End-to-end latency measured from the frame rendered on the engine to that frame visible in the viewer — the number that actually matters.

Settled

The same-machine pipe

localhost WebSocket + WebCodecs, about 9.5 ms, offline, no certificate. Proven in plan 037 — the reach with the least risk left.

Next step

The real UI + the transport seam

Build the real UI on the local pipe, behind a VideoSource / ParamSource layer, and have the local program serve it offline.

Later

LAN app & WAN

A native LAN app to remove the certificate question. WHIP → LiveKit + SSO for the ≤2-user remote case — real once we run SM in the cloud.