Failover & Reliability

Now Playing is designed to stay live even during deployments and primary-host outages. The mechanism has two parts: a blue-green deployment process that routes traffic between two environments during upgrades, and an automatic failover that activates a standby cloud VM when the primary is unhealthy.

The Two Environments

Environment	Host	Default State
Blue (primary)	Proxmox homelab, LXC container CT 109	Always running
Green (failover)	GCP Compute Engine VM, us-west1-a	Stopped by default

Both environments run the same container image and connect to the same managed database, so either one can serve traffic at any time.

How Traffic Routing Works

A Cloudflare load balancer sits in front of both environments. It runs a health check every 60 seconds against GET /api/health on each pool and routes based on the result.

                    app.nowplayingapp.com
                             │
                 ┌───────────┴────────────┐
                 │  Cloudflare Load       │
                 │  Balancer              │
                 │  Health check: 60s     │
                 │  Steering: failover    │
                 │  Session affinity: IP  │
                 └───────────┬────────────┘
                             │
               ┌─────────────┼─────────────┐
               ▼                           ▼
          Blue pool                   Green pool
       (proxmox-primary)           (gcp-failover)
         PRIMARY                    FALLBACK

Steering is set to failover mode: if the primary pool is healthy, it receives 100% of traffic. If the primary goes unhealthy, traffic shifts to the fallback pool.

Automatic Failover

When the primary goes down, a GCP Cloud Function activates the failover VM. The function is triggered automatically (and can also be invoked manually). The sequence:

Cloudflare marks the primary pool unhealthy.
Cloud Function start-failover starts the standby VM.
The VM boots (~30 seconds) and auto-starts the unified container (~30 seconds).
Cloudflare’s next health check passes, and traffic shifts to the failover.
Service is restored within about one minute of the outage being detected.

When the primary comes back online:

Cloudflare health checks begin passing on the primary.
Traffic automatically shifts back to the primary pool.
Cloud Function stop-failover stops the GCP VM to avoid ongoing cost.

Blue-Green Deployments

A deployment is a controlled version of failover. Instead of waiting for a failure, the deploy process proactively:

Starts the failover VM.
Flips the load balancer to route traffic to the failover.
Upgrades the primary container on the homelab.
Waits for the primary to become healthy again.
Flips traffic back to the primary.
Stops the failover VM.

The result: your overlay stays live for the entire upgrade. Viewers see no disconnect, and in-flight tracks continue to be delivered.

What Stays Consistent Across Failover

Because both environments connect to the same GCP-hosted PostgreSQL and managed Redis, all data is consistent across failover:

Track history is written to the same database.
User settings, authentication sessions, and overlay tokens are shared.
In-progress track enrichment continues without loss.

The only element that does not share state is the desktop-to-cloud Socket.IO connection. When traffic shifts, the desktop app reconnects automatically to the new active endpoint, typically within a few seconds.

What Failover Does Not Cover

Failover protects against primary-host outages and deployments. It does not protect against:

Your local DJ software crashing. Now Playing can only relay what it receives.
Your internet connection dropping. The desktop app buffers tracks while offline and reports a disconnected state in the dashboard.
A Cloudflare-wide outage. If the edge is down, traffic cannot reach either environment.

For the vast majority of disruptions (maintenance windows, host failures, deploys), failover keeps the service transparent to DJs and their viewers.