Testing a Self-Hosted Mesh VPN: Headscale + Tailscale on NixOS¶
Tailscale makes "every device on its own private network" feel effortless, but in many production scenarios you do not want the control plane to sit in a third-party cloud. Headscale is the popular open-source reimplementation of that control plane — the same Tailscale clients keep working, but the coordination server is yours.
This tutorial builds a fully reproducible NixOS integration test that spins up:
- a self-hosted Headscale control server with its own embedded DERP relay,
- two HTTP services behind nginx, reachable only over the tailnet,
- an end-user client device that joins the tailnet and verifies end-to-end connectivity, MagicDNS, and the tailnet allow-list.
In other words: an entire private mesh VPN, provisioned, joined, and exercised from scratch — in seconds, on every CI run.
Three things make this more than a "VPN works" smoke test:
- MagicDNS is validated end-to-end — the part of Tailscale that people most often misconfigure.
- The allow-list is checked as a negative test. Proving the services are reachable only through the tailnet is the difference between defence in depth and the illusion of it.
- The whole thing is hermetic. No external Headscale, no real Tailscale coordination, no network access required — it runs the same way on your laptop and in CI.
It is also a fair template for any private-overlay setup built on Tailscale or Headscale: WireGuard configurations, ACLs, subnet routers, exit nodes, tsnet applications.
Run this example test yourself
What you will validate¶
By the end of the test, you will have proven — automatically, in a sandbox — that:
- Headscale comes up and serves its API over HTTPS with a self-signed certificate the clients trust.
- Two infrastructure machines join the tailnet using a pre-shared admin key (the bootstrap flow you would use in real production with agenix or sops-nix).
- An end-user client joins separately with its own ephemeral key (the laptop/phone flow).
- All nodes can reach each other via
tailscale ping. - HTTP services resolve and respond via MagicDNS on
*.acme.internal. - The same services are unreachable over the underlying test LAN — the nginx allow-list really gates traffic by tailnet CIDR.
Architecture¶
Three NixOS nodes, one tailnet, two reachability assertions, and two reachability non-assertions. The whole arrangement is described in a single test file.
The test¶
{ pkgs, ... }:
let
tls-cert = # (1)
pkgs.runCommand "selfSignedCerts"
{
buildInputs = [ pkgs.openssl ];
}
''
openssl req \
-x509 -newkey rsa:2048 -sha256 -days 365 \
-nodes -out cert.pem -keyout key.pem \
-subj '/CN=control' -addext "subjectAltName=DNS:control"
mkdir -p $out
cp key.pem cert.pem $out
'';
# IPv4 prefix headscale allocates tailnet IPs from. Single source of truth
# for both the headscale allocator and the nginx allow-list.
tailnetCidrV4 = "100.64.0.0/10";
helloVhost = # (2)
{ config, ... }:
{
services.nginx = {
enable = true;
virtualHosts.hello = {
default = true;
extraConfig = ''
allow ${tailnetCidrV4};
deny all;
'';
locations."/".return = "200 'hello from ${config.networking.hostName}\\n'";
};
};
networking.firewall.interfaces.tailscale0.allowedTCPPorts = [ 80 ];
};
tailscaleJoin = # (3)
{ config, ... }:
{
services.tailscale = {
enable = true;
authKeyFile = "/var/lib/tailscale/preauthkey";
extraUpFlags = [
"--login-server=https://control"
"--hostname=${config.networking.hostName}"
];
};
};
headscaleControlPlane = # (4)
{ config, ... }:
{
services.headscale = {
enable = true;
settings = {
server_url = "https://control";
derp = {
# (5)
urls = [ ];
server = {
enabled = true;
region_id = 999;
stun_listen_addr = "0.0.0.0:${toString config.services.tailscale.derper.stunPort}";
# Pin IPv4 so tailscaled's netcheck doesn't need to DNS-resolve
# the test-VLAN hostname.
ipv4 = config.networking.primaryIPAddress;
};
};
dns = {
base_domain = "acme.internal";
override_local_dns = false;
};
prefixes.v4 = tailnetCidrV4;
};
};
networking.firewall.allowedUDPPorts = [ config.services.tailscale.derper.stunPort ];
environment.systemPackages = [
config.services.headscale.package
pkgs.jq
];
};
tlsReverseProxy = # (6)
{ config, ... }:
{
services.nginx = {
enable = true;
virtualHosts.control = {
onlySSL = true;
sslCertificate = "${tls-cert}/cert.pem";
sslCertificateKey = "${tls-cert}/key.pem";
locations."/" = {
proxyPass = "http://127.0.0.1:${toString config.services.headscale.port}";
proxyWebsockets = true;
};
};
};
networking.firewall.allowedTCPPorts = [ 443 ];
};
in
{
name = "headscale-tailnet-3node";
# Every node must trust the self-signed control-plane certificate.
defaults.security.pki.certificateFiles = [ "${tls-cert}/cert.pem" ];
nodes = {
control = {
# (7)
imports = [
helloVhost
tailscaleJoin
headscaleControlPlane
tlsReverseProxy
];
};
app = {
imports = [
helloVhost
tailscaleJoin
];
};
client = {
services.tailscale.enable = true;
};
};
testScript = ''
auth_key_path = "/var/lib/tailscale/preauthkey"
start_all()
control.wait_for_unit("headscale.service")
control.wait_for_open_port(443)
control.wait_for_unit("nginx.service")
control.wait_for_unit("tailscaled.service")
app.wait_for_unit("nginx.service")
app.wait_for_unit("tailscaled.service")
client.wait_for_unit("tailscaled.service")
# In a real deployment, all servers would be freshly provisioned and
# tailscaled-autoconnect would fail at boot because the pre-auth key
# has not been distributed yet. The bootstrap therefore happens once
# per deployment, in three steps:
# 1.) the admin creates a user and an infra-wide pre-auth key.
# `headscale preauthkeys create` takes the numeric user ID, not
# the name, so we grab it from the user-creation JSON output.
user_id = control.succeed(
"headscale users create infra --output json | jq -r .id"
).strip()
auth_key = control.succeed(
f"headscale preauthkeys create --user {user_id} --reusable --expiration 24h "
"--output json | jq -r .key"
).strip()
# 2.) the admin ships the key via a secret-management tool (e.g. agenix)
# and redeploys. We emulate that by writing the key to disk with
# restrictive permissions and restarting the autoconnect unit:
for s in [control, app]:
s.succeed(
f"umask 077 && mkdir -p $(dirname {auth_key_path}) && "
f"echo {auth_key} > {auth_key_path}"
)
s.succeed("systemctl restart tailscaled-autoconnect.service")
# 3.) all infra nodes are now members of the tailnet.
# End-user devices (laptops, phones) usually join through the browser.
# We emulate that flow on the CLI:
client_key = control.succeed(
f"headscale preauthkeys create --user {user_id} --reusable --expiration 24h "
"--output json | jq -r .key"
).strip()
client.execute(
f"tailscale up --login-server 'https://control' "
f"--auth-key {client_key} --hostname=client"
)
# Wait for an actual working tailnet path to each server.
client.wait_until_succeeds("tailscale ping --c 1 --timeout 2s control")
client.wait_until_succeeds("tailscale ping --c 1 --timeout 2s app")
# Reach both nginx instances via MagicDNS over the tailnet.
out1 = client.succeed("curl --fail --max-time 5 http://control.acme.internal/")
assert "hello from control" in out1, f"unexpected: {out1!r}"
out2 = client.succeed("curl --fail --max-time 5 http://app.acme.internal/")
assert "hello from app" in out2, f"unexpected: {out2!r}"
# nginx must NOT be reachable over the unsecured test LAN.
client.fail("curl --fail --max-time 5 http://control/")
client.fail("curl --fail --max-time 5 http://app/")
'';
}
-
A self-signed TLS certificate
Tailscale clients require an HTTPS control server. We mint a one-shot self-signed certificate in a
runCommandderivation and distribute it to every node viasecurity.pki.certificateFiles. Setting that option once under the test's top-leveldefaultsis enough — it is merged into every node, so all three machines trust the cert without repeating the line. -
helloVhost— the tailnet-only HTTP serviceA small NixOS module that defines an nginx virtual host returning
hello from <hostname>. Theallow ${tailnetCidrV4}; deny all;block is the entire allow-list — only traffic with a source IP in the tailnet CIDR is served. The matchingtailscale0firewall rule keeps the port open exclusively on the tailnet interface. We will assert later that this really blocks LAN traffic. -
tailscaleJoin— the client side of the meshEnables
services.tailscaleand points it at our Headscale instance via the pre-auth key on disk.extraUpFlagssets the--login-serverand a stable hostname. -
headscaleControlPlane— the coordination serverThe Headscale service itself, the matching firewall rule for its built-in DERP relay, and the CLI tools (
headscale,jq) the test script needs to bootstrap the tailnet. -
The embedded DERP relay
Tailscale uses DERP relays as a fallback path when peer-to-peer NAT traversal fails. In an isolated test VLAN with no public STUN servers reachable, that fallback is the only reliable path, so we enable Headscale's built-in DERP server. We pin its IPv4 to
config.networking.primaryIPAddressto short-circuittailscaled'snetcheck, which would otherwise try to resolve the test-VLAN hostname. -
tlsReverseProxy— HTTPS in front of HeadscaleTailscale clients refuse to talk to an HTTP control server, so we put nginx in front of Headscale and terminate TLS with our self-signed cert. Combined with
helloVhoston the same machine,controlends up running two virtual hosts: the "hello" vhost (default, port 80, tailnet only) and thecontrolvhost (HTTPS, port 443, reverse-proxying to Headscale). The NixOS module system merges bothservices.nginxblocks automatically. -
The composed control server
controlis now a one-liner of imports —helloVhost,tailscaleJoin,headscaleControlPlane,tlsReverseProxy. Each module owns exactly one concern; the node definition reads like an inventory.appkeeps only the two modules it needs, andclientis plainservices.tailscale.enable. See also the feature page on module composition.
How the test script flows¶
The testScript reads like a short runbook for a fresh deployment, which is exactly the point.
1. Bring services up¶
Standard machine wait helpers — wait for systemd units and the HTTPS port to be available.
2. Bootstrap the admin pre-auth key¶
user_id = control.succeed(
"headscale users create infra --output json | jq -r .id"
).strip()
auth_key = control.succeed(
f"headscale preauthkeys create --user {user_id} --reusable --expiration 24h "
"--output json | jq -r .key"
).strip()
for s in [control, app]:
s.succeed(
f"umask 077 && mkdir -p $(dirname {auth_key_path}) && "
f"echo {auth_key} > {auth_key_path}"
)
s.succeed("systemctl restart tailscaled-autoconnect.service")
This is the once-per-deployment ritual: create a Headscale user, then mint a reusable pre-auth key that infra machines can use to join unattended. headscale preauthkeys create --user takes a numeric user ID rather than the name, so we grab it from the user-creation JSON output and reuse it for both the infra and the client key below. In a real cluster you would now ship this key into your secret manager (agenix, sops-nix, …) and roll out the configuration. The test emulates that step with umask 077 so the key file lands with secret-appropriate permissions, then restarts tailscaled-autoconnect — which had failed at boot when the key file did not yet exist.
3. Let the client join the human way¶
client_key = control.succeed("headscale preauthkeys create ...").strip()
client.execute(f"tailscale up --login-server 'https://control' --auth-key {client_key} ...")
End-user devices normally join through a browser flow — but it is the same tailscale up underneath. We use a second pre-auth key to keep the test deterministic.
4. Wait for a working tailnet path¶
client.wait_until_succeeds("tailscale ping --c 1 --timeout 2s control")
client.wait_until_succeeds("tailscale ping --c 1 --timeout 2s app")
tailscale ping does not use ICMP — it pings over the tailnet itself, including DERP fallback. A successful tailscale ping is the strongest "the mesh is actually working" signal you can get without writing application traffic.
5. Hit the services via MagicDNS¶
out1 = client.succeed("curl --fail --max-time 5 http://control.acme.internal/")
assert "hello from control" in out1, f"unexpected: {out1!r}"
MagicDNS resolves the hostnames inside the tailnet. The fact that curl succeeds — and that the response really came from the right host — closes the loop.
6. Verify the LAN is not a backdoor¶
client.fail("curl --fail --max-time 5 http://control/")
client.fail("curl --fail --max-time 5 http://app/")
This is the subtle but important part. The test driver gives every node a regular VLAN where control/app resolve to LAN IPs. If the nginx allow-list were misconfigured, curl http://control/ would also succeed and the tailnet would not actually be enforcing anything. Asserting .fail() here is what turns "the services are reachable" into "the services are reachable only through the tailnet".
Related references¶
-
The mesh VPN built on top of WireGuard. This test exercises real Tailscale clients against a self-hosted control plane.
-
Open-source, self-hosted implementation of the Tailscale coordination server. The GitHub project has the full feature matrix.
-
The reference manual chapter for the framework that runs this test. Worth a bookmark.
-
Multi-node and multi-network tests
The foundation tutorial for tests that orchestrate several machines on isolated VLANs.
-
Build the test as
.driverInteractiveand step through it from an IPython prompt —tailscale status,tailscale netcheck, and friends, on the running VMs. -
Connecting to nodes interactively
When a mesh refuses to converge, the fastest path forward is a shell on the affected node.
-
The
helloVhost/tailscaleJoinreusable modules in this test follow the standard NixOS module pattern — same asnixpkgsitself.