Homelab on RevoluGame

Designing a Homelab Backup Strategy I Can Actually Trust

Tue, 23 Jun 2026 10:00:00 +0100

Most homelab diagrams start with the fun parts: the NAS, the containers, the dashboards, the automations, the small machines doing useful little jobs around the house. Backup diagrams are usually less glamorous. A few arrows to a NAS, maybe one more arrow to the cloud, and the comforting feeling that important files probably exist in more than one place.

In short: I want the NAS to be the local backup hub, Synology DSM to handle snapshots and Hyper Backup jobs, USB disks to provide an offline copy, cloud storage to provide encrypted offsite history, and Prometheus plus ntfy to tell me when the system stops doing its job.

That word, “probably”, is the problem.

I do not want a backup strategy that looks reassuring in a diagram. I want one that answers boring, specific questions: what happens if I delete a folder by mistake? What happens if the NAS dies? What happens if ransomware encrypts a mounted share? What happens if storage gets corrupted despite the UPS and clean shutdown path? What happens if I have to rebuild the whole thing on different hardware?

So this is the target architecture I want my homelab backups to move toward: not just more copies, but copies that fail for different reasons, are encrypted where they leave the house, and are tested often enough that “restore” is not a theory.

The rule behind the design

The common version is the 3-2-1 rule: keep three copies, on two different types of media, with one copy offsite. For a homelab, I think the more useful target is closer to 3-2-1-1-0:

three copies of important data
two different storage types or systems
one offsite copy
one offline or immutable copy
zero untested restores

The last two matter more than they look. A cloud sync is useful, but it is not the same thing as an offline backup. If a bad script deletes a directory and that deletion syncs perfectly to the cloud, the cloud did its job and I still lost the data. Likewise, a backup I have never restored from is mostly a hope with timestamps.

The goal is not to back up everything with the same level of paranoia. The goal is to classify data by how painful it would be to lose, then give each class the right recovery path.

Layer	Job	In this setup
Local hub	Fast recovery and one place to collect backups	Synology DSM
Snapshots	Quick rollback from mistakes	DSM Snapshot Replication
Offsite	Survive local loss	Encrypted Hyper Backup to cloud
Offline	Survive compromised or damaged online copies	Rotated USB disks
Monitoring	Notice broken backup jobs	Prometheus and ntfy
Restore tests	Prove the plan works	Scheduled restores from NAS, cloud, and USB

What needs protecting

In my setup, the important things fall into a few buckets.

Irreplaceable data is the obvious one: documents, photos, personal notes, scanned paperwork, source repositories, and anything else that cannot be recreated from a package manager or a public download.

Service state is the data that makes self-hosted apps mine: Docker bind mounts, named volumes, databases, Home Assistant backups, Gitea repositories, application config, and the little bits of state that are easy to forget until a restore fails without them.

Rebuild information is everything needed to reconstruct the machines: compose files, .env files, systemd units, NUT configuration, firewall notes, package lists, and the “why is this weird thing configured this way?” documentation that future me will absolutely need.

Convenience data is useful but not precious: media files, caches, generated reports, downloads, and anything I would be annoyed to lose but not devastated by.

Those buckets should not all get the same policy. Photos deserve versioned, offsite, offline protection. A container image cache does not.

Data class	Examples	Backup policy
Irreplaceable	photos, documents, notes, source repositories	NAS, snapshots, encrypted cloud, offline USB
Service state	Home Assistant, Gitea, app data, databases	app-aware export to NAS, then cloud and USB
Rebuild information	compose files, `.env` references, NUT config, systemd units	Git where safe, NAS backup for secrets and local-only files
Convenience	media, downloads, generated reports	NAS if useful, lower retention, no drama
Ephemeral	caches, container images, build artifacts	usually not backed up

The NAS is the hub, not the backup strategy

The NAS is the center of the design because it is the easiest place for every machine to send backups. In my case that NAS is a Synology running DSM, which gives me a few useful primitives out of the box: shared folders, Time Machine support, snapshots, notifications, USB disk handling, and Hyper Backup for versioned backup jobs. Home Assistant can push scheduled backups to it. Macs can use it as a Time Machine target. Linux machines can send restic, borg, kopia, or plain snapshot artifacts to it. Docker hosts can dump databases and copy application state to it.

But the NAS is not the strategy by itself. It is just the first aggregation point.

The layout I want is explicit enough that each source has its own place:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


/backups/
 home-assistant/
 macos-time-machine/
 ubuntu/
 docker/
 databases/
 native-apps/
 sentinel/
 gitea/
 restore-tests/

That structure matters less than the habit behind it: every source should have an obvious owner, an obvious schedule, and an obvious restore procedure. If I cannot tell what created a backup or how to use it, the backup is already weaker than it looks.

On DSM, I want Snapshot Replication enabled for the important shared folders where it is available. Snapshots are not a substitute for backup, because they live on the same system, but they are excellent for fast recovery from accidental deletion, bad sync jobs, and “I changed this file yesterday and now regret it” moments.

Docker apps need app-aware backups

Backing up Docker compose files is necessary, but not sufficient. A compose file tells me how to start the container; it does not necessarily contain the application state.

For each Docker app, I want four things backed up:

the compose file
the environment file or secret reference
bind-mounted application data or named volumes
database dumps created by the database itself

That last point is where a lot of homelab backups get fragile. Copying a live database directory may work until the day it does not. For Postgres, MariaDB, SQLite-backed apps, and similar systems, the backup job should either use the application/database’s recommended export mechanism or stop/quiesce the service before taking the copy.

In practice, the pattern should be boring:

1

prepare app -> dump database -> snapshot/copy data -> send to NAS -> verify artifact

The restore procedure should be just as boring:

1

create clean app directory -> restore compose/env -> restore data -> import database -> start app

If I cannot write that procedure down for an app, I do not really have a backup of that app yet.

Home Assistant gets its own lane

Home Assistant OS already has a good backup concept, so I do not want to fight it. The ideal version is simple:

scheduled Home Assistant backups
automatic copy to the NAS
NAS backup copied onward to cloud and offline storage
occasional restore into a test VM or spare install

The last item is the important one. Home Assistant is full of integrations, devices, add-ons, secrets, and local assumptions. A backup file existing on disk is not the same thing as knowing that I can restore it and have the house come back in a sane state.

For this category, I care less about elegant tooling and more about a tested recovery note: where the backup lives, what credentials I need, what device integrations might need manual attention, and how I know the restore worked.

The small machines count too

It is easy to forget the little infrastructure boxes because they do not feel like data stores. My NUT server is a good example. If it disappeared, I could probably rebuild it from memory, but “probably” is exactly what this strategy is trying to remove.

For small utility machines, I want a lightweight backup of:

/etc files specific to the service
systemd units and timers
scripts
package list or install notes
any local state that is not disposable

For something like a NUT server, that means backing up /etc/nut/, notification scripts, and service overrides to the NAS, while also keeping the non-secret parts in Git. The backup does not need to be large. It just needs to make rebuilds boring.

Gitea is not just “on the Mac”

Time Machine is good for recovering a Mac. It is not automatically a good application-level backup for every service that happens to run on that Mac.

For Gitea, I want a dedicated backup path: repositories, database, app.ini, custom templates, LFS data if used, and the pieces that make Git-over-SSH work. In my case SSH is enabled through Gitea’s built-in SSH server, so the restore procedure needs to account for Gitea’s SSH host keys, the configured SSH port, and the user/container mapping that lets Git operations reach the right repositories. Gitea has its own dump command, and that should be part of the plan rather than relying only on a filesystem-level Mac backup.

The reason is simple: restoring the web UI is not the same thing as restoring the developer workflow. If the repositories and database come back but every remote now fails on git push, the backup is incomplete.

The nice property of an app-native Gitea backup is that it can be restored somewhere else. That is the bar I care about. If the Mac dies, I should be able to bring Gitea up on another machine without first resurrecting the Mac exactly as it was.

Cloud backup should be encrypted and versioned

The cloud copy should not be a raw mirror of the NAS. It should be an encrypted, versioned backup repository.

The exact tool matters less than the properties:

client-side encryption before data leaves home
versioned snapshots
retention policy
integrity checks
credentials with the smallest practical permissions
restore procedure documented outside the backup itself

restic, borg, kopia, and similar tools all fit this model better than a blind sync. Since the NAS is Synology DSM, Hyper Backup is also a natural option here: it can send versioned, encrypted backups to cloud providers, rsync destinations, another Synology, or local USB storage. The important part is not the brand of tool, but that the cloud target is a backup repository with history, not just a synchronized copy of today’s mistakes.

The cloud provider is allowed to disappear from the recovery path for local failures, and the local NAS is allowed to disappear from the recovery path for cloud restores. If both are required at the same time, the design has a hidden coupling.

USB disks are for offline recovery

The USB disk is not there because I enjoy plugging in drives. It is there because offline storage survives a different class of failures.

On DSM, this is a good fit for a Hyper Backup task targeting an external USB disk. An ideal USB backup flow looks like this:

1

plug in disk -> run backup -> verify -> unmount -> physically disconnect

Even better, rotate two disks: one at home, one somewhere else. That is less convenient than a permanently attached drive, but convenience is not the job of this copy. Its job is to be unreachable when a compromised machine, broken script, or accidental deletion tries to destroy everything it can see.

This is the copy I want if the NAS and cloud repository are both logically damaged. Not because that is likely, but because that is the kind of failure that makes every online copy suspect at the same time.

A second NAS is useful if it changes the failure mode

An offsite NAS would be a good future upgrade, but only if it is not just another always-mounted destination with a different hostname.

The best version is pull-based: the offsite NAS connects in, pulls encrypted backup artifacts, and stores them with its own retention. That way, if the primary NAS is compromised, it cannot trivially reach out and delete the offsite copy with the same credentials it uses for normal backups.

If that is too much complexity, cloud plus rotated USB disks may be a better tradeoff. The point is not to collect backup destinations. The point is to avoid shared failure modes.

Monitoring is part of the backup system

A backup job that fails silently for three months is not a backup job. It is a delayed surprise.

Every recurring backup should report somewhere when it succeeds and when it fails. In my homelab, that means Prometheus for machine-readable state and history, and ntfy for the human-facing “you need to look at this” notification. The tooling is less important than the invariant: if a backup stops running, I should find out before I need it.

DSM’s own notifications should be part of this too. If a Hyper Backup task fails, a USB disk is not mounted, a volume degrades, or a snapshot job stops running, that should end up in the same alerting path as the rest of the homelab health checks.

The signal I want from each job is small:

last successful run
duration
size or number of changed files
destination
verification status
retention/prune result

Those fields are enough to spot most weirdness: a job that stopped running, a backup that suddenly became tiny, a cloud upload that never finished, or a prune operation that failed and left the repository growing forever.

Restore tests make it real

This is the part I most want to make non-optional.

I want a small restore calendar:

monthly: restore a random document or photo from NAS and cloud
quarterly: restore a Docker app into a temporary directory or VM
quarterly: restore a Home Assistant backup into a test instance
yearly: simulate losing the NAS and recover the most important data from cloud or USB

The test does not have to be dramatic. It just has to be real. A restored file should open. A restored database should start. A restored Home Assistant instance should boot far enough to prove the backup is usable.

The restore notes should live somewhere I can reach during an outage. A recovery plan stored only on the NAS it is meant to recover is a joke with excellent formatting.

The target architecture

The architecture I want to end up with looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


Mac
 -> Time Machine -> NAS
 -> Gitea app backup -> NAS

Ubuntu
 -> compose/config backup -> NAS
 -> app data + database dumps -> NAS

Home Assistant OS
 -> scheduled backups -> NAS

Sentinel / NUT server
 -> config + scripts -> NAS

NAS
 -> DSM Snapshot Replication for important shares
 -> Hyper Backup encrypted cloud backup
 -> Hyper Backup rotated offline USB backup
 -> optional offsite NAS pull backup

Monitoring
 <- every backup job reports status

Restore tests
 <- periodically restore from NAS, cloud, and USB

It is not the most exotic setup. That is a feature. The best backup strategy for a homelab is one I will actually maintain when nothing is on fire.

My checklist before calling this done

Before I consider this strategy real, I want to be able to check off the following:

every important service has a documented restore procedure
every database has an app-aware dump or quiesced backup
Home Assistant backups are copied beyond the Home Assistant machine
Gitea has an app-native backup, not just Time Machine coverage
small utility machines have their service configs backed up
DSM snapshots are enabled for important shared folders
Hyper Backup cloud jobs are encrypted, versioned, and periodically verified
at least one Hyper Backup USB target exists and is disconnected after backup
backup jobs report success and failure somewhere visible
restore tests happen on a schedule
recovery notes are available without depending on the NAS

The real lesson is that “where do I copy this?” is the wrong first question. The better question is: “which failure is this copy supposed to survive?”

Once each backup has a job, the architecture becomes easier to reason about. The NAS is for fast local recovery. The cloud is for offsite encrypted history. The USB disk is for offline survival. App-native exports are for portable restores. Monitoring is for noticing when the whole system quietly stops doing its job.

And restore tests are what turn the drawing from a comforting picture into a system I can actually trust.

My homelab stack in 2026: what runs, why, and how it all connects

Sat, 20 Jun 2026 10:00:00 +0100

I’m not going to make the case for self-hosting here. If you’re reading this, you already get it. What I want to do instead is be honest about what I actually run, why I made the specific choices I made, and - more interestingly - how the pieces talk to each other in ways that weren’t always planned from the start.

In short: my 2026 homelab runs on a mix of Raspberry Pis, a Mac Mini, and an Ubuntu mini PC. Traefik handles routing, Tailscale provides remote access, Prometheus watches the stack, Ntfy connects notifications, and local AI runs through Ollama and OpenWebUI.

The stack runs across four machines. The bulk of Docker workloads are split between a Mac Mini and an Ubuntu mini PC. Network infrastructure - Traefik and CoreDNS - runs on a Raspberry Pi 4, which also handles the NUT server for UPS management; keeping the network layer on its own always-on hardware means a container crash elsewhere doesn’t take down routing or DNS. Home Assistant runs on a Raspberry Pi 5 with the Hailo 8 AI HAT. Frigate runs alongside it as a Home Assistant add-on, which is what gives it direct access to the Hailo 8 for hardware-accelerated camera object detection - no GPU needed in the main machines for that workload.

The mental model that makes sense of the whole thing: everything gets served through Traefik, everything ships metrics to Prometheus, and Ntfy acts as the notification bus that ties async events together. Most of the rest are applications that plug into those three rails.

Category	Tools
Network	Tailscale, Traefik, CoreDNS
Dev & CI/CD	Gitea, GitHub, Woodpecker CI, Docker Registry, WUD
AI	Ollama, OpenWebUI
Search	SearXNG
Documents	Paperless-NGX, Paperless-AI
Passwords	Vaultwarden
Monitoring & management	Prometheus, Portainer, Homer
Notifications & sharing	Ntfy, Pairdrop
Home automation	Home Assistant, Frigate

graph TD
 subgraph pi4["Pi 4 — Network layer"]
 TS[Tailscale]
 TR[Traefik]
 DNS[CoreDNS]
 end

 subgraph main["Mac Mini + Ubuntu — Services"]
 GIT[Gitea]
 WP[Woodpecker CI]
 REG[Docker Registry]
 WUD[WUD]
 OLLAMA[Ollama]
 OWU[OpenWebUI]
 SEARX[SearXNG]
 PAI[Paperless-AI]
 PAPER[Paperless-NGX]
 NTFY[Ntfy]
 end

 subgraph pi5["Pi 5 — Home automation"]
 HA[Home Assistant]
 FRIGATE[Frigate]
 end

 TS --> TR
 GIT --> WP --> REG
 WUD -->|alert| NTFY
 OLLAMA --> OWU
 OLLAMA --> PAI --> PAPER
 OWU -->|search| SEARX
 FRIGATE --> HA
 HA -->|notify| NTFY

The foundation

Traefik

Everything HTTP goes through Traefik. It’s the reverse proxy in front of every Docker-hosted service, and the main reason I chose it over Nginx or Caddy is Docker-native autodiscovery. When I bring up a new container with the right labels, it appears behind a subdomain with automatic TLS, no config file reload required. That removes enough friction that I’m less tempted to leave things running unproxied on bare ports.

Traefik handles Let’s Encrypt certificate issuance and renewal. Services that aren’t on the public internet use a DNS-01 challenge, so they get valid certs without being exposed to the web. The rest of the stack is effectively Traefik labels plus container configs all the way down.

Tailscale

Tailscale is how every machine in the stack is reachable from outside the local network. All four machines are on the same Tailnet, which means I can reach any service from anywhere without opening ports or maintaining a VPN server.

📝 Note

Tailscale isn’t fully self-hosted — the coordination server is Tailscale’s. The traffic itself is peer-to-peer and never leaves the devices, but the key exchange goes through their infrastructure. For a homelab, that trade-off is easy to accept; for stricter control, Headscale is the self-hosted alternative.

CoreDNS

CoreDNS handles local name resolution. All internal subdomains resolve to the local machine without ever leaving the network, which means Traefik’s label-based routing is actually usable on every device without editing hosts files or relying on split-horizon DNS from the router. CoreDNS sits upstream of the system resolver and forwards anything it doesn’t own to the public DNS of choice. It’s invisible when it works, which is most of the time.

Dev & CI/CD: the pipeline

This is where the most deliberate architecture lives, because I wanted something that felt like a real deployment pipeline rather than manually building and copying images around.

Gitea

Gitea is where all private repositories live: infra configs, personal projects, anything I don’t want on a third-party server. Public projects still go to GitHub because that’s where the audience is, but a number of those GitHub repositories are mirrored back to Gitea as a local backup. The split is simple: Gitea for control and resilience, GitHub for reach.

Woodpecker CI

Woodpecker CI is the pipeline runner, hooked directly into Gitea webhooks. Push to a branch, a pipeline runs. The config lives as a .woodpecker.yml at the repo root, which means pipeline definitions are versioned alongside the code. Woodpecker builds Docker images and pushes them to a local registry on the same host.

Local Docker registry

A plain Docker registry container, served behind Traefik. Woodpecker pushes here; docker compose pulls from here. Keeping images local means builds are fast, no rate limits, and nothing depends on an external registry being up. Not sophisticated. Exactly as much complexity as needed.

WUD - What’s Up Docker

WUD watches running containers and detects when upstream image versions are newer than what’s deployed. It doesn’t auto-update by itself in my setup, I use it as a detection layer. When it spots a new version, it fires a notification through Ntfy, which I’ll get to shortly. The result is that upstream updates surface as notifications I can act on rather than surprises I discover when something breaks.

Local AI: Ollama + OpenWebUI

I run Ollama for local model inference. The main draws are the obvious ones: no data leaves the machine, no per-token cost, models available offline. Local models cover summarization, document classification, and general Q&A; tasks where a smaller model is good enough and keeping data local matters.

Code assistance is the clear exception: small models aren’t reliable enough there, so that goes to cloud APIs - Claude or Codex depending on the task. OpenWebUI makes the split seamless: it sits in front of Ollama but also accepts API keys for cloud providers. In practice I open one interface and pick a model from a dropdown rather than switching tools. Local by default, cloud when it’s actually worth it.

OpenWebUI also connects to SearXNG as a search tool, which means it can pull current information without phoning home to a commercial search provider.

Search: SearXNG

SearXNG is my default search engine on all devices. It’s a meta-search engine. It queries multiple sources and aggregates results. But the key property is that queries don’t get tied to an account or used to build a profile. Results are good enough for 95% of searches, and for the other 5% I have a single click to fall back to whatever source I want.

The setup is minimal: one container behind Traefik, set as the default search engine in the browser. It’s one of those things that took twenty minutes to deploy and I’ve never thought about since.

Documents: Paperless-NGX + Paperless-AI

Paperless-NGX handles all document management. Scan or drop a PDF into the inbox, it gets OCR’d, indexed, and stored. The tagging and correspondent system means I can find anything in seconds rather than digging through a folder hierarchy.

On top of that I run Paperless-AI, which hooks into the Paperless-NGX API and uses a local Ollama model to automatically suggest tags, correspondents, titles (and various other custom properties) as documents come in. This closes a loop with the AI section: Ollama isn’t just a chat model, it’s doing practical classification work for real files. The whole thing runs locally, so no document content touches an external service.

Password management: Vaultwarden

Vaultwarden is a self-hosted Bitwarden-compatible server. All credentials live locally, sync across devices through the standard Bitwarden clients, and never touch a third-party server. It’s one of those services where the self-hosted case is unusually strong: the upstream Bitwarden clients are excellent, Vaultwarden is a drop-in replacement, and keeping your password vault on your own hardware removes a meaningful point of trust.

💡 Tip

A password vault is the one thing you cannot lose and cannot recover from a partial backup. I wrote a dedicated post about the backup architecture covering how I handle this specifically for Vaultwarden.

Monitoring & management

Prometheus

Prometheus scrapes metrics from the stack. Node exporter covers the host, cAdvisor covers containers, and individual services expose their own endpoints where supported. The main value isn’t dashboards (though those exist) - it’s having a queryable record of system state over time, and a place to hook alerts when something drifts.

Portainer

Portainer is the visual management layer. I use it for quick container inspection, pulling logs, and managing stacks without SSHing in every time. It doesn’t replace Prometheus, they have different jobs. Prometheus tells me what happened and when; Portainer tells me what’s running right now and lets me poke at it.

Homer

Homer is a static dashboard - a single page with links to every service, configured via a YAML file. It’s the least technical piece in the stack and the one my family actually uses. Rather than memorizing subdomains or digging through bookmarks, everyone has Homer as a home page: a clean grid of icons that opens whatever they need. The split between Portainer and Homer is intentional - Portainer is for me, Homer is for everyone else.

Ntfy

Ntfy is the thread that ties the whole async event model together. It’s a self-hosted push notification server. HTTP POST to a topic, and every subscribed client gets a notification. Woodpecker sends build results here. WUD sends image update alerts here. Home Assistant sends automation notifications here. Having one place where things send notifications means I can manage subscriptions in one app and stop checking dashboards compulsively.

The pattern is simple enough that anything can use it: if a script or service needs to tell me something happened, it makes an HTTP request. No SDK, no auth complexity, just a POST.

Pairdrop

Pairdrop is AirDrop for the local network, any device on the LAN can discover others and transfer files peer-to-peer through the browser. No account, no cloud relay, no app install. I use it constantly for moving files between phone, laptop, and desktop without thinking about it.

Home automation: Home Assistant

Home Assistant is the one thing in the stack that runs on bare OS rather than Docker. It’s Home Assistant OS on a dedicated Raspberry Pi 5, which is a deliberate choice: the add-on ecosystem, hardware device support, and the supervisor layer all work better outside a container. Trying to run it in Docker introduced enough friction with USB devices and networking that the clean answer was to give it its own machine.

Frigate runs as a Home Assistant add-on rather than a standalone container, and that placement matters: it gives Frigate direct access to the Hailo 8 AI HAT on the Pi 5 for hardware-accelerated object detection on camera streams. Running it as an add-on keeps the integration tight and avoids the networking gymnastics that come with trying to expose a hardware accelerator across container boundaries.

It integrates back into the rest of the stack through Ntfy, automations that need to notify me fire an HTTP call to the same notification server everything else uses. One less thing to configure separately.

What’s next

One thing I’m actively removing is n8n. On paper it’s a great fit — nice UI, trivial to set up, an enormous library of nodes. In practice, for the automations I actually run, it’s massively oversized. A tool that big has a way of becoming its own maintenance surface, and when I look at what I’m using it for, most of it is simple enough to replace with a script and an HTTP call to Ntfy. Sometimes the right answer is less, not more.

The stack has been stable enough that I’ve been iterating on individual services rather than adding new ones. The pieces that took the most time to get right were the CI/CD pipeline (getting Woodpecker, the local registry, and WUD to work as a coherent unit) and Paperless-AI (tuning the prompts so document classification is actually useful rather than just technically running).

If any of this is useful as a starting point, most of these services have reasonable official documentation and active communities. The architecture isn’t novel, it’s mostly standard self-hosting patterns assembled with some thought about how the parts should talk to each other.

Designing Single-Purpose Agents Instead of One Big Automation Script

Wed, 17 Jun 2026 10:00:00 +0100

“Agent” has become one of those words that means everything and nothing this year. Before it was a hype term, I’d already ended up with a small flock of them in my homelab. Not because I was chasing a trend, but because I kept hitting the same wall every time I tried to write One Big Script: it grew a dozen unrelated responsibilities, and a bug in one of them risked taking down all of them.

In short: I prefer many small, single-purpose automation agents over one large script because each agent has a narrow job, a clear output contract, and an independent schedule. The system stays maintainable because agents communicate through JSON artifacts, one notification channel, and one dashboard.

So instead, every recurring chore in my homelab is its own small, independently-scheduled program. There turned out to be more of them than I expected once I actually sat down and counted.

A note for muggles: the repo behind all this is named hogwarts, and every agent gets sorted to match. Once you start naming services after wizards, it turns out you owe each one an in-character job description, whether it asked for one or not.

The standing watch. Four observers poll continuously and report into one correlator every five minutes. This is the layer that exists so I find out about a problem before it becomes a 3am page instead of after:

Argus Filch watches running Docker containers for restart loops, failed healthchecks, and containers that just quietly vanish.
Astronomy Tower polls Prometheus for firing alerts, down scrape targets, and recording rules that stopped working without telling anyone.
Marauder’s Map scans the UniFi network for offline devices, WAN failover events, and firewall rules that drifted open.
Mad-Eye’s Watch tracks TLS certificate expiry across configured endpoints. Constant vigilance: a warning at 30 days, a critical at 7.
The Headmaster is the one role on this list that isn’t single-purpose by design. Its entire job is reading what the other four decided was worth reporting and correlating that into one status, surfaced as an incident only when it’s actually worth one.

The daily and weekly chores. These run on their own cron schedules and never talk to each other directly:

Molly’s Cupboard reviews the Home Assistant entity list weekly: unavailable entities, missing or duplicate names, disabled automations. (Molly Weasley: keeps the household running, judges your clutter lovingly.)
Rita’s Desk is the RSS morning digest: feeds in, previous day’s articles out, ranked against persistent tag scores I vote on. Deterministic by design, no LLM in the loop. (Never met a headline she wouldn’t print, but at least she always sources it.)
Kreacher’s Kitchen plans the week’s meals from my recipe library and a couple of trusted cooking sites. (Grumbles the entire time, still gets dinner on the table.)
The Library picks a tech topic every night, gathers sources, and writes a 5-minute digest plus a 15-20 minute deep dive. (Lives in the package manifest as research-digest, but it spends every night in the Restricted Section, so the Library it is.)
Madam Pince’s Catalogue lists every running container and cross-checks it against the service directories in the infra repo, flagging any container that has no matching documentation. (A very particular librarian: every book gets catalogued, or it gets confiscated.)
Dobby’s Rounds is the homelab’s free elf: weekly housekeeping that prunes old snapshots, reports, and state files before they pile up.
O.W.L.s is the daily infrastructure audit: config drift, open ports, compliance. Read-only and deliberately paranoid about it. (Ordinary Wizarding Level exams: thorough, exhausting, and not interested in your excuses.)
Auror Office is the daily cross-domain security digest, correlating O.W.L.s’ findings with auth logs, Docker posture, and the network observers above into one report. (No badge, but it does go looking for dark wizards. I’ve written about how this one and O.W.L.s work together in more detail elsewhere.)
…and others, including media management, recommendations, and a handful more in the same spirit. Small enough that listing every one of them would be its own blog post.

Thirteen-plus names, just as many jobs. Outside of the two correlators built specifically to know about everyone else — the Headmaster and the Auror Office — not one of them needs to care that the rest exist.

The three conventions that make this work

None of these agents are individually clever. What makes the flock manageable is that they all obey the same three small contracts:

1. One artifact format. Every agent writes its result as JSON (and often a companion Markdown note) to its own outbox/latest/ path. A “latest” pointer plus a timestamped archive, every time. No agent reads another agent’s outbox directly. If something needs cross-referencing, that’s a different, explicitly-correlating agent’s job, not an implicit dependency.

2. One notification channel. Every agent that needs to tell me something pushes through the same ntfy topic convention, with a deep link back into wherever the full detail lives. I don’t maintain five different alerting integrations; I maintain one, and every agent is a thin client of it.

3. One aggregation point. A single dashboard reads everyone’s outbox/latest/ and renders it. It doesn’t collect anything itself. It has no Docker access, no Home Assistant credentials, no API keys. It’s a pure read layer over JSON files other things produced. That’s the only place in the whole system that’s allowed to know all the agents exist.

That’s the entire integration surface. Three conventions, and I can add a sixth agent tomorrow without touching the other five.

Why decompose instead of consolidate

The obvious objection: isn’t five small things more to maintain than one big thing? In my experience, no. For the same reason a set of small services usually beats a monolith at work.

A bug in Peeves’ Trakt pagination cannot break Molly’s Home Assistant checks, because they don’t share a process, a deploy, or a schedule. I can test each one in complete isolation with a fixture file instead of live credentials. I can hand any single agent’s directory to a contributor - human or an AI coding agent - and they have everything they need to understand and change it, without first having to load the other four into their head. And when I retire one (Peeves only matters because I still have a media server; that won’t be true forever), deleting it is deleting a directory, not untangling a shared module.

This is the same lesson as service boundaries and team topologies at any reasonably-sized engineering org: the interface between components should be small, explicit, and boring, and almost all of the design effort should go into keeping it that way. Not into making any individual component clever. The cleverness, if there is any, belongs inside one agent’s narrow walls, where it can’t leak.

The boring plumbing is the point

None of the five agents above is doing anything technically hard. RSS parsing, a REST API client, a cron job - this is all stuff any of us could write in an afternoon. The actual design work was deciding, up front, that “outbox JSON + one notification channel + one dashboard” would be the entire contract between them, and then refusing to let any agent reach around it.

That discipline is cheap when you only have one agent. It’s the only thing that keeps five (or fifteen) from turning back into the One Big Script I was trying to avoid in the first place.

Backing Up the One Credential That Can't Be Wrong

Mon, 15 Jun 2026 10:00:00 +0100

Most things in my homelab can fail and I shrug. A container restarts, a dashboard is stale for an hour, a media file gets deleted by mistake: annoying, recoverable, fine. The password vault is not in that category. If it’s wrong, or gone, or merely unreachable at the wrong moment, I lose access to everything else at once. It’s the one piece of infrastructure that earns the extra paranoia.

In short: my password vault backup strategy keeps three copies that survive different failure modes: the primary vault, a live self-hosted Vaultwarden mirror, and an offline KeePass archive. The important part is not the number of backups, but making sure they do not all fail for the same reason.

So instead of “back it up somewhere,” I sat down and asked the question I’d ask at work for any single point of failure: which specific failure does each copy need to survive? That question is what actually shaped the design. Not “more backups.”

Three copies, three different failure modes

The vault lives day-to-day in Dashlane. Around it, a script keeps two more copies current:

A dated, offline KeePass .kdbx archive. The script exports the vault, converts it to KeePass format, and syncs the file to NAS over rsync. This file needs nothing else to be true. No Dashlane account, no Vaultwarden instance, no network to be opened. KeePassXC plus a master passphrase is the entire recovery path.
A live Vaultwarden mirror. The script wipes and re-imports a self-hosted Vaultwarden instance on every run, so it never drifts and never accumulates stale duplicates. Unlike the .kdbx, this one behaves like a normal vault app day-to-day. Useful if Dashlane itself is the thing that’s unavailable.
The original, in Dashlane. Still the primary, still the one I actually use day to day.

Each of these answers a different question. If Dashlane has an outage or I get locked out of the account, Vaultwarden keeps working normally. If my entire home network and every service on it is down, or I’m on a borrowed machine with nothing installed, the .kdbx file plus a passphrase I’ve memorized is the whole recovery procedure. No SSH keys, no app, no account, no network. If the NAS itself dies, the live Vaultwarden mirror (running elsewhere) and Dashlane are both still fine.

That’s the test I’d apply to any redundancy claim: two backups that die from the same root cause aren’t two backups, they’re one backup with extra steps. These three don’t share a single point of failure with each other.

The part that matters more than the architecture

Diagrams of “three copies in three places” are easy to draw and easy to get wrong in the implementation details, and the details are where a vault backup script can quietly turn into a liability instead of a safety net.

The script exports the raw vault to CSV in plaintext before converting it. There’s no way around that, the conversion tool needs plaintext input. So the entire design constraint became: that plaintext must never outlive the script’s own execution. It’s written to a private temp directory, and a shell trap deletes it on every exit path. Success, failure, or interruption, not just the happy path. The master passphrase that protects the .kdbx is treated as its own tier-zero secret: it lives outside the repo, outside the backup, memorized or written down somewhere physically safe, because it’s the one credential that unlocks the credential store.

It’s a small thing, a trap ... EXIT instead of cleanup code only at the bottom of the script. But it’s exactly the kind of detail that’s invisible when it works and catastrophic when it doesn’t. I’d rather the script crash and still clean up after itself than ship a feature and leave a plaintext export sitting in a temp folder because someone hit Ctrl-C at the wrong moment.

Threat-model your one credential that can’t be wrong

Every system has at least one of these. The credential, the key, the record that everything else assumes is correct and available. For most of us it’s a password vault; for some it might be a signing key or a recovery seed.

The exercise worth doing isn’t “add more backups.” It’s listing the ways that one thing could become unavailable or wrong, and checking that your copies don’t all fail for the same reason. If they do, you’ve built redundancy theater, not redundancy.

Running a Personal SOC: Bringing Production Security Practices Home

Fri, 12 Jun 2026 10:00:00 +0100

At work, nobody questions why we have logging, alerting, and a daily look at what changed overnight. At home, the same network runs a NAS, a media stack, Home Assistant, and a handful of containers. And for years my only “security monitoring” was noticing something was broken.

So I built myself a small, read-only security operations setup for the homelab: a daily audit script and a cross-domain digest agent that correlates it with everything else running on the network. Nothing here is novel security research. The interesting part is which production habits turned out to be worth carrying home, and which ones I deliberately left at the office.

Two layers, not one

The setup is split into two pieces with different jobs.

The daily audit is the boring, deterministic layer. Once a day it collects, locally and read-only:

listening sockets, flagging anything bound to a wildcard interface
Docker/container posture: privileged containers, dangerous bind mounts, host network mode, dangerous capabilities
systemd service drift against an expected allowlist
a local secrets/config hygiene scan (path, line, and pattern only - never the matched value)
cached apt list --upgradable, optionally enriched with a trivy fs scan
whether the monitoring artifacts and timers it depends on are actually present

It writes a deterministic Markdown digest and a JSON report to a local outbox. That’s it. No remediation, no service restarts, no firewall changes. The rule I gave it was “assume breach, trust no single signal, change nothing during observation.”

The SOC agent is the correlation layer on top. It runs once a day and pulls together SSH auth logs (brute-force detection, unexpected successful logins, elevated sudo activity), Docker security posture, UniFi network signals (unknown MAC addresses, active IPS/DPI/flood/scan/rogue alarms), and re-surfaces anything security-relevant the daily audit already found. Everything gets a severity from ok up through critical, and the result is written as an Obsidian note with YAML frontmatter. So a year of these becomes a searchable, taggable incident timeline instead of a folder of text files nobody opens.

Where the LLM fits - and where it doesn’t

Both agents can optionally hand their findings to a local Ollama model to write a short narrative summary on top of the facts. This is the part I was most careful about, because it’s the part most people get backwards.

The model never sees raw logs, full inventories, or matched secrets. It only compact finding titles, IDs, severities, and evidence keys. It doesn’t decide what’s a finding; the deterministic analyzers do that. And if Ollama is unreachable or returns something unusable, the deterministic digest ships as-is. The LLM is a narrator, never the source of truth.

That’s the same boundary I’d want on a production alerting pipeline: detection logic stays deterministic and testable, and generative summarization sits strictly downstream of it, never upstream.

The part that’s just operational discipline

A few choices here have nothing to do with security theory and everything to do with habits from running things in production:

Output paths are permissioned, not just “private by convention.” Reports get 0600, runtime directories 0700.
Stale data is treated as a finding, not silently trusted. Upstream reports older than a configured threshold are flagged rather than quietly re-surfaced as current.
Notifications are a tap-through, not the whole story. A push notification carries the headline and a link into the actual note - useful at a glance, but the record of truth lives in the note, not in a chat history that scrolls away.
Everything is replayable offline. Both agents accept a fixture file in place of live collection, so I can test a new analyzer rule against a known input before it ever touches my real network.

What it actually buys me

Mostly, it buys me the same thing it buys at work: I notice drift before it becomes an incident, instead of after. A container that quietly picked up a privileged flag, a port that got published wider than intended, an unfamiliar MAC address on the network. These are exactly the kind of small, boring facts that are easy to miss and easy to detect deterministically.

It’s not a real SOC. There’s no 24/7 coverage, no incident response retainer, and the threat model is “don’t get owned by something dumb,” not “defend against a motivated attacker.” But the muscle is the same one I use at work: write the check once, make it boring and deterministic, let a model help you read the output, and keep a record you can actually search six months later.

If you’re already doing this professionally, the homelab version costs you an evening and pays you back the first time you catch something you would have otherwise missed entirely.