Back to Portfolio
Blog FEATURED April 29, 2026 · 11 min read

From paper to production: building zt-backup-kit

How a research paper on ransomware resilience in Linux NAS environments turned into an open-source backup toolkit anyone can deploy in an evening.

BackupDisaster RecoveryRansomwareLinuxZero TrustOpen SourceResearch
On this page

A few months ago I co-authored a research paper that argued something most sysadmins already feel in their bones: the backup is not the problem anymore — getting it back, alive, under pressure, is the problem. The paper went deep on a “Secure Pull” architecture for ransomware resilience in Linux NAS environments, with measured RTO and RPO numbers from a controlled lab setup. It was a satisfying piece of academic work.

But papers don’t run on cron at 2 a.m. So I built the tool.

This post is about that journey — from the research that started it, through the design choices that came out of it, to zt-backup-kit: an open-source toolkit that is the practical, opinionated implementation of the ideas the paper explored. I want to talk about what the research actually said, what surprised me when I tried to turn it into something a sysadmin could deploy in one evening, and what I’d do differently next time.

If you only want the tool, the GitHub repo is here and the permanent DOI is here. If you want the story behind it, keep reading.

The dilemma the paper set out to solve

The original paper, *Optimizing Recovery Objectives (RTO & RPO) in Secure Linux NAS Environments: A Design Science Approach to Ransomware Resilience*1, started from an observation that’s almost cliché in the security industry but worth restating: modern human-operated ransomware operators no longer just encrypt your production data. They actively hunt your backups before they trigger encryption. Industry telemetry consistently shows the majority of impactful ransomware incidents include explicit backup-deletion steps in the attacker’s playbook.

This breaks the classic “3-2-1” backup rule in a particular way. The rule itself is fine — three copies, two media types, one off-site — but it was written for a threat model where attackers were assumed to be opportunistic, external, and unprivileged. Today’s reality is closer to *insider threat with admin credentials*. If your production server can write to your backup repository, an attacker on production can delete your backup repository.

The paper framed this as a tension between two competing requirements:

1. **Recovery speed (RTO)** — backups must be quickly retrievable when needed
  1. Resilience — backups must be inaccessible to a compromised server A local NAS gives you (1) but tends to fail (2): if the server can mount the share to write nightly snapshots, ransomware on the server can encrypt the share. Cloud cold storage gives you (2) but fails (1): pulling hundreds of GB back over a typical link is hours, not minutes.

The paper’s contribution was an architectural inversion — the “Secure Pull” model — where a separate vault host pulls data from the production server over a one-way SSH channel on a schedule. The production server has no credentials for the vault. Even with full root compromise on production, the historical snapshots on the vault remain untouched and restorable.

The lab evaluation reported some numbers that, frankly, made me want to stop using anything else:

- **RPO of 15 minutes** with the pull architecture, versus 60 minutes

for the cloud-only baseline (limited by upload bandwidth).

  • RTO under 2 minutes for a 10 GB dataset over Gigabit LAN, versus 35 minutes for cloud-based restoration.
  • 100% data integrity maintained when simulated ransomware compromised the production server with root privileges. Standard push-based NAS configurations suffered total data loss in the same scenario1. That last result is what made me want to build something. The push-based NAS lost everything. The pull-based vault lost nothing. Same hardware, same backup engine, same encryption — different direction of trust.

Why I didn’t just publish the lab scripts

Research code is research code. You write it to prove a point in a paper. It hardcodes paths to your test environment, assumes a specific distribution of files, prints diagnostics that only mean something to you, and treats every error as a fatal exception because the lab is sterile.

A real sysadmin’s environment is the opposite of sterile. Files have weird permissions because some user named gayathri runs a Moodle install that nobody fully remembers setting up. The cron environment is missing half your $PATH. Half your “config files” are actually symlinks pointing at /usr/share/... because the package manager put them there. Email notifications need to come from a specific account, signed with DKIM, sent through a specific relay. None of this is in the paper.

So when I sat down to build a usable tool from the paper’s ideas, I gave myself a set of constraints that go well beyond what the research needed to prove:

1. **It must run unattended from cron without surprises.** No interactive

prompts in the automated path. Hardened $PATH so binaries resolve under cron’s minimal environment. flock-based concurrency protection so a slow run doesn’t trample a fast one. Cron is where backup tools die quietly; the tool has to survive cron.

  1. It must distinguish kinds of failure. Restic uses different exit codes — 0 for full success, 3 for “the snapshot was saved but some files couldn’t be read”, and others for hard failures. A naive script treats anything non-zero as broken; a tool used in production needs to tell you “everything’s fine, but your gayathri user has a .bash_history you can’t read” without crying wolf.
  2. It must protect itself from itself. The default restore.sh mode should never overwrite the live source. Restoring to /tmp/restore-... with a confirmation prompt is the right primitive — not a flag that defaults to in-place restore. People panic during recovery; tools shouldn’t help them turn a lost-file event into a lost-everything event.
  3. It must work on a borrowed laptop. This was the hardest design constraint, because it forced the question: what does the recovery procedure look like when you don’t have the original admin, the original server, or even your own machine? That became emergency-restore.sh — a bootstrap script that installs Restic and rclone on a clean Linux/macOS/WSL device, walks through OAuth setup, and pulls the latest snapshot to the user’s home directory. It’s the “I’m at home, the server is on fire, my passphrase is on a piece of paper in my wallet” scenario.
  4. It must come with a documented recovery procedure. A backup nobody has ever restored is not a backup. The repo ships with a printable disaster-recovery runbook with three scenarios (single file recovery, full server rebuild, emergency self-service restore) and a quarterly testing checklist. Most open-source backup tools stop at “here’s how to back up.” That’s the easy half.

What changed when theory met cron

A few things I assumed during the research turned out to be wrong, or at least incomplete, once the tool was running daily against real data. The OAuth quota cliff. The paper measured cloud transfers in a controlled environment. In production with default rclone OAuth credentials, the very first cron run against Google Drive failed with HTTP 403: Quota exceeded. It turns out rclone’s default OAuth client is shared globally across every rclone user worldwide; a few heavy users elsewhere can poison the well for everyone. The fix is creating your own Google Cloud project and OAuth credentials — straightforward, but invisible in the academic literature because labs don’t share quotas with strangers. The kit’s installation guide now spends a full section on this2. Permission-denied is the common case, not the edge case. I assumed that if a backup user has read access to /var/www, it can read everything inside. In practice, web servers are full of files dropped by random users during one-off interventions, with mode 600, owned by users who left the team three years ago. The kit ships with a default exclusion list — bash histories, caches, build artifacts, OS junk — and a granular exit-code handler so partial-success doesn’t masquerade as failure. The list grew empirically; every entry corresponds to something I hit in the wild. The “credentials archive” is a research artifact disguised as operations. The Secure Pull architecture is great while everything is running. But what about the human? In a real disaster you may not be at the office, may not have your laptop, may not have anyone who knows the restic password. The kit ships with a helper that bundles the encryption password and rclone OAuth config into a single GPG-encrypted archive, documented in the runbook with explicit instructions about where to store the archive (USB drive in office safe, personal email, second physical location) and where to store its passphrase (paper, with the runbook, not in the same place as the archive). This was a feature the paper didn’t have to consider because it never modeled humans needing to recover.

What I’d say to someone reading the paper now

Read the paper for the architecture argument. The Pull model is, I think, genuinely the right design for small-to-mid environments that can afford one extra always-on host. The paper’s RTO/RPO numbers are real and reproducible.

But don’t stop there. Open the kit, read the scripts, and notice all the operational concerns the paper had no reason to include — the flock locking, the multi-line MIME emails, the numfmt for human-readable sizes, the mktemp cleanup traps, the granular exit-code interpretation. Those aren’t in the academic contribution because they aren’t novel. They’re in the kit because *backup tools live or die on operational plumbing.*

If you’re building anything similar — academic prototype to production tool — my biggest advice is: document the recovery procedure first. Not last, not second. First. Write the runbook before the script. Force yourself to articulate, in plain prose, exactly what someone with none of your context would do at 2 a.m. on a bad day. If you can’t write that runbook, your tool isn’t done. It doesn’t matter how clever the architecture is.

What’s next

zt-backup-kit is at v0.1.0. It works in production for the use cases I care about — Linux web servers backed up to Google Drive or local NAS, cron-driven, with the documented disaster recovery procedure tested. The v0.1 status reflects that I expect rough edges as more people use it, not that the core is shaky.

The roadmap, in priority order:

1. **Pull-mode setup scripts.** Right now the kit ships with the simpler

push-mode setup. The Pull architecture from the paper deserves its own first-class setup flow with vault-host SSH key provisioning, scheduling on the vault rather than the source, and the explicit threat-model documentation that justifies it.

  1. Healthchecks.io / ntfy.sh integrations. Email reports are great, but a missing email is silent failure. A heartbeat ping to a service like Healthchecks gives you “I should have heard from this by now” alerting for free.
  2. Native B2 / S3 backend examples. The kit supports anything Restic does, but I want to add tested-and-documented configurations for the non-Google clouds, particularly Backblaze B2 which is dramatically cheaper for backup workloads.
  3. More distro testing. Currently exercised on Ubuntu and Debian. Fedora, Rocky, Alpine, Arch all should work in theory but haven’t been verified. If you use it, open an issue when something annoys you. If you build something interesting on top of it, send a PR. The whole project is MIT licensed and lives at github.com/ShanukaDilan/zt-backup-kit.

If you’d like to cite it (in research, in a corporate runbook, anywhere), the permanent DOI is 10.5281/zenodo.19849290.

And if you’d like to read the paper that started all of this, here it is:

Gomas, A.S.D., & Rathnayake, R.M.N.B. (2026). Optimizing Recovery Objectives (RTO & RPO) in Secure Linux NAS Environments: A Design Science Approach to Ransomware Resilience. *Asian Journal of Social Science and Management Technology*, 8(1), 82–94.

The paper is the why. The kit is the how. They’re better together than either is alone.

Footnotes

  1. Gomas, A.S.D., & Rathnayake, R.M.N.B. (2026). Optimizing Recovery Objectives (RTO & RPO) in Secure Linux NAS Environments: A Design Science Approach to Ransomware Resilience. Asian Journal of Social Science and Management Technology, 8(1), 82–94. 2

  2. zt-backup-kit installation guide, §8 — Recommended: create your own Google Cloud OAuth project. https://github.com/ShanukaDilan/zt-backup-kit/blob/main/docs/INSTALL.md

DG

Dilan Gomas

HCI Researcher & Web Architect at Sabaragamuwa University of Sri Lanka