Storage Persistence and Cold-Backup Execution Plan
Date:
2026-04-04
Owner: Backend / Platform
Scope: local runtime persistence + Google Drive cold backup
1. Problem
- Local disk is capacity-constrained for long retention.
- Runtime needs fast local state, while operations need off-site recoverability.
- Google Drive is selected as cold backup target, with explicit quota and API-limit handling.
2. Operating Principles
- Hot storage and cold storage must stay separate.
- Do not mount Google Drive as live runtime storage.
- Package first, then upload (avoid small-file API collapse).
- Encrypt before off-site transfer.
- Keep backup/restore jobs auditable with deterministic state transitions.
3. Reference Architecture
- Hot plane:
- local SQLite/Postgres for runtime data
- Backup plane:
- snapshot builder (DB dump + required config export)
- package (
tar.zst/tar.gz) - encryption + dedup (
restic) - remote sync (
rclone crypt-> Google Drive)
- Control plane:
- policy service
- job scheduler + retention manager
- restore prepare/commit
- audit/event logging
4. Private Control Surface
- Local summary read
- Policy read/write
- Backup job run/list/detail
- Restore prepare/commit/cancel
Exact admin route names stay in the private operator docs. The public plan only documents the capability groups and audit requirements.
5. Data Model (MVP)
storage_policiesbackup_jobsbackup_artifactsrestore_jobsstorage_audit_events
6. Default Policy
- RPO:
24h - Local retention: keep last
2~3snapshots - Remote retention: rolling
30days - Upload bandwidth cap: enabled by default
- Daily upload budget guard: enabled
- One active backup job lock: enabled
6.1 Current implementation progress (2026-04-07)
- Storage backup service already enforces policy normalization and guards for:
- remote encryption requirement
- daily upload budget
- optional bandwidth cap
- single active backup lock
- Backup and restore jobs are persisted with deterministic state transitions, and are available via admin APIs.
- Upload surfaces (template/profile-pack submit) now include idempotency replay + conflict handling to reduce duplicate submissions before they enter backup scope.
7. Drive Constraints (Handled Explicitly)
- Daily upload hard-limit guard.
- Chunked + resumable transfer through rclone.
- Plain DB/PII upload is forbidden; encrypted path required.
- API pressure reduced by packaged artifacts + dedup snapshots.
8. Reliability Guardrails
- Backup job success-rate SLI.
- Restore-prepare validation success-rate SLI.
- Storage pressure controls:
- high watermark blocks new full backups
- warning events and critical-only fallback snapshots
- Scheduled canary restore drills.
9. Main Risks and Controls
- Risk: key custody failure causes unrecoverable backups
Control: key SOP + restore drill gate. - Risk: packaging bursts local disk usage
Control: staging quota checks + stream packaging where possible. - Risk: backup marked successful but not restorable
Control: restore-prepare validation before release gate.