From 56ba0c3cf7829861851216075e6f1bb3ff12df8a Mon Sep 17 00:00:00 2001
From: Jean-Paul Calderone <exarkun@twistedmatrix.com>
Date: Mon, 8 Nov 2021 12:49:26 -0500
Subject: [PATCH] start of a design doc for backup/recovery

---
 docs/source/designs/backup-recovery.rst | 129 ++++++++++++++++++++++++
 1 file changed, 129 insertions(+)
 create mode 100644 docs/source/designs/backup-recovery.rst

diff --git a/docs/source/designs/backup-recovery.rst b/docs/source/designs/backup-recovery.rst
new file mode 100644
index 0000000..b2dec2d
--- /dev/null
+++ b/docs/source/designs/backup-recovery.rst
@@ -0,0 +1,129 @@
+ZKAP Database Backup / Recovery
+===============================
+
+*The goal is to do the least design we can get away with while still making a quality product.*
+*Think of this as a tool to help define the problem, analyze solutions, and share results.*
+*Feel free to skip sections that you don't think are relevant*
+*(but say that you are doing so).*
+*Delete the bits in italics*
+
+**Contacts:** Jean-Paul Calderone
+**Date:** 2021-11-08
+
+This is a design for a system in which *another component* can perform consistent backups of the internal ZKAPAuthorizer database which can be used to recover that database in the event primary storage of that database is lost.
+The system does *not* allow ZKAPAuthorizer to maintain backups *on its own*.
+
+Rationale
+---------
+
+The internal ZKAPAuthorizer database is used to store information that is valuable to its owner.
+This includes secrets necessary to construct ZKAPs.
+It may also include unredeemed or partially redeemed vouchers and information about problems spending some ZKAPs.
+
+This database is the canonical storage for this information.
+That is,
+if it is lost then it is not likely that it will be possible to recreate it.
+
+The premise of ZKAPAuthorizer is that ZKAPs are a scarce resource.
+It follows that unnecessary loss of ZKAPs is to be avoided.
+
+After the system described here is delivered to users it will be possible for users to recover all of the valuable information in the ZKAPAuthorizer database.
+This is true even if the entire system holding that database is lost,
+*as long as* the user has executed a basic backup workflow at least one time.
+
+User Stories
+------------
+
+Recovery
+~~~~~~~~
+
+**Category:** must
+
+As a user of ZKAPs who has lost the original device on which I installed Tahoe-LAFS with ZKAPAuthorizer
+I want to be able a install a new instance of Tahoe-LAFS with ZKAPAuthorizer to recover all of my ZKAPs
+so that I can use all of the storage that I paid for before I lost my device.
+
+**Acceptance Criteria:**
+
+  * 100% of storage-time which was paid for at the time of the loss is recovered
+  * Recovery is not impacted by the exact time of the failure that prompts it.
+  * The recovery workflow is integrated into the backup/recovery workflow for all other grid-related secrets.
+
+    * In particular, no extra steps are required for ZKAP or voucher recovery.
+
+  * Only the holder of the recovery key can recover the storage-time.
+  * Wallclock time to complete recovery is not increased.
+  * At least 500 GiB-months of unused storage-time can be recovered.
+  * At least 50 GiB-months of error-state ZKAPs can be recovered.
+  * At least 100 vouchers can be recovered.
+
+Backed Up ZKAPs
+~~~~~~~~~~~~~~~
+
+**Category:** must
+
+As a user of ZKAPs
+I want newly purchased ZKAPs to be backed up automatically
+so that I can use the system without always worrying about whether I have protected my investment in the system.
+
+**Acceptance Criteria:**
+
+  * All of the recovery criteria can be satisfied.
+  * The backup workflow is integrated into the backup/recovery workflow for all other grid-related secrets.
+
+    * In particular, no extra steps are required for ZKAP or voucher backup.
+
+*Gather Feedback*
+-----------------
+
+*It might be a good idea to stop at this point & get feedback to make sure you're solving the right problem.*
+
+Alternatives Considered
+-----------------------
+
+*What we've considered.*
+*What trade-offs are involved with each choice.*
+*Why we've chosen the one we did.*
+
+Detailed Implementation Design
+------------------------------
+
+*Focus on:*
+
+* external and internal interfaces
+* how externally-triggered system events (e.g. sudden reboot; network congestion) will affect the system
+* scalability and performance
+
+Data Integrity
+~~~~~~~~~~~~~~
+
+*If we get this wrong once, we lose forever.*
+*What data does the system need to operate on?*
+*How will old data be upgraded to meet the requirements of the design?*
+*How will data be upgraded to future versions of the implementation?*
+
+Security
+~~~~~~~~
+
+*What threat model does this design take into account?*
+*What new attack surfaces are added by this design?*
+*What defenses are deployed with the implementation to keep those surfaces safe?*
+
+Backwards Compatibility
+~~~~~~~~~~~~~~~~~~~~~~~
+
+*What existing systems are impacted by these changes?*
+*How does the design ensure they will continue to work?*
+
+Performance and Scalability
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+*How will performance of the implementation be measured?*
+
+*After measuring it, record the results here.*
+
+Further Reading
+---------------
+
+*Links to related things.*
+*Other designs, tickets, epics, mailing list threads, etc.*
-- 
GitLab