Disaster Recovery vs Business Continuity: The Distinction That Matters
Across the Web3 sector, the terms business continuity plan and disaster recovery plan are used interchangeably, which leads to a critical operational gap. Most crypto firms that have invested in formal security documentation have a Business Continuity Plan (BCP). Far fewer have a tested Disaster Recovery Plan (DRP) that specifically addresses the technical recovery of signing infrastructure and key material after a compromise. The distinction is not semantic.
A BCP addresses how an organisation continues to deliver its functions across a broad range of disruptive events: personnel loss, premises unavailability, supply chain failures, and communications outages. It covers alternative working arrangements, escalation contacts, and the prioritisation of critical processes during a period of degraded capability. The business continuity planning discipline provides the organisational framework within which a DRP sits, but it does not, by itself, answer the operational question that matters most to a crypto firm under active attack: how do we recover control of our signing infrastructure.
A DRP is a technical recovery document. It specifies, at the procedure level, how compromised keys are rotated or replaced, how multi-signature configurations are restored after a signer is removed, how hardware security modules are re-initialised after a tamper event, how backup key material is accessed and used, and what the step-by-step sequence is for returning signing infrastructure to a verified clean state. Where the BCP is strategic and organisational, the DRP is operational and technical.
"Most Web3 firms have a BCP document gathering dust in a shared drive. Almost none have a DRP that has been tested against their actual signing infrastructure. That gap becomes catastrophic the moment a key is confirmed compromised."
The People, Process, Technology framework is relevant to understanding why this gap persists. BCPs tend to be produced as governance artefacts, written by risk or compliance functions and approved by boards. They address the people and process dimensions of continuity. DRPs require deep technical knowledge of the specific custody architecture in use: the technology dimension that governance teams frequently cannot specify without input from the engineering and security functions. When the people who understand the technology are not involved in writing the recovery document, the document cannot contain the procedures that are actually needed.
Closing the BCP-DRP gap requires the security team, engineering leadership, and key custodians to work together on the DRP as a technical document, not a governance exercise. The resulting DRP should be granular enough that a competent engineer who was not present during original setup could execute the recovery procedures from the document alone. If the DRP depends on undocumented institutional knowledge held by one person, it is not a recovery document: it is a single point of failure.
The First 24 Hours After a Key Compromise
The first 24 hours following a suspected key compromise are the most consequential period in the entire incident lifecycle. The decisions made and the actions taken, or not taken, during this window determine whether the incident is contained or escalates into a total loss. Without a pre-agreed playbook, teams default to investigation before containment, a sequencing error that consistently allows attackers to continue exploiting compromised access while the team is still trying to understand what happened.
The correct sequencing is: contain first, investigate second. The moment a key compromise is suspected, the priority is to prevent further use of the potentially compromised signing authority. Rotate or freeze the affected key, revoke associated sessions and API credentials, and isolate any systems that may have been accessed using the compromised material. This containment step should be executed before the full scope of the incident is understood, because the cost of rotating a key that turns out not to have been compromised is trivial relative to the cost of leaving a confirmed compromised key active while the investigation proceeds.
The incident response plan should specify the containment procedures for each class of key compromise: hot wallet key exposure, cold wallet signer compromise, protocol upgrade key compromise, and infrastructure credential compromise. Each class has a different containment procedure and a different urgency profile. Hot wallet key exposure demands immediate action measured in minutes. Cold storage compromise has more lead time because the keys are not actively used for routine operations, but the response is no less critical.
Evidence preservation must happen in parallel with, not after, containment. Before any remediation steps that involve system rebuilds, credential resets, or log rotation, capture forensic copies of affected systems, export available logs, and document the current state of on-chain activity. The evidence captured in the immediate aftermath of an incident is often the only basis on which the scope and attack vector can later be determined. Destroying evidence inadvertently during hasty remediation is a common and costly mistake.
Out-of-band communication is essential from the first moment of incident declaration. If the incident involves a compromised internal system, the assumption must be that communications on that system are observable by the attacker. The incident response team should convene on a pre-agreed out-of-band channel: a separate Signal group, a secondary communication platform, or physical co-location if circumstances permit. The pre-agreed channel must be established and tested before an incident, not set up during one.
Legal counsel should be engaged within the first few hours. Legal counsel with crypto expertise can advise on notification obligations, the handling of evidence, and communications with affected users and regulators. Early engagement prevents the team from taking actions that could later create legal liability, such as making premature public statements about the scope or cause of an incident before the facts are established.
By the end of the first 24 hours, the team should have achieved: containment of the immediate threat, preservation of forensic evidence, a preliminary scope assessment, a secure communication channel established, legal counsel engaged, and a clear picture of which recovery procedures need to be executed in the following days. Any of these outcomes that remain incomplete at the 24-hour mark represent a gap in the incident response capability that the DRP should address.
Key Recovery Procedures: Multi-Sig Restoration
Key recovery after a signer compromise depends entirely on the custody architecture in place at the time of the incident. This is why the DRP must be written against the specific configuration of the organisation's signing infrastructure, not against a generic template. The procedures for recovering a 3-of-5 multi-sig arrangement differ from those for recovering a threshold signature scheme, which differ again from recovering HSM-based custody. Generic recovery procedures are insufficient.
For multi-signature wallet recovery after a single signer compromise, the procedural steps are: first, confirm that the number of remaining uncompromised signers meets or exceeds the signature threshold for the wallet. If the threshold is met, the uncompromised signers can authorise a transaction that removes the compromised signer key and adds a replacement key generated in a controlled key ceremony. The replacement key generation must follow the same procedural standards as the original key generation: air-gapped generation, documented ceremony, witnessed by multiple authorised personnel, and immediate secure storage of backup material.
The critical pre-condition for multi-sig recovery is that the remaining signers are genuinely uncompromised. If the incident investigation has not yet established the full attack vector, the assumption must be that additional signers may be affected. In practice, many multi-sig compromises involve multiple signers being targeted simultaneously through the same attack vector, typically social engineering of multiple individuals within the same organisation. The DRP should specify the minimum investigation steps required before uncompromised signers are allowed to proceed with recovery signing, to avoid the scenario where recovery signing is performed using signers who are themselves compromised.
When the number of confirmed compromised signers meets or exceeds the signature threshold, full recovery is not possible through the normal multi-sig mechanism. This scenario requires protocol-level intervention: a governance vote, an emergency admin function if one exists in the protocol architecture, or, in extreme cases, coordination with the broader community or relevant counterparties. The existence and operation of any emergency override mechanisms should be documented in the DRP, including the governance procedures that authorise their use and the conditions under which they can be invoked.
Threshold signature scheme recovery follows different procedures because key material is distributed differently. TSS schemes do not produce individual keys that can be individually revoked: the key material exists as distributed shares. Recovery requires that the threshold of remaining uncompromised participants execute a re-sharing ceremony, generating new key shares for all participants and invalidating the old shares. The re-sharing ceremony should be documented in the DRP with the same rigour as the initial key generation ceremony.
The cold storage policy should include explicit provisions for recovery scenarios. If cold storage keys are held in geographically distributed physical locations, the DRP must specify the procedure for accessing those locations under emergency conditions, including the authorisation required, the documentation of access, and the chain of custody for any backup material retrieved during recovery.
Rebuilding Signing Infrastructure After an Incident
Returning to normal operations after a key compromise is not simply a matter of rotating the affected keys. The signing infrastructure itself, the machines, operating systems, software dependencies, network configurations, and access controls that constitute the signing environment, must be treated as potentially compromised until proven otherwise. Signing infrastructure rebuild is the process of constructing a verified clean environment from which the organisation can resume signing operations with confidence.
The starting assumption for any infrastructure rebuild after a confirmed or suspected compromise is that the affected systems cannot be trusted to return to clean state through patching or remediation. The correct approach is to rebuild from verified clean images, not to attempt to clean a system that may have been modified by an adversary in ways that are not fully understood. The DRP should specify the process for obtaining and validating clean base images for all components of the signing infrastructure.
The rebuild sequence follows a logical dependency order. Before any signing operations resume, the following must be verified: the new signing environment has been built from clean, verified components; network segmentation and access controls are in place on the new environment; logging and monitoring are active and outputs are being collected to a write-protected destination; and the new signing keys have been generated in a controlled ceremony and loaded into the new environment. Skipping any of these steps to accelerate return to operation creates the risk of operational recurrence.
Treasury security controls must be re-verified as part of the infrastructure rebuild. All accounts with access to treasury assets should be reviewed, credentials rotated, and access logs from the incident period examined for unauthorised activity before treasury operations resume. The rebuild process is an opportunity to apply any security improvements that were identified as gaps during the incident, so that the rebuilt infrastructure is more resilient than the compromised predecessor.
Hardware security considerations are particularly important during a rebuild. If the compromise involved physical access to signing hardware, or if the attack vector is not fully understood, the compromised hardware itself should not be reused in the new signing environment. New hardware should be sourced, and its integrity verified before deployment. For HSM-based environments, re-initialisation procedures specified by the HSM vendor should be followed, including the generation of new HSM administrator credentials and the re-establishment of the HSM's security domain.
The rebuild process should be documented in detail as it is executed, both to support the forensic investigation and to produce an updated operational runbook that reflects the architecture of the new environment. The documentation produced during a rebuild is often the most accurate and complete technical documentation the organisation holds, because it is written against the actual state of the infrastructure rather than an idealised design document that may not reflect operational reality.
RTO and RPO for Web3 Firms: Setting Recovery Objectives
Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are the two fundamental metrics that define the organisation's recovery requirements and determine the investment needed in recovery capability. Most Web3 firms have not formally defined these metrics for their signing infrastructure, which means their DRP has no quantified success criteria and no basis for measuring whether recovery capability is adequate.
RTO defines the maximum acceptable time between an incident being declared and the organisation returning to operational signing capability. For a DeFi protocol with continuous user activity, an extended period during which the protocol cannot be operated or emergency functions cannot be executed represents both financial and reputational damage. The appropriate RTO depends on the nature of the protocol, its user dependencies, and the business impact of extended downtime. A protocol that operates time-sensitive liquidation mechanisms has a shorter appropriate RTO than a protocol that supports only periodic governance votes.
Setting a realistic RTO requires mapping the technical recovery steps against the time required to execute them. If the key recovery and infrastructure rebuild process realistically takes 72 hours under best-case conditions, an RTO of 24 hours is unachievable without changes to the recovery architecture: pre-staged clean infrastructure, more distributed backup key access, or dedicated recovery personnel who can operate around the clock. The RTO should be set at a level that is both operationally meaningful and technically achievable.
RPO defines the maximum acceptable data or state loss between the incident and the recovery point. In a crypto context, the on-chain state is preserved by the blockchain itself, so the RPO for on-chain transaction history is effectively zero. The RPO concern in a crypto firm's DRP is primarily for off-chain systems: databases that record off-chain state, monitoring and alerting infrastructure, key management metadata, audit logs, and operational data that is not replicated on-chain. Each of these systems requires an explicit RPO commitment backed by a tested backup and restoration procedure.
The relationship between RTO and the cost of recovery capability investment is important for budget discussions with leadership. Shorter RTOs require more investment: pre-staged hot standby environments, redundant key material accessible to multiple recovery personnel, and dedicated recovery staffing. The cost of achieving a given RTO should be weighed against the business impact of the corresponding downtime period. This analysis, documented in the DRP, provides the commercial justification for recovery capability investment and creates accountability for the recovery time commitment.
Regular review of RTO and RPO commitments is necessary as the protocol scales. An RTO that was adequate for a protocol with modest total value locked may be inadequate for the same protocol at ten times the scale. The DRP should specify the conditions, whether measured by TVL thresholds, user count, or transaction volume, that trigger a review of recovery objectives and the investment required to meet them.
Testing Your Disaster Recovery Plan Before You Need It
An untested DRP is not a recovery capability. It is a document. The difference between a recovery plan and a recovery capability is the testing programme that validates whether the procedures work as designed, under the conditions in which they will need to be executed.
DRP testing has three forms, each with a different scope and resource requirement. Document review and walkthrough involves the relevant team members reading through the DRP procedures and verbally tracing the steps, identifying gaps, ambiguities, or dependencies that are not captured in the document. This is the lowest-cost form of testing and should be conducted every time the DRP is updated. It does not validate whether the procedures actually work, but it identifies obvious gaps before they surface in a real incident.
Tabletop exercises present the team with a realistic scenario, such as a hot wallet key suspected compromised, and require them to work through the response and recovery procedures verbally in a facilitated session. Tabletop exercises surface decision-making gaps, role ambiguities, and communication failures that document review alone does not reveal. They build familiarity with the procedures among the people who will need to execute them. Tabletop exercises should be conducted at least twice per year, against different scenarios each time, and should include the full range of stakeholders: technical recovery personnel, legal counsel, communications, and senior leadership.
Technical recovery drills involve actually executing the recovery procedures against a test or staging environment that mirrors the production signing infrastructure. This is the most resource-intensive form of testing but the only one that validates whether the procedures work at the technical level. A technical drill that fails to complete within the RTO identifies a gap between the designed recovery capability and the achievable recovery time. At least one full technical recovery drill should be conducted annually, and the results should be documented and reviewed by leadership.
Each test or drill should produce a written outcome report that identifies what worked, what failed, and what actions are required to close the gaps identified. Those action items should be tracked to completion and the DRP updated to reflect any procedural changes made as a result. A testing programme that does not result in DRP updates is not improving recovery capability.
The human factors dimension of DRP testing is frequently underweighted. Recovery procedures that work smoothly in a planned drill may fail under the psychological pressure of a real incident. Training the team to execute procedures under simulated pressure, by introducing time constraints or partial information into the tabletop scenario, builds the psychological resilience required for effective performance in a real incident. The ability to follow a procedure calmly and sequentially when under pressure is a skill that is developed through practice, not inherited.
Regulatory Notification Obligations After a Crypto Breach
The regulatory landscape for crypto firms has matured significantly in recent years, and the notification obligations following a security breach are now a formal legal requirement in most major jurisdictions. Failure to comply with notification timelines adds regulatory risk to an already serious incident, and the penalties for non-compliance can be substantial. The DRP must include an explicit regulatory notification section that maps the firm's jurisdictions of operation to the applicable notification requirements.
Under the EU's Markets in Crypto-Assets Regulation (MiCA), crypto-asset service providers are required to notify their competent authority without undue delay, and no later than four hours after becoming aware of a significant operational or security incident. Incidents that affect the confidentiality, integrity, or availability of the firm's systems, or that result in financial loss to users, are likely to meet the significance threshold. The notification must include the nature of the incident, the systems affected, and the remediation actions being taken.
In the UK, firms regulated by the Financial Conduct Authority are subject to the FCA's operational resilience framework, which includes notification requirements for operational incidents. Firms operating under an e-money licence, a payment institution licence, or as a cryptoasset business registered under the Money Laundering Regulations may have additional notification timelines depending on the nature of the incident.
Data protection obligations are separate from financial regulatory obligations and may be triggered simultaneously. If a key compromise involved unauthorised access to personal data, GDPR notification requirements apply: the supervisory authority must be notified within 72 hours of becoming aware of the breach, and affected individuals must be notified without undue delay if the breach is likely to result in high risk to their rights and freedoms. The 72-hour GDPR clock runs from the point of awareness, not from the point of confirmation, which means that a suspected breach triggers the notification obligation even before the scope is fully established.
Notification to affected users presents a communications challenge that the DRP should address. User notification must be factually accurate, proportionate to the established scope of the incident, and legally reviewed before publication. Premature notification that overstates the scope, or delayed notification that violates regulatory timelines, both create additional legal exposure. The DRP should specify who is responsible for drafting user communications, who must approve them, and what the review and publication process is during an incident.
Cross-border operations add complexity to the notification picture. A firm operating across multiple jurisdictions may face concurrent notification obligations with different timelines, different content requirements, and different competent authorities. Pre-mapping these obligations in the DRP, ideally with legal counsel's input, prevents the situation where compliance obligations are discovered for the first time during an incident. Each jurisdiction in which the firm operates should have its notification requirements documented, with the contact details of the relevant competent authority and a template notification structure that can be adapted to the specific incident.
Insurance notification is a further obligation that firms with cyber insurance cover must not overlook. Most cyber insurance policies specify notification timelines for covered incidents, and failure to notify within the policy window may affect coverage. The DRP should identify the insurer, the policy number, the notification timeline, and the designated contact for incident reporting.
Frequently Asked Questions
What is the difference between a crypto disaster recovery plan and a business continuity plan?
A business continuity plan addresses how an organisation continues to operate across a broad range of disruptive events, covering people, processes, communications, and alternative operating arrangements. A disaster recovery plan is a subset of the BCP that focuses specifically on restoring technical infrastructure and systems to operational state after a disruptive event. In a crypto context, the DRP addresses the specific technical recovery of signing infrastructure, key material, and on-chain access, which the BCP alone does not cover at the required operational depth.
What should happen in the first 24 hours after a suspected key compromise?
The first 24 hours should follow a pre-defined playbook: initiate emergency key rotation or freezing of affected signing authority, assemble the incident response team and establish out-of-band communications, preserve all logs and evidence before any remediation steps that might overwrite them, notify legal counsel, begin a preliminary scope assessment to determine which keys and systems are affected, and assess whether regulatory notification obligations are triggered. Acting without a plan in this window is the primary cause of escalating losses.
How do you recover a multi-sig wallet after a signer compromise?
Multi-sig recovery after a signer compromise involves: confirming the number of remaining uncompromised signers meets or exceeds the signature threshold, using the uncompromised signers to authorise a transaction that replaces the compromised signer key with a new, freshly generated key, revoking the compromised key from the signing set, and auditing all transactions signed during the period of potential compromise. If the number of compromised signers meets or exceeds the threshold, recovery is significantly more complex and may require protocol-level intervention.
What are realistic RTO and RPO targets for a Web3 firm?
Recovery Time Objective defines how long the organisation can tolerate being unable to sign transactions or operate its protocol. For DeFi protocols with continuous user activity, an RTO of hours rather than days is appropriate. Recovery Point Objective defines the maximum acceptable data or state loss. In a crypto context, RPO is often effectively zero for on-chain state since the blockchain preserves all historical transactions, but off-chain systems require explicit RPO commitments backed by tested backup and restoration procedures.
How often should a crypto disaster recovery plan be tested?
At minimum, a tabletop exercise should be conducted twice per year, and a full technical recovery drill should be run at least annually. The DRP should also be reviewed and updated after every significant change to signing infrastructure, key custody arrangements, team composition, or protocol architecture. Untested recovery procedures are not reliable recovery procedures: the drill is the mechanism that validates whether the plan actually works.