The disaster recovery and business continuity industry’s mantra is “Test! Test! Test! And when you’re done…test some more!” There is another parallel theme that says, “There is no such thing as a failed test.”
Philosophically, we disagree with both of these self-fulfilling prophecies.
In fact, we believe that the vast majority of traditional tests are failures. Consider that it takes most companies weeks and sometimes months to prepare for a test, and even then, the test is predicated on specially created data and artificial “DR-only” jobs or processes. Then, even after all the preparation, consider that only a small subset of the production environment is exercised during the test. This is the state of most organizations’ disaster recovery capabilities. Many of the organizations who have followed this traditional industry model are now in their tenth or twentieth year of testing…and they have yet to exercise their production environments end-to-end! Now extrapolate from this worrisome reality and estimate the likelihood of a successful recovery from an unanticipated event that disrupts the entire production environment without time to prepare in advance.
We believe that this model is fundamentally flawed, and wish to present a different long-term objective: eliminate the need for as much testing as possible! We recognize that in all likelihood, this goal will never be achieved in its entirety…nor should it be. However, we do believe that whenever possible, assuming practical constraints, the recovery architecture should utilize an “active-configured” model. We coined the term active-configured to describe a production-recovery model where the recovery side is “pre-specified” and “pre-configured”. By pre-specified we mean that the actual target recovery environment is specifically known in all of its detail and its availability is guaranteed (or at least “repurpose-able”) at time of disaster. By pre-configured we mean that the target environment is fully configured for its recovery role before time of disaster. With an active-configured model, the vast majority of the most common and frequent testing activities can be eliminated and the more difficult end-to-end objectives can become the focus.
While the initial reaction is often that this approach is too costly for many organizations, we have proven repeatedly that many of the most critical infrastructure, applications and processes can enjoy an active-configured architecture for surprisingly little cost.
Nevertheless, even with an active-configured architecture, there is a need for ongoing testing and we favor a Life-Cycle testing approach during which there are always two concurrent, active testing cycles: the Short-Term cycle for the current test and the Long-Term program cycle.
Short-Term Test Cycle
The short-term cycle should consist of four distinct exercises that build upon one another to increase the likelihood of success and to maximize the productivity of precious testing resources (staff, test time, hardware availability, etc.) The first exercise is the unit pre-test during which individual scripts and procedures run on test or development environments at the home location. While this exercise does not actually prove any significant level of recoverability, it does provide a simple to way to validate component procedures on a readily available environment without consuming valuable dedicated test resources. The second exercise is the notification / response test, which confirms the ability to assess, notify, mobilize and deploy. The third exercise is the scenario test, which enhances the recovery teams’ ability to react effectively to a wide range of scenarios and a wide range of variables. The fourth exercise is the system and application test that proves the ability to pragmatically recover hardware, applications and data to the business’ stated RTOs and RPOs. Finally, for each test, there is the post-test review, which ensures that all procedures are updated based on actual exercise results.
Unit Pretests (physical test)
Unit pretesting provides an invaluable, low-cost way to maximize test resources and increase focus on the more challenging end-to-end synchronization, cross-connectivity and inter-dependency issues that too often are never exercised for lack of time. Unit pretesting can be conducted repeatedly, at will, and need not burn precious whole-environment test time. It can be used to validate alternative techniques to save time or improve reliability. And, it can be used to keep procedures evergreen as the environment changes between the “big tests”. In many organizations, complete end-to-end process streams can be pre-tested on a relatively small hardware footprint. The advantage of having this level of recoverability proven before larger tests are even started is invaluable.
Notification / Response Exercise (physical and tabletop)
Too often notification exercises are oversimplified and devolve into a mechanical exercise of a running though a call list or pushing the “notify all” button. While it is important to know that your contact information is accurate and up to date, and that the right people know how to use your notification system, the real objectives are mobilization and deployment, not just notification. In addition to the mechanics of how to contact staff, a comprehensive notification test must exercise management’s ability to decide: who to contact based on a wide-range of impact scenarios, when to contact them based on the severity of the event, when to mobilize them based on the evolving facts of the situation and where to deploy based on the geographical nature of the event.
Four distinct steps are required in order to test the full range of potential notification, mobilization and deployment responses: a general notification step (physical), an immediate response step (table-top), a communications step (table-top) and a business unit activation step (table-top). The general notification step will exercise the ability to use the call procedures, wallet cards and/or notification systems to contact recovery staff and stakeholders in a timely manner. This step proves the accuracy of the contact information and the functionality of the call-tree and/or notification system procedures. The immediate response step will exercise the response team’s ability to quickly evaluate a complicated disaster event, to determine which areas of the organization have been or will be impacted, and to place the appropriate individuals in the correct state of readiness—standby, mobilization, deployment or stand-down. The communications step exercises the ability of the crisis communications team to quickly and accurately determine when to send communiqués, which ones to send, which interest groups to send them to and which communication vehicles to use. The final step involves the business recovery teams and will exercise their ability to activate their business units and begin bridging efforts while the recovery response is being mounted.
Scenario Exercises (Table-Top)
A scenario-based testing methodology should be used for table-top exercises. Through this model, a general set of scenarios is developed wherein activation of the recovery plan is decided. Once the general scenario is defined, more specific injects are revealed to portray any number of potential real-world variations (i.e., Data Center outage with and without loss of life, workarea accessibility or not, regional or local disturbances, etc.). In this manner, it becomes possible to test the whole plan or a subset of the plan repeatedly from a wide variety of perspectives, while focusing on the same original criteria. This “perspective based reading” allows technical groups, user groups, executive groups, administrative groups and even outside groups to each respond to the same general set of plans and procedures (as applicable to each group) but to make recommendations based on their own understanding and experience as to how those plans should be executed to resolve the scenarios in question. Other testing methodologies tend to focus on “defect detection” in a general case, or the finding of flaws in a process. The inherent and fatal mistake of these models is revealed when they fail to account for the extremely wide variety of possible causes for plan activation. Scenario tests can be as big or small in terms of participants as desired. They can address part of the plan or all of it. They can include internal staff or be extended to vendors and municipal authorities. They can be matter-of-fact discussions or elaborate role-playing events. In all cases, they are specifically scripted, orchestrated, “timelined” and rehearsed.
System and Application Exercises (physical)
These exercises are intended to verify the accuracy and completeness of the detailed technical recovery procedures. Any deficiencies are recorded and subsequent corrections implemented. Testing of systems and applications ensures that personnel are not only familiar with the plan procedures and that the recovery steps are accurate, they are assured that the procedures actually work on the target recovery equipment. Every procedure in the plan is tested thoroughly and only included in the master plan after formal confirmation of technical accuracy.
System and application tests should ensure that entire data flows, not just single applications, are completely recoverable. In testing an entire data flow, we are testing every application that a single business process utilizes, making sure that every piece of that data/process stream will be functional in ‘disaster mode’. Following recovery, an entire set of test transactions is fed into those applications with known entry- and exit-states. In this way, application teams can be confident that data is being transformed in the predicted way and that the entire recovery effort is not only successful, but repeatable.
System and application testing also involves user groups as much as possible during all phases of the recovery. Rather than recovering simply the IT components, full system and application testing involves user requirements as well. It is not enough to assume that just because the servers are recovered that users and processes are functional. By testing the connectivity of users from multiple locations to recovered hardware and data, the ability of users to initiate transactions and access data and the communication of progress to users becomes an integral part of the entire testing process.
Post Test Reviews (physical and table-top)
The post-test review closes the loop for complete test cycle. This is where all of the findings and short-comings of each exercise are remediated into the plan documents in a continuous improvement process. This process assumes that each exercise is conducted “from the plan”, meaning that actual, documented plan steps are executed as opposed to shooting from the hip just to get the job done. The only effective way to document these findings is with an independent “test recorder” who fully understands all aspects of testing but who also has the time available to watch the proceedings and document them proactively.
Long-Term Life-Cycle Testing
The challenge of testing is that the production environment typically changes faster than the ability to test it. The result is that instead of each test being bigger and more aggressive than the previous one, the same general scope is repeated endlessly, albeit for the latest version of hardware, software and applications.
The implementation of a Long-Term Testing Life-Cycle will help ensure that the original investment of developing the recovery capability is protected and that test resources are maximized by following an organized, progressive cycle. Life-Cycle testing is a natural extension of the testing conducted during and immediately following plan development. It is required on a periodic basis, typically two to three times per year, to (a) verify accuracy and refine plan action steps, (b) routinely reconfirm task durations and streamline the recovery process, and (c) continually reinforce the understanding of the plan and enhance the recovery team members’ ability to respond. The Long-Term Testing Life-Cycle helps ensure that each test builds upon the proven capabilities of previous tests by constantly and consistently increasing both the objectives and scope of each test.
Once the objective of progressive testing is accepted, it fosters a focus on continuous improvement. Costs of testing are soon replaced with investments in technologies and techniques that in turn inherently improve recovery capabilities. When combined with an ongoing needs definition process that accurately quantifies recovery requirements and requisites, RTOs and RPOs can be improved through design and planning instead of spending.