With the vigorous development of cloud computing, more and more important computer information systems appear in cloud computing. As users and enterprises in various industries rely more and more on network applications and data information, sudden disasters such as fires, floods, earthquakes, regional power outages, or man-made sabotage will have a significant impact on the data and business production of the entire enterprise. Loss of important information, service interruption, economic loss, loss of customers, etc.
Therefore, in order to ensure the business continuity and data reliability of computer information systems in cloud computing, Huawei provides disaster recovery solutions for cloud computing to ensure that key data is not lost when a disaster occurs and system services resume operation as soon as possible.
A disaster tolerance system refers to the establishment of two or more systems with the same function in a distant place, and the health status monitoring and function switching between the systems can be carried out. When a system stops working due to an accident (such as fire, flood, earthquake, deliberate sabotage, etc.), the entire application system can be switched to another, so that the system functions can continue to work normally.
The disaster recovery system needs to have relatively complete data protection and disaster recovery functions to ensure the integrity of data and business continuity when the production center fails to work normally, and the disaster recovery center will take over in the shortest time to restore the normal operation of the business system. So as to minimize the loss.
Evaluation index of disaster recovery system
The disaster recovery system is mainly to prevent business interruption when a disaster occurs. Then, when a disaster occurs, what do users care most about?
Following is the internationally universal assessment standard Share 78 of the disaster recovery system, which can be used as an index for users to measure and select disaster recovery solutions:
- Scope of backup/restore.
- The status of the disaster recovery plan.
- The distance between the business center and the disaster recovery center.
- How to connect the business center and disaster recovery center.
- How is data transmitted between the two centers?
- How much data is allowed to be lost?
- How to ensure that the updated data is updated in the disaster recovery center.
- The ability of the disaster recovery center to start the disaster recovery process.
Therefore, the design of the disaster recovery system is mainly based on these user needs. Due to the limitation of the number of funds invested by users, it is obviously difficult to reach the 6th level of disaster tolerance with fewer funds.
The system we design can only minimize the failure duration under the existing conditions and recover as much data as possible. This is also an indicator to measure the quality of the disaster recovery system we have designed. In the actual disaster recovery system design process, we focus on two indicators, RTO and RPO.
RPO (Recovery Point Objective):
The data recovery point objective, in terms of time, that is, when a disaster occurs, the system and data must be recovered to the point in time. RPO marks the maximum amount of data loss that the system can tolerate. The smaller the amount of lost data tolerated by the system, the smaller the value of RPO.
RTO (Recovery Time Objective):
The recovery time objective, in terms of time, that is, the time required for the information system or business function to be restored after a disaster occurs. The RTO marks the maximum time that the system can tolerate a service stoppage. The higher the urgency of system services, the smaller the value of RTO.
RPO is for data loss, while RTO is for service loss. The determination of RTO and RPO must be determined according to different business requirements after risk analysis and business impact analysis.
A good disaster tolerance system needs to meet the needs of users as much as possible, but the design of a disaster tolerance system is often restricted by a variety of conditions, such as available technology, current network conditions, the user will, and user business. But so far, the decisive factor is the cost of disaster recovery construction.
Disaster recovery system construction process
According to the disaster tolerance system construction model, the disaster tolerance system construction process is divided into four stages: analysis, strategy formulation, program implementation, and testing/exercise/maintenance. The following is an explanation of each stage:
1. Analysis stage
After obtaining the formal consent of the management, the personnel and resources are guaranteed. Firstly, collect business process information, technical infrastructure support environment, disaster types, etc., and then conduct business impact analysis and risk analysis to determine the possible impact due to interruptions and anticipated disasters. The results of the analysis are used to determine the criticality of the business, business recovery time, and the degree of data loss that can be tolerated.
2. Strategy formulation stage
In this stage, combining the above analysis results and the enterprise’s investment planning for disaster recovery, formulate the enterprise’s short-term and long-term disaster recovery strategies and goals, and first define the preliminary plan.
It is further analyzed in combination with various factors, the unsuitable solutions are eliminated from the candidate solutions, and the remaining available solutions are submitted to the evaluation team. The evaluation team selects the most suitable disaster recovery solution after a fully detailed review.
3. Project implementation stage
According to the selected disaster tolerance plan, integrate the relevant resources of the enterprise, determine the disaster tolerance system architecture and disaster recovery plan, and achieve the required disaster tolerance goals through technical means and services.
4. Test/exercise/maintenance stage
Any plan made must undergo continuous testing and revision to meet the continuous development needs of the enterprise Data Analysis.
At the same time, through the training and testing process, the internal personnel of the enterprise can also be familiar with their role in the disaster recovery process, ensuring that the recovery process can be executed in an orderly manner when the disaster really occurs.