How to ensure hardware redundancy if one server is broken?

profile_document

How to ensure hardware redundancy if one server is broken? If one of the Tioga Pass is broken, can we replace with a normal x86 machine for emergency?

Last Update:2020-08-14
Version:001
Language:en

Page Content

Rapid.Space redundancy concept consists of achieving resiliency through multiple data centers, not through redundancy in a single data center.

Many things can break in a cloud system:

servers;
internet transit;
electricity;
building.

No matter how much hardware redundancy is in place for servers, internet transit or electricity, it will never be enough if the data center building is destroyed by fire or flood, which actually happens from time to time to conventional public clouds.

The safest approach for resiliency is thus to deploy applications in multiple data centers, possibly based on different sources of electricity or internet transit. This is called data center redundancy. This is how Nexedi has been deploying ERP5 on Rapid.Space, by ensuring that there are always copies of the production ERP in one or two other sites, ready to take over the main site in case of failure. This feature, called "resiliency stack", is part of the collection of sample buildout scripts of SlapOS.

Data centre redundancy does not mean that one should not also try to achieve high availability in each data center. By deploying applications on multiple servers with appropriate high availability software (eg. Linbit, OpenSVC, repman, etc.), crash of one server does not affect the whole system.

One should also note that even though Rapid.Space hardware is based on OCP standards, this does not prevent from extending Rapid.Space infrastructure with any kind of hardware, including traditional x86_64 servers or even other architectures (ARM, PowerPC, etc.) for even further resiliency.