High Availability Solutions in Fault Tolerant Systems

Availability is a way to measure the durability of a system. Availability can be defined as the length of time a system is actually functioning (or service operational), divided by the length of time the system could have been working.

What are Typical Availability Levels?

Systems are usually segmented into availability levels by their number of “nines,” and further described using terms like “highly-available” and “fault-tolerant.” If a system is available 99% of the time (two nines), that then means it’s not available 1% of the time. In any given year with 525,600 available minutes, you can expect a “two nines” system to be down for 5,256 of those minutes, or for about 88 hours or 4 days. Depending on your particular cost of downtime, this can be expensive.

Systems operating at higher “four nines” and “five nines” average availability levels are often called “highly-available” or “fault-tolerant” systems with "seven nines" representing the evolution of fault tolerance in intelligent, predictive platforms.

What are Common Methods Used to Increase Availability?

There are several time-tested methods companies use to improve availability, ranging from improving system reliability and resilience, implementing backup and recovery procedures, or deploying redundant clusters (physical or virtual) with failover services.

1. Using Reliable and Resilient Systems

One way to improve availability is to use more reliable systems. The more rugged and reliable your system, the less likely it is to break down. The less it breaks down, the longer it keeps running, and by definition, the longer it’s available.

A related way to increase availability is to implement a more resilient system – one that can bounce back quickly from a setback. By reducing the time it takes to repair the system and resume services, you are lowering downtime and increasing overall availability. What’s interesting is that if a system can bounce back quickly every time, then it matters less how often it breaks.

2. Implementing Backup and Recovery

Reliability and resilience have their limits though. In many cases, it’s not just system availability, but data protection and data integrity that you also have to worry about.

Companies taking a more holistic approach to availability will often backup their data on a regular basis and keep spare systems in inventory. If their production systems experience a catastrophic failure, they restart services on their spare systems, recovering the data they need from their archives.

Setting up backup and recovery services requires some skill. And the time to recover can vary, from a few hours to a few days, depending on the applications, the amount of data, and the availability of spare parts.

3. Using Native and Virtual Clustering and Failover Services

For some companies, resuming services after a few hours or a few days may be acceptable. But those with higher relative downtime costs need a more resilient approach, for both their applications and data.

Clustering and failover uses the same principle as backup and recovery, but shortens the time to recover services by doing some things in advance, like replicating systems so that they’re ready to resume at a moment’s notice. Several systems are combined and data is shared by these redundant systems. Typically, one system acts as the primary, providing users with access to applications and data, while a secondary system acts as a backup, either staying dormant until needed (passive) or running other applications (active). In the event of a failure to the primary system, the application will “failover” to the secondary system and resume running there, so long as the connections to shared data are established.

With the emergence of virtualization technologies, clustering and failover concepts have been extended to virtual systems. Today, virtualization and clustering technologies are being used to combine physical systems and failover applications running on virtual machines (VM), taking advantage of VM portability.

‍

Penguin Solutions offers a wide variety of Edge Computing solutions covering the full availability spectrum. From software only products like everRun, to complete solutions like ztC Endurance, ztC Edge, and ftServer that include hardware, software, and services, help customers easily and affordably deliver highly-available and fault-tolerant workloads. Contact us today to discuss your Edge Computing needs.

Talk to the Experts at
Penguin Solutions

At Penguin, our team designs, builds, deploys, and manages high-performance, high-availability HPC & AI enterprise solutions, empowering customers to achieve their breakthrough innovations.

Reach out today and let's discuss your infrastructure solution project needs.

What is Availability?

What are Typical Availability Levels?

What are Common Methods Used to Increase Availability?

1. Using Reliable and Resilient Systems

2. Implementing Backup and Recovery

3. Using Native and Virtual Clustering and Failover Services

Related Articles

Talk to the Experts at
Penguin Solutions

Solving complexity. Accelerating results.

Get in touch

Partners

Company

What are Typical Availability Levels?

What are Common Methods Used to Increase Availability?

1. Using Reliable and Resilient Systems

2. Implementing Backup and Recovery

3. Using Native and Virtual Clustering and Failover Services

Related Articles

Talk to the Experts atPenguin Solutions

Solving complexity. Accelerating results.

Get in touch

Partners

Company

Talk to the Experts at
Penguin Solutions