AI & HPC Data Centers
Fault Tolerant Solutions
Integrated Memory
Discover how to minimize the impact of system failures with fault tolerance strategies. Learn practical steps and examples to boost resilience and protect revenue.
What measures can be taken when a failure occurs in equipment or systems? Even if a failure does occur, the impact on revenue can be minimized by taking measures and making preparations to minimize the damage. We will introduce specific measures that can improve resilience against failures, along with concrete examples.
The term fault tolerance is sometimes used when talking about problems that occur in equipment or systems. What does fault tolerance mean, and in what situations is it used?
Any equipment or system will inevitably experience some kind of accident or trouble over a long period of use.
Any system will inevitably end up in a physical device if it is traced through the operating environment, and the parts built into it will deteriorate over time. As long as there is physical equipment there, even if there are no software problems, some kind of hardware problem will inevitably occur.
Such accidents and problems, or malfunctions due to aging, will cause equipment and systems to fail.
In the case of a system failure that occurred at Japan Airlines (JAL) in February 2022, it took about 10 hours to recover. Automatic check-in machines and reservation services for boarding procedures became unusable, and many flights across the country were delayed. The cause of the trouble was announced to be a failure of the server used in the connection infrastructure system.
In this way, even systems operated at a high level can experience failures due to physical factors.
So, what kind of preparations should we make, knowing that a failure will occur at some point? This is where the idea of fault tolerance comes in.
Fault tolerance refers to the ability of a device or system to maintain its function and continue operation when a failure occurs, or the mechanism for doing so. Even if some of the components of a device or system stop working, fault tolerance can be increased by having a backup system or a function that can deal with the problem.
In this sense, fault tolerance is also expressed as "fault tolerance."
There are two words that sound similar to fault tolerance and have similar meanings: "fault avoidance" and "high availability."
Fault avoidance is also expressed as the ability to avoid failures, and means to prevent failures from occurring. Sufficient testing and maintenance are carried out to increase reliability so that the occurrence of failures themselves can be avoided. It can be said to be a way of thinking that maintains continued operation using a different approach from fault tolerance. In some cases, an approach for fault avoidance is incorporated into the product design itself.
The degree of availability is called availability, and high availability refers to a state in which availability is high. In other words, high availability can be expressed as a state in which a product can be "used for a long time." To achieve high availability, it is effective to address fault tolerance and fault avoidance at the same time. By creating a state in which failures are unlikely to occur and preparing measures to maintain operation even if a failure does occur, equipment and systems can be kept available.
Improving fault tolerance has the following benefits:
BCP (Business Continuity Plan) is an initiative to maintain the continuity of business activities, including a company's funds and employees.
It is now becoming common knowledge around the world that it is essential to prepare measures in case a company's business activities are halted due to a disaster, terrorist attack, or large-scale failure. Increasing the fault tolerance of equipment and systems is one BCP measure, as it helps prevent business activities from being halted. If equipment or systems are important to a company's business activities, the importance of fault tolerance also increases.
If a system is in place that allows operations to continue and if an actual failure does occur, the time of downtime can be kept to a minimum through a rapid response, then the company's credibility can be maintained. This will help to avoid situations where credibility is damaged and business opportunities are lost.
The biggest goal for a company is to secure profits. It can be said that equipment and systems are ultimately used to generate profits. If the operation of those equipment and systems stops, it means that profits are lost for the time they are stopped. In
other words, establishing a system to continue operation can be said to be an effort to maximize profits.
Fault tolerance is often used mainly as an IT term, and some people may think of it as a term used for software. However, fault tolerance means preparing for all causes of faults, and is not a measure limited to software. Let's
consider how to improve fault tolerance using some examples.
Data centers often store programs and databases that operate important systems for client companies, and even a few seconds of service interruption can result in significant losses.
For this reason, fault tolerance is extremely important in order to continue operations or minimize downtime in the unlikely event of a problem.
The following measures can be considered to improve the fault tolerance of data centers:
As such, it is important to take measures on both the software and hardware sides.
Let's consider the fault tolerance of industrial robots used on manufacturing lines.
Industrial robots are equipped with many sensors, and many models use information obtained from the sensors to visualize the operating status and the state of the robot itself. In addition, models equipped with machine vision that uses AI to judge information captured and sensed by cameras and optical devices and process it according to the instructions are becoming more and more popular.
In this way, IoT has become an indispensable component for industrial robots. At this time, fault tolerance measures change depending on where the information obtained by the robot is sent for processing.
If the system for operating the robot is in the cloud, it is possible that the operation of the robot will be hindered by interruptions or delays in communication to the cloud.
In the operation of industrial robots, the following measures will improve fault tolerance.
Among the measures mentioned here, adopting a distributed system operation is an important measure in terms of maintaining the continuity of operations.
Edge computing is one example of a mechanism that provides distributed processing capabilities. Edge computing is a technology that considers the front line of the on-site side of the network to be the edge of the network, i.e., the edge, and processes not only at the center but also at the edge.
By distributing information that is advantageous to process on a terminal at the edge and information that should be stored in the cloud, it ensures high-speed processing and real-time information. In addition, even if communication with the cloud is interrupted or delayed, the system is distributed and processing can be performed at the edge, which prepares for failures.
In this way, distributed system operation increases the possibility of continued operation.
Fault tolerance means having a mechanism in place to maintain operation even if a problem occurs with equipment or systems.
Up until now, development of IT products and systems has included improvements to their fault tolerance. Now that IoT has become a fundamental technology for industry and daily life, improving fault tolerance is an essential element.
Fault tolerance, which has been important in large-scale data centers and infrastructure systems, is spreading to a variety of fields. In the future, fault tolerance is expected to become important not only in centralized systems such as data centers, but also in on-site systems in a wide range of fields, such as manufacturing and logistics.
To improve fault tolerance, it is necessary to consider the fault tolerance of the platform used in the field. Distributed system operation is an essential measure when considering fault tolerance. Edge computing, which performs processing using a distributed structure, is likely to become an indispensable technology in the future.
Please also read this article:
Availability in Edge Computing | Stratus Blog
At Penguin, our team designs, builds, deploys, and manages high-performance, high-availability HPC & AI enterprise solutions, empowering customers to achieve their breakthrough innovations.
Reach out today and let's discuss your infrastructure solution project needs.