RAS: Reliability, Availability and Serviceability. What does this term really mean? And how does it impact your IT environment at your company?
Reliability: Reliability refers the ability of your IT system to consistently perform according to its specifications. There should be features implemented to uncover and reveal faults in your computing equipment. A reliable computing system, whether it’s your data servers or your data storage equipment should be able to detect corrupted data and instead of continuing to use bad data or worsen the problem, is able to alert you to the problem, and if your system is advanced enough, begin to repair the problem.
Availability: Availability is the amount of time your system is actually running versus the amount of time it is expected or required to perform. Different availability features allow your IT system to remain up and running even when there are failures or data corruptions found. It can also be expressed in terms of average downtime per week, month or year or as total downtime for a given week, month or year. Salespeople will often speak about availability in terms of slowed down systems (for instance, they will say your server can operate at 80% throughput capacity even if two data ports are down). A highly available system can disable a malfunctioning part and continue operating, at a reduced but still manageable capacity. A poorly available system might completely crash and be non-operational.
Serviceability: Serviceability addresses the ease with which a system can be repaired, maintained or serviced without critically impacting your ability to use the system while this is happening. Maintenance and repair should affect your system as little as possible, and a more serviceable IT environment can handle this work without causing disruption for users.
Some of the more commonly used methods to build RAS capabilities include:
- Redundant power: Especially in large storage systems, AC power from two separate electrical loops is implemented to avoid complete power-downs should an electrical circuit in the building fail. In addition, many data storage systems will be delivered with two internal batteries that are powerful to last through a complete data recovery process should AC power fail from all external sources.
- Duplication: The design of redundant systems and components is used to ensure that there is no single point of failure inside that could cause failure and loss of access to data.
- Data backup: A key component of a fully enabled RAS environment, regular backup of your data to tape or disk library storage systems ensures that should something catastrophic happen to your primary IT environment, you still have a way to recreate and move forward with the customer data or business data that is vital to your company’s operations.
- Power-on replacement: A development that has come about in the past 10 years, being able to “hot swap” electrical components while the larger server or storage system is still running has drastically increased both availability and serviceability in the IT world. Individual faulty components like communication cards or power supplies can be powered down on their own when you require tape library repair service, swapped for fully functioning ones and then restarted to become part of the larger, functioning system again.
- Over-engineering: Particularly in the area of mechanical failure (vibration, moving parts, temperature and voltage among others), over-designing to stay well within failure parameters creates a more robust and adaptable system. Designing systems to specifications better than minimum requirements allows for occasional exceptions where a machine is briefly exposed to too much vibration or an accidental temperature spike in the office that otherwise would cause an immediate failure.
A term that was originally coined at IBM to describe the robustness and superiority of their products, it has been adopted across the electronics world and can be used to discuss computing products with almost anyone these days. Be sure you’re asking questions about the RAS side of any environment being designed, maintained or operated for your company’s IT needs.