Redundancy Does Not Imply Fault Tolerance
- Resource Type
- Authors
- Aishwarya Ganesan; Andrea C. Arpaci-Dusseau; Ramnatthan Alagappan; Remzi H. Arpaci-Dusseau
- Source
- ACM Transactions on Storage. 13:1-33
- Subject
- Computer science
Distributed computing
020206 networking & telecommunications
020207 software engineering
Fault tolerance
02 engineering and technology
Data loss
Hardware and Architecture
Software fault tolerance
Distributed data store
Data_FILES
0202 electrical engineering, electronic engineering, information engineering
Redundancy (engineering)
Data Corruption
Unavailability
Cloud storage
- Language
- ISSN
- 1553-3093
1553-3077
We analyze how modern distributed storage systems behave in the presence of file-system faults such as data corruption and read and write errors. We characterize eight popular distributed storage systems and uncover numerous problems related to file-system fault tolerance. We find that modern distributed systems do not consistently use redundancy to recover from file-system faults: a single file-system fault can cause catastrophic outcomes such as data loss, corruption, and unavailability. We also find that the above outcomes arise due to fundamental problems in file-system fault handling that are common across many systems. Our results have implications for the design of next-generation fault-tolerant distributed and cloud storage systems.