Sunday, September 20, 2015

Review of "The Google File System"

The problem here is certainly real: most (all?) existing networked file systems simply were not designed to handle data at the scale of Google and other big data use cases. A new way of storing this data, with fundamentally different considerations, was definitely necessary. 

The solution's main idea is to have a single master server which deals with all metadata and coordination (but does not store any data), plus numerous (up to thousands) of chunkservers which actually store the data (replicated across multiple chunkservers). Slightly loosened consistency models and lack of a full POSIX filesystem API allow for a more efficient file system under their assumptions about workload. 

This solution is different because it makes very different assumptions about both workload and system than previous filesystems. Generally, file systems attempt to implement the full suite of POSIX file operations, provide pretty strong consistency, assume that machines fail infrequently, attempt to optimize for a wide range of cases, etc. GFS assumes that machines fail often, and makes strong assumptions about the way in which the file system will be used - mostly append only, very large files, few random reads, etc. 

A fundamental trade-off here is flexibility vs. speed/efficiency/scalability. GFS gives up a lot in terms of flexibility as compared to a normal file system, specifically optimizing for a very specific use case, which allows for significant reduction in complexity and increase in scalability and efficiency. There is also a trade-off of simplicity vs efficiency having a single master node rather than some sort of distributed protocol (the master becomes a bottleneck). 

I do think this paper will be influential in 10 years, and in fact has already certainly proved to be, as HDFS is still hugely commonly used in industry and is mostly primarily off of this paper. 

No comments:

Post a Comment