Similarity-Based Efficiency Improvements with Workload Awareness for Distributed Storage Systems
Access status:
USyd Access
Type
ThesisThesis type
Doctor of PhilosophyAuthor/s
Zhang, BinqiAbstract
Data storage systems play important roles in the cloud. In the era of big data, new applications such as data lakes, artificial intelligence and internet of things are all generating massive amounts of data. To improve storage efficiency, data deduplication and compression technologies ...
See moreData storage systems play important roles in the cloud. In the era of big data, new applications such as data lakes, artificial intelligence and internet of things are all generating massive amounts of data. To improve storage efficiency, data deduplication and compression technologies are commonly used. Applying them in the primary storage systems usually comes with a cost, the performance penalty. Most existing methods work well for secondary storage but have overlooked the improvements for primary storage. In addition, traditional deduplication and compression approaches have not paid enough attention to the different characteristics of the data itself. Traditional chunk-based deduplication has the limitation that it may not be able to identify content similarity even if two files only have very small different parts. Delta compression can be a good complementary storage space saving method when deduplication does not work well. Prior arts focused on improving the performance of the delta compression algorithm itself. Also, exploiting temporal and spatial locality is a way to improve the performance of data compression and deduplication in a storage system. Through our evaluation, we find that content level similarity measures such as similar tags of photos have a certain correlation to data compressibility. Based on these findings, we propose similarity-based efficiency improvements that are aware of some system level and workload attributes. First, the binary similarity-based routing algorithms in a primary distributed storage system can optimise the hit rate for underlying deduplication firmware in the SSD. Second, the method of selective application of compute-intensive delta compression may achieve best return for using computing resources. Finally, the new system design could prompt more research into data-specialised storage systems.
See less
See moreData storage systems play important roles in the cloud. In the era of big data, new applications such as data lakes, artificial intelligence and internet of things are all generating massive amounts of data. To improve storage efficiency, data deduplication and compression technologies are commonly used. Applying them in the primary storage systems usually comes with a cost, the performance penalty. Most existing methods work well for secondary storage but have overlooked the improvements for primary storage. In addition, traditional deduplication and compression approaches have not paid enough attention to the different characteristics of the data itself. Traditional chunk-based deduplication has the limitation that it may not be able to identify content similarity even if two files only have very small different parts. Delta compression can be a good complementary storage space saving method when deduplication does not work well. Prior arts focused on improving the performance of the delta compression algorithm itself. Also, exploiting temporal and spatial locality is a way to improve the performance of data compression and deduplication in a storage system. Through our evaluation, we find that content level similarity measures such as similar tags of photos have a certain correlation to data compressibility. Based on these findings, we propose similarity-based efficiency improvements that are aware of some system level and workload attributes. First, the binary similarity-based routing algorithms in a primary distributed storage system can optimise the hit rate for underlying deduplication firmware in the SSD. Second, the method of selective application of compute-intensive delta compression may achieve best return for using computing resources. Finally, the new system design could prompt more research into data-specialised storage systems.
See less
Date
2017-08-30Licence
The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.Faculty/School
Faculty of Engineering and Information Technologies, School of Information TechnologiesAwarding institution
The University of SydneyShare