Design of global data deduplication for a scale-out distributed storage system

Myoungwon Oh, Sejin Park, Jungyeon Yoon, Sangjae Kim, Kang Won Lee, Sage Weil, Heon Y. Yeom, Myoungsoo Jung

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Scale-out distributed storage systems can uphold balanced data growth in terms of capacity and performance on an on-demand basis. However, it is a challenge to store and manage large sets of contents being generated by the explosion of data. One of the promising solutions to mitigate big data issues is data deduplication, which removes redundant data across many nodes of the storage system. Nevertheless, it is non-trivial to apply a conventional deduplication design to the scale-out storage due to the following root causes. First, chunk-lookup for deduplication is not as scalable and extendable as the underlying storage system supports. Second, managing the metadata associated to deduplication requires a huge amount of design and implementation modifications of the existing distributed storage system. Lastly, the data processing and additional I/O traffic imposed by deduplication can significantly degrade performance of the scale-out storage. To address these challenges, we propose a new deduplication method, which is highly scalable and compatible with the existing scale-out storage. Specifically, our deduplication method employs a double hashing algorithm that leverages hashes used by the underlying scale-out storage, which addresses the limits of current fingerprint hashing. In addition, our design integrates the meta-information of file system and deduplication into a single object, and it controls the deduplication ratio at online by being aware of system demands based on post-processing. We implemented the proposed deduplication method on an open source scale-out storage. The experimental results show that our design can save more than 90% of the total amount of storage space, under the execution of diverse standard storage workloads, while offering the same or similar performance, compared to the conventional scale-out storage.

Original languageEnglish
Title of host publicationProceedings - 2018 IEEE 38th International Conference on Distributed Computing Systems, ICDCS 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1063-1073
Number of pages11
ISBN (Electronic)9781538668719
DOIs
Publication statusPublished - 2018 Jul 19
Event38th IEEE International Conference on Distributed Computing Systems, ICDCS 2018 - Vienna, Austria
Duration: 2018 Jul 22018 Jul 5

Publication series

NameProceedings - International Conference on Distributed Computing Systems
Volume2018-July

Other

Other38th IEEE International Conference on Distributed Computing Systems, ICDCS 2018
CountryAustria
CityVienna
Period18/7/218/7/5

Fingerprint

Metadata
Explosions
Processing
Big data

All Science Journal Classification (ASJC) codes

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications

Cite this

Oh, M., Park, S., Yoon, J., Kim, S., Lee, K. W., Weil, S., ... Jung, M. (2018). Design of global data deduplication for a scale-out distributed storage system. In Proceedings - 2018 IEEE 38th International Conference on Distributed Computing Systems, ICDCS 2018 (pp. 1063-1073). (Proceedings - International Conference on Distributed Computing Systems; Vol. 2018-July). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICDCS.2018.00106
Oh, Myoungwon ; Park, Sejin ; Yoon, Jungyeon ; Kim, Sangjae ; Lee, Kang Won ; Weil, Sage ; Yeom, Heon Y. ; Jung, Myoungsoo. / Design of global data deduplication for a scale-out distributed storage system. Proceedings - 2018 IEEE 38th International Conference on Distributed Computing Systems, ICDCS 2018. Institute of Electrical and Electronics Engineers Inc., 2018. pp. 1063-1073 (Proceedings - International Conference on Distributed Computing Systems).
@inproceedings{1f1440c5f9364bc9928dff259cd29d20,
title = "Design of global data deduplication for a scale-out distributed storage system",
abstract = "Scale-out distributed storage systems can uphold balanced data growth in terms of capacity and performance on an on-demand basis. However, it is a challenge to store and manage large sets of contents being generated by the explosion of data. One of the promising solutions to mitigate big data issues is data deduplication, which removes redundant data across many nodes of the storage system. Nevertheless, it is non-trivial to apply a conventional deduplication design to the scale-out storage due to the following root causes. First, chunk-lookup for deduplication is not as scalable and extendable as the underlying storage system supports. Second, managing the metadata associated to deduplication requires a huge amount of design and implementation modifications of the existing distributed storage system. Lastly, the data processing and additional I/O traffic imposed by deduplication can significantly degrade performance of the scale-out storage. To address these challenges, we propose a new deduplication method, which is highly scalable and compatible with the existing scale-out storage. Specifically, our deduplication method employs a double hashing algorithm that leverages hashes used by the underlying scale-out storage, which addresses the limits of current fingerprint hashing. In addition, our design integrates the meta-information of file system and deduplication into a single object, and it controls the deduplication ratio at online by being aware of system demands based on post-processing. We implemented the proposed deduplication method on an open source scale-out storage. The experimental results show that our design can save more than 90{\%} of the total amount of storage space, under the execution of diverse standard storage workloads, while offering the same or similar performance, compared to the conventional scale-out storage.",
author = "Myoungwon Oh and Sejin Park and Jungyeon Yoon and Sangjae Kim and Lee, {Kang Won} and Sage Weil and Yeom, {Heon Y.} and Myoungsoo Jung",
year = "2018",
month = "7",
day = "19",
doi = "10.1109/ICDCS.2018.00106",
language = "English",
series = "Proceedings - International Conference on Distributed Computing Systems",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "1063--1073",
booktitle = "Proceedings - 2018 IEEE 38th International Conference on Distributed Computing Systems, ICDCS 2018",
address = "United States",

}

Oh, M, Park, S, Yoon, J, Kim, S, Lee, KW, Weil, S, Yeom, HY & Jung, M 2018, Design of global data deduplication for a scale-out distributed storage system. in Proceedings - 2018 IEEE 38th International Conference on Distributed Computing Systems, ICDCS 2018. Proceedings - International Conference on Distributed Computing Systems, vol. 2018-July, Institute of Electrical and Electronics Engineers Inc., pp. 1063-1073, 38th IEEE International Conference on Distributed Computing Systems, ICDCS 2018, Vienna, Austria, 18/7/2. https://doi.org/10.1109/ICDCS.2018.00106

Design of global data deduplication for a scale-out distributed storage system. / Oh, Myoungwon; Park, Sejin; Yoon, Jungyeon; Kim, Sangjae; Lee, Kang Won; Weil, Sage; Yeom, Heon Y.; Jung, Myoungsoo.

Proceedings - 2018 IEEE 38th International Conference on Distributed Computing Systems, ICDCS 2018. Institute of Electrical and Electronics Engineers Inc., 2018. p. 1063-1073 (Proceedings - International Conference on Distributed Computing Systems; Vol. 2018-July).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Design of global data deduplication for a scale-out distributed storage system

AU - Oh, Myoungwon

AU - Park, Sejin

AU - Yoon, Jungyeon

AU - Kim, Sangjae

AU - Lee, Kang Won

AU - Weil, Sage

AU - Yeom, Heon Y.

AU - Jung, Myoungsoo

PY - 2018/7/19

Y1 - 2018/7/19

N2 - Scale-out distributed storage systems can uphold balanced data growth in terms of capacity and performance on an on-demand basis. However, it is a challenge to store and manage large sets of contents being generated by the explosion of data. One of the promising solutions to mitigate big data issues is data deduplication, which removes redundant data across many nodes of the storage system. Nevertheless, it is non-trivial to apply a conventional deduplication design to the scale-out storage due to the following root causes. First, chunk-lookup for deduplication is not as scalable and extendable as the underlying storage system supports. Second, managing the metadata associated to deduplication requires a huge amount of design and implementation modifications of the existing distributed storage system. Lastly, the data processing and additional I/O traffic imposed by deduplication can significantly degrade performance of the scale-out storage. To address these challenges, we propose a new deduplication method, which is highly scalable and compatible with the existing scale-out storage. Specifically, our deduplication method employs a double hashing algorithm that leverages hashes used by the underlying scale-out storage, which addresses the limits of current fingerprint hashing. In addition, our design integrates the meta-information of file system and deduplication into a single object, and it controls the deduplication ratio at online by being aware of system demands based on post-processing. We implemented the proposed deduplication method on an open source scale-out storage. The experimental results show that our design can save more than 90% of the total amount of storage space, under the execution of diverse standard storage workloads, while offering the same or similar performance, compared to the conventional scale-out storage.

AB - Scale-out distributed storage systems can uphold balanced data growth in terms of capacity and performance on an on-demand basis. However, it is a challenge to store and manage large sets of contents being generated by the explosion of data. One of the promising solutions to mitigate big data issues is data deduplication, which removes redundant data across many nodes of the storage system. Nevertheless, it is non-trivial to apply a conventional deduplication design to the scale-out storage due to the following root causes. First, chunk-lookup for deduplication is not as scalable and extendable as the underlying storage system supports. Second, managing the metadata associated to deduplication requires a huge amount of design and implementation modifications of the existing distributed storage system. Lastly, the data processing and additional I/O traffic imposed by deduplication can significantly degrade performance of the scale-out storage. To address these challenges, we propose a new deduplication method, which is highly scalable and compatible with the existing scale-out storage. Specifically, our deduplication method employs a double hashing algorithm that leverages hashes used by the underlying scale-out storage, which addresses the limits of current fingerprint hashing. In addition, our design integrates the meta-information of file system and deduplication into a single object, and it controls the deduplication ratio at online by being aware of system demands based on post-processing. We implemented the proposed deduplication method on an open source scale-out storage. The experimental results show that our design can save more than 90% of the total amount of storage space, under the execution of diverse standard storage workloads, while offering the same or similar performance, compared to the conventional scale-out storage.

UR - http://www.scopus.com/inward/record.url?scp=85050972597&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85050972597&partnerID=8YFLogxK

U2 - 10.1109/ICDCS.2018.00106

DO - 10.1109/ICDCS.2018.00106

M3 - Conference contribution

T3 - Proceedings - International Conference on Distributed Computing Systems

SP - 1063

EP - 1073

BT - Proceedings - 2018 IEEE 38th International Conference on Distributed Computing Systems, ICDCS 2018

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Oh M, Park S, Yoon J, Kim S, Lee KW, Weil S et al. Design of global data deduplication for a scale-out distributed storage system. In Proceedings - 2018 IEEE 38th International Conference on Distributed Computing Systems, ICDCS 2018. Institute of Electrical and Electronics Engineers Inc. 2018. p. 1063-1073. (Proceedings - International Conference on Distributed Computing Systems). https://doi.org/10.1109/ICDCS.2018.00106