RDFChain: Chain centric storage for scalable join processing of RDF graphs using mapreduce and HBase

Pilsik Choi, Jooik Jung, Kyong Ho Lee

Research output: Contribution to journalConference article

9 Citations (Scopus)

Abstract

As a massive linked open data is available in RDF, the scalable storage and efficient retrieval using MapReduce have been actively studied. Most of previous researches focus on reducing the number of MapReduce jobs for processing join operations in SPARQL queries. However, the cost of shuffle phase still occurs due to their reduce-side joins. In this paper, we propose RDFChain which supports the scalable storage and efficient retrieval of a large volume of RDF data using a combination of MapReduce and HBase which is NoSQL storage system. Since the proposed storage schema of RDFChain reflects all the possible join patterns of queries, it provides a reduced number of storage accesses depending on the join pattern of a query. In addition, the proposed cost-based map-side join of RDFChain reduces the number of map jobs since it processes as many joins as possible in a map job using statistics.

Original languageEnglish
Pages (from-to)249-252
Number of pages4
JournalCEUR Workshop Proceedings
Volume1035
Publication statusPublished - 2013 Jan 1
Event12th International Semantic Web Conference, ISWC 2013 - Sydney, Australia
Duration: 2013 Oct 23 → …

Fingerprint

Processing
Costs
Statistics

All Science Journal Classification (ASJC) codes

  • Computer Science(all)

Cite this

@article{5a75d829e71c4da292c400d51d92cf34,
title = "RDFChain: Chain centric storage for scalable join processing of RDF graphs using mapreduce and HBase",
abstract = "As a massive linked open data is available in RDF, the scalable storage and efficient retrieval using MapReduce have been actively studied. Most of previous researches focus on reducing the number of MapReduce jobs for processing join operations in SPARQL queries. However, the cost of shuffle phase still occurs due to their reduce-side joins. In this paper, we propose RDFChain which supports the scalable storage and efficient retrieval of a large volume of RDF data using a combination of MapReduce and HBase which is NoSQL storage system. Since the proposed storage schema of RDFChain reflects all the possible join patterns of queries, it provides a reduced number of storage accesses depending on the join pattern of a query. In addition, the proposed cost-based map-side join of RDFChain reduces the number of map jobs since it processes as many joins as possible in a map job using statistics.",
author = "Pilsik Choi and Jooik Jung and Lee, {Kyong Ho}",
year = "2013",
month = "1",
day = "1",
language = "English",
volume = "1035",
pages = "249--252",
journal = "CEUR Workshop Proceedings",
issn = "1613-0073",
publisher = "CEUR-WS",

}

RDFChain : Chain centric storage for scalable join processing of RDF graphs using mapreduce and HBase. / Choi, Pilsik; Jung, Jooik; Lee, Kyong Ho.

In: CEUR Workshop Proceedings, Vol. 1035, 01.01.2013, p. 249-252.

Research output: Contribution to journalConference article

TY - JOUR

T1 - RDFChain

T2 - Chain centric storage for scalable join processing of RDF graphs using mapreduce and HBase

AU - Choi, Pilsik

AU - Jung, Jooik

AU - Lee, Kyong Ho

PY - 2013/1/1

Y1 - 2013/1/1

N2 - As a massive linked open data is available in RDF, the scalable storage and efficient retrieval using MapReduce have been actively studied. Most of previous researches focus on reducing the number of MapReduce jobs for processing join operations in SPARQL queries. However, the cost of shuffle phase still occurs due to their reduce-side joins. In this paper, we propose RDFChain which supports the scalable storage and efficient retrieval of a large volume of RDF data using a combination of MapReduce and HBase which is NoSQL storage system. Since the proposed storage schema of RDFChain reflects all the possible join patterns of queries, it provides a reduced number of storage accesses depending on the join pattern of a query. In addition, the proposed cost-based map-side join of RDFChain reduces the number of map jobs since it processes as many joins as possible in a map job using statistics.

AB - As a massive linked open data is available in RDF, the scalable storage and efficient retrieval using MapReduce have been actively studied. Most of previous researches focus on reducing the number of MapReduce jobs for processing join operations in SPARQL queries. However, the cost of shuffle phase still occurs due to their reduce-side joins. In this paper, we propose RDFChain which supports the scalable storage and efficient retrieval of a large volume of RDF data using a combination of MapReduce and HBase which is NoSQL storage system. Since the proposed storage schema of RDFChain reflects all the possible join patterns of queries, it provides a reduced number of storage accesses depending on the join pattern of a query. In addition, the proposed cost-based map-side join of RDFChain reduces the number of map jobs since it processes as many joins as possible in a map job using statistics.

UR - http://www.scopus.com/inward/record.url?scp=84908669508&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84908669508&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:84908669508

VL - 1035

SP - 249

EP - 252

JO - CEUR Workshop Proceedings

JF - CEUR Workshop Proceedings

SN - 1613-0073

ER -