An identification framework for print-scan books in a large database

Sanghoon Lee, Jongyoo Kim, Sanghoon Lee

Research output: Contribution to journalArticle

4 Citations (Scopus)

Abstract

In this paper, we propose an identification framework to determine copyright infringement in the form of illegally distributed print-scan books in a large database. The framework contains following main stages: image pre-processing, feature vector extraction, clustering, and indexing, and hierarchical search. The image pre-processing stage provides methods for alleviating the distortions induced by a scanner or digital camera. From the pre-processed image, we propose to generate feature vectors that are robust against distortion. To enhance the clustering performance in a large database, we use a clustering method based on the parallel-distributed computing of Hadoop MapReduce. In addition, to store the clustered feature vectors efficiently and minimize the searching time, we investigate an inverted index for feature vectors. Finally, we implement a two-step hierarchical search to achieve fast and accurate on-line identification. In a simulation, the proposed identification framework shows accurate and robust in the presence of print-scan distortions. The processing time analysis in a parallel computing environment gives extensibility of the proposed framework to massive data. In the matching performance analysis, we empirically and theoretically find that in terms of query time, the optimal number of clusters scales with O(N) for N print-scan books.

Original languageEnglish
Pages (from-to)33-54
Number of pages22
JournalInformation sciences
Volume396
DOIs
Publication statusPublished - 2017 Aug 1

Fingerprint

Feature Vector
Parallel Computing
Preprocessing
Processing
Clustering
MapReduce
Digital Camera
Digital cameras
Distributed computer systems
Number of Clusters
Parallel processing systems
Distributed Computing
Scanner
Clustering Methods
Indexing
Performance Analysis
Query
Minimise
Framework
Data base

All Science Journal Classification (ASJC) codes

  • Software
  • Control and Systems Engineering
  • Theoretical Computer Science
  • Computer Science Applications
  • Information Systems and Management
  • Artificial Intelligence

Cite this

Lee, Sanghoon ; Kim, Jongyoo ; Lee, Sanghoon. / An identification framework for print-scan books in a large database. In: Information sciences. 2017 ; Vol. 396. pp. 33-54.
@article{8a0a6bb95b2c4f9980a2376cda2d589c,
title = "An identification framework for print-scan books in a large database",
abstract = "In this paper, we propose an identification framework to determine copyright infringement in the form of illegally distributed print-scan books in a large database. The framework contains following main stages: image pre-processing, feature vector extraction, clustering, and indexing, and hierarchical search. The image pre-processing stage provides methods for alleviating the distortions induced by a scanner or digital camera. From the pre-processed image, we propose to generate feature vectors that are robust against distortion. To enhance the clustering performance in a large database, we use a clustering method based on the parallel-distributed computing of Hadoop MapReduce. In addition, to store the clustered feature vectors efficiently and minimize the searching time, we investigate an inverted index for feature vectors. Finally, we implement a two-step hierarchical search to achieve fast and accurate on-line identification. In a simulation, the proposed identification framework shows accurate and robust in the presence of print-scan distortions. The processing time analysis in a parallel computing environment gives extensibility of the proposed framework to massive data. In the matching performance analysis, we empirically and theoretically find that in terms of query time, the optimal number of clusters scales with O(N) for N print-scan books.",
author = "Sanghoon Lee and Jongyoo Kim and Sanghoon Lee",
year = "2017",
month = "8",
day = "1",
doi = "10.1016/j.ins.2017.02.001",
language = "English",
volume = "396",
pages = "33--54",
journal = "Information Sciences",
issn = "0020-0255",
publisher = "Elsevier Inc.",

}

An identification framework for print-scan books in a large database. / Lee, Sanghoon; Kim, Jongyoo; Lee, Sanghoon.

In: Information sciences, Vol. 396, 01.08.2017, p. 33-54.

Research output: Contribution to journalArticle

TY - JOUR

T1 - An identification framework for print-scan books in a large database

AU - Lee, Sanghoon

AU - Kim, Jongyoo

AU - Lee, Sanghoon

PY - 2017/8/1

Y1 - 2017/8/1

N2 - In this paper, we propose an identification framework to determine copyright infringement in the form of illegally distributed print-scan books in a large database. The framework contains following main stages: image pre-processing, feature vector extraction, clustering, and indexing, and hierarchical search. The image pre-processing stage provides methods for alleviating the distortions induced by a scanner or digital camera. From the pre-processed image, we propose to generate feature vectors that are robust against distortion. To enhance the clustering performance in a large database, we use a clustering method based on the parallel-distributed computing of Hadoop MapReduce. In addition, to store the clustered feature vectors efficiently and minimize the searching time, we investigate an inverted index for feature vectors. Finally, we implement a two-step hierarchical search to achieve fast and accurate on-line identification. In a simulation, the proposed identification framework shows accurate and robust in the presence of print-scan distortions. The processing time analysis in a parallel computing environment gives extensibility of the proposed framework to massive data. In the matching performance analysis, we empirically and theoretically find that in terms of query time, the optimal number of clusters scales with O(N) for N print-scan books.

AB - In this paper, we propose an identification framework to determine copyright infringement in the form of illegally distributed print-scan books in a large database. The framework contains following main stages: image pre-processing, feature vector extraction, clustering, and indexing, and hierarchical search. The image pre-processing stage provides methods for alleviating the distortions induced by a scanner or digital camera. From the pre-processed image, we propose to generate feature vectors that are robust against distortion. To enhance the clustering performance in a large database, we use a clustering method based on the parallel-distributed computing of Hadoop MapReduce. In addition, to store the clustered feature vectors efficiently and minimize the searching time, we investigate an inverted index for feature vectors. Finally, we implement a two-step hierarchical search to achieve fast and accurate on-line identification. In a simulation, the proposed identification framework shows accurate and robust in the presence of print-scan distortions. The processing time analysis in a parallel computing environment gives extensibility of the proposed framework to massive data. In the matching performance analysis, we empirically and theoretically find that in terms of query time, the optimal number of clusters scales with O(N) for N print-scan books.

UR - http://www.scopus.com/inward/record.url?scp=85013677846&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85013677846&partnerID=8YFLogxK

U2 - 10.1016/j.ins.2017.02.001

DO - 10.1016/j.ins.2017.02.001

M3 - Article

VL - 396

SP - 33

EP - 54

JO - Information Sciences

JF - Information Sciences

SN - 0020-0255

ER -