Shape-based retrieval of CNV regions in read coverage data

Sangkyun Hong, Jeehee Yoon, Dongwan Hong, Unjoo Lee, Baeksop Kim, Sanghyun Park

Research output: Contribution to journalArticlepeer-review

1 Citation (Scopus)


This study proposes a novel copy number variation (CNV) detection method, CNV-shape, based on variations in the shape of the read coverage data which are obtained from millions of short reads aligned to a reference sequence. The proposed method carries out two transforms, mean shift transform and mean slope transform, to extract the shape of a CNV more precisely from real human data, which are vulnerable to experimental and biological noises. The mean shift transform is a procedure for gaining a preliminary estimation of the CNVs by statistically evaluating moving averages of given read coverage data. The mean slope transform extracts candidate CNVs by filtering out non-stationary sub-regions from each of the primary CNVs pre-estimated in the mean shift procedure. Each of the candidate CNVs is merged with neighbours depending on the merging score to be finally identified as a putative CNV, where the merging score is estimated by the ratio of the positions with non-zero values of the mean shift transform to the total length of the region including two neighbouring candidate CNVs and the interval between them. The proposed CNV detection method was validated experimentally with simulated data and real human data. The simulated data with coverage in the range of 1X to 10X were generated for various sampling sizes and p-values. Five individual human genomes were used as real human data. The results show that relatively small CNVs (>1 kbp) can be detected from low coverage (>1.7X) data. The results also reveal that, in contrast to conventional methods, performance improvement from 8.18 to 87.90% was achieved in CNV-shape. The outcomes suggest that the proposed method is very effective in reducing noises inherent in real data as well as in detecting CNVs of various sizes and types.

Original languageEnglish
Pages (from-to)254-276
Number of pages23
JournalInternational Journal of Data Mining and Bioinformatics
Issue number3
Publication statusPublished - 2014

All Science Journal Classification (ASJC) codes

  • Information Systems
  • Biochemistry, Genetics and Molecular Biology(all)
  • Library and Information Sciences


Dive into the research topics of 'Shape-based retrieval of CNV regions in read coverage data'. Together they form a unique fingerprint.

Cite this