Extracting logical structures from HTML tables

Yeon Seok Kim, Kyong Ho Lee

Research output: Contribution to journalArticle

17 Citations (Scopus)

Abstract

While HTML is mainly designed for the visual rendering of Web documents, XML is widely accepted as a standard format to process and manage information. In particular, it can embed the information of logical structures. However, in order to utilize XML, the logical structures of HTML tables should first be extracted and transformed into XML representations. This paper presents an efficient method for the process, which consists of two phases: area segmentation and structure analysis. The area segmentation cleans up tables and segments them into attribute and value areas by checking visual and semantic coherency. The hierarchical structure between attribute and value areas is then analyzed and transformed into an XML representation using a proposed table model. Experimental results with 1180 HTML tables show that the proposed method performs better than conventional methods, resulting in an average accuracy of 86.7%.

Original languageEnglish
Pages (from-to)296-308
Number of pages13
JournalComputer Standards and Interfaces
Volume30
Issue number5
DOIs
Publication statusPublished - 2008 Jul 1

Fingerprint

HTML
XML
World Wide Web
Values
Semantics
semantics
segmentation

All Science Journal Classification (ASJC) codes

  • Software
  • Hardware and Architecture
  • Law

Cite this

@article{a279617388694df989b59bdf210e467f,
title = "Extracting logical structures from HTML tables",
abstract = "While HTML is mainly designed for the visual rendering of Web documents, XML is widely accepted as a standard format to process and manage information. In particular, it can embed the information of logical structures. However, in order to utilize XML, the logical structures of HTML tables should first be extracted and transformed into XML representations. This paper presents an efficient method for the process, which consists of two phases: area segmentation and structure analysis. The area segmentation cleans up tables and segments them into attribute and value areas by checking visual and semantic coherency. The hierarchical structure between attribute and value areas is then analyzed and transformed into an XML representation using a proposed table model. Experimental results with 1180 HTML tables show that the proposed method performs better than conventional methods, resulting in an average accuracy of 86.7{\%}.",
author = "Kim, {Yeon Seok} and Lee, {Kyong Ho}",
year = "2008",
month = "7",
day = "1",
doi = "10.1016/j.csi.2007.08.006",
language = "English",
volume = "30",
pages = "296--308",
journal = "Computer Standards and Interfaces",
issn = "0920-5489",
publisher = "Elsevier",
number = "5",

}

Extracting logical structures from HTML tables. / Kim, Yeon Seok; Lee, Kyong Ho.

In: Computer Standards and Interfaces, Vol. 30, No. 5, 01.07.2008, p. 296-308.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Extracting logical structures from HTML tables

AU - Kim, Yeon Seok

AU - Lee, Kyong Ho

PY - 2008/7/1

Y1 - 2008/7/1

N2 - While HTML is mainly designed for the visual rendering of Web documents, XML is widely accepted as a standard format to process and manage information. In particular, it can embed the information of logical structures. However, in order to utilize XML, the logical structures of HTML tables should first be extracted and transformed into XML representations. This paper presents an efficient method for the process, which consists of two phases: area segmentation and structure analysis. The area segmentation cleans up tables and segments them into attribute and value areas by checking visual and semantic coherency. The hierarchical structure between attribute and value areas is then analyzed and transformed into an XML representation using a proposed table model. Experimental results with 1180 HTML tables show that the proposed method performs better than conventional methods, resulting in an average accuracy of 86.7%.

AB - While HTML is mainly designed for the visual rendering of Web documents, XML is widely accepted as a standard format to process and manage information. In particular, it can embed the information of logical structures. However, in order to utilize XML, the logical structures of HTML tables should first be extracted and transformed into XML representations. This paper presents an efficient method for the process, which consists of two phases: area segmentation and structure analysis. The area segmentation cleans up tables and segments them into attribute and value areas by checking visual and semantic coherency. The hierarchical structure between attribute and value areas is then analyzed and transformed into an XML representation using a proposed table model. Experimental results with 1180 HTML tables show that the proposed method performs better than conventional methods, resulting in an average accuracy of 86.7%.

UR - http://www.scopus.com/inward/record.url?scp=42949133788&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=42949133788&partnerID=8YFLogxK

U2 - 10.1016/j.csi.2007.08.006

DO - 10.1016/j.csi.2007.08.006

M3 - Article

AN - SCOPUS:42949133788

VL - 30

SP - 296

EP - 308

JO - Computer Standards and Interfaces

JF - Computer Standards and Interfaces

SN - 0920-5489

IS - 5

ER -