Detecting tables in Web documents

Yeon Seok Kim, Kyong Ho Lee

Research output: Contribution to journalArticle

5 Citations (Scopus)

Abstract

The TABLE tags in HTML (Hypertext Markup Language) documents are widely used for formatting layout of Web documents as well as for describing genuine tables with relational information. As a prerequisite for information extraction from the Web, this paper presents an efficient method for sophisticated table detection. The proposed method consists of two phases: preprocessing and attribute-value relations extraction. During preprocessing, a part of genuine or non-genuine tables are filtered out using a set of rules, which are devised based on careful examination of general characteristics of various HTML tables. The remaining tables are detected at the attribute-value relations extraction phase. Specifically, a value area is extracted and checked out whether there is syntactic coherency. Furthermore, the method looks for semantic coherency between an attribute area and a value area of a table. Experimental results with 11,477 TABLE tags from 1393 HTML documents show that the method has performed better compared with previous works, resulting in a precision of 97.54% and a recall of 99.22%.

Original languageEnglish
Pages (from-to)745-757
Number of pages13
JournalEngineering Applications of Artificial Intelligence
Volume18
Issue number6
DOIs
Publication statusPublished - 2005 Sep 1

Fingerprint

HTML
World Wide Web
Syntactics
Semantics

All Science Journal Classification (ASJC) codes

  • Control and Systems Engineering
  • Artificial Intelligence
  • Electrical and Electronic Engineering

Cite this

@article{c60e514a8f914d4f9a67c8d619419cdb,
title = "Detecting tables in Web documents",
abstract = "The TABLE tags in HTML (Hypertext Markup Language) documents are widely used for formatting layout of Web documents as well as for describing genuine tables with relational information. As a prerequisite for information extraction from the Web, this paper presents an efficient method for sophisticated table detection. The proposed method consists of two phases: preprocessing and attribute-value relations extraction. During preprocessing, a part of genuine or non-genuine tables are filtered out using a set of rules, which are devised based on careful examination of general characteristics of various HTML tables. The remaining tables are detected at the attribute-value relations extraction phase. Specifically, a value area is extracted and checked out whether there is syntactic coherency. Furthermore, the method looks for semantic coherency between an attribute area and a value area of a table. Experimental results with 11,477 TABLE tags from 1393 HTML documents show that the method has performed better compared with previous works, resulting in a precision of 97.54{\%} and a recall of 99.22{\%}.",
author = "Kim, {Yeon Seok} and Lee, {Kyong Ho}",
year = "2005",
month = "9",
day = "1",
doi = "10.1016/j.engappai.2005.01.009",
language = "English",
volume = "18",
pages = "745--757",
journal = "Engineering Applications of Artificial Intelligence",
issn = "0952-1976",
publisher = "Elsevier Limited",
number = "6",

}

Detecting tables in Web documents. / Kim, Yeon Seok; Lee, Kyong Ho.

In: Engineering Applications of Artificial Intelligence, Vol. 18, No. 6, 01.09.2005, p. 745-757.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Detecting tables in Web documents

AU - Kim, Yeon Seok

AU - Lee, Kyong Ho

PY - 2005/9/1

Y1 - 2005/9/1

N2 - The TABLE tags in HTML (Hypertext Markup Language) documents are widely used for formatting layout of Web documents as well as for describing genuine tables with relational information. As a prerequisite for information extraction from the Web, this paper presents an efficient method for sophisticated table detection. The proposed method consists of two phases: preprocessing and attribute-value relations extraction. During preprocessing, a part of genuine or non-genuine tables are filtered out using a set of rules, which are devised based on careful examination of general characteristics of various HTML tables. The remaining tables are detected at the attribute-value relations extraction phase. Specifically, a value area is extracted and checked out whether there is syntactic coherency. Furthermore, the method looks for semantic coherency between an attribute area and a value area of a table. Experimental results with 11,477 TABLE tags from 1393 HTML documents show that the method has performed better compared with previous works, resulting in a precision of 97.54% and a recall of 99.22%.

AB - The TABLE tags in HTML (Hypertext Markup Language) documents are widely used for formatting layout of Web documents as well as for describing genuine tables with relational information. As a prerequisite for information extraction from the Web, this paper presents an efficient method for sophisticated table detection. The proposed method consists of two phases: preprocessing and attribute-value relations extraction. During preprocessing, a part of genuine or non-genuine tables are filtered out using a set of rules, which are devised based on careful examination of general characteristics of various HTML tables. The remaining tables are detected at the attribute-value relations extraction phase. Specifically, a value area is extracted and checked out whether there is syntactic coherency. Furthermore, the method looks for semantic coherency between an attribute area and a value area of a table. Experimental results with 11,477 TABLE tags from 1393 HTML documents show that the method has performed better compared with previous works, resulting in a precision of 97.54% and a recall of 99.22%.

UR - http://www.scopus.com/inward/record.url?scp=20144366320&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=20144366320&partnerID=8YFLogxK

U2 - 10.1016/j.engappai.2005.01.009

DO - 10.1016/j.engappai.2005.01.009

M3 - Article

VL - 18

SP - 745

EP - 757

JO - Engineering Applications of Artificial Intelligence

JF - Engineering Applications of Artificial Intelligence

SN - 0952-1976

IS - 6

ER -