Abstract
The TABLE tags in HTML (Hypertext Markup Language) documents are widely used for formatting layout of Web documents as well as for describing genuine tables with relational information. As a prerequisite for information extraction from the Web, this paper presents an efficient method for sophisticated table detection. The proposed method consists of two phases: preprocessing and attribute-value relations extraction. During preprocessing, a part of genuine or non-genuine tables are filtered out using a set of rules, which are devised based on careful examination of general characteristics of various HTML tables. The remaining tables are detected at the attribute-value relations extraction phase. Specifically, a value area is extracted and checked out whether there is syntactic coherency. Furthermore, the method looks for semantic coherency between an attribute area and a value area of a table. Experimental results with 11,477 TABLE tags from 1393 HTML documents show that the method has performed better compared with previous works, resulting in a precision of 97.54% and a recall of 99.22%.
Original language | English |
---|---|
Pages (from-to) | 745-757 |
Number of pages | 13 |
Journal | Engineering Applications of Artificial Intelligence |
Volume | 18 |
Issue number | 6 |
DOIs | |
Publication status | Published - 2005 Sept |
All Science Journal Classification (ASJC) codes
- Control and Systems Engineering
- Artificial Intelligence
- Electrical and Electronic Engineering