Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

TEXT: Automatic Template Extraction from Heterogeneous Web Pages

Authors
김철연심규석
Issue Date
Apr-2011
Publisher
IEEE COMPUTER SOC
Citation
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, v.23, no.4, pp 612 - 626
Pages
15
Journal Title
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
Volume
23
Number
4
Start Page
612
End Page
626
URI
https://scholarworks.sookmyung.ac.kr/handle/2020.sw.sookmyung/147782
DOI
10.1109/TKDE.2010.140
ISSN
1041-4347
1558-2191
Abstract
World Wide Web is the most useful source of information. In order to achieve high productivity of publishing, the webpages in many websites are automatically populated by using the common templates with contents. The templates provide readers easy access to the contents guided by consistent structures. However, for machines, the templates are considered harmful since they degrade the accuracy and performance of web applications due to irrelevant terms in templates. Thus, template detection techniques have received a lot of attention recently to improve the performance of search engines, clustering, and classification of web documents. In this paper, we present novel algorithms for extracting templates from a large number of web documents which are generated from heterogeneous templates. We cluster the web documents based on the similarity of underlying template structures in the documents so that the template for each cluster is extracted simultaneously. We develop a novel goodness measure with its fast approximation for clustering and provide comprehensive analysis of our algorithm. Our experimental results with real-life data sets confirm the effectiveness and robustness of our algorithm compared to the state of the art for template detection algorithms.
Files in This Item
There are no files associated with this item.
Appears in
Collections
ICT융합공학부 > IT공학전공 > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Kim, Chul Yun photo

Kim, Chul Yun
공과대학 (인공지능공학부)
Read more

Altmetrics

Total Views & Downloads

BROWSE