首页 | 本学科首页   官方微博 | 高级检索  
     


Using linguistic features to automatically extract web page title
Affiliation:1. Department of Civil, Environmental, Aerospace, and Material Engineering, Polytechnic School, University of Palermo, Italy, Viale delle Scienze, Ed 8, 90128 Palermo, ITALY;2. Department of Energy, Information Engineering and Mathematical Models, Polytechnic School, University of Palermo, Italy, Viale delle Scienze, Ed 8, 90128 Palermo, ITALY;1. Instituto Universitario para el Desarrollo Tecnológico y la Innovación en Comunicaciones, Universidad de Las Palmas de Gran Canaria, Las Palmas 35017, Spain;2. Dipartimento di Informatica, Università degli Studi di Bari, Bari 70126, Italy
Abstract:Existing methods for extracting titles from HTML web page mostly rely on visual and structural features. However, this approach fails in the case of service-based web pages because advertisements are often given more visual emphasize than the main headlines. To improve the current state-of-the-art, we propose a novel method that combines statistical features, linguistic knowledge, and text segmentation. Using annotated English corpus, we learn the morphosyntactic characteristics of known titles and define a part-of-speech tag patterns that help to extract candidate phrases from the web page. To evaluate the proposed method, we compared two datasets Titler and Mopsi and evaluated the extracted features using four classifiers: Naïve Bayes, k-NN, SVM, and clustering. Experimental results show that the proposed method outperform the solution used by Google from 0.58 to 0.85 on Titler corpus and from 0.43 to 0.55 on Mopsi dataset, and offers a readily available solution for the title extraction problem.
Keywords:
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号