首页 | 本学科首页   官方微博 | 高级检索  
     


An optimized approach for massive web page classification using entity similarity based on semantic network
Affiliation:1. Key Lab of Big Data Security and Intelligent Processing, Institute of Computer Technology, School of Computer Science & Technology, School of Software Nanjing University of Posts and Telecommunications, Nanjing, 210023, China;2. The Third Research Institute of the Ministry of Public Security, Shanghai, 201204, China;3. Department of Information Systems and Cyber Security, The University of Texas at San Antonio, San Antonio, TX 78249-0631, USA;1. Center for Sustainable Development of the Semi-Arid, Federal University of Campina Grande, Sumé/PB CEP 58540-000, Brazil;2. Management Engineering Department, Universidade Federal de Pernambuco, Cx. Postal 7462, Recife/PE CEP 50630-970, Brazil;1. ENEA: Italian National Agency for New Technologies, Energy and Sustainable Economic Development, Rome, Italy;2. Università Politecnica delle Marche, Italy;3. Center for Polymer Studies, Boston University, Boston, Massachusetts, United States;4. University of Tor Vergata, Rome, Italy
Abstract:With the development of mobile technology, the users browsing habits are gradually shifted from only information retrieval to active recommendation. The classification mapping algorithm between users interests and web contents has been become more and more difficult with the volume and variety of web pages. Some big news portal sites and social media companies hire more editors to label these new concepts and words, and use the computing servers with larger memory to deal with the massive document classification, based on traditional supervised or semi-supervised machine learning methods. This paper provides an optimized classification algorithm for massive web page classification using semantic networks, such as Wikipedia, WordNet. In this paper, we used Wikipedia data set and initialized a few category entity words as class words. A weight estimation algorithm based on the depth and breadth of Wikipedia network is used to calculate the class weight of all Wikipedia Entity Words. A kinship-relation association based on content similarity of entity was therefore suggested optimizing the unbalance problem when a category node inherited the probability from multiple fathers. The keywords in the web page are extracted from the title and the main text using N-gram with Wikipedia Entity Words, and Bayesian classifier is used to estimate the page class probability. Experimental results showed that the proposed method obtained good scalability, robustness and reliability for massive web pages.
Keywords:Web page classification  Semantic network  Kinship-relation association  Entity class probability  Hereditary weight
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号