首页 | 本学科首页   官方微博 | 高级检索  
     

基于多维度特征的不良网站检测
引用本文:田双柱,陈勇,延志伟,李晓东.基于多维度特征的不良网站检测[J].计算机系统应用,2017,26(2):207-211.
作者姓名:田双柱  陈勇  延志伟  李晓东
作者单位:中国科学院大学, 北京 100049;中国科学院计算机网络信息中心, 北京 100190;中国互联网络信息中心 互联网络域名管理技术国家工程实验室, 北京 100190,中国互联网络信息中心 互联网络域名管理技术国家工程实验室, 北京 100190,中国互联网络信息中心 互联网络域名管理技术国家工程实验室, 北京 100190,中国互联网络信息中心 互联网络域名管理技术国家工程实验室, 北京 100190
摘    要:目前主要是通过基于URL(Uniform Resource Locator)、关键词、图片等网页内容为特征的机器学习方法进行不良网站检测.但是,不良网站制作者也会通过更换URL,避免常见不良关键词的使用,对搜索爬虫隐藏图片等做法来规避检测,这使得基于内容的检测方法会有漏检的情况.为了更准确的检测出此类网站,本文提出了注册、解析方面的相关特征,并通过最主流的机器学习方法构建了检测模型.用模型预测新数据集,结果证明,基于解析和注册特征的检测方法可以有效的在网站集合中检测出前文提到的不良网站,并且对于一般不良也依然能够准确识别.本次研究为不良网站的检测研究提供了又一思路.

关 键 词:解析  注册  不良网站  检测
收稿时间:2016/5/17 0:00:00
修稿时间:2016/6/27 0:00:00

Illegitimate Website Detection Based on Multi-Dimensional Features
TIAN Shuang-Zhu,CHEN Yong,YAN Zhi-Wei and LI Xiao-Dong.Illegitimate Website Detection Based on Multi-Dimensional Features[J].Computer Systems& Applications,2017,26(2):207-211.
Authors:TIAN Shuang-Zhu  CHEN Yong  YAN Zhi-Wei and LI Xiao-Dong
Affiliation:University of Chinese Academy of Sciences, Beijing 100049, China;Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China;National Engineering Laboratory of Internet Domain Name Management Technology, China Internet Network Information Center, Beijing 100190, China,National Engineering Laboratory of Internet Domain Name Management Technology, China Internet Network Information Center, Beijing 100190, China,National Engineering Laboratory of Internet Domain Name Management Technology, China Internet Network Information Center, Beijing 100190, China and National Engineering Laboratory of Internet Domain Name Management Technology, China Internet Network Information Center, Beijing 100190, China
Abstract:The Web Information Extraction and Knowledge Presentation System is proposed to extract information from data intensive web pages. It downloads dynamic web pages, based on a knowledge database, changes them to XML documents after preprocessing, finds repeated patterns from them, by using a PAT-array based pattern discovery algorithm, recognizes their data display structure models, automatically based on the repeated patterns and an ontology-based keyword library, and then extracts the data and stores them in the knowledge database with the object-relational mapping technology of XML. Through these steps, web data is extracted automatically, and the knowledge database is also expanded automatically. Experiments on the traffic information auto-extraction and mixed traffic travel schemes auto-creation system showed that the system has high precision and is adaptive to web pages in different domains with different structures.
Keywords:analysis  registration  illegitimate website  detection
点击此处可从《计算机系统应用》浏览原始摘要信息
点击此处可从《计算机系统应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号