首页 | 本学科首页   官方微博 | 高级检索  
     

基于半监督聚类的文档敏感信息推导方法
引用本文:苏赢彬,杜学绘,夏春涛,曹利峰,陈华成.基于半监督聚类的文档敏感信息推导方法[J].计算机科学,2015,42(10):132-137.
作者姓名:苏赢彬  杜学绘  夏春涛  曹利峰  陈华成
作者单位:解放军信息工程大学 郑州450001;数学工程与先进计算国家重点实验室 郑州450001,解放军信息工程大学 郑州450001;数学工程与先进计算国家重点实验室 郑州450001,解放军信息工程大学 郑州450001;数学工程与先进计算国家重点实验室 郑州450001,解放军信息工程大学 郑州450001;数学工程与先进计算国家重点实验室 郑州450001,解放军73503部队 福州 350018
基金项目:本文受国家高技术研究发展计划(863计划)项目(2012AA012704)资助
摘    要:针对当前多文档聚合推导引起的敏感信息泄露问题存在风险大、隐蔽性高的特点,提出了一种基于半监督聚类的文档敏感信息推导方法。首先,为确保在较小的时间开销下获得高质量的约束信息,设计了一种新颖的二阶约束主动学习算法,它通过选择不确定性最大的样本点来生成信息量最大的约束闭包;然后,在引入约束信息的基础上结合DBSCAN提出一种新的半监督聚类算法,它能够有效解决DBSCAN算法存在的边界模糊问题,提高文档聚类准确性;最后,在半监督聚类结果的基础上,对相似文档进行敏感信息可能性测度。实验表明,半监督聚类算法准确率提升明显,推导方法能够有效推导出敏感信息。

关 键 词:半监督聚类  DBSCAN  主动学习  敏感信息  模糊数学  推导方法
收稿时间:2014/10/18 0:00:00
修稿时间:1/5/2015 12:00:00 AM

Sensitive Information Inference Method Based on Semi-supervised Document Clustering
SU Ying-bin,DU Xue-hui,XIA Chun-tao,CAO Li-feng and CHEN Hua-cheng.Sensitive Information Inference Method Based on Semi-supervised Document Clustering[J].Computer Science,2015,42(10):132-137.
Authors:SU Ying-bin  DU Xue-hui  XIA Chun-tao  CAO Li-feng and CHEN Hua-cheng
Affiliation:PLA Information Engineering University,Zhengzhou 450001,China;State Key Laboratory of Mathematical Engineering and Advanced Computing,Zhengzhou 450001,China,PLA Information Engineering University,Zhengzhou 450001,China;State Key Laboratory of Mathematical Engineering and Advanced Computing,Zhengzhou 450001,China,PLA Information Engineering University,Zhengzhou 450001,China;State Key Laboratory of Mathematical Engineering and Advanced Computing,Zhengzhou 450001,China,PLA Information Engineering University,Zhengzhou 450001,China;State Key Laboratory of Mathematical Engineering and Advanced Computing,Zhengzhou 450001,China and 73503 PLA Troop,Fuzhou 350018,China
Abstract:For the problem that sensitive information leakage caused by multi-document clustering and inference has the features of high risk and high concealment,a sensitive information inference method based on semi-supervised document clustering was proposed.Firstly,a new second-order constraint active learning algorithm was designed,which can ensure to obtain high quality constraints with less time by choosing the most uncertain informative data.Then,a new semi-supervised clustering algorithm combining constraints and DBSCAN was proposed,which can effectively resolve fuzzy boundaries of DBSCAN and improve the precision of document clustering.Finally,possibility measure of sensitive information on similar documents was calculated based on the results of semi-supervise clustering.The experiments show that the precision of semi-supervised clustering improves significantly,and the inference method can infer sensitive information effectively.
Keywords:Semi-supervised clustering  DBSCAN  Active learning  Sensitive information  Fuzzy math  Inference method
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机科学》浏览原始摘要信息
点击此处可从《计算机科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号