首页 | 本学科首页   官方微博 | 高级检索  
     

基于扩展树状知识库的海量数据清洗算法
引用本文:燕彩蓉,孙圭宁,高念高. 基于扩展树状知识库的海量数据清洗算法[J]. 计算机工程与应用, 2010, 46(28): 146-148. DOI: 10.3778/j.issn.1002-8331.2010.28.041
作者姓名:燕彩蓉  孙圭宁  高念高
作者单位:1.东华大学 计算机学院,上海 201620 2.众恒信息产业有限公司,上海 200040
摘    要:针对传统知识库表示的局限性,通过分解和重组领域知识,建立扩展树状结构的知识库,其中叶结点对应具体知识实例,称为原子知识,非叶结点只对应知识概念。同时提出相关的数据清洗算法,根据用户的选择,自动提取原子知识进行分析,消除重复,按照处理权重建立原子知识序列,然后逐一对数据进行清洗。实验表明,该算法能有效优化用户的请求,减少对海量数据的遍历次数,海量数据的清洗效率明显提高。

关 键 词:领域知识  知识库  数据清洗  海量数据  
收稿时间:2009-03-02
修稿时间:2009-4-22 

Mass data cleaning algorithm based on extended tree-like knowledge base
YAN Cai-rong,SUN Gui-ning,GAO Nian-gao. Mass data cleaning algorithm based on extended tree-like knowledge base[J]. Computer Engineering and Applications, 2010, 46(28): 146-148. DOI: 10.3778/j.issn.1002-8331.2010.28.041
Authors:YAN Cai-rong  SUN Gui-ning  GAO Nian-gao
Affiliation:1.School of Computer,Donghua University,Shanghai 201620,China 2.Triman Information & Technology Ltd.,Shanghai 200040,China
Abstract:By analyzing the limitation of traditional structures of knowledge base,an extended tree-like knowledge base is built by decomposing and recomposing the domain knowledge.The leaf node of the tree is linked with the knowledge instance called atomic knowledge and the non-leaf node is linked with the concept of knowledge.Based on the knowledge base,a data cleaning algorithm is proposed.It extracts atomic knowledge of the selected nodes firstly,then analyzes their relations,deletes the same objects,builds an atomic knowledge sequence based on weights,lastly cleans data according to the sequence.The experiment shows that the count of scaning mass data can be reduced rapidly by adopting the algorithm to optimize the users'requests and the data cleaning efficiency can be improved clearly.
Keywords:domain knowledge  knowledge base  data cleaning  mass data
本文献已被 维普 万方数据 等数据库收录!
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号