基于扩展树状知识库的海量数据清洗算法 Mass data cleaning algorithm based on extended tree-like knowledge base期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于扩展树状知识库的海量数据清洗算法

引用本文：	燕彩蓉,孙圭宁,高念高. 基于扩展树状知识库的海量数据清洗算法[J]. 计算机工程与应用, 2010, 46(28): 146-148. DOI: 10.3778/j.issn.1002-8331.2010.28.041

作者姓名：	燕彩蓉孙圭宁高念高

作者单位：	1.东华大学计算机学院，上海 201620 2.众恒信息产业有限公司，上海 200040

摘要：	针对传统知识库表示的局限性，通过分解和重组领域知识，建立扩展树状结构的知识库，其中叶结点对应具体知识实例，称为原子知识，非叶结点只对应知识概念。同时提出相关的数据清洗算法，根据用户的选择，自动提取原子知识进行分析，消除重复，按照处理权重建立原子知识序列，然后逐一对数据进行清洗。实验表明，该算法能有效优化用户的请求，减少对海量数据的遍历次数，海量数据的清洗效率明显提高。
关键词：	领域知识知识库数据清洗海量数据
收稿时间：	2009-03-02
修稿时间：	2009-4-22
Mass data cleaning algorithm based on extended tree-like knowledge base

YAN Cai-rong,SUN Gui-ning,GAO Nian-gao. Mass data cleaning algorithm based on extended tree-like knowledge base[J]. Computer Engineering and Applications, 2010, 46(28): 146-148. DOI: 10.3778/j.issn.1002-8331.2010.28.041

Authors:	YAN Cai-rong SUN Gui-ning GAO Nian-gao

Affiliation:	1.School of Computer，Donghua University，Shanghai 201620，China 2.Triman Information & Technology Ltd.，Shanghai 200040，China

Abstract:	By analyzing the limitation of traditional structures of knowledge base,an extended tree-like knowledge base is built by decomposing and recomposing the domain knowledge.The leaf node of the tree is linked with the knowledge instance called atomic knowledge and the non-leaf node is linked with the concept of knowledge.Based on the knowledge base,a data cleaning algorithm is proposed.It extracts atomic knowledge of the selected nodes firstly,then analyzes their relations,deletes the same objects,builds an atomic knowledge sequence based on weights,lastly cleans data according to the sequence.The experiment shows that the count of scaning mass data can be reduced rapidly by adopting the algorithm to optimize the users＇requests and the data cleaning efficiency can be improved clearly.

Keywords:	domain knowledge knowledge base data cleaning mass data
本文献已被维普万方数据等数据库收录！
	点击此处可从《计算机工程与应用》浏览原始摘要信息
	点击此处可从《计算机工程与应用》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏