多源异构数据的实体匹配方法研究 Reserch of Entity Matching Based on Multiple Heterogenous Data期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

多源异构数据的实体匹配方法研究

引用本文：	王凌阳,陈钦况,寿黎但,陈珂. 多源异构数据的实体匹配方法研究[J]. 计算机工程与应用, 2019, 55(19): 87-95. DOI: 10.3778/j.issn.1002-8331.1807-0153

作者姓名：	王凌阳陈钦况寿黎但陈珂

作者单位：	浙江大学计算机科学与技术学院,杭州,310000;浙江大学计算机科学与技术学院,杭州 310000;浙江大学大数据智能计算重点实验室,杭州 310027

基金项目：	国家重点研发计划;国家自然科学基金;国家自然科学基金;浙江省自然科学基金

摘要：	近年来，针对多源异构数据的实体匹配问题，已经有诸多学者提出不同的解决方法。然而，这些方法几乎都集中在RDFS或OWL等语义框架下进行实体匹配，不具有通用性。此外，针对多数据源实体匹配问题，目前主流解决方式是将其转换为多组两两数据源的实体匹配问题，该种方式直接进行两两匹配的计算复杂度过高，且没有从多数据源全局的角度分析问题。从这些问题出发，提出了一种的实体匹配方法，利用了实体中普遍存在的名称、属性和上下文信息，构建多种索引，缩减计算空间同时生成高质量的候选集；还定义了度量实体相似度的计算方法，有效地判别了实体对是否匹配。并根据实体间边的权重以及互斥关系，提出一种基于图划分的优化算法，划分多个等价实体构成的集合。从互联网中抓取商业领域下品牌和人物类别的真实数据进行实验测试，实验结果表明该方法取得了良好的效果。
关键词：	实体匹配知识库多源异构数据图划分
Reserch of Entity Matching Based on Multiple Heterogenous Data

WANG Lingyang,CHEN Qinkuang,SHOU Lidan,CHEN Ke. Reserch of Entity Matching Based on Multiple Heterogenous Data[J]. Computer Engineering and Applications, 2019, 55(19): 87-95. DOI: 10.3778/j.issn.1002-8331.1807-0153

Authors:	WANG Lingyang CHEN Qinkuang SHOU Lidan CHEN Ke

Affiliation:	1.College of Computer Science and Technology, Zhejiang University, Hangzhou 310000, China2.Key Laboratory of Big Data Intelligent Computing of Zhejiang Province, Zhejiang University, Hangzhou 310027, China

Abstract:	In recent years, for the entity matching problem of multi-source heterogeneous data, many scholars have proposed different solutions. However, these methods usually focus on entity matching under semantic frameworks such as RDFS or OWL. In addition, when facing multiple data source entity matching problem, most current methods will regard it as a two data source matching problem. These methods not only have high computational complexity, but also do not analyze the entity data from multiple aspects. To address this issue, the paper proposes an entity matching method which uses the commonly existing names, attributes, and context information of entities to construct multiple indexes, which can reduce the space complexity and generate high-quality candidate sets. This paper also proposes a method for calculating the similarity of entities, which effectively determining whether entity pair matches. According to the weights and mutual exclusion relations between entities, it proposes an optimization algorithm based on graph division and divides equivalent entities into the same set. Experiments are conducted on real-world datasets of brand and character categories in the business domain, and the experimental results show that this method can achieve good improvements.

Keywords:	entity matching knowledge base multiple heterogenous data graphic partitioning
本文献已被万方数据等数据库收录！
	点击此处可从《计算机工程与应用》浏览原始摘要信息
	点击此处可从《计算机工程与应用》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏