首页 | 本学科首页   官方微博 | 高级检索  
     

Web大数据环境下的不一致跨源数据发现
引用本文:余伟, 李石君, 杨莎, 胡亚慧, 刘晶, 丁永刚, 王骞. Web大数据环境下的不一致跨源数据发现[J]. 计算机研究与发展, 2015, 52(2): 295-308. DOI: 10.7544/issn1000-1239.2015.20140224
作者姓名:余伟  李石君  杨莎  胡亚慧  刘晶  丁永刚  王骞
作者单位:1.1(武汉大学计算机学院 武汉 430079);2.2(汉口学院计算机科学与技术学院 武汉 430212);3.3(空军预警学院 武汉 430070) (yuwei@whu.edu.cn)
基金项目:国家自然科学基金项目,中央高校基本科研业务费专项资金项目,湖北省自然科学基金项目
摘    要:Web中不同数据源之间的数据不一致是一个普遍存在的问题,严重影响了互联网的可信度和质量.目前数据不一致的研究主要集中在传统数据库应用中,对于种类多样、结构复杂、快速变化、数量庞大的跨源Web大数据的一致性研究还很少.针对跨源Web数据的多源异构特性和Web大数据的5V特征,将从站点结构、特征数据和知识规则3个方面建立统一数据抽取算法和Web对象数据模型;研究不同类型的Web数据不一致特征,建立不一致分类模型、一致性约束机制和不一致推理代数运算系统;从而在跨源Web数据一致性理论体系的基础上,实现通过约束规则检测、统计偏移分析的Web不一致数据自动发现方法,并结合这两种方法的特点,基于Hadoop MapReduce架构提出了基于层次概率判定的Web不一致数据的自动发现算法.该框架在Hadoop平台上对多个B2C电子商务大数据进行实验,并与传统架构和其他方法进行了比较,实验结果证明该方法具有良好的精确性和高效性.

关 键 词:Web大数据  Web数据挖掘  数据一致性  Web数据管理  数据质量评估  跨源数据分析

Automatically Discovering of Inconsistency Among Cross-Source Data Based on Web Big Data
Yu Wei, Li Shijun, Yang Sha, Hu Yahui, Liu Jing, Ding Yonggang, Wang Qian. Automatically Discovering of Inconsistency Among Cross-Source Data Based on Web Big Data[J]. Journal of Computer Research and Development, 2015, 52(2): 295-308. DOI: 10.7544/issn1000-1239.2015.20140224
Authors:Yu Wei  Li Shijun  Yang Sha  Hu Yahui  Liu Jing  Ding Yonggang  Wang Qian
Affiliation:1.1(Computer School, Wuhan University, Wuhan 430079);2.2(College of Computer Science and Technology, Hankou University, Wuhan 430212);3.3(Air Force Early Warning Academy, Wuhan 430070)
Abstract:Data inconsistency is a pervasive phenomenon existing in Web, which has gravely affected the quality of Web information. The current research of data inconsistency mainly focused on traditional database application. It is lack of consistency research on diverse, complicated, rapidly-changing and abundant Web big data. On account of multi-source heterogeneous Web data and 5V features of big data, we present unified algorithm of data extraction and Web object data model based on three aspects: website structure, characteristic data and knowledge rules. We study and sort the features of data inconsistency, and establish inconsistency classifier model, inconsistency constraint mechanism and inconsistency inference algebra computing system. Then based on cross-source Web data consistency theory system, we've researched Web inconsistency data automatically discovery method via constraint rules detection and statistical deviation analysis. Combining the characters of the two methods, we propose an automatically discovery algorithm of Web inconsistency data in view of hierarchy probabilistic judgment based on Hadoop MapReduce architecture. The framework is applied to multiple B2C electronic commerce big data on Hadoop platform and compared with traditional architecture and other methods. The results of our experiment proves the accuracy and efficiency of the method.
Keywords:Web big data  Web data mining  data consistency  Web data management  data quality assessment  cross-source analysis
本文献已被 CNKI 万方数据 等数据库收录!
点击此处可从《计算机研究与发展》浏览原始摘要信息
点击此处可从《计算机研究与发展》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号