首页 | 本学科首页   官方微博 | 高级检索  
     

基于近邻传播聚类和TANE算法的高校数据中函数依赖的发现
引用本文:黄永鑫,唐雪飞. 基于近邻传播聚类和TANE算法的高校数据中函数依赖的发现[J]. 计算机应用, 2020, 40(1): 90-95. DOI: 10.11772/j.issn.1001-9081.2019061050
作者姓名:黄永鑫  唐雪飞
作者单位:电子科技大学 信息与软件工程学院, 成都 610054
基金项目:国家重点研发计划项目(2017YFB1401303);四川省科技计划项目(2017GZ0192)。
摘    要:针对高校实际数据质量检测过程中数据集存在缺失值以及发现的函数依赖个数较少且不准确的问题,提出了一种结合近邻传播(AP)聚类算法和TANE算法的高校函数依赖发现方法(APTANE)。首先,对数据集中的中文字段进行列剖析,将中文字段值用对应的数值来表示;其次,使用AP聚类算法对数据集中的缺失值进行填补;最后,使用TANE算法从处理好的数据集中自动发现出满足非平凡、最小要求的函数依赖。实验结果表明,在使用AP聚类算法对真实的高校数据集进行修复之后,相比于直接使用函数依赖自动发现算法,发现的函数依赖个数增加到了80个,经过缺失值填补后所发现的函数依赖在表示字段间关联关系时也更加准确,减少了领域专家的工作量,提升了高校数据所拥有数据的质量。

关 键 词:高校信息化  数据质量  近邻传播聚类算法  函数依赖  TANE  
收稿时间:2019-06-21
修稿时间:2019-09-05

Discovery of functional dependencies in university data based on affinity propagation clustering and TANE algorithms
HUANG Yongxin,TANG Xuefei. Discovery of functional dependencies in university data based on affinity propagation clustering and TANE algorithms[J]. Journal of Computer Applications, 2020, 40(1): 90-95. DOI: 10.11772/j.issn.1001-9081.2019061050
Authors:HUANG Yongxin  TANG Xuefei
Affiliation:School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu Sichuan 610054, China
Abstract:In view of the missing values of datasets and the number of found functional dependencies is small and inaccurate in actual data quality detection process of universities, a university functional dependency discovery method combining Affinity Propagation (AP) clustering and TANE algorithm (APTANE) was proposed. Firstly, the Chinese field in the dataset was parsed row by row, and the Chinese field values were represented by the corresponding numerical values. Then, the AP clustering algorithm was used to fill the missing values in the dataset. Finally, the TANE algorithm was used to automatically find out the functional dependencies satisfying non-trivial and minimum requirements from the processed dataset. The experimental results show that after using AP clustering algorithm to repair real university dataset, compared with the direct use of functional dependency automatic discovery algorithm, the number of functional dependencies found increases to 80. The functional dependencies found after the filling of missing values represent the relationship between fields more accurately, reducing the workload of domain experts and improving the quality of data held by universities.
Keywords:university informationization   data quality   Affinity Propagation (AP) clustering algorithm   functional dependency   TANE
本文献已被 维普 万方数据 等数据库收录!
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号