Applying machine learning techniques for scaling out data quality algorithms in cloud computing environments期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Applying machine learning techniques for scaling out data quality algorithms in cloud computing environments

Authors:	Dimas Cassimiro Nascimento Carlos Eduardo Pires Demetrio Gomes Mestre

Abstract:	Deduplication is the task of identifying the entities in a data set which refer to the same real world object. Over the last decades, this problem has been largely investigated and many techniques have been proposed to improve the efficiency and effectiveness of the deduplication algorithms. As data sets become larger, such algorithms may generate critical bottlenecks regarding memory usage and execution time. In this context, cloud computing environments have been used for scaling out data quality algorithms. In this paper, we investigate the efficacy of different machine learning techniques for scaling out virtual clusters for the execution of deduplication algorithms under predefined time restrictions. We also propose specific heuristics (Best Performing Allocation, Probabilistic Best Performing Allocation, Tunable Allocation, Adaptive Allocation and Sliced Training Data) which, together with the machine learning techniques, are able to tune the virtual cluster estimations as demands fluctuate over time. The experiments we have carried out using multiple scale data sets have provided many insights regarding the adequacy of the considered machine learning algorithms and proposed heuristics for tackling cloud computing provisioning.

Keywords:
本文献已被 SpringerLink 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏