Recently, healthcare data analysis has become an attractive research topic. Data gathering is the first step in data analysis and processing. During the collection of the data, some errors may occur due to human mistakes, devices’ errors, or the transmission process noise. The correct treatment of the missed data and outliers conserve the data size and improve the model’s performance. This paper provides two enhanced algorithms to handle missing values and outliers in big datasets. The main idea is dividing the dataset into its different classes, or clustering it by using k-means++, then calculate the average value of each part, finally replace the missed data and outliers with its corresponding part mean value. The projected imputation and outliers’ data handling algorithms are tested on a dataset called Pima Indian diabetic, which contains 2768 patients dividing into 952 diabetic and 1816 controls. Four classifiers (Random Forest, Decision Tree, Support Vector Machine, and Naïve Bayes) are used to evaluate the effect of the proposed algorithms. The results show that the proposed algorithms improve classification accuracy by 8% and decrease the RMSE by 17% over Deep Learning (DL). DL is the most powerful algorithms used in repairing the missed data.
相似文献