首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.

Fraudulent online sellers often collude with reviewers to garner fake reviews for their products. This act undermines the trust of buyers in product reviews, and potentially reduces the effectiveness of online markets. Being able to accurately detect fake reviews is, therefore, critical. In this study, we investigate several preprocessing and textual-based featuring methods along with machine learning classifiers, including single and ensemble models, to build a fake review detection system. Given the nature of product review data, where the number of fake reviews is far less than that of genuine reviews, we look into the results of each class in detail in addition to the overall results. We recognise from our preliminary analysis that, owing to imbalanced data, there is a high imbalance between the accuracies for different classes (e.g., 1.3% for the fake review class and 99.7% for the genuine review class), despite the overall accuracy looking promising (around 89.7%). We propose two dynamic random sampling techniques that are possible for textual-based featuring methods to solve this class imbalance problem. Our results indicate that both sampling techniques can improve the accuracy of the fake review class—for balanced datasets, the accuracies can be improved to a maximum of 84.5% and 75.6% for random under and over-sampling, respectively. However, the accuracies for genuine reviews decrease to 75% and 58.8% for random under and over-sampling, respectively. We also discover that, for smaller datasets, the Adaptive Boosting ensemble model outperforms other single classifiers; whereas, for larger datasets, the performance improvement from ensemble models is insignificant compared to the best results obtained by single classifiers.

  相似文献   

2.
The dynamic ensemble selection of classifiers is an effective approach for processing label-imbalanced data classifications. However, such a technique is prone to overfitting, owing to the lack of regularization methods and the dependence on local geometry of data. In this study, focusing on binary imbalanced data classification, a novel dynamic ensemble method, namely adaptive ensemble of classifiers with regularization (AER), is proposed, to overcome the stated limitations. The method solves the overfitting problem through a new perspective of implicit regularization. Specifically, it leverages the properties of stochastic gradient descent to obtain the solution with the minimum norm, thereby achieving regularization; furthermore, it interpolates the ensemble weights by exploiting the global geometry of data to further prevent overfitting. According to our theoretical proofs, the seemingly complicated AER paradigm, in addition to its regularization capabilities, can actually reduce the asymptotic time and memory complexities of several other algorithms. We evaluate the proposed AER method on seven benchmark imbalanced datasets from the UCI machine learning repository and one artificially generated GMM-based dataset with five variations. The results show that the proposed algorithm outperforms the major existing algorithms based on multiple metrics in most cases, and two hypothesis tests (McNemar’s and Wilcoxon tests) verify the statistical significance further. In addition, the proposed method has other preferred properties such as special advantages in dealing with highly imbalanced data, and it pioneers the researches on regularization for dynamic ensemble methods.  相似文献   

3.
Classifying non-stationary and imbalanced data streams encompasses two important challenges, namely concept drift and class imbalance. Concept drift is changes in the underlying function being learnt, and class imbalance is vast difference between the numbers of instances in different classes of data. Class imbalance is an obstacle for the efficiency of most classifiers. Previous methods for classifying non-stationary and imbalanced data streams mainly focus on batch solutions, in which the classification model is trained using a chunk of data. Here, we propose two online classifiers. The classifiers are one-layer NNs. In the proposed classifiers, class imbalance is handled with two separate cost-sensitive strategies. The first one incorporates a fixed and the second one an adaptive misclassification cost matrix. The proposed classifiers are evaluated on 3 synthetic and 8 real-world datasets. The results show statistically significant improvements in imbalanced data metrics.  相似文献   

4.
5.
The problem of limited minority class data is encountered in many class imbalanced applications, but has received little attention. Synthetic over-sampling, as popular class-imbalance learning methods, could introduce much noise when minority class has limited data since the synthetic samples are not i.i.d. samples of minority class. Most sophisticated synthetic sampling methods tackle this problem by denoising or generating samples more consistent with ground-truth data distribution. But their assumptions about true noise or ground-truth data distribution may not hold. To adapt synthetic sampling to the problem of limited minority class data, the proposed Traso framework treats synthetic minority class samples as an additional data source, and exploits transfer learning to transfer knowledge from them to minority class. As an implementation, TrasoBoost method firstly generates synthetic samples to balance class sizes. Then in each boosting iteration, the weights of synthetic samples and original data decrease and increase respectively when being misclassified, and remain unchanged otherwise. The misclassified synthetic samples are potential noise, and thus have smaller influence in the following iterations. Besides, the weights of minority class instances have greater change than those of majority class instances to be more influential. And only original data are used to estimate error rate to be immune from noise. Finally, since the synthetic samples are highly related to minority class, all of the weak learners are aggregated for prediction. Experimental results show TrasoBoost outperforms many popular class-imbalance learning methods.  相似文献   

6.
The positive unlabeled learning term refers to the binary classification problem in the absence of negative examples. When only positive and unlabeled instances are available, semi-supervised classification algorithms cannot be directly applied, and thus new algorithms are required. One of these positive unlabeled learning algorithms is the positive naive Bayes (PNB), which is an adaptation of the naive Bayes induction algorithm that does not require negative instances. In this work we propose two ways of enhancing this algorithm. On one hand, we have taken the concept behind PNB one step further, proposing a procedure to build more complex Bayesian classifiers in the absence of negative instances. We present a new algorithm (named positive tree augmented naive Bayes, PTAN) to obtain tree augmented naive Bayes models in the positive unlabeled domain. On the other hand, we propose a new Bayesian approach to deal with the a priori probability of the positive class that models the uncertainty over this parameter by means of a Beta distribution. This approach is applied to both PNB and PTAN, resulting in two new algorithms. The four algorithms are empirically compared in positive unlabeled learning problems based on real and synthetic databases. The results obtained in these comparisons suggest that, when the predicting variables are not conditionally independent given the class, the extension of PNB to more complex networks increases the classification performance. They also show that our Bayesian approach to the a priori probability of the positive class can improve the results obtained by PNB and PTAN.  相似文献   

7.
In this paper we consider induction of rule-based classifiers from imbalanced data, where one class (a minority class) is under-represented in comparison to the remaining majority classes. The minority class is usually of primary interest. However, most rule-based classifiers are biased towards the majority classes and they have difficulties with correct recognition of the minority class. In this paper we discuss sources of these difficulties related to data characteristics or to an algorithm itself. Among the problems related to the data distribution we focus on the role of small disjuncts, overlapping of classes and presence of noisy examples. Then, we show that standard techniques for induction of rule-based classifiers, such as sequential covering, top-down induction of rules or classification strategies, were created with the assumption of balanced data distribution, and we explain why they are biased towards the majority classes. Some modifications of rule-based classifiers have been already introduced, but they usually concentrate on individual problems. Therefore, we propose a novel algorithm, BRACID, which more comprehensively addresses the issues associated with imbalanced data. Its main characteristics includes a hybrid representation of rules and single examples, bottom-up learning of rules and a local classification strategy using nearest rules. The usefulness of BRACID has been evaluated in experiments on several imbalanced datasets. The results show that BRACID significantly outperforms the well known rule-based classifiers C4.5rules, RIPPER, PART, CN2, MODLEM as well as other related classifiers as RISE or K-NN. Moreover, it is comparable or better than the studied approaches specialized for imbalanced data such as generalizations of rule algorithms or combinations of SMOTE + ENN preprocessing with PART. Finally, it improves the support of minority class rules, leading to better recognition of the minority class examples.  相似文献   

8.
Classification of weld flaws with imbalanced class data   总被引:1,自引:0,他引:1  
This paper presents research results of our investigation of the imbalanced data problem in the classification of different types of weld flaws, a multi-class classification problem. The one-against-all scheme is adopted to carry out multi-class classification and three algorithms including minimum distance, nearest neighbors, and fuzzy nearest neighbors are employed as the classifiers. The effectiveness of 22 data preprocessing methods for dealing with imbalanced data is evaluated in terms of eight evaluation criteria to determine whether any method would emerge to dominate the others. The test results indicate that: (1) nearest neighbor classifiers outperform the minimum distance classifier; (2) some data preprocessing methods do not improve any criterion and they vary from one classifier to another; (3) the combination of using the AHC_KM data preprocessing method with the 1-NN classifier is the best because they together produce the best performance in six of eight evaluation criteria; and (4) the most difficult weld flaw type to recognize is crack.  相似文献   

9.
不平衡数据分类的研究现状*   总被引:6,自引:3,他引:6  
不平衡数据在实际应用中广泛存在,它们已对机器学习领域构成了一个挑战,如何有效处理不平衡数据也成为目前的一个新的研究热点.综述了这一新领域的研究现状,包括该领域最新研究内容、方法及成果.  相似文献   

10.
11.
目前数据流分类算法大多是基于类分布这一理想状态,然而在真实数据流环境中数据分布往往是不均衡的,并且数据流中往往伴随着概念漂移。针对数据流中的不均衡问题和概念漂移问题,提出了一种新的基于集成学习的不均衡数据流分类算法。首先为了解决数据流的不均衡问题,在训练模型前加入混合采样方法平衡数据集,然后采用基分类器加权和淘汰策略处理概念漂移问题,从而提高分类器的分类性能。最后与经典数据流分类算法在人工数据集和真实数据集上进行对比实验,实验结果表明,本文提出的算法在含有概念漂移和不均衡的数据流环境中,其整体分类性能优于其他算法的。  相似文献   

12.
The storage and labeling of industrial data incur significant costs during the development of defect detection algorithms. Active learning solves these problems by selecting the most informative data among the given unlabeled data. The existing active learning methods for image segmentation focus on studying natural images and medical images, with less attention given to industrial images, and little research has been performed on imbalanced data. To solve these problems, we propose an active learning framework to selecting informative data for defect segmentation under imbalanced data. In the initialization stage, the framework uses self-supervised learning to initialize the data so that the initialization data contain more defect data, thereby solving the cold-start problem. During the iterative stage, we design the main body of the active learning framework, which is composed of a segmentation learner and a reconstruction learner. These learners use supervised learning to further improve the framework’s ability to select informative data. The experimental results obtained on public and self-owned datasets show that the framework can save 70% of the required storage space and greatly reduce the cost of labeling. The intersection over union value proves that the designed framework can achieve the equivalent effect of labeling the whole dataset by labeling partial data.  相似文献   

13.
不均衡数据集学习中基于初分类的过抽样算法   总被引:2,自引:0,他引:2  
韩慧  王路  温明  王文渊 《计算机应用》2006,26(8):1894-1897
为了有效地提高不均衡数据集中少数类的分类性能,提出了基于初分类的过抽样算法。首先,对测试集进行初分类,以尽可能多地保留多数类的有用信息;其次,对于被初分类预测为少数类的样本进行再次分类,以有效地提高少数类的分类性能。使用美国加州大学欧文分校的数据集将基于初分类的过抽样算法与合成少数类过抽样算法、欠抽样方法进行了实验比较。结果表明,基于初分类的过抽样算法的少数类与多数类的分类性能都优于其他两种算法。  相似文献   

14.
不均衡数据下基于阴性免疫的过抽样新算法   总被引:2,自引:0,他引:2  
陶新民  徐晶 《控制与决策》2010,25(6):867-872
为提高不均衡数据集下算法分类性能,提出一种基于阴性免疫的过抽样算法.该算法利用阴性免疫实现少数类样本空间覆盖,以生成的检测器中心为人工生成的少数类样本.由于该算法利用的是多数类样本信息生成少数类样本,避免了人工少数类过抽样技术(SMOTE)生成的人工样本缺乏空间代表性的不足.通过实验将此算法与SMOTE算法及其改进算法进行比较,结果表明,该算法不仅有效提高了少数类样本的分类性能,而且总体分类性能也有了显著提高.  相似文献   

15.
Hu Li  Ye Wang  Hua Wang  Bin Zhou 《World Wide Web》2017,20(6):1507-1525
Imbalanced streaming data is commonly encountered in real-world data mining and machine learning applications, and has attracted much attention in recent years. Both imbalanced data and streaming data in practice are normally encountered together; however, little research work has been studied on the two types of data together. In this paper, we propose a multi-window based ensemble learning method for the classification of imbalanced streaming data. Three types of windows are defined to store the current batch of instances, the latest minority instances, and the ensemble classifier. The ensemble classifier consists of a set of latest sub-classifiers, and the instances employed to train each sub-classifier. All sub-classifiers are weighted prior to predicting the class labels of newly arriving instances, and new sub-classifiers are trained only when the precision is below a predefined threshold. Extensive experiments on synthetic datasets and real-world datasets demonstrate that the new approach can efficiently and effectively classify imbalanced streaming data, and generally outperforms existing approaches.  相似文献   

16.
Automatic in silico synthesis of metabolic pathway can practically reduce the cost of wet laboratories. To achieve this, predicting whether or not two metabolites are transformable is the first essential step. The problems of predicting the possibility of transforming one metabolite into another and how to computationally synthesize a metabolic pathway were studied. These two problems were transformed to the problem of classifying features of metabolite pairs into transformable or non-transformable classes. The following two main issues were contributed in this study: (1) two new feature schemes, i.e. the projected features on their first principal component and the average features, for representing transform-ability of each metabolite pair using 2D and 3D compound structural features and (2) a method of modified imbalanced data handling by adding synthetic boundary data of different classes to balance data. Based on the E. coli reference pathways, the results of proposed features with feature selection and our imbalanced data handling approach show the better performance than the results from other methods when evaluated by several metrics. Our significant feature group possibly achieves high classification correctness of computational pathway synthesis. In pathway recovery results by a group of neural network models, 19 pathways were significantly recovered by our feature group at each recovery ratio of at least 0.5, whereas the other compared feature group gave only four significantly recovered pathways.  相似文献   

17.
18.
Traditional classification algorithms require a large number of labelled examples from all the predefined classes, which is generally difficult and time-consuming to obtain. Furthermore, data uncertainty is prevalent in many real-world applications, such as sensor network, market analysis and medical diagnosis. In this article, we explore the issue of classification on uncertain data when only positive and unlabelled examples are available. We propose an algorithm to build naive Bayes classifier from positive and unlabelled examples with uncertainty. However, the algorithm requires the prior probability of positive class, and it is generally difficult for the user to provide this parameter in practice. Two approaches are proposed to avoid this user-specified parameter. One approach is to use a validation set to search for an appropriate value for this parameter, and the other is to estimate it directly. Our extensive experiments show that the two approaches can basically achieve satisfactory classification performance on uncertain data. In addition, our algorithm exploiting uncertainty in the dataset can potentially achieve better classification performance comparing to traditional naive Bayes which ignores uncertainty when handling uncertain data.  相似文献   

19.
In many real applications, data are not all available at the same time, or it is not affordable to process them all in a batch process, but rather, instances arrive sequentially in a stream. The scenario of streaming data introduces new challenges to the machine learning community, since difficult decisions have to be made. The problem addressed in this paper is that of classifying incoming instances for which one attribute arrives only after a given delay. In this formulation, many open issues arise, such as how to classify the incomplete instance, whether to wait for the delayed attribute before performing any classification, or when and how to update a reference set. Three different strategies are proposed which address these issues differently. Orthogonally to these strategies, three classifiers of different characteristics are used. Keeping on-line learning strategies independent of the classifiers facilitates system design and contrasts with the common alternative of carefully crafting an ad hoc classifier. To assess how good learning is under these different strategies and classifiers, they are compared using learning curves and final classification errors for fifteen data sets. Results indicate that learning in this stringent context of streaming data and delayed attributes can successfully take place even with simple on-line strategies. Furthermore, active strategies behave generally better than more conservative passive ones. Regarding the classifiers, it was found that simple instance-based classifiers such as the well-known nearest neighbor may outperform more elaborate classifiers such as the support vector machines, especially if some measure of classification confidence is considered in the process.  相似文献   

20.
Zhang  Yong  Liu  Bo  Cai  Jing  Zhang  Suhua 《Neural computing & applications》2016,28(1):259-267

Extreme learning machine for single-hidden-layer feedforward neural networks has been extensively applied in imbalanced data learning due to its fast learning capability. Ensemble approach can effectively improve the classification performance by combining several weak learners according to a certain rule. In this paper, a novel ensemble approach on weighted extreme learning machine for imbalanced data classification problem is proposed. The weight of each base learner in the ensemble is optimized by differential evolution algorithm. Experimental results on 12 datasets show that the proposed method could achieve more classification performance compared with the simple vote-based ensemble method and non-ensemble method.

  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号