首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 125 毫秒
1.
现实中许多领域产生的数据通常具有多个类别并且是不平衡的。在多类不平衡分类中,类重叠、噪声和多个少数类等问题降低了分类器的能力,而有效解决多类不平衡问题已经成为机器学习与数据挖掘领域中重要的研究课题。根据近年来的多类不平衡分类方法的文献,从数据预处理和算法级分类方法两方面进行了分析与总结,并从优缺点和数据集等方面对所有算法进行了详细的分析。在数据预处理方法中,介绍了过采样、欠采样、混合采样和特征选择方法,对使用相同数据集算法的性能进行了比较。从基分类器优化、集成学习和多类分解技术三个方面对算法级分类方法展开介绍和分析。最后对多类不平衡数据分类研究领域的未来发展方向进行总结归纳。  相似文献   

2.
胡顺仁  欧阳 《计算机科学》2004,31(3):190-191
类之间的依赖关系,对于面向对象系统分析、设计和测试都有重要的意义。本文首先对类之间的依赖关系进行了定义和说明,并细分其为数据依赖和方法依赖,在此基础上,对类之间的依赖关系进行了度量,提出依赖度和被依赖度两种度量方法,并以此确定类地规模大小。  相似文献   

3.
对象类之间依赖关系度量分析   总被引:4,自引:1,他引:4  
类之间的依赖关系,对于面向对象系统分析、设计和测试都有重要的意义。该文首先对类之间的依赖关系进行了定义和说明,并细分其为数据依赖和方法依赖,在此基础上,提出依赖度和被依赖度两种度量方法,并进行了严格的语义分析和说明。最后,文章提出依据这两种度量方法来确定类的规模大小的算法。  相似文献   

4.
This paper proposes to apply machine learning techniques to predict students’ performance on two real-world educational data-sets. The first data-set is used to predict the response of students with autism while they learn a specific task, whereas the second one is used to predict students’ failure at a secondary school. The two data-sets suffer from two major problems that can negatively impact the ability of classification models to predict the correct label; class imbalance and class noise. A series of experiments have been carried out to improve the quality of training data, and hence improve prediction results. In this paper, we propose two noise filter methods to eliminate the noisy instances from the majority class located inside the borderline area. Our methods combine the over-sampling SMOTE technique with the thresholding technique to balance the training data and choose the best boundary between classes. Then we apply a noise detection approach to identify the noisy instances. We have used the two data-sets to assess the efficacy of class-imbalance approaches as well as both proposed methods. Results for different classifiers show that, the AUC scores significantly improved when the two proposed methods combined with existing class-imbalance techniques.  相似文献   

5.
Feature extraction is an important component of pattern classification and speech recognition. Extracted features should discriminate classes from each other while being robust to environmental conditions such as noise. For this purpose, several feature transformations are proposed which can be divided into two main categories: data-dependent transformation and classifier-dependent transformation. The drawback of data-dependent transformation is that its optimization criteria are different from the measure of classification error which can potentially degrade the classifier’s performance. In this paper, we propose a framework to optimize data-dependent feature transformations such as PCA (Principal Component Analysis), LDA (Linear Discriminant Analysis) and HLDA (Heteroscedastic LDA) using minimum classification error (MCE) as the main objective. The classifier itself is based on Hidden Markov Model (HMM). In our proposed HMM minimum classification error technique, the transformation matrices are modified to minimize the classification error for the mapped features, and the dimension of the feature vector is not changed. To evaluate the proposed methods, we conducted several experiments on the TIMIT phone recognition and the Aurora2 isolated word recognition tasks. The experimental results show that the proposed methods improve performance of PCA, LDA and HLDA transformation for mapping Mel-frequency cepstral coefficients (MFCC).  相似文献   

6.
The kernel functions play a central role in kernel methods, accordingly over the years the optimization of kernel functions has been a promising research area. Ideally Fisher discriminant criteria can be used as an objective function to optimize the kernel function to augment the margin between different classes. Unfortunately, Fisher criteria are optimal only in the case that all the classes are generated from underlying multivariate normal distributions of common covariance matrix but different means and each class is expressed by a single cluster. Due to the assumptions, Fisher criteria obviously are not a suitable choice as a kernel optimization rule in some applications such as the multimodally distributed data. In order to solve this problem, recently many improved discriminant criteria (DC) have been also developed. Therefore, to apply these discriminant criteria to kernel optimization, in this paper based on a data-dependent kernel function we propose a unified kernel optimization framework, which can use any discriminant criteria formulated in a pairwise manner as the objective functions. Under the kernel optimization framework, to employ different discriminant criteria, one has to only change the corresponding affinity matrices without having to resort to any complex derivations in feature space. Experimental results based on some benchmark data demonstrate the efficiency of our method.  相似文献   

7.
ABSTRACT

Deep models are extremely data hungry. Their success is being driven by the availability of large amounts of data for training. For semantic segmentation tasks on aerial and satellite imagery, a major dilemma at present is that it still relies heavily on manual labelling of data. Among these tasks, the semantic segmentation of road is special since it is possible to use auxiliary data, such as GPS track data, to automatically label data. For a better understanding of this possibility, this paper proposes to rethink some basic issues of labelling approaches for roads.

We experimentally investigated the unavoidable class imbalance problem in road segmentation tasks through simulated and real datasets and quantitatively show that class imbalance has a serious detrimental impact on deep model’s generalization performance. We also observed that the detrimental impact even outweighs the benefits of strictly annotating roads – expanding road labels can give deep networks better segmentation accuracy, even though the segmentation location is no longer the edge of the road. We think this is because the impact of class imbalance is much overwhelming than the sensitivity of DNN on the edges of real roads. This finding is valuable for supporting the use of centreline-based approaches in place of edge-based approaches in some applications for better cost-effective solutions.

We proposed a guided Random Sample Consensus (RANSAC) algorithm to determine the optimal expansion ratio of road label. On these bases, we further proposed a general framework to combine two networks to achieve better performance than the state-of-the-art performance of using alone. We attribute this to the alleviation of the class imbalance problem because simply cascading the two networks does not achieve the purpose of improving accuracy in our experiments. We believe that this work is enlightening for studies of road segmentation.  相似文献   

8.
Group decision making plays an important role in various fields of management decision and economics. In this paper, we develop two methods for hesitant fuzzy multiple criteria group decision making with group consensus in which all the experts use hesitant fuzzy decision matrices (HFDMs) to express their preferences. The aim of this paper is to present two novel consensus models applied in different group decision making situations, which are composed of consensus checking processes, consensus-reaching processes, and selection processes. All the experts make their own judgments on each alternative over multiple criteria by hesitant fuzzy sets, and then the aggregation of each hesitant fuzzy set under each criterion is calculated by the aggregation operators. Furthermore, we can calculate the distance between any two aggregations of hesitant fuzzy sets, based on which the deviation between any two experts is yielded. After introducing the consensus measure, we develop two kinds of consensus-reaching procedures and then propose two step-by-step algorithms for hesitant fuzzy multiple criteria group decision making. A numerical example concerning the selection of selling ways about ‘Trade-Ins’ for Apple Inc. is provided to illustrate and verify the developed approaches. In this example, the methods which aim to reach a high consensus of all the experts before the selection process can avoid some experts’ preference values being too high or too low. After modifying the previous preference information by using our consensus measures, the result of the selection process is much more reasonable.  相似文献   

9.
不平衡数据集分类为机器学习热点研究问题之一,近年来研究人员提出很多理论和算法以改进传统分类技术在不平衡数据集上的性能,其中用阈值判定标准确定神经网络中的阈值是重要的方法之一。常用的阈值判定标准存在一定缺点,如不能使少数类及多数类分类精度同时取得最好、过于偏好多数类的精度等。为此提出一种新的阈值判定标准,依据该标准能够使少数类及多数类分类精度同时取得最好而不受样例类别比例的影响。以神经网络与遗传算法相结合训练分类器,作为阈值选择条件和分类器的评价标准,新标准能够得到较好的结果。  相似文献   

10.
We suggest an optimization approach of cluster-based undersampling to select appropriate instances. This approach can solve the data imbalance problem, which can lead to knowledge extraction for improving the performance of existing data mining techniques. Although data mining techniques among various big data analytics technologies have been successfully applied and proven in terms of classification performance in various domains, such as marketing, accounting and finance areas, the data imbalance problem has been regarded as one of the most important issues to be considered.We examined the effectiveness of a hybrid method using a clustering technique and genetic algorithms based on the artificial neural networks model to balance the proportion between the minority class and majority class. The objective of this paper is to constitute the best suitable training dataset for both decreasing data imbalance and improving the classification accuracy. We extracted the properly balanced dataset composed of optimal or near-optimal instances for the artificial neural networks model. The main contribution of the proposed method is that we extract explorative knowledge based on recognition of the data structure and categorize instances through the clustering technique while performing simultaneous optimization for the artificial neural networks modeling. In addition, we can easily understand why the instances are selected by the rule-format knowledge representation increasing the expressive power of the criteria of selecting instances. The proposed method is successfully applied to the bankruptcy prediction problem using financial data for which the proportion of small- and medium-sized bankruptcy firms in the manufacturing industry is extremely small compared to that of non-bankruptcy firms.  相似文献   

11.
In this paper, a novel inverse random under sampling (IRUS) method is proposed for the class imbalance problem. The main idea is to severely under sample the majority class thus creating a large number of distinct training sets. For each training set we then find a decision boundary which separates the minority class from the majority class. By combining the multiple designs through fusion, we construct a composite boundary between the majority class and the minority class. The proposed methodology is applied on 22 UCI data sets and experimental results indicate a significant increase in performance when compared with many existing class-imbalance learning methods. We also present promising results for multi-label classification, a challenging research problem in many modern applications such as music, text and image categorization.  相似文献   

12.
Streamline computation in a very large vector field data set represents a significant challenge due to the nonlocal and data-dependent nature of streamline integration. In this paper, we conduct a study of the performance characteristics of hybrid parallel programming and execution as applied to streamline integration on a large, multicore platform. With multicore processors now prevalent in clusters and supercomputers, there is a need to understand the impact of these hybrid systems in order to make the best implementation choice. We use two MPI-based distribution approaches based on established parallelization paradigms, parallelize over seeds and parallelize over blocks, and present a novel MPI-hybrid algorithm for each approach to compute streamlines. Our findings indicate that the work sharing between cores in the proposed MPI-hybrid parallel implementation results in much improved performance and consumes less communication and I/O bandwidth than a traditional, nonhybrid distributed implementation.  相似文献   

13.
Relational models are the most common representation of structured data, and acyclic database theory is important in relational databases. In this paper, we propose the method for constructing the Bayesian network structure from dependencies implied in multiple relational schemas. Based on the acyclic database theory and its relationships with probabilistic networks, we are to construct the Bayesian network structure starting from implied independence information instead of mining database instances. We first give the method to find the maximum harmoniousness subset for the multi-valued dependencies on an acyclic schema, and thus the most information of conditional independencies can be retained. Further, aiming at multi-relational environments, we discuss the properties of join graphs of multiple 3NF database schemas, and thus the dependencies between separate relational schemas can be obtained. In addition, on the given cyclic join dependency, the transformation from cyclic to acyclic database schemas is proposed by virtue of finding a minimal acyclic augmentation. An applied example shows that our proposed methods are feasible.  相似文献   

14.
ContextBlocking bugs are bugs that prevent other bugs from being fixed. Previous studies show that blocking bugs take approximately two to three times longer to be fixed compared to non-blocking bugs.ObjectiveThus, automatically predicting blocking bugs early on so that developers are aware of them, can help reduce the impact of or avoid blocking bugs. However, a major challenge when predicting blocking bugs is that only a small proportion of bugs are blocking bugs, i.e., there is an unequal distribution between blocking and non-blocking bugs. For example, in Eclipse and OpenOffice, only 2.8% and 3.0% bugs are blocking bugs, respectively. We refer to this as the class imbalance phenomenon.MethodIn this paper, we propose ELBlocker to identify blocking bugs given a training data. ELBlocker first randomly divides the training data into multiple disjoint sets, and for each disjoint set, it builds a classifier. Next, it combines these multiple classifiers, and automatically determines an appropriate imbalance decision boundary to differentiate blocking bugs from non-blocking bugs. With the imbalance decision boundary, a bug report will be classified to be a blocking bug when its likelihood score is larger than the decision boundary, even if its likelihood score is low.ResultsTo examine the benefits of ELBlocker, we perform experiments on 6 large open source projects – namely Freedesktop, Chromium, Mozilla, Netbeans, OpenOffice, and Eclipse containing a total of 402,962 bugs. We find that ELBlocker achieves F1 and EffectivenessRatio@20% scores of up to 0.482 and 0.831, respectively. On average across the 6 projects, ELBlocker improves the F1 and EffectivenessRatio@20% scores over the state-of-the-art method proposed by Garcia and Shihab by 14.69% and 8.99%, respectively. Statistical tests show that the improvements are significant and the effect sizes are large.ConclusionELBlocker can help deal with the class imbalance phenomenon and improve the prediction of blocking bugs. ELBlocker achieves a substantial and statistically significant improvement over the state-of-the-art methods, i.e., Garcia and Shihab’s method, SMOTE, OSS, and Bagging.  相似文献   

15.
It is well-known that software defect prediction is one of the most important tasks for software quality improvement. The use of defect predictors allows test engineers to focus on defective modules. Thereby testing resources can be allocated effectively and the quality assurance costs can be reduced. For within-project defect prediction (WPDP), there should be sufficient data within a company to train any prediction model. Without such local data, cross-project defect prediction (CPDP) is feasible since it uses data collected from similar projects in other companies. Software defect datasets have the class imbalance problem increasing the difficulty for the learner to predict defects. In addition, the impact of imbalanced data on the real performance of models can be hidden by the performance measures chosen. We investigate if the class imbalance learning can be beneficial for CPDP. In our approach, the asymmetric misclassification cost and the similarity weights obtained from distributional characteristics are closely associated to guide the appropriate resampling mechanism. We performed the effect size A-statistics test to evaluate the magnitude of the improvement. For the statistical significant test, we used Wilcoxon rank-sum test. The experimental results show that our approach can provide higher prediction performance than both the existing CPDP technique and the existing class imbalance technique.  相似文献   

16.
Classifying non-stationary and imbalanced data streams encompasses two important challenges, namely concept drift and class imbalance. Concept drift is changes in the underlying function being learnt, and class imbalance is vast difference between the numbers of instances in different classes of data. Class imbalance is an obstacle for the efficiency of most classifiers. Previous methods for classifying non-stationary and imbalanced data streams mainly focus on batch solutions, in which the classification model is trained using a chunk of data. Here, we propose two online classifiers. The classifiers are one-layer NNs. In the proposed classifiers, class imbalance is handled with two separate cost-sensitive strategies. The first one incorporates a fixed and the second one an adaptive misclassification cost matrix. The proposed classifiers are evaluated on 3 synthetic and 8 real-world datasets. The results show statistically significant improvements in imbalanced data metrics.  相似文献   

17.
Many approaches have been described for the parallel loop scheduling problem for shared-memory systems, but little work has been done on the data-dependent loop scheduling problem (nested loops with loop carried dependencies). In this paper, we propose a general model for the data-dependent loop scheduling problem on distributed as well as shared memory systems. In order to achieve load balancing and low runtime scheduling and communication overhead, our model is based on a loop task graph and the notion of critical path. In addition, we develop a heuristic algorithm based on our model and on genetic algorithms to test the reliability of the model. We test our approach on different scenarios and benchmarks. The results are very encouraging and suggest a future parallel compiler implementation based on our model.  相似文献   

18.
The performance and development review (PADR) evaluation in a company is a complex group decision‐making problem that is influenced by multiple and conflicting objectives. The complexity of the PADR evaluation problem is often due to the difficulties in determining the degrees of an alternative that satisfies the criteria. In this paper, we present a hesitant fuzzy multiple criteria group decision‐making methods for PADR evaluation. We first develop some operations based on Einstein operations. Then, we proposed some aggregation operators to aggregate hesitant fuzzy elements and the relationship between our proposed operators and the existing ones are discussed in detail. Furthermore, the procedure of multicriteria group decision making based on the proposed operators is given under hesitant fuzzy environment. Finally, a practical example about PADR evaluation in a company is provided to illustrate the developed method.  相似文献   

19.
We present methods to store and access templates of data arrays in parallel processors with shuffle-exchange-type interconnection networks. For this purpose, we define the class of composite linear permutations. In our method, each element of the data array is stored in the memory module determined by applying a suitable composite linear permutation on its indices. Simple necessary and sufficient criteria to avoid memory conflicts in the access of important templates such as row, column, main diagonal, and square block are given based on the composite linear permutation involved. The criteria so derived also specify the set of permutations to be realized by an interconnection network to avoid network conflicts. In particular, we give the criteria to be satisfied by a scheme of the proposed class to avoid network conflicts during the access of templates, when shuffle-exchange-type networks are used. Almost all the previously known scrambled storage methods are special cases in the class of storage methods presented in this paper.  相似文献   

20.
Preference analysis is an important task in multi-criteria decision making. The rough set theory has been successfully extended to deal with preference analysis by replacing equivalence relations with dominance relations. The existing studies involving preference relations cannot capture the uncertainty presented in numerical and fuzzy criteria. In this paper, we introduce a method to extract fuzzy preference relations from samples characterized by numerical criteria. Fuzzy preference relations are incorporated into a fuzzy rough set model, which leads to a fuzzy preference based rough set model. The measure of attribute dependency of the Pawlak’s rough set model is generalized to compute the relevance between criteria and decisions. The definitions of upward dependency, downward dependency and global dependency are introduced. Algorithms for computing attribute dependency and reducts are proposed and experimentally evaluated by using two publicly available data sets.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号