首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Contemporary biological technologies produce extremely high-dimensional data sets from which to design classifiers, with 20,000 or more potential features being common place. In addition, sample sizes tend to be small. In such settings, feature selection is an inevitable part of classifier design. Heretofore, there have been a number of comparative studies for feature selection, but they have either considered settings with much smaller dimensionality than those occurring in current bioinformatics applications or constrained their study to a few real data sets. This study compares some basic feature-selection methods in settings involving thousands of features, using both model-based synthetic data and real data. It defines distribution models involving different numbers of markers (useful features) versus non-markers (useless features) and different kinds of relations among the features. Under this framework, it evaluates the performances of feature-selection algorithms for different distribution models and classifiers. Both classification error and the number of discovered markers are computed. Although the results clearly show that none of the considered feature-selection methods performs best across all scenarios, there are some general trends relative to sample size and relations among the features. For instance, the classifier-independent univariate filter methods have similar trends. Filter methods such as the t-test have better or similar performance with wrapper methods for harder problems. This improved performance is usually accompanied with significant peaking. Wrapper methods have better performance when the sample size is sufficiently large. ReliefF, the classifier-independent multivariate filter method, has worse performance than univariate filter methods in most cases; however, ReliefF-based wrapper methods show performance similar to their t-test-based counterparts.  相似文献   

2.
In fault detection systems, a massive amount of data gathered from the life-cycle of equipment is often used to learn models or classifiers that aims at diagnosing different kinds of errors or failures. Among this huge quantity of information, some features (or sets of features) are more correlated with a kind of failure than another. The presence of irrelevant features might affect the performance of the classifier. To improve the performance of a detection system, feature selection is hence a key step. We propose in this paper an algorithm named STRASS, which aims at detecting relevant features for classification purposes. In certain cases, when there exists a strong correlation between some features and the associated class, conventional feature selection algorithms fail at selecting the most relevant features. In order to cope with this problem, STRASS algorithm uses k-way correlation between features and the class to select relevant features. To assess the performance of STRASS, we apply it on simulated data collected from the Tennessee Eastman chemical plant simulator. The Tennessee Eastman process (TEP) has been used in many fault detection studies and three specific faults are not well discriminated with conventional algorithms. The results obtained by STRASS are compared to those obtained with reference feature selection algorithms. We show that the features selected by STRASS always improve the performance of a classifier compared to the whole set of original features and that the obtained classification is better than with most of the other feature selection algorithms.  相似文献   

3.
为降低集成特征选择方法的计算复杂性,提出了一种基于粗糙集约简的神经网络集成分类方法。该方法首先通过结合遗传算法求约简和重采样技术的动态约简技术,获得稳定的、泛化能力较强的属性约简集;然后,基于不同约简设计BP网络作为待集成的基分类器,并依据选择性集成思想,通过一定的搜索策略,找到具有最佳泛化性能的集成网络;最后通过多数投票法实现神经网络集成分类。该方法在某地区Landsat 7波段遥感图像的分类实验中得到了验证,由于通过粗糙集约简,过滤掉了大量分类性能欠佳的特征子集,和传统的集成特征选择方法相比,该方法时  相似文献   

4.
为降低集成特征选择方法的计算复杂性,提出了一种基于粗糙集约简的神经网络集成分类方法。该方法首先通过结合遗传算法求约简和重采样技术的动态约简技术,获得稳定的、泛化能力较强的属性约简集;然后,基于不同约简设计BP网络作为待集成的基分类器,并依据选择性集成思想,通过一定的搜索策略,找到具有最佳泛化性能的集成网络;最后通过多数投票法实现神经网络集成分类。该方法在某地区Landsat 7波段遥感图像的分类实验中得到了验证,由于通过粗糙集约简,过滤掉了大量分类性能欠佳的特征子集,和传统的集成特征选择方法相比,该方法时间开销少,计算复杂性低,具有满意的分类性能。  相似文献   

5.
In cluster analysis, one of the most challenging and difficult problems is the determination of the number of clusters in a data set, which is a basic input parameter for most clustering algorithms. To solve this problem, many algorithms have been proposed for either numerical or categorical data sets. However, these algorithms are not very effective for a mixed data set containing both numerical attributes and categorical attributes. To overcome this deficiency, a generalized mechanism is presented in this paper by integrating Rényi entropy and complement entropy together. The mechanism is able to uniformly characterize within-cluster entropy and between-cluster entropy and to identify the worst cluster in a mixed data set. In order to evaluate the clustering results for mixed data, an effective cluster validity index is also defined in this paper. Furthermore, by introducing a new dissimilarity measure into the k-prototypes algorithm, we develop an algorithm to determine the number of clusters in a mixed data set. The performance of the algorithm has been studied on several synthetic and real world data sets. The comparisons with other clustering algorithms show that the proposed algorithm is more effective in detecting the optimal number of clusters and generates better clustering results.  相似文献   

6.
行人重识别主要研究在不同摄像机拍摄的图形中检索目标行人的任务,是计算机视觉领域一个极具挑战性的研究课题.传统依赖手工特征的行人重识别方法性能低且鲁棒性差,不能适应数据爆炸增长的信息时代.近年来,随着大规模行人数据集的出现和深度学习的迅速发展,行人重识别研究取得了许多突出成果.梳理了性能接近饱和的有监督学习研究方法,并探...  相似文献   

7.
随着微博机器人账户的不断增多,对其识别检测已成为当前数据挖掘领域的热点问题。已有的微博机器人识别研究多使用爬取搜集的相关数据,在小规模平衡分布的机器人与普通用户数据集上训练并验证算法模型, 在样本分布不平衡的真实情况下存在局限性。重采样是一种针对不平衡数据集分类的常用技术,为探究重采样对 相关监督学习机器人识别算法的影响,该文以微热点数据挖掘竞赛的真实数据为基础,提出一种结合重采样的微 博机器人识别框架,在5种不同采样方式的基础上使用多种评价指标,综合评估了7种监督学习算法在不平衡验 证集上的分类性能。实验结果表明,以往基于小规模平衡样本数据训练的模型在真实情况下的Recall有较大降低,而结合重采样的算法框架能够大幅提高机器人账户的识别率,其中使用 NearMiss欠采样会让算法的 Recall大幅提升,而使用 ADASYN 过采样会让算法的 G_mean有所提高。一般而言,微博用户的发布时间、发布地域以及 发布时间间隔等属性是区分正常用户和机器人的重要特征属性。重采样调整了机器学习算法所依赖的特征属性, 从而获得更好的预测性能。  相似文献   

8.
Feature selection is the basic pre-processing task of eliminating irrelevant or redundant features through investigating complicated interactions among features in a feature set. Due to its critical role in classification and computational time, it has attracted researchers’ attention for the last five decades. However, it still remains a challenge. This paper proposes a binary artificial bee colony (ABC) algorithm for the feature selection problems, which is developed by integrating evolutionary based similarity search mechanisms into an existing binary ABC variant. The performance analysis of the proposed algorithm is demonstrated by comparing it with some well-known variants of the particle swarm optimization (PSO) and ABC algorithms, including standard binary PSO, new velocity based binary PSO, quantum inspired binary PSO, discrete ABC, modification rate based ABC, angle modulated ABC, and genetic algorithms on 10 benchmark datasets. The results show that the proposed algorithm can obtain higher classification performance in both training and test sets, and can eliminate irrelevant and redundant features more effectively than the other approaches. Note that all the algorithms used in this paper except for standard binary PSO and GA are employed for the first time in feature selection.  相似文献   

9.
针对目前基于评论文本的推荐算法存在文本特征和隐含信息提取能力不足的问题, 提出一种基于注意力机制的深度学习推荐算法. 通过分别构建用户和项目的评论文本表示, 利用双向门控循环单元提取文本的上下文依赖关系以获得文本特征表示, 引入注意力机制, 更准确的获取用户兴趣偏好和项目属性特征. 将生成的用户和项目评论数据的两组隐含特征分别输入全连接层处理, 再合并到同一个向量空间进行评分预测, 得到推荐结果. 在Yelp和Amazon两个公开数据集中进行实验, 结果表明所提出的算法与其他算法相比, 具有更好的推荐性能.  相似文献   

10.
The scientific research community has reached a stage of maturity where its strong need for high-performance computing has diffused into also everyday life of engineering and industry algorithms. In efforts to satisfy this need, parallel computers provide an efficient and economical way to solve large-scale and/or time-constrained problems. As a consequence, the end-users of these systems have a vested interest in defining the asymptotic time complexity of parallel algorithms to predict their performance on a particular parallel computer. The asymptotic parallel time complexity of data-dependent algorithms depends on the number of processors, data size, and other parameters. Discovering the main other parameters is a challenging problem and the clue in obtaining a good estimate of performance order. Great examples of these types of applications are sorting algorithms, searching algorithms and solvers of the traveling salesman problem (TSP). This article encompasses all the knowledge discovery aspects to the problem of defining the asymptotic parallel time complexity of data-dependent algorithms. The knowledge discovery methodology begins by designing a considerable number of experiments and measuring their execution times. Then, an interactive and iterative process explores data in search of patterns and/or relationships detecting some parameters that affect performance. Knowing the key parameters which characterise time complexity, it becomes possible to hypothesise to restart the process and to produce a subsequent improved time complexity model. Finally, the methodology predicts the performance order for new data sets on a particular parallel computer by replacing a numerical identification. As a case of study, a global pruning traveling salesman problem implementation (GP-TSP) has been chosen to analyze the influence of indeterminism in performance prediction of data-dependent parallel algorithms, and also to show the usefulness of the defined knowledge discovery methodology. The subsequent hypotheses generated to define the asymptotic parallel time complexity of the TSP were corroborated one by one. The experimental results confirm the expected capability of the proposed methodology; the predictions of performance time order were rather good comparing with real execution time (in the order of 85%).  相似文献   

11.
Support vector machines for spam categorization   总被引:44,自引:0,他引:44  
We study the use of support vector machines (SVM) in classifying e-mail as spam or nonspam by comparing it to three other classification algorithms: Ripper, Rocchio, and boosting decision trees. These four algorithms were tested on two different data sets: one data set where the number of features were constrained to the 1000 best features and another data set where the dimensionality was over 7000. SVM performed best when using binary features. For both data sets, boosting trees and SVM had acceptable test performance in terms of accuracy and speed. However, SVM had significantly less training time.  相似文献   

12.
The ability to predict a student's performance could be useful in a great number of different ways associated with university-level distance learning. Students' key demographic characteristics and their marks on a few written assignments can constitute the training set for a supervised machine learning algorithm. The learning algorithm could then be able to predict the performance of new students, thus becoming a useful tool for identifying predicted poor performers. The scope of this work is to compare some of the state of the art learning algorithms. Two experiments have been conducted with six algorithms, which were trained using data sets provided by the Hellenic Open University. Among other significant conclusions, it was found that the Naïve Bayes algorithm is the most appropriate to be used for the construction of a software support tool, has more than satisfactory accuracy, its overall sensitivity is extremely satisfactory, and is the easiest algorithm to implement.  相似文献   

13.
Efficient mining of association rules in distributed databases   总被引:14,自引:0,他引:14  
Many sequential algorithms have been proposed for the mining of association rules. However, very little work has been done in mining association rules in distributed databases. A direct application of sequential algorithms to distributed databases is not effective, because it requires a large amount of communication overhead. In this study, an efficient algorithm called DMA (Distributed Mining of Association rules), is proposed. It generates a small number of candidate sets and requires only O(n) messages for support-count exchange for each candidate set, where n is the number of sites in a distributed database. The algorithm has been implemented on an experimental testbed, and its performance is studied. The results show that DMA has superior performance, when compared with the direct application of a popular sequential algorithm, in distributed databases  相似文献   

14.
Recent advancement in microarray technology permits monitoring of the expression levels of a large set of genes across a number of time points simultaneously. For extracting knowledge from such huge volume of microarray gene expression data, computational analysis is required. Clustering is one of the important data mining tools for analyzing such microarray data to group similar genes into clusters. Researchers have proposed a number of clustering algorithms in this purpose. In this article, an attempt has been made in order to improve the performance of fuzzy clustering by combining it with support vector machine (SVM) classifier. A recently proposed real-coded variable string length genetic algorithm based clustering technique and an iterated version of fuzzy C-means clustering have been utilized in this purpose. The performance of the proposed clustering scheme has been compared with that of some well-known existing clustering algorithms and their SVM boosted versions for one simulated and six real life gene expression data sets. Statistical significance test based on analysis of variance (ANOVA) followed by posteriori Tukey-Kramer multiple comparison test has been conducted to establish the statistical significance of the superior performance of the proposed clustering scheme. Moreover biological significance of the clustering solutions have been established.  相似文献   

15.
Given a large set of potential features, it is usually necessary to find a small subset with which to classify. The task of finding an optimal feature set is inherently combinatoric and therefore suboptimal algorithms are typically used to find feature sets. If feature selection is based directly on classification error, then a feature-selection algorithm must base its decision on error estimates. This paper addresses the impact of error estimation on feature selection using two performance measures: comparison of the true error of the optimal feature set with the true error of the feature set found by a feature-selection algorithm, and the number of features among the truly optimal feature set that appear in the feature set found by the algorithm. The study considers seven error estimators applied to three standard suboptimal feature-selection algorithms and exhaustive search, and it considers three different feature-label model distributions. It draws two conclusions for the cases considered: (1) depending on the sample size and the classification rule, feature-selection algorithms can produce feature sets whose corresponding classifiers possess errors far in excess of the classifier corresponding to the optimal feature set; and (2) for small samples, differences in performances among the feature-selection algorithms are less significant than performance differences among the error estimators used to implement the algorithms. Moreover, keeping in mind that results depend on the particular classifier-distribution pair, for the error estimators considered in this study, bootstrap and bolstered resubstitution usually outperform cross-validation, and bolstered resubstitution usually performs as well as or better than bootstrap.  相似文献   

16.
In recent years, methods of feature selection have been increasingly emphasized as venues for reducing cost and shortening the length of time required for computation in data mining. This study utilizes electromagnetism-like mechanism as a wrapper approach to feature selection. Birbil and Fang proposed EM in 2003. EM uses the attraction-repulsion mechanism of the electromagnetism theory to ascertain the optimal solution. Although EM has been applied to the topic of optimization in continuous space and a small number of studies on discrete problems, it has not been applied to the subject of feature selection. In this study, EM combined with 1-nearest-neighbor (1NN) was applied for feature selection and classification. This study utilized the total force exerted on a particle and evaluated this force to determine which features are to be selected. The most crucial features were selected according to the proposed method based on the minimum miss-classification rate, which was attained through 1NN. An unknown datum was classified by 1NN based on the chosen reduced model. To estimate the effectiveness of the proposed method, a numerical experiment was conducted using several data sets with diverse sizes, features, separability, and classes. Experimental results indicated that the proposed method outperformed other well-known algorithms in not only balanced classification accuracy but also efficiency of feature selection. Lastly, this study used an actual case concerning gestational diabetes mellitus to demonstrate the workability of the proposed method.  相似文献   

17.
田丹  臧守雨  涂斌斌 《图学学报》2021,42(5):755-761
因频繁遮挡、尺度变化、边界效应等因素的影响,进行目标跟踪时,时常难以达到较好的预期 效果。再有,采用传统特征提取策略也会影响目标跟踪的鲁棒性。针对上述问题,提出一种具有空间调整和稀 疏约束的相关跟踪算法。利用传统特征与深度特征的有效融合,适应目标表观变化;基于峰值旁瓣比判别目标 在跟踪过程中是否被遮挡,若发生遮挡,则对滤波器进行稀疏正则化约束,提高模型对遮挡问题的鲁棒性;若 未发生遮挡,则通过高斯空间调整惩罚滤波器系数,抑制边界效应的影响。实验利用 OTB 数据集中 5 组涵盖 了严重遮挡和尺度变化等挑战因素的标准视频序列进行测试,定性和定量对比了算法与 4 种热点算法的跟踪效 果。定性分析中基于视频序列的主要挑战因素进行比较,定量分析通过中心点位置误差和重叠率比较跟踪算法 的性能。实验结果表明,算法对上述挑战因素更具鲁棒性。  相似文献   

18.
Extreme learning machine (ELM) has been an important research topic over the last decade due to its high efficiency, easy-implementation, unification of classification and regression, and unification of binary and multi-class learning tasks. Though integrating these advantages, existing ELM algorithms cannot directly handle the case where some features of the samples are missing or unobserved, which is usually very common in practical applications. The work in this paper fills this gap by proposing an absent ELM (A-ELM) algorithm to address the above issue. By observing the fact that some structural characteristics of a part of packed malware instances hold unreasonable values, we cast the packed executable identification tasks into an absence learning problem, which can be efficiently addressed via the proposed A-ELM algorithm. Extensive experiments have been conducted on six UCI data sets and a packed data set to evaluate the performance of the proposed algorithm. As indicated, the proposed A-ELM algorithm is superior to other imputation algorithms and existing state-of-the-art ones.  相似文献   

19.
A multilevel relaxation algorithm for simultaneous localization and mapping   总被引:2,自引:0,他引:2  
This paper addresses the problem of simultaneous localization and mapping (SLAM) by a mobile robot. An incremental SLAM algorithm is introduced that is derived from multigrid methods used for solving partial differential equations. The approach improves on the performance of previous relaxation methods for robot mapping, because it optimizes the map at multiple levels of resolution. The resulting algorithm has an update time that is linear in the number of estimated features for typical indoor environments, even when closing very large loops, and offers advantages in handling nonlinearities compared with other SLAM algorithms. Experimental comparisons with alternative algorithms using two well-known data sets and mapping results on a real robot are also presented.  相似文献   

20.
The leading partitional clustering technique, k-modes, is one of the most computationally efficient clustering methods for categorical data. However, in the k-modes-type algorithms, the performance of their clustering depends on initial cluster centers and the number of clusters needs be known or given in advance. This paper proposes a novel initialization method for categorical data which is implemented to the k-modes-type algorithms. The proposed method can not only obtain the good initial cluster centers but also provide a criterion to find candidates for the number of clusters. The performance and scalability of the proposed method has been studied on real data sets. The experimental results illustrate that the proposed method is effective and can be applied to large data sets for its linear time complexity with respect to the number of data points.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号