首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
关联分类具有较高的分类精度和较强的适应性,然而由于分类器是由一组高置信度的规则构成,有时会存在过度拟合问题。提出了基于规则兴趣度的关联分类(ACIR)。它扩展了TD-FP-growth算法,使之有效地挖掘训练集,产生满足最小支持度和最小置信度的有趣的规则。通过剪枝选择一个小规则集构造分类器。在规则剪枝过程中,采用规则兴趣度来评价规则的质量,综合考虑规则的预测精度和规则中项的兴趣度。实验结果表明该方法在分类精度上优于See5、CBA和CMAR,并且具有较好的可理解性和扩展性。  相似文献   

2.
关联分类具有较高的分类精度和较强的扩展性,但是由于分类器是由高置信度的规则构成,因此有时会出现过拟合。因此考虑在fp-growth挖掘频繁项的基础上。计算频繁项与测试数据间的最小差异度,即分类规则与测试数据的匹配程度。将最小差异度最小的类标号赋予测试数据。实验结果表明,该算法较先前算法有较高的精确度,如CBA (Classification-Based Association),CMAR (Classification based on Multiple Association Rules),CPAR(Classification based on Predictive Association Rules)。但是不足之处是精确度提高的代价是存储频繁项的矩阵过于庞大.系统开销不小。  相似文献   

3.
在信息化评估过程中,传统关联分类算法无法优先发现短规则,且分类精度对规则次序的依赖较强。为此,提出基于子集支持度和多规则分类的关联分类算法,将训练集按待分类属性归类,利用子集支持度挖掘关联规则,通过计算类平均支持度对测试集进行分类。实验结果表明,该算法发现规则的能力和分类精度均优于传统方法。  相似文献   

4.
提出了基于属性重要性的关联分类方法.与传统算法不同的是根据属性重要性程度生成类别关联规则;并且在构造分类器时改进了CBA算法中对于具有相同支持度、置信度规则选择时的随机性.实验结果证明,用该方法得到的分类规则与传统的关联分类算法相比,复杂度低,且有效提高了分类效果.  相似文献   

5.
Mining class association rules (CARs) is an essential, but time-intensive task in Associative Classification (AC). A number of algorithms have been proposed to speed up the mining process. However, sequential algorithms are not efficient for mining CARs in large datasets while existing parallel algorithms require communication and collaboration among computing nodes which introduces the high cost of synchronization. This paper addresses these drawbacks by proposing three efficient approaches for mining CARs in large datasets relying on parallel computing. To date, this is the first study which tries to implement an algorithm for parallel mining CARs on a computer with the multi-core processor architecture. The proposed parallel algorithm is theoretically proven to be faster than existing parallel algorithms. The experimental results also show that our proposed parallel algorithm outperforms a recent sequential algorithm in mining time.  相似文献   

6.
Discretization of continuous data is one of the important pre-processing tasks in data mining and knowledge discovery. Generally speaking, discretization can lead to improved predictive accuracy of induction algorithms, and the obtained rules are normally shorter and more understandable. In this paper, we present the Class-Attribute Coherence Maximization (CACM) algorithm and the Efficient-CACM algorithm. We have compared the performance of our algorithms with the most relevant discretization algorithm, Fast Class-Attribute Interdependence Maximization (Fast-CAIM) discertization algorithm (Kurgan and Cios, 2003). Empirical evaluation of our algorithms and Fast-CAIM on 12 well-known datasets shows that ours generate the superior discretization scheme, which can significantly improve the classification performance of C4.5 and RBF-SVM classifier. As to the execution time of discretization, ours also prove faster than Fast-CAIM algorithm, with the Efficient-CACM algorithm having the shortest execution time.  相似文献   

7.
Website phishing is considered one of the crucial security challenges for the online community due to the massive numbers of online transactions performed on a daily basis. Website phishing can be described as mimicking a trusted website to obtain sensitive information from online users such as usernames and passwords. Black lists, white lists and the utilisation of search methods are examples of solutions to minimise the risk of this problem. One intelligent approach based on data mining called Associative Classification (AC) seems a potential solution that may effectively detect phishing websites with high accuracy. According to experimental studies, AC often extracts classifiers containing simple “If-Then” rules with a high degree of predictive accuracy. In this paper, we investigate the problem of website phishing using a developed AC method called Multi-label Classifier based Associative Classification (MCAC) to seek its applicability to the phishing problem. We also want to identify features that distinguish phishing websites from legitimate ones. In addition, we survey intelligent approaches used to handle the phishing problem. Experimental results using real data collected from different sources show that AC particularly MCAC detects phishing websites with higher accuracy than other intelligent algorithms. Further, MCAC generates new hidden knowledge (rules) that other algorithms are unable to find and this has improved its classifiers predictive performance.  相似文献   

8.
Associative classification (AC), which is based on association rules, has shown great promise over many other classification techniques. To implement AC effectively, we need to tackle the problems on the very large search space of candidate rules during the rule discovery process and incorporate the discovered association rules into the classification process. This paper proposes a new approach that we call artificial immune system-associative classification (AIS-AC), which is based on AIS, for mining association rules effectively for classification. Instead of massively searching for all possible association rules, AIS-AC will only find a subset of association rules that are suitable for effective AC in an evolutionary manner. In this paper, we also evaluate the performance of the proposed AIS-AC approach for AC based on large datasets. The performance results have shown that the proposed approach is efficient in dealing with the problem on the complexity of the rule search space, and at the same time, good classification accuracy has been achieved. This is especially important for mining association rules from large datasets in which the search space of rules is huge.  相似文献   

9.
关联分类及较多的改进算法很难同时既具有较高的整体准确率又有较好的小类分类性能。针对此问题,提出了一种基于类支持度阈值独立挖掘的关联分类改进算法—ACCS。ACCS算法的主要特点是:(1)根据训练集中各类数量大小给出每个类类支持度阈值的设定方法,并基于各类的类支持度阈值独立挖掘该类的关联分类规则,尽量使小类生成更多高置信度的规则;(2)采用类支持度对置信度相同的规则排序,提高小类规则的优先级;(3)用综合考虑置信度和提升度的新的规则度量预测未知实例。在多个数据集上的实验结果表明,相比多种关联分类改进算法,ACCS算法有更高的整体分类准确率,且在不平衡数据上也能取得较好的小类分类性能。  相似文献   

10.
一种改进的关联分类算法   总被引:2,自引:0,他引:2  
关联分类算法是数据挖掘技术中一种主要分类方法,但传统关联分类算法仅根据置信度构造分类器,影响分类精度。提出一种改进算法,在选择高置信度构造分类器的基础上,优先考虑短规则分类。实验结果表明,该改进算法在分类精度和分类器大小上均优于传统分类算法。  相似文献   

11.
文本分类任务作为文本挖掘的核心问题,已成为自然语言处理领域的一个重要课题.而短文本分类由于稀疏性、实时性和不规范性等特点,已成为文本分类亟待解决的问题之一.在某些特定场景,短文本存在大量隐含语义,由此给挖掘有限文本内的隐含语义特征等任务带来挑战.已有的方法对短文本分类主要采用传统机器学习或深度学习算法,但该类算法的模型构建复杂且工作量大,效率不高.此外,短文本包含有效信息较少且口语化严重,对模型的特征学习能力要求较高.针对以上问题,提出了KAe RCNN模型,该模型在TextRCNN模型的基础上,融合了知识感知与双重注意力机制.知识感知包含了知识图谱实体链接和知识图谱嵌入,可以引入外部知识以获取语义特征,同时,双重注意力机制可以提高模型对短文本中有效信息提取的效率.实验结果表明,KAe RCNN模型在分类准确度、F1值和实际应用效果等方面显著优于传统的机器学习算法.对算法的性能和适应性进行了验证,准确率达到95.54%, F1值达到0.901,对比4种传统机器学习算法,准确率平均提高了约14%, F1值提升了约13%.与TextRCNN相比,KAe RCNN模型在准确性方面提升了约3%...  相似文献   

12.
Classification is one of the key issues in medical diagnosis. In this paper, a novel approach to perform pattern classification tasks is presented. This model is called Associative Memory based Classifier (AMBC). Throughout the experimental phase, the proposed algorithm is applied to help diagnose diseases; particularly, it is applied in the diagnosis of seven different problems in the medical field. The performance of the proposed model is validated by comparing classification accuracy of AMBC against the performance achieved by other twenty well known algorithms. Experimental results have shown that AMBC achieved the best performance in three of the seven pattern classification problems in the medical field. Similarly, it should be noted that our proposal achieved the best classification accuracy averaged over all datasets.  相似文献   

13.
提出了一种分布多库环境下的全局库分类规则发现算法——FGCMAR。FGCMAR在各个站点采用CMAR算法分别生成频繁模式树,并在各个站点间传送条件模式基来形成全局条件频繁模式树,最终通过挖掘条件频繁模式树来得到全局分类规则。该算法能够有效的减小网络通信量,提高挖掘效率。理论分析和实验结果表明该算法是有效可行的。  相似文献   

14.
Classification is one of the important tasks in data mining. The probabilistic neural network (PNN) is a well-known and efficient approach for classification. The objective of the work presented in this paper is to build on this approach to develop an effective method for classification problems that can find high-quality solutions (with respect to classification accuracy) at a high convergence speed. To achieve this objective, we propose a method that hybridizes the firefly algorithm with simulated annealing (denoted as SFA), where simulated annealing is applied to control the randomness step inside the firefly algorithm while optimizing the weights of the standard PNN model. We also extend our work by investigating the effectiveness of using Lévy flight within the firefly algorithm (denoted as LFA) to better explore the search space and by integrating SFA with Lévy flight (denoted as LSFA) in order to improve the performance of the PNN. The algorithms were tested on 11 standard benchmark datasets. Experimental results indicate that the LSFA shows better performance than the SFA and LFA. Moreover, when compared with other algorithms in the literature, the LSFA is able to obtain better results in terms of classification accuracy.  相似文献   

15.
Associative classification is a new classification approach integrating association mining and classification. It becomes a significant tool for knowledge discovery and data mining. However, high-order association mining is time consuming when the number of attributes becomes large. The recent development of the AdaBoost algorithm indicates that boosting simple rules could often achieve better classification results than the use of complex rules. In view of this, we apply the AdaBoost algorithm to an associative classification system for both learning time reduction and accuracy improvement. In addition to exploring many advantages of the boosted associative classification system, this paper also proposes a new weighting strategy for voting multiple classifiers.  相似文献   

16.
The substitution of missing values, also called imputation, is an important data preparation task for data mining applications. Imputation algorithms have been traditionally compared in terms of the similarity between imputed and original values. However, this traditional approach, sometimes referred to as prediction ability, does not allow inferring the influence of imputed values in the ultimate modeling tasks (e.g., in classification). Based on an extensive experimental work, we study the influence of five nearest-neighbor based imputation algorithms (KNNImpute, SKNN, IKNNImpute, KMI and EACImpute) and two simple algorithms widely used in practice (Mean Imputation and Majority Method) on classification problems. In order to experimentally assess these algorithms, simulations of missing values were performed on six datasets by means of two missingness mechanisms: Missing Completely at Random (MCAR) and Missing at Random (MAR). The latter allows the probabilities of missingness to depend on observed data but not on missing data, whereas the former occurs when the distribution of missingness does not depend on the observed data either. The quality of the imputed values is assessed by two measures: prediction ability and classification bias. Experimental results show that IKNNImpute outperforms the other algorithms in the MCAR mechanism. KNNImpute, SKNN and EACImpute, by their turn, provided the best results in the MAR mechanism. Finally, our experiments also show that best prediction results (in terms of mean squared errors) do not necessarily yield to less classification bias.  相似文献   

17.
A discretization algorithm based on Class-Attribute Contingency Coefficient   总被引:1,自引:0,他引:1  
Discretization algorithms have played an important role in data mining and knowledge discovery. They not only produce a concise summarization of continuous attributes to help the experts understand the data more easily, but also make learning more accurate and faster. In this paper, we propose a static, global, incremental, supervised and top-down discretization algorithm based on Class-Attribute Contingency Coefficient. Empirical evaluation of seven discretization algorithms on 13 real datasets and four artificial datasets showed that the proposed algorithm could generate a better discretization scheme that improved the accuracy of classification. As to the execution time of discretization, the number of generated rules, and the training time of C5.0, our approach also achieved promising results.  相似文献   

18.
Building a highly-compact and accurate associative classifier   总被引:1,自引:1,他引:0  
Associative classification has aroused significant research attention in recent years due to its advantage in rule forms with satisfactory accuracy. However, the rules in associative classifiers derived from typical association rule mining (e.g., Apriori-type) may easily become too many to be understood and even be sometimes redundant or conflicting. To deal with these issues of concern, a recently proposed approach (i.e., GARC) appears to be superior to other existing approaches (e.g., C4.5-type, NN, SVM, CBA) in two respects: one is its classification accuracy that is equally satisfactory; the other is the compactness that the generated classifier is constituted with much fewer rules. Along with this line of methodological thinking, this paper presents a novel GARC-type approach, namely GEAR, to build an associative classifier with three distinctive and desirable features. First, the rules in the GEAR classifier are more intuitively appealing; second, the GEAR classification accuracy is improved or at least as good as others; and third, the GEAR classifier is significantly more compact in size. In doing so, a number of notions including rule redundancy and compact set are provided, together with related properties that could be incorporated into the rule mining process as algorithmic pruning strategies. The experimental results with benchmarking datasets also reveal that GEAR outperforms GARC and other approaches in an effective manner.  相似文献   

19.
为了克服数据流中概念漂移对分类的影响,提出了一种基于多样性和精度加权的集成分类方法(diversity and accuracy weighting ensemble classification algorithm, DAWE),该方法与已有的其他集成方法不同的地方在于,DAWE同时考虑了多样性和精度这两种度量标准,将分类器在最新数据块上的精度及其在集成分类器中的多样性进行线性加权,以此来衡量一个分类器对于当前集成分类器的价值,并将价值度量用于基分类器替换策略。提出的DAWE算法与MOA中最新算法分别在真实数据和人工合成数据上进行了对比实验,实验表明,提出的方法是有效的,在所有数据集上的平均精度优于其他算法,该方法能有效处理数据流挖掘中的概念漂移问题。  相似文献   

20.
余海  李斌  王培霞  贾荻  王永吉 《计算机应用》2016,36(12):3448-3453
源代码注释是软件的重要组成部分,研究者往往需要利用人工或自动化的方法产生分析注释,注释的质量评估也往往是通过人工来完成,这无疑是低效不客观的。为此,首先从注释的格式、语言形式、内容以及与代码相关度4个方面出发构建注释评估准则;进而,基于这一准则提出了一种基于组合分类算法的注释质量评估方法。该方法将机器学习以及自然语言处理技术引入到注释质量评估中来,利用分类算法将注释分为不合格、合格、良好、优秀四个等级。通过对基本分类算法的组合使用,使得评估效果进一步提高。组合分类算法的准确率和F1值较单独使用某一种分类算法提高20个百分点左右,除宏平均F1值外,各项指标都达到了70%以上。实验结果表明,所提方法能够很好地应用于注释质量评估。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号