共查询到20条相似文献,搜索用时 0 毫秒
1.
Frank S.C. Tseng 《Information Sciences》2010,180(22):4263-4894
Existing parallel algorithms for association rule mining have a large inter-site communication cost or require a large amount of space to maintain the local support counts of a large number of candidate sets. This study proposes a de-clustering approach for distributed architectures, which eliminates the inter-site communication cost, for most of the influential association rule mining algorithms. To de-cluster the database into similar partitions, an efficient algorithm is developed to approximate the shortest spanning path (SSP) to link transaction data together. The SSP obtained is then used to evenly de-cluster the transaction data into subgroups. The proposed approach guarantees that all subgroups are similar to each other and to the original group. Experiment results show that data size and the number of items are the only two factors that determine the performance of de-clustering. Additionally, based on the approach, most of the influential association rule mining algorithms can be implemented in a distributed architecture to obtain a drastic increase in speed without losing any frequent itemsets. Furthermore, the data distribution in each de-clustered participant is almost the same as that of a single site, which implies that the proposed approach can be regarded as a sampling method for distributed association rule mining. Finally, the experiment results prove that the original inadequate mining results can be improved to an almost perfect level. 相似文献
2.
3.
Abolfazl Tazaree Amir-Masud Eftekhari-Moghadam Saeedeh Sajjadi-Ghaem-Maghami 《Multimedia Tools and Applications》2014,69(3):921-949
One of the major challenges in the content-based information retrieval and machine learning techniques is to-build-the-so-called “semantic classifier” which is able to effectively and efficiently classify semantic concepts in a large database. This paper dealt with semantic image classification based on hierarchical Fuzzy Association Rules (FARs) mining in the image database. Intuitively, an association rule is a unique and significant combination of image features and a semantic concept, which determines the degree of correlation between features and concept. The main idea behind this approach is that any image visual concept has some associated features, so that, there are strong correlations between the concepts and their corresponding features. Regardless of the semantic gap, an image concept appears when the corresponding features emerge in an image and vice versa. Specially, this paper’s contribution was to propose a novel Fuzzy Association Rule for improving traditional association rules. Moreover, it was concerned with establishing a hierarchical fuzzy rule base in the training phase and setup corresponding fuzzy inference engine in order to classify images in the testing phase. The presented approach was independent from image segmentation and can be applied on multi-label images. Experimental results on a database of 6000 general-purpose images demonstrated the superiority of the proposed algorithm. 相似文献
4.
DNA microarray technology, a high throughput technology evaluates the expression of thousands of genes simultaneously under different experimental conditions. Analysis of the gene expression data reveals that not all but few important genes are responsible for the diseases. However, the DNA microarray data set usually contain multiple missing value and therefore, selection of important genes using the incomplete data set may be erroneous, resulting misclassification in disease prediction. In the paper we propose an integrated framework, which first imputes the missing value and then in order to achieve maximum accuracy in classifying the patients a classifier has been designed to select the genes using the complete microarray data set.Here functionally similar genes are employed to estimate the missing value unlike the existing gene expression value based distance similarity measure. However, the functionally similar genes may differ in their protein production capacity and so the degree of similarity between the genes varies from gene to gene. The problem has been dealt by proposing a novel method to impute the missing value using the concept of fuzzy similarity. After imputing the missing value, the continuous gene expression matrix is discretized using fuzzy sets to distinguish the activation levels of different genes. The proposed fuzzy importance factor (FIf) of each gene represents its activation level or protein production capacity both in the disease and normal class. The importance of each gene is evaluated while optimizing the number of rules in the fuzzy classifier depending on the FIf. The methodology we propose has been demonstrated using nine different cancer data sets and compared with the state of the art methods. Analysis of experimental results reveals that the proposed framework able to classify the diseased and normal patients with improved accuracy. 相似文献
5.
6.
A two-stage hybrid model for data classification and rule extraction is proposed. The first stage uses a Fuzzy ARTMAP (FAM) classifier with Q-learning (known as QFAM) for incremental learning of data samples, while the second stage uses a Genetic Algorithm (GA) for rule extraction from QFAM. Given a new data sample, the resulting hybrid model, known as QFAM-GA, is able to provide prediction pertaining to the target class of the data sample as well as to give a fuzzy if-then rule to explain the prediction. To reduce the network complexity, a pruning scheme using Q-values is applied to reduce the number of prototypes generated by QFAM. A ‘don't care’ technique is employed to minimize the number of input features using the GA. A number of benchmark problems are used to evaluate the effectiveness of QFAM-GA in terms of test accuracy, noise tolerance, model complexity (number of rules and total rule length). The results are comparable, if not better, than many other models reported in the literature. The main significance of this research is a usable and useful intelligent model (i.e., QFAM-GA) for data classification in noisy conditions with the capability of yielding a set of explanatory rules with minimum antecedents. In addition, QFAM-GA is able to maximize accuracy and minimize model complexity simultaneously. The empirical outcome positively demonstrate the potential impact of QFAM-GA in the practical environment, i.e., providing an accurate prediction with a concise justification pertaining to the prediction to the domain users, therefore allowing domain users to adopt QFAM-GA as a useful decision support tool in assisting their decision-making processes. 相似文献
7.
基于TD-FP-growth的模糊关联规则挖掘算法 总被引:1,自引:0,他引:1
提出一种基于TD_FP-growth的模糊关联规则挖掘算法.首先,使用3种t-模算子以及由其产生的蕴涵算子计算模糊频繁项的支持度和规则的蕴涵度,产生的关联规则能表示模糊项间的确定性和渐近性逻辑语义;然后,以事务的惟一标识为键值,散列存储每个事务相对FP-tree中每个结点所表示模糊项的隶属度,使TD-FP-growth适用于模糊频繁项的挖掘,并分析了算法的时间和空间复杂度;最后,实验结果表明该算法比基于apriori的模糊频繁项挖掘算法在时间方面更加有效.Abstract: An algorithm based on TD-FP-growth is proposed for mining fuzzy association rule, which uses three kinds of t-norm operator to calculate the support degree of fuzzy frequent items, and adopts corresponding implication operator to measure implication degree of fuzzy association rule.The association rule mined by the algorithm can express the logic semantic of graduality and certainty between fuzzy items.Each transaction's membership degree versus fuzzy item denoted by FP-tree's node is stored by hash technology, and each transaction's identifier is regarded as key value, which adapts TD-FP-growth to mine fuzzy frequent items.The time and space complexity of the algorithm are analyzed.The experimental results show that the algorithm is more effective than the fuzzy frequent item mining algorithm based on apriori in term of time. 相似文献
8.
提出一种新颖的基于boosting RBF神经网络的入侵检测方法。将模糊聚类和神经网络技术相结合,提出基于改进的FCM算法和OLS算法相结合的FORBF算法,为了提高RBF神经网络的泛化能力,采用Boosting方法,进行网络集成。以“KDD Cup 1999 Data”网络连接数据集训练神经网络并仿真实验,得到了较高的检测率和较低的误警率。 相似文献
9.
The usage of association rules is playing a vital role in the field of knowledge data discovery. Numerous rules have to be processed and plot based on the ranges on the schema. The step in this process depends on the user's queries. Previously, several projects have been proposed to reduce work and improve filtration processes. However, they have some limitations in preprocessing time and filtration rate. In this article, an improved fuzzy weighted-iterative concept is introduced to overcome the limitation based on the user request and visualization of discovering rules. The initial step includes the mix of client learning with posthandling to use the semantics. The above advance was trailed by surrounding rule schemas to fulfill and anticipate unpredictable guidelines dependent on client desires. Preparing the above developments can be imagined by the use of yet another clever method of study. Standards on guidelines are recognized by the average learning professionals. 相似文献
10.
In this paper, we introduce a new adaptive rule-based classifier for multi-class classification of biological data, where several problems of classifying biological data are addressed: overfitting, noisy instances and class-imbalance data. It is well known that rules are interesting way for representing data in a human interpretable way. The proposed rule-based classifier combines the random subspace and boosting approaches with ensemble of decision trees to construct a set of classification rules without involving global optimisation. The classifier considers random subspace approach to avoid overfitting, boosting approach for classifying noisy instances and ensemble of decision trees to deal with class-imbalance problem. The classifier uses two popular classification techniques: decision tree and k-nearest-neighbor algorithms. Decision trees are used for evolving classification rules from the training data, while k-nearest-neighbor is used for analysing the misclassified instances and removing vagueness between the contradictory rules. It considers a series of k iterations to develop a set of classification rules from the training data and pays more attention to the misclassified instances in the next iteration by giving it a boosting flavour. This paper particularly focuses to come up with an optimal ensemble classifier that will help for improving the prediction accuracy of DNA variant identification and classification task. The performance of proposed classifier is tested with compared to well-approved existing machine learning and data mining algorithms on genomic data (148 Exome data sets) of Brugada syndrome and 10 real benchmark life sciences data sets from the UCI (University of California, Irvine) machine learning repository. The experimental results indicate that the proposed classifier has exemplary classification accuracy on different types of biological data. Overall, the proposed classifier offers good prediction accuracy to new DNA variants classification where noisy and misclassified variants are optimised to increase test performance. 相似文献
11.
In the rapidly changing financial market, investors always have difficulty in deciding the right time to trade. In order to enhance investment profitability, investors desire a decision support system. The proposed artificial intelligence methodology provides investors with the ability to learn the association among different parameters. After the associations are extracted, investors can apply the rules in their decision support systems. In this work, the model is built with the ultimate goal of predicting the level of the Hang Seng Index in Hong Kong. The movement of Hang Seng Index, which is associated with other economics indices including the gross domestic product (GDP) index, the consumer price index (CPI), the interest rate, and the export value of goods from Hong Kong, is learnt by the proposed method. The case study shows that the proposed method is a feasible way to provide decision support for investors who may not be able to identify the hidden rules between the Hang Seng Index and other economics indices. 相似文献
12.
13.
为解决传统关联规则挖掘算法对大规模连续数据库进行挖掘时所产生的信息损失和效率低下等问题,给出一种改进的模糊关联规则挖掘算法,称为F-ARMVLQD算法。该算法利用模糊均值聚类算法解决离散属性间隔之间出现"尖锐边界"的问题,同时算法引入有向无环图和字节向量用以提高频繁项目集的计算效率,并吸取分区算法的优势,解决对该数据库挖掘时磁盘操作频繁的问题,整个算法只需扫描两次数据库。实验结果表明,该算法比传统算法具有更高的执行效率。 相似文献
14.
An ACS-based framework for fuzzy data mining 总被引:1,自引:0,他引:1
Tzung-Pei Hong Ya-Fang Tung Shyue-Liang Wang Min-Thai Wu Yu-Lung Wu 《Expert systems with applications》2009,36(9):11844-11852
Data mining is often used to find out interesting and meaningful patterns from huge databases. It may generate different kinds of knowledge such as classification rules, clusters, association rules, and among others. A lot of researches have been proposed about data mining and most of them focused on mining from binary-valued data. Fuzzy data mining was thus proposed to discover fuzzy knowledge from linguistic or quantitative data. Recently, ant colony systems (ACS) have been successfully applied to optimization problems. However, few works have been done on applying ACS to fuzzy data mining. This thesis thus attempts to propose an ACS-based framework for fuzzy data mining. In the framework, the membership functions are first encoded into binary-bits and then fed into the ACS to search for the optimal set of membership functions. The problem is then transformed into a multi-stage graph, with each route representing a possible set of membership functions. When the termination condition is reached, the best membership function set (with the highest fitness value) can then be used to mine fuzzy association rules from a database. At last, experiments are made to make a comparison with other approaches and show the performance of the proposed framework. 相似文献
15.
随着旅游业的发展,从海量旅行数据中挖掘旅客类型和环境因素之间内在的、隐含的相关性,是分析旅游市场状况、预测对相关行业影响的一种有效方法。结合旅行数据特点,并针对现有约束方法的局限性,提出一种基于关系延展路径约束的关联规则并行挖掘算法。该算法有效结合MapReduce并行机制,在关系延展路径约束下生成事务集,提升后续并行效率;同时利用并行方法改进Apriori算法的逐层搜索,带来“二次”效率提升,从而更好更快地把握旅游业发展动态,调整旅游业宏观政策。 相似文献
16.
Dewan Md. Farid Li Zhang Alamgir Hossain Chowdhury Mofizur Rahman Rebecca Strachan Graham Sexton Keshav Dahal 《Expert systems with applications》2013,40(15):5895-5906
It is challenging to use traditional data mining techniques to deal with real-time data stream classifications. Existing mining classifiers need to be updated frequently to adapt to the changes in data streams. To address this issue, in this paper we propose an adaptive ensemble approach for classification and novel class detection in concept drifting data streams. The proposed approach uses traditional mining classifiers and updates the ensemble model automatically so that it represents the most recent concepts in data streams. For novel class detection we consider the idea that data points belonging to the same class should be closer to each other and should be far apart from the data points belonging to other classes. If a data point is well separated from the existing data clusters, it is identified as a novel class instance. We tested the performance of this proposed stream classification model against that of existing mining algorithms using real benchmark datasets from UCI (University of California, Irvine) machine learning repository. The experimental results prove that our approach shows great flexibility and robustness in novel class detection in concept drifting and outperforms traditional classification models in challenging real-life data stream applications. 相似文献
17.
Elicitation of classification rules by fuzzy data mining 总被引:1,自引:0,他引:1
Yi-Chung Hu Gwo-Hshiung Tzeng 《Engineering Applications of Artificial Intelligence》2003,16(7-8):709-716
Data mining techniques can be used to find potentially useful patterns from data and to ease the knowledge acquisition bottleneck in building prototype rule-based systems. Based on the partition methods presented in simple-fuzzy-partition-based method (SFPBM) proposed by Hu et al. (Comput. Ind. Eng. 43(4) (2002) 735), the aim of this paper is to propose a new fuzzy data mining technique consisting of two phases to find fuzzy if–then rules for classification problems: one to find frequent fuzzy grids by using a pre-specified simple fuzzy partition method to divide each quantitative attribute, and the other to generate fuzzy classification rules from frequent fuzzy grids. To improve the classification performance of the proposed method, we specially incorporate adaptive rules proposed by Nozaki et al. (IEEE Trans. Fuzzy Syst. 4(3) (1996) 238) into our methods to adjust the confidence of each classification rule. For classification generalization ability, the simulation results from the iris data demonstrate that the proposed method may effectively derive fuzzy classification rules from training samples. 相似文献
18.
Fuzzy cognitive maps (FCMs) are one of the representative techniques in developing scenarios that include future concepts and issues, as well as their causal relationships. The technique, initially dependent on deductive modeling of expert knowledge, suffered from inherent limitations of scope and subjectivity; though this lack has been partially addressed by the recent emergence of inductive modeling, the fact that inductive modeling uses a retrospective, historical data that often misses trend-breaking developments. Addressing this issue, the paper suggests the utilization of futuristic data, a collection of future-oriented opinions extracted from online communities of large participation, in scenario building. Because futuristic data is both large in scope and prospective in nature, we believe a methodology based on this particular data set addresses problems of subjectivity and myopia suffered by the previous modeling techniques. To this end, text mining (TM) and latent semantic analysis (LSA) algorithm are applied to extract scenario concepts from futuristic data in textual documents; and fuzzy association rule mining (FARM) technique is utilized to identify their causal weights based on if-then rules. To illustrate the utility of proposed approach, a case of electric vehicle is conducted. The suggested approach can improve the effectiveness and efficiency of scanning knowledge for scenario development. 相似文献
19.
We have proposed a decision tree classifier named MMC (multi-valued and multi-labeled classifier) before. MMC is known as its capability of classifying a large multi-valued and multi-labeled data. Aiming to improve the accuracy of MMC, this paper has developed another classifier named MMDT (multi-valued and multi-labeled decision tree). MMDT differs from MMC mainly in attribute selection. MMC attempts to split a node into child nodes whose records approach the same multiple labels. It basically measures the average similarity of labels of each child node to determine the goodness of each splitting attribute. MMDT, in contrast, uses another measuring strategy which considers not only the average similarity of labels of each child node but also the average appropriateness of labels of each child node. The new measuring strategy takes scoring approach to have a look-ahead measure of accuracy contribution of each attribute's splitting. The experimental results show that MMDT has improved the accuracy of MMC. 相似文献
20.
Data classification is an important topic in the field of data mining due to its wide applications. A number of related methods have been proposed based on the well-known learning models such as decision tree or neural network. Although data classification was widely discussed, relatively few studies explored the topic of temporal data classification. Most of the existing researches focused on improving the accuracy of classification by using statistical models, neural network, or distance-based methods. However, they cannot interpret the results of classification to users. In many research cases, such as gene expression of microarray, users prefer the classification information above a classifier only with a high accuracy. In this paper, we propose a novel pattern-based data mining method, namely classify-by-sequence (CBS), for classifying large temporal datasets. The main methodology behind the CBS is integrating sequential pattern mining with probabilistic induction. The CBS has the merit of simplicity in implementation and its pattern-based architecture can supply clear classification information to users. Through experimental evaluation, the CBS was shown to deliver classification results with high accuracy under two real time series datasets. In addition, we designed a simulator to evaluate the performance of CBS under datasets with different characteristics. The experimental results show that CBS can discover the hidden patterns and classify data effectively by utilizing the mined sequential patterns. 相似文献