共查询到20条相似文献,搜索用时 24 毫秒
1.
This paper proposes a methodology for text mining relying on the classical knowledge discovery loop, with a number of adaptations.
First, texts are indexed and prepared to be processed by frequent itemset levelwise search. Association rules are then extracted
and interpreted, with respect to a set of quality measures and domain knowledge, under the control of an analyst. The article
includes an experimentation on a real-world text corpus holding on molecular biology. 相似文献
3.
为了将完全加权关联规则挖掘技术应用于查询扩展,提出面向查询扩展的基于多种剪枝策略的完全加权词间关联规则挖掘算法,该算法能够极大地提高挖掘效率;提出了一种新的查询扩展模型和扩展词权重计算方法,使扩展词权值更加合理,在此基础上提出一种新的基于局部反馈的查询扩展算法,该算法利用完全加权关联规则挖掘算法自动从局部反馈的前列初检文档中挖掘与原查询相关的完全加权关联规则,构建规则库,从中提取与原查询相关的扩展词,实现查询扩展。实验结果表明,查询扩展算法的检索性能确实得到了很好的改善和提高,与现有查询扩展算法比较,在相同的查全率水平级下其平均查准率有了明显的提高。 相似文献
4.
Quantitative association rule (QAR) mining has been recognized an influential research problem over the last decade due to
the popularity of quantitative databases and the usefulness of association rules in real life. Unlike boolean association
rules (BARs), which only consider boolean attributes, QARs consist of quantitative attributes which contain much richer information
than the boolean attributes. However, the combination of these quantitative attributes and their value intervals always gives
rise to the generation of an explosively large number of itemsets, thereby severely degrading the mining efficiency. In this
paper, we propose an information-theoretic approach to avoid unrewarding combinations of both the attributes and their value
intervals being generated in the mining process. We study the mutual information between the attributes in a quantitative
database and devise a normalization on the mutual information to make it applicable in the context of QAR mining. To indicate
the strong informative relationships among the attributes, we construct a mutual information graph (MI graph), whose edges
are attribute pairs that have normalized mutual information no less than a predefined information threshold. We find that
the cliques in the MI graph represent a majority of the frequent itemsets. We also show that frequent itemsets that do not
form a clique in the MI graph are those whose attributes are not informatively correlated to each other. By utilizing the
cliques in the MI graph, we devise an efficient algorithm that significantly reduces the number of value intervals of the
attribute sets to be joined during the mining process. Extensive experiments show that our algorithm speeds up the mining
process by up to two orders of magnitude. Most importantly, we are able to obtain most of the high-confidence QARs, whereas
the QARs that are not returned by MIC are shown to be less interesting. 相似文献
5.
The paper focuses on the adaptive relational association rule mining problem. Relational association rules represent a particular type of association rules which describe frequent relations that occur between the features characterizing the instances within a data set. We aim at re-mining an object set, previously mined, when the feature set characterizing the objects increases. An adaptive relational association rule method, based on the discovery of interesting relational association rules, is proposed. This method, called ARARM ( Adaptive Relational Association Rule Mining) adapts the set of rules that was established by mining the data before the feature set changed, preserving the completeness. We aim to reach the result more efficiently than running the mining algorithm again from scratch on the feature-extended object set. Experiments testing the method's performance on several case studies are also reported. The obtained results highlight the efficiency of the ARARM method and confirm the potential of our proposal. 相似文献
6.
Identifying irregular file system permissions in large, multi-user systems is challenging due to the complexity of gaining structural understanding from large volumes of permission information. This challenge is exacerbated when file systems permissions are allocated in an ad-hoc manner when new access rights are required, and when access rights become redundant as users change job roles or terminate employment. These factors make it challenging to identify what can be classed as an irregular file system permission, as well as identifying if they are irregular and exposing a vulnerability. The current way of finding such irregularities is by performing an exhaustive audit of the permission distribution; however, this requires expert knowledge and a significant amount of time. In this paper a novel method of modelling file system permissions which can be used by association rule mining techniques to identify irregular permissions is presented. This results in the creation of object-centric model as a by-product. This technique is then implemented and tested on Microsoft’s New Technology File System permissions (NTFS). Empirical observations are derived by making comparisons with expert knowledge to determine the effectiveness of the proposed technique on five diverse real-world directory structures extracted from different organisations. The results demonstrate that the technique is able to correctly identify irregularities with an average accuracy rate of 91%, minimising the reliance on expert knowledge. Experiments are also performed on synthetic directory structures which demonstrate an accuracy rate of 95% when the number of irregular permissions constitutes 1% of the total number. This is a significant contribution as it creates the possibility of identifying vulnerabilities without prior knowledge of how to file systems permissions are implemented within a directory structure. 相似文献
7.
基于支持度的关联规则挖掘算法无法找到那些非频繁但效用很高的项集,基于效用的关联规则会漏掉那些效用不高但发生比较频繁、支持度和效用值的积(激励)很大的项集。提出了基于激励的关联规则挖掘问题及一种自下而上的挖掘算法HM-miner。激励综合了支持度与效用的优点,能同时度量项集的统计重要性和语义重要性。HM-miner利用激励的上界特性进行减枝,能有效挖掘高激励项集。 相似文献
9.
In the rapidly changing financial market, investors always have difficulty in deciding the right time to trade. In order to enhance investment profitability, investors desire a decision support system. The proposed artificial intelligence methodology provides investors with the ability to learn the association among different parameters. After the associations are extracted, investors can apply the rules in their decision support systems. In this work, the model is built with the ultimate goal of predicting the level of the Hang Seng Index in Hong Kong. The movement of Hang Seng Index, which is associated with other economics indices including the gross domestic product (GDP) index, the consumer price index (CPI), the interest rate, and the export value of goods from Hong Kong, is learnt by the proposed method. The case study shows that the proposed method is a feasible way to provide decision support for investors who may not be able to identify the hidden rules between the Hang Seng Index and other economics indices. 相似文献
10.
The usage of association rules is playing a vital role in the field of knowledge data discovery. Numerous rules have to be processed and plot based on the ranges on the schema. The step in this process depends on the user's queries. Previously, several projects have been proposed to reduce work and improve filtration processes. However, they have some limitations in preprocessing time and filtration rate. In this article, an improved fuzzy weighted-iterative concept is introduced to overcome the limitation based on the user request and visualization of discovering rules. The initial step includes the mix of client learning with posthandling to use the semantics. The above advance was trailed by surrounding rule schemas to fulfill and anticipate unpredictable guidelines dependent on client desires. Preparing the above developments can be imagined by the use of yet another clever method of study. Standards on guidelines are recognized by the average learning professionals. 相似文献
11.
Mining association rules plays an important role in data mining and knowledge discovery since it can reveal strong associations between items in databases. Nevertheless, an important problem with traditional association rule mining methods is that they can generate a huge amount of association rules depending on how parameters are set. However, users are often only interested in finding the strongest rules, and do not want to go through a large amount of rules or wait for these rules to be generated. To address those needs, algorithms have been proposed to mine the top-k association rules in databases, where users can directly set a parameter k to obtain the k most frequent rules. However, a major issue with these techniques is that they remain very costly in terms of execution time and memory. To address this issue, this paper presents a novel algorithm named ETARM (Efficient Top-k Association Rule Miner) to efficiently find the complete set of top-k association rules. The proposed algorithm integrates two novel candidate pruning properties to more effectively reduce the search space. These properties are applied during the candidate selection process to identify items that should not be used to expand a rule based on its confidence, to reduce the number of candidates. An extensive experimental evaluation on six standard benchmark datasets show that the proposed approach outperforms the state-of-the-art TopKRules algorithm both in terms of runtime and memory usage. 相似文献
12.
This paper introduces a new approach to a problem of data sharing among multiple parties, without disclosing the data between the parties. Our focus is data sharing among parties involved in a data mining task. We study how to share private or confidential data in the following scenario: multiple parties, each having a private data set, want to collaboratively conduct association rule mining without disclosing their private data to each other or any other parties. To tackle this demanding problem, we develop a secure protocol for multiple parties to conduct the desired computation. The solution is distributed, i.e., there is no central, trusted party having access to all the data. Instead, we define a protocol using homomorphic encryption techniques to exchange the data while keeping it private. 相似文献
13.
Most algorithms for association rule mining are variants of the basic Apriori algorithm (Agarwal and Srikant, Fast algorithms for mining association rules in databases, in: Proceedings of the 20th International Conference on Very Large Data Bases (VLDB’94), Santiago, Chile, 1994, pp. 487–499). One characteristic of these Apriori-based algorithms is that candidate itemsets are generated in rounds, with the size of the itemsets incremented by one per round. The number of database scans required by Apriori-based algorithms thus depends on the size of the biggest frequent itemsets. In this paper, we devise a more general candidate set generation algorithm, LGen, which generates candidate itemsets of multiple sizes during each database scan. We present an algorithm FindLarge which uses LGen to find frequent itemsets. We show that, given a reasonable set of suggested frequent itemsets, FindLarge can significantly reduce the number of I/O passes required. In the best cases, only two passes are sufficient to discover all the frequent itemsets irrespective of the size of the biggest ones.Two I/O-saving algorithms, namely DIC and Pincher-Search, are compared with FindLarge in a series of experiments. We discuss the conditions under which FindLarge significantly outperforms the others in terms of I/O efficiency. 相似文献
14.
In this paper, we present an efficient computer-aided mass classification method in digitized mammograms using Association rule mining, which performs benign–malignant classification on region of interest that contains mass. One of the major mammographic characteristics for mass classification is texture. Association rule mining (ARM) exploits this important factor to classify the mass into benign or malignant. The statistical textural features used in characterizing the masses are mean, standard deviation, entropy, skewness, kurtosis and uniformity. The main aim of the method is to increase the effectiveness and efficiency of the classification process in an objective manner to reduce the numbers of false-positive of malignancies. Correlated association rule mining was proposed for classifying the marked regions into benign and malignant and 98.6% sensitivity and 97.4% specificity is achieved that is very much promising compare to the radiologist’s sensitivity 75%. 相似文献
15.
Traditional temporal association rules mining algorithms cannot dynamically update the temporal association rules within the valid time interval with increasing data. In this paper, a new algorithm called incremental fuzzy temporal association rule mining using fuzzy grid table (IFTARMFGT) is proposed by combining the advantages of boolean matrix with incremental mining. First, multivariate time series data are transformed into discrete fuzzy values that contain the time intervals and fuzzy membership. Second, in order to improve the mining efficiency, the concept of boolean matrices was introduced into the fuzzy membership to generate a fuzzy grid table to mine the frequent itemsets. Finally, in view of the Fast UPdate (FUP) algorithm, fuzzy temporal association rules are incrementally mined and updated without repeatedly scanning the original database by considering the lifespan of each item and inheriting the information from previous mining results. The experiments show that our algorithm provides better efficiency and interpretability in mining temporal association rules than other algorithms. 相似文献
16.
介绍了关联规则的常用理论,研究了关联规则中的标准Apriori算法,针对其不足进行了有益的改进,提出了一种新的加权关联规则挖掘算法,并分析了其主要特点。通过把该算法用于电子商务数据挖掘中,并与标准Apriori算法的对比分析,证明了这种新的加权关联规则挖掘算法的有效性。 相似文献
17.
Association Rule Mining (ARM) can be considered as a combinatorial problem with the purpose of extracting the correlations between items in sizeable datasets. The numerous polynomial exact algorithms already proposed for ARM are unadapted for large databases and especially for those existing on the web. Assuming that datasets are a large space search, intelligent algorithms was used to found high quality rules and solve ARM issue. This paper deals with a cooperative multi-swarm bat algorithm for association rule mining. It is based on the bat-inspired algorithm adapted to rule discovering problem (BAT-ARM). This latter suffers from absence of communication between bats in the population which lessen the exploration of search space. However, it has a powerful rule generation process which leads to perfect local search. Therefore, to maintain a good trade-off between diversification and intensification, in our proposed approach, we introduce cooperative strategies between the swarms that already proved their efficiency in multi-swarm optimization algorithm(Ring, Master-slave). Furthermore, we innovate a new topology called Hybrid that merges Ring strategy with Master-slave plan previously developed in our earlier work [ 23]. A series of experiments are carried out on nine well known datasets in ARM field and the performance of proposed approach are evaluated and compared with those of other recently published methods. The results show a clear superiority of our proposal against its similar approaches in terms of time and rule quality. The analysis also shows a competitive outcomes in terms of quality in-face-of multi-objective optimization methods. 相似文献
18.
In recent years, manufacturing processes have become more and more complex, and meeting high-yield target expectations and quickly identifying root-cause machinesets, the most likely sources of defective products, also become essential issues. In this paper, we first define the root-cause machineset identification problem of analyzing correlations between combinations of machines and the defective products. We then propose the Root-cause Machine Identifier (RMI) method using the technique of association rule mining to solve the problem efficiently and effectively. The experimental results of real datasets show that the actual root-cause machinesets are almost ranked in the top 10 by the proposed RMI method. 相似文献
19.
Privacy preservation in distributed database is an active area of research. With the advancement of technology, massive amounts of data are continuously being collected and stored in distributed database applications. Indeed, temporal associations and correlations among items in large transactional datasets of distributed database can help in many business decision-making processes. One among them is mining frequent itemset and computing their association rules, which is a nontrivial issue. In a typical situation, multiple parties may wish to collaborate for extracting interesting global information such as frequent association, without revealing their respective data to each other. This may be particularly useful in applications such as retail market basket analysis, medical research, academic, etc. In the proposed work, we aim to find frequent items and to develop a global association rules model based on the genetic algorithm (GA). The GA is used due to its inherent features like robustness with respect to local maxima/minima and domain-independent nature for large space search technique to find exact or approximate solutions for optimization and search problems. For privacy preservation of the data, the concept of trusted third party with two offsets has been used. The data are first anonymized at local party end, and then, the aggregation and global association is done by the trusted third party. The proposed algorithms address various types of partitions such as horizontal, vertical, and arbitrary. 相似文献
|