共查询到20条相似文献,搜索用时 15 毫秒
1.
In this paper, we develop a semi-supervised regression algorithm to analyze data sets which contain both categorical and numerical attributes. This algorithm partitions the data sets into several clusters and at the same time fits a multivariate regression model to each cluster. This framework allows one to incorporate both multivariate regression models for numerical variables (supervised learning methods) and k-mode clustering algorithms for categorical variables (unsupervised learning methods). The estimates of regression models and k-mode parameters can be obtained simultaneously by minimizing a function which is the weighted sum of the least-square errors in the multivariate regression models and the dissimilarity measures among the categorical variables. Both synthetic and real data sets are presented to demonstrate the effectiveness of the proposed method. 相似文献
2.
Clustering consists in partitioning a set of objects into disjoint and homogeneous clusters. For many years, clustering methods have been applied in a wide variety of disciplines and they also have been utilized in many scientific areas. Traditionally, clustering methods deal with numerical data, i.e. objects represented by a conjunction of numerical attribute values. However, nowadays commercial or scientific databases usually contain categorical data, i.e. objects represented by categorical attributes. In this paper we present a dissimilarity measure which is capable to deal with tree structured categorical data. Thus, it can be used for extending the various versions of the very popular k-means clustering algorithm to deal with such data. We discuss how such an extension can be achieved. Moreover, we empirically prove that the proposed dissimilarity measure is accurate, compared to other well-known (dis)similarity measures for categorical data. 相似文献
3.
4.
Hui-Ling Hu 《Information Sciences》2008,178(19):3683-3696
5.
In this paper, we propose an efficient graph-based mining (GBM) algorithm for mining the frequent trajectory patterns in a spatial-temporal database. The proposed method comprises two phases. First, we scan the database once to generate a mapping graph and trajectory information lists (TI-lists). Then, we traverse the mapping graph in a depth-first search manner to mine all frequent trajectory patterns in the database. By using the mapping graph and TI-lists, the GBM algorithm can localize support counting and pattern extension in a small number of TI-lists. Moreover, it utilizes the adjacency property to reduce the search space. Therefore, our proposed method can efficiently mine the frequent trajectory patterns in the database. The experimental results show that it outperforms the Apriori-based and PrefixSpan-based methods by more than one order of magnitude. 相似文献
6.
关联规则挖掘Apriori算法的研究与改进 总被引:7,自引:1,他引:6
关联规则挖掘是数据挖掘研究领域中的一个重要任务,旨在挖掘事务数据库中有趣的关联.Apriori算法是关联规则挖掘中的经典算法.然而Apriori算法存在着产生候选项目集效率低和频繁扫描数据等缺点.对Apriori算法的原理及效率进行分析,指出了一些不足,并且提出了改进的Apriori_LB算法.该算法基于新的数据结构,改进了产生候选项集的连接方法.在详细阐述了Apriori_LB算法后,对Apriori算法和Apriori_LB算法进行了分析和比较,实验结果表明改进的Apriori_LB算法优于Apriori算法,特别是对最小支持度较小或者项数较少的事务数据库进行挖掘时,效果更加显著. 相似文献
7.
增量式频繁项集挖掘是当前研究的热点,基于FP-Growth的Pre-FUFP算法有效处理了频繁模式的更新,但需递归遍历FP-tree,导致效率较低。提出Pre-FIUT算法,引入频繁超度量树结构,提高了获得频繁项集挖掘效率;基于FIUT的Pre-FIUT可通过查看频繁超度量树叶子结点的支持度确定频繁项集,并与次频繁项集概念相结合进行增量式频繁项集挖掘。实验表明,Pre-FIUT算法能快速扫描和更新数据,合理利用内存,精确获得频繁项集。 相似文献
8.
9.
Finding centric local outliers in categorical/numerical spaces 总被引:2,自引:0,他引:2
Jeffrey Xu Yu Weining Qian Hongjun Lu Aoying Zhou 《Knowledge and Information Systems》2006,9(3):309-338
Outlier detection techniques are widely used in many applications such as credit-card fraud detection, monitoring criminal
activities in electronic commerce, etc. These applications attempt to identify outliers as noises, exceptions, or objects
around the border. The existing density-based local outlier detection assigns the degree to which an object is an outlier
in a numerical space. In this paper, we propose a novel mutual-reinforcement-based local outlier detection approach. Instead
of detecting local outliers as noise, we attempt to identify local outliers in the center, where they are similar to some
clusters of objects on one hand, and are unique on the other. Our technique can be used for bank investment to identify a
unique body, similar to many good competitors, in which to invest. We attempt to detect local outliers in categorical, ordinal
as well as numerical data. In categorical data, the challenge is that there are many similar but different ways to specify
relationships among the data items. Our mutual-reinforcement-based approach is stable, with similar but different user-defined
relationships. Our technique can reduce the burden for users to determine the relationships among data items, and find the
explanations why the outliers are found. We conducted extensive experimental studies using real datasets.
Jeffrey Xu Yu received his B.E., M.E. and Ph.D. in computer science, from the University of Tsukuba, Japan, in 1985, 1987 and 1990, respectively.
Jeffrey Xu Yu was a research fellow in the Institute of Information Sciences and Electronics, University of Tsukuba (Apr.
1990–Mar. 1991), and held teaching positions in the Institute of Information Sciences and Electronics, University of Tsukuba
(Apr. 1991–July 1992) and in the Department of Computer Science, Australian National University (July 1992–June 2000). Currently
he is an Associate Professor in the Department of Systems Engineering and Engineering Management, Chinese University of Hong
Kong. His major research interests include data mining, data stream mining/processing, XML query processing and optimization,
data warehouse, on-line analytical processing, and design and implementation of database management systems.
Weining Qian is currently an assistant professor of computer science at Fudan University, Shanghai, China. He received his M.S. and Ph.D.
degrees in computer science from Fudan University in 2001 and 2004, respectively. He was supported by a Microsoft Research
Fellowship when he was doing the research presented in this paper, and he is supported by the Shanghai Rising Star Program.
His research interests include data mining for very large databases, data stream query processing and mining and peer-to-peer
computing.
Hongjun Lu received his B.Sc. from Tsinghua University, China, and M.Sc. and Ph.D. from the Department of Computer Science, University
of Wisconsin–Madison. He worked as an engineer in the Chinese Academy of Space Technology, and a principal research scientist
in the Computer Science Center of Honeywell Inc., Minnesota, USA (1985–1987), and a professor at the School of Computing of
the National University of Singapore (1987–2000), and is a full professor of the Hong Kong University of Science and Technology.
His research interests are in data/knowledge-base management systems with an emphasis on query processing and optimization,
physical database design, and database performance. Hongjun Lu is currently a trustee of the VLDB Endowment, an associate
editor of the IEEE Transactions on Knowledge and Data Engineering (TKDE), and a member of the review board of the Journal
of Database Management. He served as a member of the ACM SIGMOD Advisory Board in 1998–2002.
Aoying Zhou born in 1965, is currently a professor of computer science at Fudan University, Shanghai, China. He won his Bachelor degree
and Master degree in Computer Science from Sichuan University in Chengdu, Sichuan, China in 1985 and 1988. respectively, and
a Ph.D. degree from Fudan University in 1993. He has served as a member or chair of the program committees for many international
conferences such as VLDB, ER, DASFAA, WAIM, and etc. His papers have been published in ACM SIGMOD, VLDB, ICDE and some international
journals. His research interests include data mining and knowledge discovery, XML data management, web query and searching,
data stream analysis and processing and peer-to-peer computing. 相似文献
10.
《Expert systems with applications》2014,41(10):4505-4512
Node-list and N-list, two novel data structure proposed in recent years, have been proven to be very efficient for mining frequent itemsets. The main problem of these structures is that they both need to encode each node of a PPC-tree with pre-order and post-order code. This causes that they are memory-consuming and inconvenient to mine frequent itemsets. In this paper, we propose Nodeset, a more efficient data structure, for mining frequent itemsets. Nodesets require only the pre-order (or post-order code) of each node, which makes it saves half of memory compared with N-lists and Node-lists. Based on Nodesets, we present an efficient algorithm called FIN to mining frequent itemsets. For evaluating the performance of FIN, we have conduct experiments to compare it with PrePost and FP-growth1, two state-of-the-art algorithms, on a variety of real and synthetic datasets. The experimental results show that FIN is high performance on both running time and memory usage. 相似文献
11.
A data stream is a massive and unbounded sequence of data elements that are continuously generated at a fast speed. Compared with traditional approaches, data mining in data streams is more challenging since several extra requirements need to be satisfied. In this paper, we propose a mining algorithm for finding frequent itemsets over the transactional data stream. Unlike most of existing algorithms, our method works based on the theory of Approximate Inclusion–Exclusion. Without incrementally maintaining the overall synopsis of the stream, we can approximate the itemsets’ counts according to certain kept information and the counts bounding technique. Some additional techniques are designed and integrated into the algorithm for performance improvement. Besides, the performance of the proposed algorithm is tested and analyzed through a series of experiments. 相似文献
12.
A core issue of the association rule extracting process in the data mining field is to find the frequent patterns in the database of operational transactions. If these patterns discovered, the decision making process and determining strategies in organizations will be accomplished with greater precision. Frequent pattern is a pattern seen in a significant number of transactions. Due to the properties of these data models which are unlimited and high-speed production, these data could not be stored in memory and for this reason it is necessary to develop techniques that enable them to be processed online and find repetitive patterns. Several mining methods have been proposed in the literature which attempt to efficiently extract a complete or a closed set of different types of frequent patterns from a dataset. In this paper, a method underpinned upon Cellular Learning Automata (CLA) is presented for mining frequent itemsets. The proposed method is compared with Apriori, FP-Growth and BitTable methods and it is ultimately concluded that the frequent itemset mining could be achieved in less running time. The experiments are conducted on several experimental data sets with different amounts of minsup for all the algorithms as well as the presented method individually. Eventually the results prod to the effectiveness of the proposed method. 相似文献
13.
Mining frequent itemsets is an essential problem in data mining and plays an important role in many data mining applications. In recent years, some itemset representations based on node sets have been proposed, which have shown to be very efficient for mining frequent itemsets. In this paper, we propose DiffNodeset, a novel and more efficient itemset representation, for mining frequent itemsets. Based on the DiffNodeset structure, we present an efficient algorithm, named dFIN, to mining frequent itemsets. To achieve high efficiency, dFIN finds frequent itemsets using a set-enumeration tree with a hybrid search strategy and directly enumerates frequent itemsets without candidate generation under some case. For evaluating the performance of dFIN, we have conduct extensive experiments to compare it against with existing leading algorithms on a variety of real and synthetic datasets. The experimental results show that dFIN is significantly faster than these leading algorithms. 相似文献
14.
面向分类数据的自组织神经网络 总被引:1,自引:2,他引:1
作为一种优良的聚类和降维工具,自组织神经网络SOM(SelfOrganizingFeatureMaps)已经得到广泛应用。其不足之处是仅适合于数值数据,这对时常需要处理分类型数据(Categoricalvalueddata)或数值型与分类型混合数据(Mixednumericandcategoricalvalueddata)的数据挖掘应用是不够的。该文提出了一种新的基于覆盖(Overlap)的距离函数并将其用于SOM训练。实验结果表明,在不增加时空开销的前提下可取得较好的聚类效果。 相似文献
15.
关联规则的发现是数据挖掘的一个重要方面,产生频繁项集是其中一个关键步骤。提出了一种基于十字链表快速挖掘频繁项集的算法,该算法只需扫描一次数据库,充分利用已有信息产生频繁项集,无需存储候选项集。通过与其它一些算法比较,说明该算法有更好的性能。 相似文献
16.
最大频繁项集挖掘算法的分析研究 总被引:2,自引:0,他引:2
本文介绍了频繁项集挖掘的基本情况,用比较的方法通过示例分析、研究了两种最大频繁项集挖掘算法,并指出了最大频繁项集挖掘算法的局限性。进而阐述了最大频繁项集挖掘算法的特点及优化算法的途径。 相似文献
17.
裴古英 《自动化与仪器仪表》2009,(5):16-18
关联规则的发现是数据挖掘中的一个重要问题,其核心是频繁模式的挖掘,通常采用的APriori算法要多次扫描数据库并产生大量的候选项集,开销很大。本文采用基于布尔矩阵关联挖掘的算法,只需扫描一次数据库而且不需要链接产生候选项集,从而提高算法的效率。并通过实例说明了它是一种有效的关联规则挖掘方法。 相似文献
18.
《Expert systems with applications》2014,41(6):2914-2938
Multilevel knowledge in transactional databases plays a significant role in our real-life market basket analysis. Many researchers have mined the hierarchical association rules and thus proposed various approaches. However, some of the existing approaches produce many multilevel and cross-level association rules that fail to convey quality information. From these large number of redundant association rules, it is extremely difficult to extract any meaningful information. There also exist some approaches that mine minimal association rules, but these have many shortcomings due to their naïve-based approaches. In this paper, we have focused on the need for generating hierarchical minimal rules that provide maximal information. An algorithm has been proposed to derive minimal multilevel association rules and cross-level association rules. Our work has made significant contributions in mining the minimal cross-level association rules, which express the mixed relationship between the generalized and specialized view of the transaction itemsets. We are the first to design an efficient algorithm using a closed itemset lattice-based approach, which can mine the most relevant minimal cross-level association rules. The parent–child relationship of the lattices has been exploited while mining cross-level closed itemset lattices. We have extensively evaluated our proposed algorithm’s efficiency using a variety of real-life datasets and performing a large number of experiments. The proposed algorithm has outperformed the existing related work significantly during the pervasive performance comparison. 相似文献
19.
A new concise representation of frequent itemsets using generators and a positive border 总被引:2,自引:2,他引:0
A complete set of frequent itemsets can get undesirably large due to redundancy when the minimum support threshold is low
or when the database is dense. Several concise representations have been previously proposed to eliminate the redundancy.
Generator based representations rely on a negative border to make the representation lossless. However, the number of itemsets
on a negative border sometimes even exceeds the total number of frequent itemsets. In this paper, we propose to use a positive
border together with frequent generators to form a lossless representation. A positive border is usually orders of magnitude
smaller than its corresponding negative border. A set of frequent generators plus its positive border is always no larger
than the corresponding complete set of frequent itemsets, thus it is a true concise representation. The generalized form of
this representation is also proposed. We develop an efficient algorithm, called GrGrowth, to mine generators and positive
borders as well as their generalizations. The GrGrowth algorithm uses the depth-first-search strategy to explore the search
space, which is much more efficient than the breadth-first-search strategy adopted by most of the existing generator mining
algorithms. Our experiment results show that the GrGrowth algorithm is significantly faster than level-wise algorithms for
mining generator based representations, and is comparable to the state-of-the-art algorithms for mining frequent closed itemsets.
相似文献
Guimei LiuEmail: |
20.
Frequent itemset mining is an important problem in the data mining area with a wide range of applications. Many decision support systems need to support online interactive frequent itemset mining, which is a challenging task because frequent itemset mining is a computation intensive repetitive process. One solution is to precompute frequent itemsets. In this paper, we propose a compact disk-based data structure—CFP-tree to store precomputed frequent itemsets on a disk to support online mining requests. The CFP-tree structure effectively utilizes the redundancy in frequent itemsets to save space. The compressing ratio of a CFP-tree can be as high as several thousands or even higher. Efficient algorithms for retrieving frequent itemsets from a CFP-tree, as well as efficient algorithms to construct and maintain a CFP-tree, are developed. Our performance study demonstrates that with a CFP-tree, frequent itemset mining requests can be responded to promptly. 相似文献