首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 140 毫秒
1.
一种直接在Trans-树中挖掘频繁模式的新算法   总被引:5,自引:1,他引:5  
范明  王秉政 《计算机科学》2003,30(8):117-120
Frequent pattern mining plays an essential role in many important data mining tasks. FP-growth is a very efficient algorithm for frequent pattern mining. However, it still suffers from creating conditional FP-tree separately and recursively during the mining process. In this paper, we propose a new algorithm, called Least-Item-First Pat-tern Growth (LIFPG), for mining frequent patterns. LIFPG mines frequent patterns directly in Trans-tree withoutusing any additional data structures. The key idea is that least items are always considered first when the current pat-tern growth. By this way, conditional sub-tree can be created directly in Trans-tree by adjusting node-links and re-counting counts of some nodes. Experiments show that, in comparison with FP-Growth, our algorithm is about fourtimes faster and saves half of memory;it also has good time and space scalability with the number of transactions,and has an excellent performance in dense dataset mining as well.  相似文献   

2.
Previous research works have presented convincing arguments that a frequent pattern mining algorithm should not mine all frequent but only the closed ones because the latter leads to not only more compact yet complete result set but also better efficiency. Upon discovery of frequent closed XML query patterns, indexing and caching can be effectively adopted for query performance enhancement. Most of the previous algorithms for finding frequent patterns basically introduced a straightforward generate-and-test strategy. In this paper, we present SOLARIA*, an efficient algorithm for mining frequent closed XML query patterns without candidate maintenance and costly tree-containment checking. Efficient algorithm of sequence mining is involved in discovering frequent tree-structured patterns, which aims at replacing expensive containment testing with cheap parent-child checking in sequences. SOLARIA* deeply prunes unrelated search space for frequent pattern enumeration by parent-child relationship constraint. By a thorough experimental study on various real-life data, we demonstrate the efficiency and scalability of SOLARIA* over the previous known alternative. SOLARIA* is also linearly scalable in terms of XML queries' size.  相似文献   

3.
Mining frequent patterns from datasets is one of the key success of data mining research. Currently,most of the studies focus on the data sets in which the elements are independent, such as the items in the marketing basket. However, the objects in the real world often have close relationship with each other. How to extract frequent patterns from these relations is the objective of this paper. The authors use graphs to model the relations, and select a simple type for analysis. Combining the graph theory and algorithms to generate frequent patterns, a new algorithm called Topology, which can mine these graphs efficiently, has been proposed.The performance of the algorithm is evaluated by doing experiments with synthetic datasets and real data. The experimental results show that Topology can do the job well. At the end of this paper, the potential improvement is mentioned.  相似文献   

4.
Mining frequent itemsets from large databases has played an essential role in many data mining tasks. It is also important to maintain the discovered frequent itemsets for these data mining tasks when the database is updated. All algorithms proposed so far for the maintenance of discovered frequent itemsets are only performed with a fixed minimum support,which is the same as that used to obtain the discovered frequent itemsets. That is, users cannot change the minimum support even if the new results are unsatisfactory to the users. In this paper two new complementary algorithms, FMP (First Maintaining Process) and RMP (Repeated Maintaining Process), are proposed to maintain discovered frequent itemsets in the case that new transaction data are added to a transaction database. Both algorithms allow users to change the minimum support for the maintenance processes. FMP is used for the first maintaining process, and when the result derived from the FMP is unsatisfactory, RMP will be performed repeatedly until satisfactory results are obtained. The proposed algorithms re-use the previous results to cut down the cost of maintenance. Extensive experiments have been conducted to assess the performance of the algorithms. The experimental results show that the proposed algorithms are very resultful compared with the previous mining and maintenance algorithms for maintenance of discovered frequent itemsets.  相似文献   

5.
Mining frequent itemsets has emerged as a fundamental problem in data mining and plays an essential role in many important data mining tasks.In this paper,we propose a novel vertical data representation called N-list,which originates from an FP-tree-like coding prefix tree called PPC-tree that stores crucial information about frequent itemsets.Based on the N-list data structure,we develop an efficient mining algorithm,PrePost,for mining all frequent itemsets.Efficiency of PrePost is achieved by the following three reasons.First,N-list is compact since transactions with common prefixes share the same nodes of the PPC-tree.Second,the counting of itemsets’ supports is transformed into the intersection of N-lists and the complexity of intersecting two N-lists can be reduced to O(m + n) by an efficient strategy,where m and n are the cardinalities of the two N-lists respectively.Third,PrePost can directly find frequent itemsets without generating candidate itemsets in some cases by making use of the single path property of N-list.We have experimentally evaluated PrePost against four state-of-the-art algorithms for mining frequent itemsets on a variety of real and synthetic datasets.The experimental results show that the PrePost algorithm is the fastest in most cases.Even though the algorithm consumes more memory when the datasets are sparse,it is still the fastest one.  相似文献   

6.
Finding correlated sequential patterns in large sequence databases is one of the essential tasks in data mining since a huge number of sequential patterns are usually mined, but it is hard to find sequential patterns with the correlation. According to the requirement of real applications, the needed data analysis should be different. In previous mining approaches, after mining the sequential patterns, sequential patterns with the weak affinity are found even with a high minimum support. In this paper, a new framework is suggested for mining weighted support affinity patterns in which an objective measure, sequential ws-confidence is developed to detect correlated sequential patterns with weighted support affinity patterns. To efficiently prune the weak affinity patterns, it is proved that ws-confidence measure satisfies the anti-monotone and cross weighted support properties which can be applied to eliminate sequential patterns with dissimilar weighted support levels. Based on the framework, a weighted support affinity pattern mining algorithm (WSMiner) is suggested. The performance study shows that WSMiner is efficient and scalable for mining weighted support affinity patterns.  相似文献   

7.
Developing an efficient algorithm that can maintain discovered information as a database changes is quite important in data mining. Many proposed algorithms focused on a single level, and did not utilize previously mined information in incrementally growing databases. In the past, we proposed an incremental mining algorithm for maintenance of multiple-level association rules as new transactions were inserted. Deletion of records in databases is, however, commonly seen in real-world applications. In this paper, we thus attempt to extend our previous approach to solve this issue. The concept of pre-large itemsets is used to reduce the need for rescanning original databases and to save maintenance costs. A pre-large itemset is not truly large, but promises to be large in the future. A lower support threshold and an upper support threshold are used to realize this concept. The two user-specified upper and lower support thresholds make the pre-large itemsets act as a gap to avoid small itemsets becoming large in the updated database when transactions are deleted. A new algorithm is thus proposed based on the concept to maintain discovered multiple-level association rules for deletion of records. The proposed algorithm doesn't need to rescan the original database until a number of records have been deleted. It can thus save much maintenance time.  相似文献   

8.
Web Usage Mining is the application of data mining techniques to large web log databases in order to extract usage patterns. However, most of the previous studies on usage patterns discovery just focus on mining intra-transaction associations, i.e., the associations among items within the same user's transactions, m cross-transaction association rule describes the association relationships among different users' transactions. In this paper, the closure property of frequent itemsets, which can determine the complete set of all frequent items exactly and is usually much smaller than the latter, is used to mine cross-transaction association rules from web log databases. We give the basic notion of frequent cross-transaction closed itemsets and prove the related necessary theories. And an efficient algorithm, i.e. MFCCPS(Mining Frequent Cross-Transaction Closed Pageviews Sets), is designed and implemented. At last, an extensive experimental result on two synthetic datasets shows that our approach outperforms previous methods.  相似文献   

9.
Graphics processing units (GPUs) have an SIMD architecture and have been widely used recently as powerful general-purpose co-processors for the CPU. In this paper, we investigate efficient GPU-based data cubing because the most frequent operation in data cube computation is aggregation, which is an expensive operation well suited for SIMD parallel processors. H-tree is a hyper-linked tree structure used in both top-k H-cubing and the stream cube. Fast H-tree construction, update and real-time query response are crucial in many OLAP applications. We design highly efficient GPU-based parallel algorithms for these H-tree based data cube operations. This has been made possible by taking effective methods, such as parallel primitives for segmented data and efficient memory access patterns, to achieve load balance on the GPU while hiding memory access latency. As a result, our GPU algorithms can often achieve more than an order of magnitude speedup when compared with their sequential counterparts on a single CPU. To the best of our knowledge, this is the first attempt to develop parallel data cubing algorithms on graphics processors.  相似文献   

10.
Mining sequential patterns from large databases has been recognized by many researchers as an attractive task of data mining and knowledge discovery.Previous algorithms scan the databases for many times,which is often unendurable due to the very large amount of databases.In this paper,the authors introduce an effective algorithm for mining sequential patterns from large databases.In the algorithm,the original database is not used at all for counting the support of sequences after the first pass.Rather,a tidlist structure generated in the previous pass is employed for the purpose based on set intersection operations,avoiding the multiple scans of the databases.  相似文献   

11.
In this pager,we report our success in building efficient scalable classifiers by exploring the capabilities of modern relational database management systems (RDBMS).In addition to high classification accuracy,the unique features of the approach include its high training speed ,linear scalability,and simplicity in implementation.More importantly,the major computation required in the approach can be implemented using standard functions provided by the modern realtional DBMS.Besides,with the effective rule pruning strategy,the algorithm proposed in this paper can produce a compact set of classification rules,The results of experiments conducted for performance evaluation an analysis are presented.  相似文献   

12.
Classification is an important technique in data mining.The decision trees builty by most of the existing classification algorithms commonly feature over-branching,which will lead to poor efficiency in the subsequent classification period.In this paper,we present a new value-oriented classification method,which aims at building accurately proper-sized decision trees while reducing over-branching as much as possible,based on the concepts of frequent-pattern-node and exceptive-child-node.The experiments show that while using relevant anal-ysis as pre-processing ,our classification method,without loss of accuracy,can eliminate the over-branching greatly in decision trees more effectively and efficiently than other algorithms do.  相似文献   

13.
In this paper, we study the problem of efficiently computing k-medians over high-dimensional and high speed data streams. The focus of this paper is on the issue of minimizing CPU time to handle high speed data streams on top of the requirements of high accuracy and small memory. Our work is motivated by the following observation: the existing algorithms have similar approximation behaviors in practice, even though they make noticeably different worst case theoretical guarantees. The underlying reason is that in order to achieve high approximation level with the smallest possible memory, they need rather complex techniques to maintain a sketch, along time dimension, by using some existing off-line clustering algorithms. Those clustering algorithms cannot guarantee the optimal clustering result over data segments in a data stream but accumulate errors over segments, which makes most algorithms behave the same in terms of approximation level, in practice. We propose a new grid-based approach which divides the entire data set into cells (not along time dimension). We can achieve high approximation level based on a novel concept called (1 - ε)-dominant. We further extend the method to the data stream context, by leveraging a density-based heuristic and frequent item mining techniques over data streams. We only need to apply an existing clustering once to computing k-medians, on demand, which reduces CPU time significantly. We conducted extensive experimental studies, and show that our approaches outperform other well-known approaches.  相似文献   

14.
The recent increase in HyperText Transfer Protocol (HTTP) traffic on the World Wide Web (WWW) has generated an enormous amount of log records on Web server databases. Applying Web mining techniques on these server log records can discover potentially useful patterns and reveal user access behaviors on the Web site. In this paper, we propose a new approach for mining user access patterns for predicting Web page requests, which consists of two steps. First, the Minimum Reaching Distance (MRD) algorithm is applied to find the distances between the Web pages. Second, the association rule mining technique is applied to form a set of predictive rules, and the MRD information is used to prune the results from the association rule mining process. Experimental results from a real Web data set show that our approach improved the performance over the existing Markov-model approach in precision, recall, and the reduction of user browsing time. Mei-Ling Shyu received her Ph.D. degree from the School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN in 1999, and three Master's degrees from Computer Science, Electrical Engineering, and Restaurant, Hotel, Institutional, and Tourism Management from Purdue University. She has been an Associate Professor in the Department of Electrical and Computer Engineering (ECE) at the University of Miami (UM), Coral Gables, FL, since June 2005, Prior to that, she was an Assistant Professor in ECE at UM dating from January 2000. Her research interests include data mining, multimedia database systems, multimedia networking, database systems, and security. She has authored and co-authored more than 120 technical papers published in various prestigious journals, refereed conference/symposium/workshop proceedings, and book chapters. She is/was the guest editor of several journal special issues. Choochart Haruechaiyasak received his Ph.D. degree from the Department of Electrical and Computer Engineering, University of Miami, in 2003 with the Outstanding Departmental Graduating Student award from the College of Engineering. After receiving his degree, he has joined the National Electronics and Computer Technology Center (NECTEC), located in Thailand Science Park, as a researcher in Information Research and Development Division (RDI). His current research interests include data/ text/ Web mining, Natural Language Processing, Information Retrieval, Search Engines, and Recommender Systems. He is currently leading a small group of researchers and programmer to develop an open-source search engine for Thai language. One of his objectives is to promote the use of data mining technology and other advanced applications in Information Technology in Thailand. He is also a visiting lecturer for Data Mining, Artificial Intelligence and Decision Support Systems courses in many universities in Thailand. Shu-Ching Chen received his Ph.D. from the School of Electrical and Computer Engineering at Purdue University, West Lafayette, IN, USA in December, 1998. He also received Master's degrees in Computer Science, Electrical Engineering, and Civil Engineering from Purdue University. He has been an Associate Professor in the School of Computing and Information Sciences (SCIS), Florida International University (FIU) since August, 2004. Prior to that, he was an Assistant Professor in SCIS at FIU dating from August, 1999. His main research interests include distributed multimedia database systems and multimedia data mining. Dr. Chen has authored and co-authored more than 140 research papers in journals, refereed conference/symposium/workshop proceedings, and book chapters. In 2005, he was awarded the IEEE Systems, Man, and Cybernetics Society's Outstanding Contribution Award. He was also awarded a University Outstanding Faculty Research Award from FIU in 2004, Outstanding Faculty Service Award from SCIS in 2004 and Outstanding Faculty Research Award from SCIS in 2002.  相似文献   

15.
With the growing popularity of the World Wide Web, large volume of user access data has been gathered automatically by Web servers and stored in Web logs. Discovering and understanding user behavior patterns from log files can provide Web personalized recommendation services. In this paper, a novel clustering method is presented for log files called Clustering large Weblog based on Key Path Model (CWKPM), which is based on user browsing key path model, to get user behavior profiles. Compared with the previous Boolean model, key path model considers the major features of users‘ accessing to the Web: ordinal, contiguous and duplicate. Moreover, for clustering, it has fewer dimensions. The analysis and experiments show that CWKPM is an efficient and effective approach for clustering large and high-dimension Web logs.  相似文献   

16.
Frequent itemset mining was initially proposed and has been studied extensively in the context of association rule mining. In recent years, several studies have also extended its application to transaction or document clustering. However, most of the frequent itemset based clustering algorithms need to first mine a large intermediate set of frequent itemsets in order to identify a subset of the most promising ones that can be used for clustering. In this paper, we study how to directly find a subset of high quality frequent itemsets that can be used as a concise summary of the transaction database and to cluster the categorical data. By exploring key properties of the subset of itemsets that we are interested in, we proposed several search space pruning methods and designed an efficient algorithm called SUMMARY. Our empirical results show that SUMMARY runs very fast even when the minimum support is extremely low and scales very well with respect to the database size, and surprisingly, as a pure frequent itemset mining algorithm it is very effective in clustering the categorical data and summarizing the dense transaction databases. Jianyong Wang received the Ph.D. degree in computer science in 1999 from the Institute of Computing Technology, the Chinese Academy of Sciences. Since then, he ever worked as an assistant professor in the Department of Computer Science and Technology, Peking (Beijing) University in the areas of distributed systems and Web search engines, and visited the School of Computing Science at Simon Fraser University, the Department of Computer Science at the University of Illinois at Urbana-Champaign, and the Digital Technology Center and the Department of Computer Science at the University of Minnesota, mainly working in the area of data mining. He is currently an associate professor of the Department of Computer Science and Technology at Tsinghua University, P.R. China. George Karypis received his Ph.D. degree in computer science at the University of Minnesota and he is currently an associate professor at the Department of Computer Science and Engineering at the University of Minnesota. His research interests spans the areas of parallel algorithm design, data mining, bioinformatics, information retrieval, applications of parallel processing in scientific computing and optimization, sparse matrix computations, parallel preconditioners, and parallel programming languages and libraries. His research has resulted in the development of software libraries for serial and parallel graph partitioning (METIS and ParMETIS), hypergraph partitioning (hMETIS), for parallel Cholesky factorization (PSPASES), for collaborative filtering-based recommendation algorithms (SUGGEST), clustering high dimensional datasets (CLUTO), and finding frequent patterns in diverse datasets (PAFI). He has coauthored over ninety journal and conference papers on these topics and a book title “Introduction to Parallel Computing” (Publ. Addison Wesley, 2003, 2nd edition). In addition, he is serving on the program committees of many conferences and workshops on these topics and is an associate editor of the IEEE Transactions on Parallel and Distributed Systems.  相似文献   

17.
Data mining can dig out valuable information from databases to assist a business in approaching knowledge discovery and improving business intelligence. Database stores large structured data. The amount of data increases due to the advanced database technology and extensive use of information systems. Despite the price drop of storage devices, it is still important to develop efficient techniques for database compression. This paper develops a database compression method by eliminating redundant data, which often exist in transaction database. The proposed approach uses a data mining structure to extract association rules from a database. Redundant data will then be replaced by means of compression rules. A heuristic method is designed to resolve the conflicts of the compression rules. To prove its efficiency and effectiveness, the proposed approach is compared with two other database compression methods. Chin-Feng Lee is an associate professor with the Department of Information Management at Chaoyang University of Technology, Taiwan, R.O.C. She received her M.S. and Ph.D. degrees in 1994 and 1998, respectively, from the Department of Computer Science and Information Engineering at National Chung Cheng University. Her current research interests include database design, image processing and data mining techniques. S. Wesley Changchien is a professor with the Institute of Electronic Commerce at National Chung-Hsing University, Taiwan, R.O.C. He received a BS degree in Mechanical Engineering (1989) and completed his MS (1993) and Ph.D. (1996) degrees in Industrial Engineering at State University of New York at Buffalo, USA. His current research interests include electronic commerce, internet/database marketing, knowledge management, data mining, and decision support systems. Jau-Ji Shen received his Ph.D. degree in Information Engineering and Computer Science from National Taiwan University at Taipei, Taiwan in 1988. From 1988 to 1994, he was the leader of the software group in Institute of Aeronautic, Chung-Sung Institute of Science and Technology. He is currently an associate professor of information management department in the National Chung Hsing University at Taichung. His research areas focus on the digital multimedia, database and information security. His current research areas focus on data engineering, database techniques and information security. Wei-Tse Wang received the B.A. (2001) and M.B.A (2003) degrees in Information Management at Chaoyang University of Technology, Taiwan, R.O.C. His research interests include data mining, XML, and database compression.  相似文献   

18.
This paper presents a metamodel for modeling system features and relationships between features. The underlying idea of this metamodel is to employ features as first-class entities in the problem space of software and to improve the customization of software by explicitly specifying both static and dynamic dependencies between system features. In this metamodel, features are organized as hierarchy structures by the refinement relationships, static dependencies between features are specified by the constraint relationships, and dynamic dependencies between features are captured by the interaction relationships. A first-order logic based method is proposed to formalize constraints and to verify constraints and customization. This paper also presents a framework for interaction classification, and an informal mapping between interactions and constraints through constraint semantics. Hong Mei received the BSc and MSc degrees in computer science from the Nanjing University of Aeronautics and Astronautics (NUAA), China, in 1984 and 1987, respectively, and the PhD degree in computer science from the Shanghai Jiao Tong University in 1992. He is currently a professor of Computer Science at the Peking University, China. His current research interests include Software Engineering and Software Engineering Environment, Software Reuse and Software Component Technology, Distributed Object Technology, and Programming Language. He has published more than 100 technical papers. Wei Zhang received the BSc in Engineering Thermophysics and the MSc in Computer Science from the Nanjing University of Aeronautics and Astronautics (NUAA), China, in 1999 and 2002, respectively. He is currently a PhD student at the School of Electronics Engineering and Computer Science of the Peking University, China. His research interests include feature-oriented requirements modeling, feature-driven software architecture design and feature-oriented software reuse. Haiyan Zhao received both the BSc and the MSc degree in Computer Science from the Peking Univeristy, China, and the Ph.D degree in Information Engineering from the University of Tokyo, Japan. She is currently an associate professor of Computer Science at the Peking University, China. Her research interests include Software Reuse, Domain Engineering, Domain Specific Languange and Program Transformation.  相似文献   

19.
Finding centric local outliers in categorical/numerical spaces   总被引:2,自引:0,他引:2  
Outlier detection techniques are widely used in many applications such as credit-card fraud detection, monitoring criminal activities in electronic commerce, etc. These applications attempt to identify outliers as noises, exceptions, or objects around the border. The existing density-based local outlier detection assigns the degree to which an object is an outlier in a numerical space. In this paper, we propose a novel mutual-reinforcement-based local outlier detection approach. Instead of detecting local outliers as noise, we attempt to identify local outliers in the center, where they are similar to some clusters of objects on one hand, and are unique on the other. Our technique can be used for bank investment to identify a unique body, similar to many good competitors, in which to invest. We attempt to detect local outliers in categorical, ordinal as well as numerical data. In categorical data, the challenge is that there are many similar but different ways to specify relationships among the data items. Our mutual-reinforcement-based approach is stable, with similar but different user-defined relationships. Our technique can reduce the burden for users to determine the relationships among data items, and find the explanations why the outliers are found. We conducted extensive experimental studies using real datasets. Jeffrey Xu Yu received his B.E., M.E. and Ph.D. in computer science, from the University of Tsukuba, Japan, in 1985, 1987 and 1990, respectively. Jeffrey Xu Yu was a research fellow in the Institute of Information Sciences and Electronics, University of Tsukuba (Apr. 1990–Mar. 1991), and held teaching positions in the Institute of Information Sciences and Electronics, University of Tsukuba (Apr. 1991–July 1992) and in the Department of Computer Science, Australian National University (July 1992–June 2000). Currently he is an Associate Professor in the Department of Systems Engineering and Engineering Management, Chinese University of Hong Kong. His major research interests include data mining, data stream mining/processing, XML query processing and optimization, data warehouse, on-line analytical processing, and design and implementation of database management systems. Weining Qian is currently an assistant professor of computer science at Fudan University, Shanghai, China. He received his M.S. and Ph.D. degrees in computer science from Fudan University in 2001 and 2004, respectively. He was supported by a Microsoft Research Fellowship when he was doing the research presented in this paper, and he is supported by the Shanghai Rising Star Program. His research interests include data mining for very large databases, data stream query processing and mining and peer-to-peer computing. Hongjun Lu received his B.Sc. from Tsinghua University, China, and M.Sc. and Ph.D. from the Department of Computer Science, University of Wisconsin–Madison. He worked as an engineer in the Chinese Academy of Space Technology, and a principal research scientist in the Computer Science Center of Honeywell Inc., Minnesota, USA (1985–1987), and a professor at the School of Computing of the National University of Singapore (1987–2000), and is a full professor of the Hong Kong University of Science and Technology. His research interests are in data/knowledge-base management systems with an emphasis on query processing and optimization, physical database design, and database performance. Hongjun Lu is currently a trustee of the VLDB Endowment, an associate editor of the IEEE Transactions on Knowledge and Data Engineering (TKDE), and a member of the review board of the Journal of Database Management. He served as a member of the ACM SIGMOD Advisory Board in 1998–2002. Aoying Zhou born in 1965, is currently a professor of computer science at Fudan University, Shanghai, China. He won his Bachelor degree and Master degree in Computer Science from Sichuan University in Chengdu, Sichuan, China in 1985 and 1988. respectively, and a Ph.D. degree from Fudan University in 1993. He has served as a member or chair of the program committees for many international conferences such as VLDB, ER, DASFAA, WAIM, and etc. His papers have been published in ACM SIGMOD, VLDB, ICDE and some international journals. His research interests include data mining and knowledge discovery, XML data management, web query and searching, data stream analysis and processing and peer-to-peer computing.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号