首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
DSM-FI: an efficient algorithm for mining frequent itemsets in data streams   总被引:4,自引:4,他引:0  
Online mining of data streams is an important data mining problem with broad applications. However, it is also a difficult problem since the streaming data possess some inherent characteristics. In this paper, we propose a new single-pass algorithm, called DSM-FI (data stream mining for frequent itemsets), for online incremental mining of frequent itemsets over a continuous stream of online transactions. According to the proposed algorithm, each transaction of the stream is projected into a set of sub-transactions, and these sub-transactions are inserted into a new in-memory summary data structure, called SFI-forest (summary frequent itemset forest) for maintaining the set of all frequent itemsets embedded in the transaction data stream generated so far. Finally, the set of all frequent itemsets is determined from the current SFI-forest. Theoretical analysis and experimental studies show that the proposed DSM-FI algorithm uses stable memory, makes only one pass over an online transactional data stream, and outperforms the existing algorithms of one-pass mining of frequent itemsets.
Suh-Yin LeeEmail:
  相似文献   

2.
In this paper, we deal with mining sequential patterns in multiple time sequences. Building on a state-of-the-art sequential pattern mining algorithm PrefixSpan for mining transaction databases, we propose MILE (MIning in muLtiple sEquences), an efficient algorithm to facilitate the mining process. MILE recursively utilizes the knowledge of existing patterns to avoid redundant data scanning, and therefore can effectively speed up the new patterns’ discovery process. Another unique feature of MILE is that it can incorporate prior knowledge of the data distribution in time sequences into the mining process to further improve the performance. Extensive empirical results show that MILE is significantly faster than PrefixSpan. As MILE consumes more memory than PrefixSpan, we also present a solution to trade time efficiency in memory constrained environments.
Xingquan ZhuEmail:
  相似文献   

3.
A novel approach for process mining based on event types   总被引:2,自引:0,他引:2  
Despite the omnipresence of event logs in transactional information systems (cf. WFM, ERP, CRM, SCM, and B2B systems), historic information is rarely used to analyze the underlying processes. Process mining aims at improving this by providing techniques and tools for discovering process, control, data, organizational, and social structures from event logs, i.e., the basic idea of process mining is to diagnose business processes by mining event logs for knowledge. Given its potential and challenges it is no surprise that recently process mining has become a vivid research area. In this paper, a novel approach for process mining based on two event types, i.e., START and COMPLETE, is proposed. Information about the start and completion of tasks can be used to explicitly detect parallelism. The algorithm presented in this paper overcomes some of the limitations of existing algorithms such as the α-algorithm (e.g., short-loops) and therefore enhances the applicability of process mining.
Jiaguang SunEmail:
  相似文献   

4.
A complete set of frequent itemsets can get undesirably large due to redundancy when the minimum support threshold is low or when the database is dense. Several concise representations have been previously proposed to eliminate the redundancy. Generator based representations rely on a negative border to make the representation lossless. However, the number of itemsets on a negative border sometimes even exceeds the total number of frequent itemsets. In this paper, we propose to use a positive border together with frequent generators to form a lossless representation. A positive border is usually orders of magnitude smaller than its corresponding negative border. A set of frequent generators plus its positive border is always no larger than the corresponding complete set of frequent itemsets, thus it is a true concise representation. The generalized form of this representation is also proposed. We develop an efficient algorithm, called GrGrowth, to mine generators and positive borders as well as their generalizations. The GrGrowth algorithm uses the depth-first-search strategy to explore the search space, which is much more efficient than the breadth-first-search strategy adopted by most of the existing generator mining algorithms. Our experiment results show that the GrGrowth algorithm is significantly faster than level-wise algorithms for mining generator based representations, and is comparable to the state-of-the-art algorithms for mining frequent closed itemsets.
Guimei LiuEmail:
  相似文献   

5.
Querying live media streams is a challenging problem that is becoming an essential requirement in a growing number of applications. Research in multimedia information systems has addressed and made good progress in dealing with archived data. Meanwhile, research in stream databases has received significant attention for querying alphanumeric symbolic streams. The lack of a data model capable of representing different multimedia data in a declarative way, hiding the media heterogeneity and providing reasonable abstractions for querying live multimedia streams poses the challenge of how to make the best use of data in video, audio and other media sources for various applications. In this paper we propose a system that enables directly capturing media streams from sensors and automatically generating more meaningful feature streams that can be queried by a data stream processor. The system provides an effective combination between extendible digital processing techniques and general data stream management research. Together with other query techniques developed in related data stream management streams, our system can be used in those application areas where multifarious live media senors are deployed for surveillance, disaster response, live conferencing, telepresence, etc.
Bin LiuEmail:
  相似文献   

6.
This paper describes security and privacy issues for multimedia database management systems. Multimedia data includes text, images, audio and video. It describes access control for multimedia database management systems and describes security policies and security architectures for such systems. Privacy problems that result from multimedia data mining are also discussed.
Bhavani ThuraisinghamEmail:
  相似文献   

7.
Association Rule Mining is one of the important data mining activities and has received substantial attention in the literature. Association rule mining is a computationally and I/O intensive task. In this paper, we propose a solution approach for mining optimized fuzzy association rules of different orders. We also propose an approach to define membership functions for all the continuous attributes in a database by using clustering techniques. Although single objective genetic algorithms are used extensively, they degenerate the solution. In our approach, extraction and optimization of fuzzy association rules are done together using multi-objective genetic algorithm by considering the objectives such as fuzzy support, fuzzy confidence and rule length. The effectiveness of the proposed approach is tested using computer activity dataset to analyze the performance of a multi processor system and network audit data to detect anomaly based intrusions. Experiments show that the proposed method is efficient in many scenarios.
V. S. AnanthanarayanaEmail:
  相似文献   

8.
Nowadays data mining plays an important role in decision making. Since many organizations do not possess the in-house expertise of data mining, it is beneficial to outsource data mining tasks to external service providers. However, most organizations hesitate to do so due to the concern of loss of business intelligence and customer privacy. In this paper, we present a Bloom filter based solution to enable organizations to outsource their tasks of mining association rules, at the same time, protect their business intelligence and customer privacy. Our approach can achieve high precision in data mining by trading-off the storage requirement. This research was supported by the USA National Science Foundation Grants CCR-0310974 and IIS-0546027.
Ling Qiu (Corresponding author)Email:
Yingjiu LiEmail:
Xintao WuEmail:
  相似文献   

9.
On Detecting Spatial Outliers   总被引:1,自引:1,他引:0  
The ever-increasing volume of spatial data has greatly challenged our ability to extract useful but implicit knowledge from them. As an important branch of spatial data mining, spatial outlier detection aims to discover the objects whose non-spatial attribute values are significantly different from the values of their spatial neighbors. These objects, called spatial outliers, may reveal important phenomena in a number of applications including traffic control, satellite image analysis, weather forecast, and medical diagnosis. Most of the existing spatial outlier detection algorithms mainly focus on identifying single attribute outliers and could potentially misclassify normal objects as outliers when their neighborhoods contain real spatial outliers with very large or small attribute values. In addition, many spatial applications contain multiple non-spatial attributes which should be processed altogether to identify outliers. To address these two issues, we formulate the spatial outlier detection problem in a general way, design two robust detection algorithms, one for single attribute and the other for multiple attributes, and analyze their computational complexities. Experiments were conducted on a real-world data set, West Nile virus data, to validate the effectiveness of the proposed algorithms.
Feng Chen (Corresponding author)Email:
  相似文献   

10.
Mining of music data is one of the most important problems in multimedia data mining. In this paper, two research issues of mining music data, i.e., online mining of music query streams and change detection of music query streams, are discussed. First, we proposed an efficient online algorithm, FTP-stream (Frequent Temporal Pattern mining of streams), to mine all frequent melody structures over sliding windows of music melody sequence streams. An effective bit-sequence representation is used in the proposed algorithm to reduce the time and memory needed to slide the windows. An effective list structure is developed in the FTP-stream algorithm to overcome the performance bottleneck of 2-candidate generation. Experiments show that the proposed algorithm FTP-stream only needs a half of memory requirement of original melody sequence data, and just scans the music query stream once. After mining frequent melody structures, we developed a simple online algorithm, MQS-change (changes of Music Query Streams), to detect the changes of frequent melody structures in current user-centered music query streams. Two music melody structures (set of chord-sets and string of chord-sets) are maintained and four melody structure changes (positive burst, negative burst, increasing change and decreasing change) are monitored in a new summary data structure, MSC-list (a list of Music Structure Changes). Experiments show that the MQS-change algorithm is an effective online method to detect the changes of music melody structures over continuous music query streams.
Hua-Fu LiEmail:
  相似文献   

11.
Recently, a new class of data mining methods, known as privacy preserving data mining (PPDM) algorithms, has been developed by the research community working on security and knowledge discovery. The aim of these algorithms is the extraction of relevant knowledge from large amount of data, while protecting at the same time sensitive information. Several data mining techniques, incorporating privacy protection mechanisms, have been developed that allow one to hide sensitive itemsets or patterns, before the data mining process is executed. Privacy preserving classification methods, instead, prevent a miner from building a classifier which is able to predict sensitive data. Additionally, privacy preserving clustering techniques have been recently proposed, which distort sensitive numerical attributes, while preserving general features for clustering analysis. A crucial issue is to determine which ones among these privacy-preserving techniques better protect sensitive information. However, this is not the only criteria with respect to which these algorithms can be evaluated. It is also important to assess the quality of the data resulting from the modifications applied by each algorithm, as well as the performance of the algorithms. There is thus the need of identifying a comprehensive set of criteria with respect to which to assess the existing PPDM algorithms and determine which algorithm meets specific requirements. In this paper, we present a first evaluation framework for estimating and comparing different kinds of PPDM algorithms. Then, we apply our criteria to a specific set of algorithms and discuss the evaluation results we obtain. Finally, some considerations about future work and promising directions in the context of privacy preservation in data mining are discussed. *The work reported in this paper has been partially supported by the EU under the IST Project CODMINE and by the Sponsors of CERIAS. Editor:  Geoff Webb
Elisa Bertino (Corresponding author)Email:
Igor Nai FovinoEmail:
Loredana Parasiliti ProvenzaEmail:
  相似文献   

12.
Multi-objective optimization has played a major role in solving problems where two or more conflicting objectives need to be simultaneously optimized. This paper presents a Multi-Objective grammar-based genetic programming (MOGGP) system that automatically evolves complete rule induction algorithms, which in turn produce both accurate and compact rule models. The system was compared with a single objective GGP and three other rule induction algorithms. In total, 20 UCI data sets were used to generate and test generic rule induction algorithms, which can be now applied to any classification data set. Experiments showed that, in general, the proposed MOGGP finds rule induction algorithms with competitive predictive accuracies and more compact models than the algorithms it was compared with.
Gisele L. PappaEmail: Email:
  相似文献   

13.
In the paper, we try to find a method that can service more users in a video-on-demand (VoD) system, based on MPEG-4 object streams. The characteristics of object segmentation made on MPEG-4 videos can be utilized to reduce re-transmission of the same objects, and then the saved bandwidth can be used to service more users. However, some thresholds must be analyzed first to maintain the acceptable quality of services (QoS) requested by users, when reducing unnecessary object transmission on one side. Thus, according to the defined thresholds, we propose a dynamically adjusting algorithm to coordinate the object streams between the server and clients. The server not only allocates network bandwidth, but also adjusts ever-allocated QoS appropriately using a degrading and upgrading strategy, based on the current network status. Lastly, through the simulation, we found that our method has better performance than the other three methods owing to its flexibility to the network status.
Yin-Fu HuangEmail:
  相似文献   

14.
One in a million: picking the right patterns   总被引:7,自引:6,他引:1  
Constrained pattern mining extracts patterns based on their individual merit. Usually this results in far more patterns than a human expert or a machine leaning technique could make use of. Often different patterns or combinations of patterns cover a similar subset of the examples, thus being redundant and not carrying any new information. To remove the redundant information contained in such pattern sets, we propose two general heuristic algorithms—Bouncer and Picker—for selecting a small subset of patterns. We identify several selection techniques for use in this general algorithm and evaluate those on several data sets. The results show that both techniques succeed in severely reducing the number of patterns, while at the same time apparently retaining much of the original information. Additionally, the experiments show that reducing the pattern set indeed improves the quality of classification results. Both results show that the developed solutions are very well suited for the goals we aim at.
Albrecht Zimmermann (Corresponding author)Email:
  相似文献   

15.
Clustering multidimensional sequences in spatial and temporal databases   总被引:3,自引:2,他引:1  
Many environmental, scientific, technical or medical database applications require effective and efficient mining of time series, sequences or trajectories of measurements taken at different time points and positions forming large temporal or spatial databases. Particularly the analysis of concurrent and multidimensional sequences poses new challenges in finding clusters of arbitrary length and varying number of attributes. We present a novel algorithm capable of finding parallel clusters in different subspaces and demonstrate our results for temporal and spatial applications. Our analysis of structural quality parameters in rivers is successfully used by hydrologists to develop measures for river quality improvements.
Thomas SeidlEmail:
  相似文献   

16.
ONTRACK: Dynamically adapting music playback to support navigation   总被引:3,自引:3,他引:0  
Listening to music on personal, digital devices whilst mobile is an enjoyable, everyday activity. We explore a scheme for exploiting this practice to immerse listeners in navigation cues. Our prototype, ONTRACK, continuously adapts audio, modifying the spatial balance and volume to lead listeners to their target destination. First we report on an initial lab-based evaluation that demonstrated the approach’s efficacy: users were able to complete tasks within a reasonable time and their subjective feedback was positive. Encouraged by these results we constructed a handheld prototype. Here, we discuss this implementation and the results of field-trials. These indicate that even with a low-fidelity realisation of the concept, users can quite effectively navigate complicated routes.
Matt Jones (Corresponding author)Email:
Steve JonesEmail:
Gareth BradleyEmail:
Nigel WarrenEmail:
David BainbridgeEmail:
Geoff HolmesEmail:
  相似文献   

17.
Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing   总被引:2,自引:0,他引:2  
Support vector machines (SVMs) have been promising methods for classification and regression analysis due to their solid mathematical foundations, which include two desirable properties: margin maximization and nonlinear classification using kernels. However, despite these prominent properties, SVMs are usually not chosen for large-scale data mining problems because their training complexity is highly dependent on the data set size. Unlike traditional pattern recognition and machine learning, real-world data mining applications often involve huge numbers of data records. Thus it is too expensive to perform multiple scans on the entire data set, and it is also infeasible to put the data set in memory. This paper presents a method, Clustering-Based SVM (CB-SVM), that maximizes the SVM performance for very large data sets given a limited amount of resource, e.g., memory. CB-SVM applies a hierarchical micro-clustering algorithm that scans the entire data set only once to provide an SVM with high quality samples. These samples carry statistical summaries of the data and maximize the benefit of learning. Our analyses show that the training complexity of CB-SVM is quadratically dependent on the number of support vectors, which is usually much less than that of the entire data set. Our experiments on synthetic and real-world data sets show that CB-SVM is highly scalable for very large data sets and very accurate in terms of classification. A preliminary version of the paper, “Classifying Large Data Sets Using SVM with Hierarchical Clusters”, by H. Yu, J. Yang, and J. Han, appeared in Proc. 2003 Int. Conf. on Knowledge Discovery in Databases (KDD'03), Washington, DC, August 2003. However, this submission has substantially extended the previous paper and contains new and major-value added technical contribution in comparison with the conference publication.
Hwanjo Yu (Corresponding author)Email:
Jiong YangEmail:
Jiawei HanEmail:
Xiaolei LiEmail:
  相似文献   

18.
Discovering correlated spatio-temporal changes in evolving graphs   总被引:6,自引:6,他引:0  
Graphs provide powerful abstractions of relational data, and are widely used in fields such as network management, web page analysis and sociology. While many graph representations of data describe dynamic and time evolving relationships, most graph mining work treats graphs as static entities. Our focus in this paper is to discover regions of a graph that are evolving in a similar manner. To discover regions of correlated spatio-temporal change in graphs, we propose an algorithm called cSTAG. Whereas most clustering techniques are designed to find clusters that optimise a single distance measure, cSTAG addresses the problem of finding clusters that optimise both temporal and spatial distance measures simultaneously. We show the effectiveness of cSTAG using a quantitative analysis of accuracy on synthetic data sets, as well as demonstrating its utility on two large, real-life data sets, where one is the routing topology of the Internet, and the other is the dynamic graph of files accessed together on the 1998 World Cup official website.
Jeffrey ChanEmail:
  相似文献   

19.
Recently, multi-objective evolutionary algorithms have been applied to improve the difficult tradeoff between interpretability and accuracy of fuzzy rule-based systems. It is known that both requirements are usually contradictory, however, these kinds of algorithms can obtain a set of solutions with different trade-offs. This contribution analyzes different application alternatives in order to attain the desired accuracy/interpr-etability balance by maintaining the improved accuracy that a tuning of membership functions could give but trying to obtain more compact models. In this way, we propose the use of multi-objective evolutionary algorithms as a tool to get almost one improved solution with respect to a classic single objective approach (a solution that could dominate the one obtained by such algorithm in terms of the system error and number of rules). To do that, this work presents and analyzes the application of six different multi-objective evolutionary algorithms to obtain simpler and still accurate linguistic fuzzy models by performing rule selection and a tuning of the membership functions. The results on two different scenarios show that the use of expert knowledge in the algorithm design process significantly improves the search ability of these algorithms and that they are able to improve both objectives together, obtaining more accurate and at the same time simpler models with respect to the single objective based approach.
María José Gacto (Corresponding author)Email:
Rafael AlcaláEmail:
Francisco HerreraEmail:
  相似文献   

20.
On modeling software defect repair time   总被引:2,自引:2,他引:0  
The ability to predict the time required to repair software defects is important for both software quality management and maintenance. Estimated repair times can be used to improve the reliability and time-to-market of software under development. This paper presents an empirical approach to predicting defect repair times by constructing models that use well-established machine learning algorithms and defect data from past software defect reports. We describe, as a case study, the analysis of defect reports collected during the development of a large medical software system. Our predictive models give accuracies as high as 93.44%, despite the limitations of the available data. We present the proposed methodology along with detailed experimental results, which include comparisons with other analytical modeling approaches.
Phongphun KijsanayothinEmail:
  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号