首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
DSM-FI: an efficient algorithm for mining frequent itemsets in data streams   总被引:4,自引:4,他引:0  
Online mining of data streams is an important data mining problem with broad applications. However, it is also a difficult problem since the streaming data possess some inherent characteristics. In this paper, we propose a new single-pass algorithm, called DSM-FI (data stream mining for frequent itemsets), for online incremental mining of frequent itemsets over a continuous stream of online transactions. According to the proposed algorithm, each transaction of the stream is projected into a set of sub-transactions, and these sub-transactions are inserted into a new in-memory summary data structure, called SFI-forest (summary frequent itemset forest) for maintaining the set of all frequent itemsets embedded in the transaction data stream generated so far. Finally, the set of all frequent itemsets is determined from the current SFI-forest. Theoretical analysis and experimental studies show that the proposed DSM-FI algorithm uses stable memory, makes only one pass over an online transactional data stream, and outperforms the existing algorithms of one-pass mining of frequent itemsets.
Suh-Yin LeeEmail:
  相似文献   

2.
In this paper, we propose two parallel algorithms for mining maximal frequent itemsets from databases. A frequent itemset is maximal if none of its supersets is frequent. One parallel algorithm is named distributed max-miner (DMM), and it requires very low communication and synchronization overhead in distributed computing systems. DMM has the local mining phase and the global mining phase. During the local mining phase, each node mines the local database to discover the local maximal frequent itemsets, then they form a set of maximal candidate itemsets for the top-down search in the subsequent global mining phase. A new prefix tree data structure is developed to facilitate the storage and counting of the global candidate itemsets of different sizes. This global mining phase using the prefix tree can work with any local mining algorithm. Another parallel algorithm, named parallel max-miner (PMM), is a parallel version of the sequential max-miner algorithm (Proc of ACM SIGMOD Int Conf on Management of Data, 1998, pp 85–93). Most of existing mining algorithms discover the frequent k-itemsets on the kth pass over the databases, and then generate the candidate (k + 1)-itemsets for the next pass. Compared to those level-wise algorithms, PMM looks ahead at each pass and prunes more candidate itemsets by checking the frequencies of their supersets. Both DMM and PMM were implemented on a cluster of workstations, and their performance was evaluated for various cases. They demonstrate very good performance and scalability even when there are large maximal frequent itemsets (i.e., long patterns) in databases.
Congnan LuoEmail:
  相似文献   

3.
Mining frequent itemsets is a popular method for finding associated items in databases. For this method, support, the co-occurrence frequency of the items which form an association, is used as the primary indicator of the associations's significance. A single, user-specified support threshold is used to decided if associations should be further investigated. Support has some known problems with rare items, favors shorter itemsets and sometimes produces misleading associations. In this paper we develop a novel model-based frequency constraint as an alternative to a single, user-specified minimum support. The constraint utilizes knowledge of the process generating transaction data by applying a simple stochastic mixture model (the NB model) which allows for transaction data's typically highly skewed item frequency distribution. A user-specified precision threshold is used together with the model to find local frequency thresholds for groups of itemsets. Based on the constraint we develop the notion of NB-frequent itemsets and adapt a mining algorithm to find all NB-frequent itemsets in a database. In experiments with publicly available transaction databases we show that the new constraint provides improvements over a single minimum support threshold and that the precision threshold is more robust and easier to set and interpret by the user.
Michael HahslerEmail:
  相似文献   

4.
Recently, a new class of data mining methods, known as privacy preserving data mining (PPDM) algorithms, has been developed by the research community working on security and knowledge discovery. The aim of these algorithms is the extraction of relevant knowledge from large amount of data, while protecting at the same time sensitive information. Several data mining techniques, incorporating privacy protection mechanisms, have been developed that allow one to hide sensitive itemsets or patterns, before the data mining process is executed. Privacy preserving classification methods, instead, prevent a miner from building a classifier which is able to predict sensitive data. Additionally, privacy preserving clustering techniques have been recently proposed, which distort sensitive numerical attributes, while preserving general features for clustering analysis. A crucial issue is to determine which ones among these privacy-preserving techniques better protect sensitive information. However, this is not the only criteria with respect to which these algorithms can be evaluated. It is also important to assess the quality of the data resulting from the modifications applied by each algorithm, as well as the performance of the algorithms. There is thus the need of identifying a comprehensive set of criteria with respect to which to assess the existing PPDM algorithms and determine which algorithm meets specific requirements. In this paper, we present a first evaluation framework for estimating and comparing different kinds of PPDM algorithms. Then, we apply our criteria to a specific set of algorithms and discuss the evaluation results we obtain. Finally, some considerations about future work and promising directions in the context of privacy preservation in data mining are discussed. *The work reported in this paper has been partially supported by the EU under the IST Project CODMINE and by the Sponsors of CERIAS. Editor:  Geoff Webb
Elisa Bertino (Corresponding author)Email:
Igor Nai FovinoEmail:
Loredana Parasiliti ProvenzaEmail:
  相似文献   

5.
Mining top-K frequent itemsets from data streams   总被引:1,自引:0,他引:1  
Frequent pattern mining on data streams is of interest recently. However, it is not easy for users to determine a proper frequency threshold. It is more reasonable to ask users to set a bound on the result size. We study the problem of mining top K frequent itemsets in data streams. We introduce a method based on the Chernoff bound with a guarantee of the output quality and also a bound on the memory usage. We also propose an algorithm based on the Lossy Counting Algorithm. In most of the experiments of the two proposed algorithms, we obtain perfect solutions and the memory space occupied by our algorithms is very small. Besides, we also propose the adapted approach of these two algorithms in order to handle the case when we are interested in mining the data in a sliding window. The experiments show that the results are accurate.
Ada Wai-Chee FuEmail:
  相似文献   

6.
A novel approach for process mining based on event types   总被引:2,自引:0,他引:2  
Despite the omnipresence of event logs in transactional information systems (cf. WFM, ERP, CRM, SCM, and B2B systems), historic information is rarely used to analyze the underlying processes. Process mining aims at improving this by providing techniques and tools for discovering process, control, data, organizational, and social structures from event logs, i.e., the basic idea of process mining is to diagnose business processes by mining event logs for knowledge. Given its potential and challenges it is no surprise that recently process mining has become a vivid research area. In this paper, a novel approach for process mining based on two event types, i.e., START and COMPLETE, is proposed. Information about the start and completion of tasks can be used to explicitly detect parallelism. The algorithm presented in this paper overcomes some of the limitations of existing algorithms such as the α-algorithm (e.g., short-loops) and therefore enhances the applicability of process mining.
Jiaguang SunEmail:
  相似文献   

7.
Nowadays data mining plays an important role in decision making. Since many organizations do not possess the in-house expertise of data mining, it is beneficial to outsource data mining tasks to external service providers. However, most organizations hesitate to do so due to the concern of loss of business intelligence and customer privacy. In this paper, we present a Bloom filter based solution to enable organizations to outsource their tasks of mining association rules, at the same time, protect their business intelligence and customer privacy. Our approach can achieve high precision in data mining by trading-off the storage requirement. This research was supported by the USA National Science Foundation Grants CCR-0310974 and IIS-0546027.
Ling Qiu (Corresponding author)Email:
Yingjiu LiEmail:
Xintao WuEmail:
  相似文献   

8.
Mining of music data is one of the most important problems in multimedia data mining. In this paper, two research issues of mining music data, i.e., online mining of music query streams and change detection of music query streams, are discussed. First, we proposed an efficient online algorithm, FTP-stream (Frequent Temporal Pattern mining of streams), to mine all frequent melody structures over sliding windows of music melody sequence streams. An effective bit-sequence representation is used in the proposed algorithm to reduce the time and memory needed to slide the windows. An effective list structure is developed in the FTP-stream algorithm to overcome the performance bottleneck of 2-candidate generation. Experiments show that the proposed algorithm FTP-stream only needs a half of memory requirement of original melody sequence data, and just scans the music query stream once. After mining frequent melody structures, we developed a simple online algorithm, MQS-change (changes of Music Query Streams), to detect the changes of frequent melody structures in current user-centered music query streams. Two music melody structures (set of chord-sets and string of chord-sets) are maintained and four melody structure changes (positive burst, negative burst, increasing change and decreasing change) are monitored in a new summary data structure, MSC-list (a list of Music Structure Changes). Experiments show that the MQS-change algorithm is an effective online method to detect the changes of music melody structures over continuous music query streams.
Hua-Fu LiEmail:
  相似文献   

9.
Current workflow management technology offers rich support for process-oriented coordination of distributed teamwork. In this paper, we evaluate the performance of an industrial workflow process where similar tasks can be performed by various actors at many different locations. We analyzed a large workflow process log with state-of-the-art mining tools associated with the ProM framework. Our analysis leads to the conclusion that there is a positive effect on process performance when workflow actors are geographically close to each other. Our case study shows that the use of workflow technology in itself is not sufficient to level geographical barriers between team members and that additional measures are required for a desirable performance.
Byungduk JeongEmail:
  相似文献   

10.
Exploiting maximal redundancy to optimize SQL queries   总被引:1,自引:1,他引:0  
Detecting and dealing with redundancy is an ubiquitous problem in query optimization, which manifests itself in many areas of research such as materialized views, multi-query optimization, and query-containment algorithms. In this paper, we focus on the issue of intra-query redundancy, redundancy present within a query. We present a method to detect the maximal redundancy present between a main (outer) query block and a subquery block. We then use the method for query optimization, introducing query plans and a new operator that take full advantage of the redundancy discovered. Our approach can deal with redundancy in a wider spectrum of queries than existing techniques. We show experimental evidence that our approach works under certain conditions, and compares favorably to existing optimization techniques when applicable.
Antonio BadiaEmail:
  相似文献   

11.
ONTRACK: Dynamically adapting music playback to support navigation   总被引:3,自引:3,他引:0  
Listening to music on personal, digital devices whilst mobile is an enjoyable, everyday activity. We explore a scheme for exploiting this practice to immerse listeners in navigation cues. Our prototype, ONTRACK, continuously adapts audio, modifying the spatial balance and volume to lead listeners to their target destination. First we report on an initial lab-based evaluation that demonstrated the approach’s efficacy: users were able to complete tasks within a reasonable time and their subjective feedback was positive. Encouraged by these results we constructed a handheld prototype. Here, we discuss this implementation and the results of field-trials. These indicate that even with a low-fidelity realisation of the concept, users can quite effectively navigate complicated routes.
Matt Jones (Corresponding author)Email:
Steve JonesEmail:
Gareth BradleyEmail:
Nigel WarrenEmail:
David BainbridgeEmail:
Geoff HolmesEmail:
  相似文献   

12.
We present a study of using camera-phones and visual-tags to access mobile services. Firstly, a user-experience study is described in which participants were both observed learning to interact with a prototype mobile service and interviewed about their experiences. Secondly, a pointing-device task is presented in which quantitative data was gathered regarding the speed and accuracy with which participants aimed and clicked on visual-tags using camera-phones. We found that participants’ attitudes to visual-tag-based applications were broadly positive, although they had several important reservations about camera-phone technology more generally. Data from our pointing-device task demonstrated that novice users were able to aim and click on visual-tags quickly (well under 3 s per pointing-device trial on average) and accurately (almost all meeting our defined speed/accuracy tradeoff of 6% error-rate). Based on our findings, design lessons for camera-phone and visual-tag applications are presented.
Eleanor Toye (Corresponding author)Email:
Richard SharpEmail:
Anil MadhavapeddyEmail:
David ScottEmail:
Eben UptonEmail:
Alan BlackwellEmail:
  相似文献   

13.
Recently, multi-objective evolutionary algorithms have been applied to improve the difficult tradeoff between interpretability and accuracy of fuzzy rule-based systems. It is known that both requirements are usually contradictory, however, these kinds of algorithms can obtain a set of solutions with different trade-offs. This contribution analyzes different application alternatives in order to attain the desired accuracy/interpr-etability balance by maintaining the improved accuracy that a tuning of membership functions could give but trying to obtain more compact models. In this way, we propose the use of multi-objective evolutionary algorithms as a tool to get almost one improved solution with respect to a classic single objective approach (a solution that could dominate the one obtained by such algorithm in terms of the system error and number of rules). To do that, this work presents and analyzes the application of six different multi-objective evolutionary algorithms to obtain simpler and still accurate linguistic fuzzy models by performing rule selection and a tuning of the membership functions. The results on two different scenarios show that the use of expert knowledge in the algorithm design process significantly improves the search ability of these algorithms and that they are able to improve both objectives together, obtaining more accurate and at the same time simpler models with respect to the single objective based approach.
María José Gacto (Corresponding author)Email:
Rafael AlcaláEmail:
Francisco HerreraEmail:
  相似文献   

14.
Quantitative usability requirements are a critical but challenging, and hence an often neglected aspect of a usability engineering process. A case study is described where quantitative usability requirements played a key role in the development of a new user interface of a mobile phone. Within the practical constraints of the project, existing methods for determining usability requirements and evaluating the extent to which these are met, could not be applied as such, therefore tailored methods had to be developed. These methods and their applications are discussed.
Timo Jokela (Corresponding author)Email:
Jussi KoivumaaEmail:
Jani PirkolaEmail:
Petri SalminenEmail:
Niina KantolaEmail:
  相似文献   

15.
The requirements and issues associated with computational representations for planning extend beyond those apparent in real-time control, where a substantial, existing research literature informs designers. To assist in the identification of requirements for planning representations, this paper provides two resources: (1) a theoretical foundation drawn from computer science and (2) illustrations of representations and corresponding work practice for real-time control and planning for the US Shuttle program. Together, these resources illustrate the human role in the planning process, and the need for work practices and information that combine to assist human operators in interpreting a representation that is loosely coupled to the physical world while shared among and modified by multiple participants in the planning process.
Valerie L. ShalinEmail:
  相似文献   

16.
Multimodal support to group dynamics   总被引:1,自引:1,他引:0  
The complexity of group dynamics occurring in small group interactions often hinders the performance of teams. The availability of rich multimodal information about what is going on during the meeting makes it possible to explore the possibility of providing support to dysfunctional teams from facilitation to training sessions addressing both the individuals and the group as a whole. A necessary step in this direction is that of capturing and understanding group dynamics. In this paper, we discuss a particular scenario, in which meeting participants receive multimedia feedback on their relational behaviour, as a first step towards increasing self-awareness. We describe the background and the motivation for a coding scheme for annotating meeting recordings partially inspired by the Bales’ Interaction Process Analysis. This coding scheme was aimed at identifying suitable observable behavioural sequences. The study is complemented with an experimental investigation on the acceptability of such a service.
Fabio Pianesi (Corresponding author)Email:
Massimo ZancanaroEmail:
Elena NotEmail:
Chiara LeonardiEmail:
Vera FalconEmail:
Bruno LepriEmail:
  相似文献   

17.
Maximum entropy based significance of itemsets   总被引:7,自引:5,他引:2  
We consider the problem of defining the significance of an itemset. We say that the itemset is significant if we are surprised by its frequency when compared to the frequencies of its sub-itemsets. In other words, we estimate the frequency of the itemset from the frequencies of its sub-itemsets and compute the deviation between the real value and the estimate. For the estimation we use Maximum Entropy and for measuring the deviation we use Kullback–Leibler divergence. A major advantage compared to the previous methods is that we are able to use richer models whereas the previous approaches only measure the deviation from the independence model. We show that our measure of significance goes to zero for derivable itemsets and that we can use the rank as a statistical test. Our empirical results demonstrate that for our real datasets the independence assumption is too strong but applying more flexible models leads to good results.
Nikolaj TattiEmail:
  相似文献   

18.
Dataless Transitions Between Concise Representations of Frequent Patterns   总被引:1,自引:0,他引:1  
For many data mining problems in order to solve them it is required to discover frequent patterns. Frequent itemsets are useful e.g. in the discovery of association and episode rules, sequential patterns and clusters. Nevertheless, the number of frequent itemsets is usually huge. Therefore, a number of lossless representations of frequent itemsets have recently been proposed. Two of such representations, namely the closed itemsets and the generators representation, are of particular interest as they can efficiently be applied for the discovery of most interesting non-redundant association and episode rules. On the other hand, it has been proved experimentally that other representations of frequent patterns happen to be more concise and more quickly extractable than these two representations even by several orders of magnitude. Hence, such concise representations seem to be an interesting alternative for materializing and reusing the knowledge of frequent patterns. The problem however arises, how to transform the intermediate representations into the desired ones efficiently and preferably without accessing the database. This article tackles this problem. As a result of investigating the properties of representations of frequent patterns, we offer a set of efficient algorithms for dataless transitioning between them.  相似文献   

19.
Mining Long, Sharable Patterns in Trajectories of Moving Objects   总被引:1,自引:0,他引:1  
The efficient analysis of spatio-temporal data, generated by moving objects, is an essential requirement for intelligent location-based services. Spatio-temporal rules can be found by constructing spatio-temporal baskets, from which traditional association rule mining methods can discover spatio-temporal rules. When the items in the baskets are spatio-temporal identifiers and are derived from trajectories of moving objects, the discovered rules represent frequently travelled routes. For some applications, e.g., an intelligent ridesharing application, these frequent routes are only interesting if they are long and sharable, i.e., can potentially be shared by several users. This paper presents a database projection based method for efficiently extracting such long, sharable frequent routes. The method prunes the search space by making use of the minimum length and sharable requirements and avoids the generation of the exponential number of sub-routes of long routes. Considering alternative modelling options for trajectories, leads to the development of two effective variants of the method. SQL-based implementations are described, and extensive experiments on both real life- and large-scale synthetic data show the effectiveness of the method and its variants.
Torben Bach PedersenEmail:
  相似文献   

20.
The paper reflects on the unique experience of social and technological development in Lithuania since the regaining of independence as a newly reshaped society constructing a distinctive competitive IST-based model at global level. This has presented Lithuanian pattern of how to integrate different experiences and relations between generations in implementing complex information society approaches. The resulting programme in general is linked to the Lisbon objectives of the European Union. The experience of transitional countries in Europe, each different but facing some common problems, may be useful to developing countries in Africa.
Arunas Augustinaitis (Corresponding author)Email:
Richard EnnalsEmail:
Egle MalinauskieneEmail:
Rimantas PetrauskasEmail:
  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号