共查询到20条相似文献,搜索用时 31 毫秒
1.
Online mining of data streams is an important data mining problem with broad applications. However, it is also a difficult
problem since the streaming data possess some inherent characteristics. In this paper, we propose a new single-pass algorithm,
called DSM-FI (data stream mining for frequent itemsets), for online incremental mining of frequent itemsets over a continuous
stream of online transactions. According to the proposed algorithm, each transaction of the stream is projected into a set
of sub-transactions, and these sub-transactions are inserted into a new in-memory summary data structure, called SFI-forest
(summary frequent itemset forest) for maintaining the set of all frequent itemsets embedded in the transaction data stream
generated so far. Finally, the set of all frequent itemsets is determined from the current SFI-forest. Theoretical analysis
and experimental studies show that the proposed DSM-FI algorithm uses stable memory, makes only one pass over an online transactional
data stream, and outperforms the existing algorithms of one-pass mining of frequent itemsets.
相似文献
Suh-Yin LeeEmail: |
2.
Mining top-K frequent itemsets from data streams 总被引:1,自引:0,他引:1
Frequent pattern mining on data streams is of interest recently. However, it is not easy for users to determine a proper frequency
threshold. It is more reasonable to ask users to set a bound on the result size. We study the problem of mining top K frequent itemsets in data streams. We introduce a method based on the Chernoff bound with a guarantee of the output quality
and also a bound on the memory usage. We also propose an algorithm based on the Lossy Counting Algorithm. In most of the experiments
of the two proposed algorithms, we obtain perfect solutions and the memory space occupied by our algorithms is very small.
Besides, we also propose the adapted approach of these two algorithms in order to handle the case when we are interested in
mining the data in a sliding window. The experiments show that the results are accurate.
相似文献
Ada Wai-Chee FuEmail: |
3.
Finding maximum-length repeating patterns in music databases 总被引:1,自引:0,他引:1
Ioannis Karydis Alexandros Nanopoulos Yannis Manolopoulos 《Multimedia Tools and Applications》2007,32(1):49-71
This paper introduces the problem of discovering maximum-length repeating patterns in music objects. A novel algorithm is
presented for the extraction of this kind of patterns from a melody music object. The proposed algorithm discovers all maximum-length
repeating patterns using an “aggressive” accession during searching, by avoiding costly repetition frequency calculation and
by examining as few as possible repeating patterns in order to reach the maximum-length repeating pattern(s). Detailed experimental
results illustrate the significant performance gains due to the proposed algorithm, compared to an existing baseline algorithm.
相似文献
Yannis Manolopoulos (Corresponding author)Email: |
4.
Hua-Fu Li 《Multimedia Systems》2011,17(3):237-245
Effective and efficient mining of music structure patterns from music query data is one of the most interesting issues of
multimedia data mining. In this paper, we introduce a new kind of pattern, called emerging melody structure (EMS), for knowledge
discovery from music melody streams. EMSs are defined as music data items with melody strings whose support increase significantly
from one sliding window to another window from streaming melody sequences. The discovered EMS can be used to predict the future
trend of online music style recommendation, to personalize the Web service of music downloading priority, for music composers
to compose new music or for service provider to collect more similar music. Therefore, an efficient data mining approach,
called MEMSA (Mining Emerging Melody Structure Algorithm), is proposed to discover all EMSs from streaming music query data over sliding
windows. In the framework of MEMSA, a prefix tree-based data structure, called EMS-tree (Emerging Melody Structure tree), is constructed for maintaining temporal EMSs effectively. Experimental results show that
the proposed method MEMSA is an efficient algorithm for mining all EMSs from streaming melody sequences efficiently. 相似文献
5.
Efficient mining of maximal frequent itemsets from databases on a cluster of workstations 总被引:2,自引:2,他引:0
In this paper, we propose two parallel algorithms for mining maximal frequent itemsets from databases. A frequent itemset
is maximal if none of its supersets is frequent. One parallel algorithm is named distributed max-miner (DMM), and it requires very low communication and synchronization overhead in distributed computing systems. DMM has the
local mining phase and the global mining phase. During the local mining phase, each node mines the local database to discover
the local maximal frequent itemsets, then they form a set of maximal candidate itemsets for the top-down search in the subsequent
global mining phase. A new prefix tree data structure is developed to facilitate the storage and counting of the global candidate
itemsets of different sizes. This global mining phase using the prefix tree can work with any local mining algorithm. Another
parallel algorithm, named parallel max-miner (PMM), is a parallel version of the sequential max-miner algorithm (Proc of ACM SIGMOD Int Conf on Management of Data, 1998,
pp 85–93). Most of existing mining algorithms discover the frequent k-itemsets on the kth pass over the databases, and then generate the candidate (k + 1)-itemsets for the next pass. Compared to those level-wise algorithms, PMM looks ahead at each pass and prunes more candidate
itemsets by checking the frequencies of their supersets. Both DMM and PMM were implemented on a cluster of workstations, and
their performance was evaluated for various cases. They demonstrate very good performance and scalability even when there
are large maximal frequent itemsets (i.e., long patterns) in databases.
相似文献
Congnan LuoEmail: |
6.
A novel approach for process mining based on event types 总被引:2,自引:0,他引:2
Lijie Wen Jianmin Wang Wil M. P. van der Aalst Biqing Huang Jiaguang Sun 《Journal of Intelligent Information Systems》2009,32(2):163-190
Despite the omnipresence of event logs in transactional information systems (cf. WFM, ERP, CRM, SCM, and B2B systems), historic
information is rarely used to analyze the underlying processes. Process mining aims at improving this by providing techniques
and tools for discovering process, control, data, organizational, and social structures from event logs, i.e., the basic idea
of process mining is to diagnose business processes by mining event logs for knowledge. Given its potential and challenges
it is no surprise that recently process mining has become a vivid research area. In this paper, a novel approach for process
mining based on two event types, i.e., START and COMPLETE, is proposed. Information about the start and completion of tasks
can be used to explicitly detect parallelism. The algorithm presented in this paper overcomes some of the limitations of existing
algorithms such as the α-algorithm (e.g., short-loops) and therefore enhances the applicability of process mining.
相似文献
Jiaguang SunEmail: |
7.
A new concise representation of frequent itemsets using generators and a positive border 总被引:2,自引:2,他引:0
A complete set of frequent itemsets can get undesirably large due to redundancy when the minimum support threshold is low
or when the database is dense. Several concise representations have been previously proposed to eliminate the redundancy.
Generator based representations rely on a negative border to make the representation lossless. However, the number of itemsets
on a negative border sometimes even exceeds the total number of frequent itemsets. In this paper, we propose to use a positive
border together with frequent generators to form a lossless representation. A positive border is usually orders of magnitude
smaller than its corresponding negative border. A set of frequent generators plus its positive border is always no larger
than the corresponding complete set of frequent itemsets, thus it is a true concise representation. The generalized form of
this representation is also proposed. We develop an efficient algorithm, called GrGrowth, to mine generators and positive
borders as well as their generalizations. The GrGrowth algorithm uses the depth-first-search strategy to explore the search
space, which is much more efficient than the breadth-first-search strategy adopted by most of the existing generator mining
algorithms. Our experiment results show that the GrGrowth algorithm is significantly faster than level-wise algorithms for
mining generator based representations, and is comparable to the state-of-the-art algorithms for mining frequent closed itemsets.
相似文献
Guimei LiuEmail: |
8.
As music can be represented symbolically, most of the existing methods extend some string matching algorithms to retrieve musical patterns in a music database. However, not all retrieved patterns are perceptually significant because some of them are, in fact, inaudible. Music is perceived in groupings of musical notes called streams. The process of grouping musical notes into streams is called stream segregation. Stream-crossing musical patterns are perceptually insignificant and should be pruned from the retrieval results. This can be done if all musical notes in a music database are segregated into streams and musical patterns are retrieved from the streams. Findings in auditory psychology are utilized in this paper, in which stream segregation is modelled as a clustering process and an adapted single-link clustering algorithm is proposed. Supported by experiments on real music data, streams are identified by the proposed algorithm with considerable accuracy.
相似文献
Man Hon WongEmail: |
9.
Querying live media streams is a challenging problem that is becoming an essential requirement in a growing number of applications.
Research in multimedia information systems has addressed and made good progress in dealing with archived data. Meanwhile,
research in stream databases has received significant attention for querying alphanumeric symbolic streams. The lack of a
data model capable of representing different multimedia data in a declarative way, hiding the media heterogeneity and providing
reasonable abstractions for querying live multimedia streams poses the challenge of how to make the best use of data in video,
audio and other media sources for various applications. In this paper we propose a system that enables directly capturing
media streams from sensors and automatically generating more meaningful feature streams that can be queried by a data stream
processor. The system provides an effective combination between extendible digital processing techniques and general data
stream management research. Together with other query techniques developed in related data stream management streams, our
system can be used in those application areas where multifarious live media senors are deployed for surveillance, disaster
response, live conferencing, telepresence, etc.
相似文献
Bin LiuEmail: |
10.
Three dimensional human motions recorded by motion capture and hand gestures recorded by using data gloves generate variable-length
data streams. These data streams usually have dozens of attributes, and have different variations for similar motions. To
segment and recognize motion streams, a classification-based approach is proposed in this paper. Classification feature vectors
are extracted by utilizing singular value decompositions (SVD) of motion data. The extracted feature vectors capture the dominating
geometric structures of motion data as revealed by SVD. Multi-class support vector machine (SVM) classifiers with class probability
estimates are explored for classifying the feature vectors in order to segment and recognize motion streams. Experiments show
that the proposed approach can find patterns in motion data streams with high accuracy.
相似文献
B. PrabhakaranEmail: |
11.
Mining Long, Sharable Patterns in Trajectories of Moving Objects 总被引:1,自引:0,他引:1
The efficient analysis of spatio-temporal data, generated by moving objects, is an essential requirement for intelligent location-based
services. Spatio-temporal rules can be found by constructing spatio-temporal baskets, from which traditional association rule
mining methods can discover spatio-temporal rules. When the items in the baskets are spatio-temporal identifiers and are derived
from trajectories of moving objects, the discovered rules represent frequently travelled routes. For some applications, e.g.,
an intelligent ridesharing application, these frequent routes are only interesting if they are long and sharable, i.e., can
potentially be shared by several users. This paper presents a database projection based method for efficiently extracting
such long, sharable frequent routes. The method prunes the search space by making use of the minimum length and sharable requirements
and avoids the generation of the exponential number of sub-routes of long routes. Considering alternative modelling options
for trajectories, leads to the development of two effective variants of the method. SQL-based implementations are described,
and extensive experiments on both real life- and large-scale synthetic data show the effectiveness of the method and its variants.
相似文献
Torben Bach PedersenEmail: |
12.
RRSi: indexing XML data for proximity twig queries 总被引:2,自引:2,他引:0
Twig query pattern matching is a core operation in XML query processing. Indexing XML documents for twig query processing
is of fundamental importance to supporting effective information retrieval. In practice, many XML documents on the web are
heterogeneous and have their own formats; documents describing relevant information can possess different structures. Therefore
some “user-interesting” documents having similar but non-exact structures against a user query are often missed out. In this
paper, we propose the RRSi, a novel structural index designed for structure-based query lookup on heterogeneous sources of XML documents supporting
proximate query answers. The index avoids the unnecessary processing of structurally irrelevant candidates that might show
good content relevance. An optimized version of the index, oRRSi, is also developed to further reduce both space requirements and computational complexity. To our knowledge, these structural
indexes are the first to support proximity twig queries on XML documents. The results of our preliminary experiments show
that RRSi and oRRSi based query processing significantly outperform previously proposed techniques in XML repositories with structural heterogeneity.
相似文献
Vincent T. Y. NgEmail: |
13.
A Schedule-based Pathfinding Algorithm for Transit Networks Using Pattern First Search 总被引:1,自引:0,他引:1
Ruihong Huang 《GeoInformatica》2007,11(2):269-285
The lack of effective and efficient schedule-based pathfinding algorithms for transit networks has limited the application
of GIS in transit trip planning services. This paper introduces a schedule-based path finding algorithm for transit networks.
Based on a pattern-centered spatiotemporal transit network model, the algorithm searches the network by following route patterns.
A pattern is a spatial layout of a route in transit terminology. A route usually has many patterns to serve various locations
at different times. This path search algorithm is significantly different from traditional shortest path algorithms that are
based on adjacent node search. By establishing a set of lemmas and theorems the paper proves that paths generated by the PFS
algorithm are schedule-coordinated fastest paths for trips with given constraints. After analyzing computation and database
query complexities of the algorithm the paper indicates that the PFS is efficient in computation and database query. Finally,
effectiveness and efficiency of the algorithm are demonstrated by implementations in GIS-based online transit trip planners
in Wisconsin, US.
相似文献
Ruihong HuangEmail: |
14.
Association Rule Mining is one of the important data mining activities and has received substantial attention in the literature.
Association rule mining is a computationally and I/O intensive task. In this paper, we propose a solution approach for mining optimized fuzzy association rules of different orders.
We also propose an approach to define membership functions for all the continuous attributes in a database by using clustering
techniques. Although single objective genetic algorithms are used extensively, they degenerate the solution. In our approach,
extraction and optimization of fuzzy association rules are done together using multi-objective genetic algorithm by considering
the objectives such as fuzzy support, fuzzy confidence and rule length. The effectiveness of the proposed approach is tested
using computer activity dataset to analyze the performance of a multi processor system and network audit data to detect anomaly
based intrusions. Experiments show that the proposed method is efficient in many scenarios.
相似文献
V. S. AnanthanarayanaEmail: |
15.
Ira Assent Ralph Krieger Boris Glavic Thomas Seidl 《Knowledge and Information Systems》2008,16(1):29-51
Many environmental, scientific, technical or medical database applications require effective and efficient mining of time
series, sequences or trajectories of measurements taken at different time points and positions forming large temporal or spatial
databases. Particularly the analysis of concurrent and multidimensional sequences poses new challenges in finding clusters
of arbitrary length and varying number of attributes. We present a novel algorithm capable of finding parallel clusters in
different subspaces and demonstrate our results for temporal and spatial applications. Our analysis of structural quality
parameters in rivers is successfully used by hydrologists to develop measures for river quality improvements.
相似文献
Thomas SeidlEmail: |
16.
Scalable landmark recognition using EXTENT 总被引:1,自引:1,他引:0
We have proposed the EXTENT system for automated photograph annotation using image content and context analysis. A key component
of EXTENT is a Landmark recognition system called LandMarker. In this paper, we present the architecture of LandMarker. The content of a query photograph is analyzed and compared against a database of sample landmark images, to recognize any
landmarks it contains. An algorithm is presented for comparing a query image with a sample image. Context information may
be used to assist landmark recognition. Also, we show how LandMarker deals with scalability to allow recognition of a large number of landmarks. We have implemented a prototype of the system,
and present empirical results on a large dataset.
相似文献
Arun QamraEmail: |
17.
Protecting business intelligence and customer privacy while outsourcing data mining tasks 总被引:1,自引:1,他引:0
Nowadays data mining plays an important role in decision making. Since many organizations do not possess the in-house expertise
of data mining, it is beneficial to outsource data mining tasks to external service providers. However, most organizations
hesitate to do so due to the concern of loss of business intelligence and customer privacy. In this paper, we present a Bloom
filter based solution to enable organizations to outsource their tasks of mining association rules, at the same time, protect
their business intelligence and customer privacy. Our approach can achieve high precision in data mining by trading-off the
storage requirement.
This research was supported by the USA National Science Foundation Grants CCR-0310974 and IIS-0546027.
相似文献
Ling Qiu (Corresponding author)Email: |
Yingjiu LiEmail: |
Xintao WuEmail: |
18.
Various techniques have been developed for different query types in content-based image retrieval systems such as sampling
queries, constrained sampling queries, multiple constrained sampling queries, k-NN queries, constrained k-NN queries, and multiple localized k-NN queries. In this paper, we propose a generalized query model suitable for expressing queries of different types, and investigate
efficient processing techniques for this new framework. We exploit sequential access and data sharing by developing new storage
and query processing techniques to leverage inter-query concurrency. Our experimental results, based on the Corel dataset,
indicate that the proposed optimization can significantly reduce average response time in a multiuser environment, and achieve
better retrieval precision and recall compared to two recent techniques.
相似文献
Ning YuEmail: |
19.
Bhavani Thuraisingham 《Multimedia Tools and Applications》2007,33(1):13-29
This paper describes security and privacy issues for multimedia database management systems. Multimedia data includes text,
images, audio and video. It describes access control for multimedia database management systems and describes security policies
and security architectures for such systems. Privacy problems that result from multimedia data mining are also discussed.
相似文献
Bhavani ThuraisinghamEmail: |
20.
Michael Hahsler 《Data mining and knowledge discovery》2006,13(2):137-166
Mining frequent itemsets is a popular method for finding associated items in databases. For this method, support, the co-occurrence
frequency of the items which form an association, is used as the primary indicator of the associations's significance. A single,
user-specified support threshold is used to decided if associations should be further investigated. Support has some known
problems with rare items, favors shorter itemsets and sometimes produces misleading associations.
In this paper we develop a novel model-based frequency constraint as an alternative to a single, user-specified minimum support.
The constraint utilizes knowledge of the process generating transaction data by applying a simple stochastic mixture model
(the NB model) which allows for transaction data's typically highly skewed item frequency distribution. A user-specified precision
threshold is used together with the model to find local frequency thresholds for groups of itemsets. Based on the constraint
we develop the notion of NB-frequent itemsets and adapt a mining algorithm to find all NB-frequent itemsets in a database.
In experiments with publicly available transaction databases we show that the new constraint provides improvements over a
single minimum support threshold and that the precision threshold is more robust and easier to set and interpret by the user.
相似文献
Michael HahslerEmail: |