共查询到20条相似文献,搜索用时 46 毫秒
1.
Mining frequent itemsets is a popular method for finding associated items in databases. For this method, support, the co-occurrence
frequency of the items which form an association, is used as the primary indicator of the associations's significance. A single,
user-specified support threshold is used to decided if associations should be further investigated. Support has some known
problems with rare items, favors shorter itemsets and sometimes produces misleading associations.
In this paper we develop a novel model-based frequency constraint as an alternative to a single, user-specified minimum support.
The constraint utilizes knowledge of the process generating transaction data by applying a simple stochastic mixture model
(the NB model) which allows for transaction data's typically highly skewed item frequency distribution. A user-specified precision
threshold is used together with the model to find local frequency thresholds for groups of itemsets. Based on the constraint
we develop the notion of NB-frequent itemsets and adapt a mining algorithm to find all NB-frequent itemsets in a database.
In experiments with publicly available transaction databases we show that the new constraint provides improvements over a
single minimum support threshold and that the precision threshold is more robust and easier to set and interpret by the user.
相似文献
2.
A complete set of frequent itemsets can get undesirably large due to redundancy when the minimum support threshold is low
or when the database is dense. Several concise representations have been previously proposed to eliminate the redundancy.
Generator based representations rely on a negative border to make the representation lossless. However, the number of itemsets
on a negative border sometimes even exceeds the total number of frequent itemsets. In this paper, we propose to use a positive
border together with frequent generators to form a lossless representation. A positive border is usually orders of magnitude
smaller than its corresponding negative border. A set of frequent generators plus its positive border is always no larger
than the corresponding complete set of frequent itemsets, thus it is a true concise representation. The generalized form of
this representation is also proposed. We develop an efficient algorithm, called GrGrowth, to mine generators and positive
borders as well as their generalizations. The GrGrowth algorithm uses the depth-first-search strategy to explore the search
space, which is much more efficient than the breadth-first-search strategy adopted by most of the existing generator mining
algorithms. Our experiment results show that the GrGrowth algorithm is significantly faster than level-wise algorithms for
mining generator based representations, and is comparable to the state-of-the-art algorithms for mining frequent closed itemsets.
相似文献
3.
Frequent pattern mining on data streams is of interest recently. However, it is not easy for users to determine a proper frequency
threshold. It is more reasonable to ask users to set a bound on the result size. We study the problem of mining top K frequent itemsets in data streams. We introduce a method based on the Chernoff bound with a guarantee of the output quality
and also a bound on the memory usage. We also propose an algorithm based on the Lossy Counting Algorithm. In most of the experiments
of the two proposed algorithms, we obtain perfect solutions and the memory space occupied by our algorithms is very small.
Besides, we also propose the adapted approach of these two algorithms in order to handle the case when we are interested in
mining the data in a sliding window. The experiments show that the results are accurate.
相似文献
4.
In this paper, we propose two parallel algorithms for mining maximal frequent itemsets from databases. A frequent itemset
is maximal if none of its supersets is frequent. One parallel algorithm is named distributed max-miner (DMM), and it requires very low communication and synchronization overhead in distributed computing systems. DMM has the
local mining phase and the global mining phase. During the local mining phase, each node mines the local database to discover
the local maximal frequent itemsets, then they form a set of maximal candidate itemsets for the top-down search in the subsequent
global mining phase. A new prefix tree data structure is developed to facilitate the storage and counting of the global candidate
itemsets of different sizes. This global mining phase using the prefix tree can work with any local mining algorithm. Another
parallel algorithm, named parallel max-miner (PMM), is a parallel version of the sequential max-miner algorithm (Proc of ACM SIGMOD Int Conf on Management of Data, 1998,
pp 85–93). Most of existing mining algorithms discover the frequent k-itemsets on the kth pass over the databases, and then generate the candidate ( k + 1)-itemsets for the next pass. Compared to those level-wise algorithms, PMM looks ahead at each pass and prunes more candidate
itemsets by checking the frequencies of their supersets. Both DMM and PMM were implemented on a cluster of workstations, and
their performance was evaluated for various cases. They demonstrate very good performance and scalability even when there
are large maximal frequent itemsets (i.e., long patterns) in databases.
相似文献
5.
Mining of music data is one of the most important problems in multimedia data mining. In this paper, two research issues of
mining music data, i.e., online mining of music query streams and change detection of music query streams, are discussed.
First, we proposed an efficient online algorithm, FTP-stream ( Frequent Temporal Pattern mining of streams), to mine all frequent melody structures over sliding windows of music melody sequence streams. An effective bit-sequence
representation is used in the proposed algorithm to reduce the time and memory needed to slide the windows. An effective list
structure is developed in the FTP-stream algorithm to overcome the performance bottleneck of 2-candidate generation. Experiments
show that the proposed algorithm FTP-stream only needs a half of memory requirement of original melody sequence data, and
just scans the music query stream once. After mining frequent melody structures, we developed a simple online algorithm, MQS-change
( changes of Music Query Streams), to detect the changes of frequent melody structures in current user-centered music query streams. Two music melody
structures (set of chord-sets and string of chord-sets) are maintained and four melody structure changes (positive burst,
negative burst, increasing change and decreasing change) are monitored in a new summary data structure, MSC-list (a list of Music Structure Changes). Experiments show that the MQS-change algorithm is an effective online method to detect the changes of music melody
structures over continuous music query streams.
相似文献
6.
Recently, a new class of data mining methods, known as privacy preserving data mining (PPDM) algorithms, has been developed by the research community working on security and knowledge discovery. The aim of these
algorithms is the extraction of relevant knowledge from large amount of data, while protecting at the same time sensitive
information. Several data mining techniques, incorporating privacy protection mechanisms, have been developed that allow one
to hide sensitive itemsets or patterns, before the data mining process is executed. Privacy preserving classification methods,
instead, prevent a miner from building a classifier which is able to predict sensitive data. Additionally, privacy preserving
clustering techniques have been recently proposed, which distort sensitive numerical attributes, while preserving general
features for clustering analysis. A crucial issue is to determine which ones among these privacy-preserving techniques better
protect sensitive information. However, this is not the only criteria with respect to which these algorithms can be evaluated.
It is also important to assess the quality of the data resulting from the modifications applied by each algorithm, as well
as the performance of the algorithms. There is thus the need of identifying a comprehensive set of criteria with respect to
which to assess the existing PPDM algorithms and determine which algorithm meets specific requirements.
In this paper, we present a first evaluation framework for estimating and comparing different kinds of PPDM algorithms. Then,
we apply our criteria to a specific set of algorithms and discuss the evaluation results we obtain. Finally, some considerations
about future work and promising directions in the context of privacy preservation in data mining are discussed.
*The work reported in this paper has been partially supported by the EU under the IST Project CODMINE and by the Sponsors of
CERIAS.
Editor: Geoff Webb
相似文献
7.
In this paper, we deal with mining sequential patterns in multiple time sequences. Building on a state-of-the-art sequential
pattern mining algorithm PrefixSpan for mining transaction databases, we propose MILE ( MIning in mu Ltiple s Equences), an efficient algorithm to facilitate the mining process. MILE recursively utilizes the knowledge of existing patterns
to avoid redundant data scanning, and therefore can effectively speed up the new patterns’ discovery process. Another unique
feature of MILE is that it can incorporate prior knowledge of the data distribution in time sequences into the mining process
to further improve the performance. Extensive empirical results show that MILE is significantly faster than PrefixSpan. As
MILE consumes more memory than PrefixSpan, we also present a solution to trade time efficiency in memory constrained environments.
相似文献
8.
In this paper, we present an Inverse Multi-Objective Robust Evolutionary (IMORE) design methodology that handles the presence
of uncertainty without making assumptions about the uncertainty structure. We model the clustering of uncertain events in
families of nested sets using a multi-level optimization search. To reduce the high computational costs of the proposed methodology
we proposed schemes for (1) adapting the step-size in estimating the uncertainty, and (2) trimming down the number of calls
to the objective function in the nested search. Both offline and online adaptation strategies are considered in conjunction
with the IMORE design algorithm. Design of Experiments (DOE) approaches further reduce the number of objective function calls
in the online adaptive IMORE algorithm. Empirical studies conducted on a series of test functions having diverse complexities
show that the proposed algorithms converge to a set of Pareto-optimal design solutions with non-dominated nominal and robustness
performances efficiently.
相似文献
9.
Support vector machines (SVMs) have been promising methods for classification and regression analysis due to their solid mathematical
foundations, which include two desirable properties: margin maximization and nonlinear classification using kernels. However,
despite these prominent properties, SVMs are usually not chosen for large-scale data mining problems because their training
complexity is highly dependent on the data set size. Unlike traditional pattern recognition and machine learning, real-world
data mining applications often involve huge numbers of data records. Thus it is too expensive to perform multiple scans on
the entire data set, and it is also infeasible to put the data set in memory. This paper presents a method, Clustering-Based SVM (CB-SVM), that maximizes the SVM performance for very large data sets given a limited amount of resource, e.g., memory. CB-SVM applies
a hierarchical micro-clustering algorithm that scans the entire data set only once to provide an SVM with high quality samples.
These samples carry statistical summaries of the data and maximize the benefit of learning. Our analyses show that the training
complexity of CB-SVM is quadratically dependent on the number of support vectors, which is usually much less than that of
the entire data set. Our experiments on synthetic and real-world data sets show that CB-SVM is highly scalable for very large
data sets and very accurate in terms of classification.
A preliminary version of the paper, “ Classifying Large Data Sets Using SVM with Hierarchical Clusters”, by H. Yu, J. Yang, and J. Han, appeared in Proc. 2003 Int. Conf. on Knowledge Discovery in Databases (KDD'03), Washington, DC, August 2003. However, this submission has substantially extended the previous paper and contains new and
major-value added technical contribution in comparison with the conference publication.
相似文献
10.
This paper describes security and privacy issues for multimedia database management systems. Multimedia data includes text,
images, audio and video. It describes access control for multimedia database management systems and describes security policies
and security architectures for such systems. Privacy problems that result from multimedia data mining are also discussed.
相似文献
11.
Many environmental, scientific, technical or medical database applications require effective and efficient mining of time
series, sequences or trajectories of measurements taken at different time points and positions forming large temporal or spatial
databases. Particularly the analysis of concurrent and multidimensional sequences poses new challenges in finding clusters
of arbitrary length and varying number of attributes. We present a novel algorithm capable of finding parallel clusters in
different subspaces and demonstrate our results for temporal and spatial applications. Our analysis of structural quality
parameters in rivers is successfully used by hydrologists to develop measures for river quality improvements.
相似文献
12.
Because of the rapid growth in information and communication technologies, a company’s data may be spread over several continents.
For an effective decision-making process, knowledge workers need data, which may be geographically spread in different locations.
In such circumstances, multi-database mining plays a major role in the process of extracting knowledge from different data
sources. In this paper, we have proposed a new methodology for synthesizing high-frequency rules from different data sources,
where data source weight has been calculated on the basis of their transaction population. We have also proposed a new method
for calculating global confidence. Our goal in synthesizing local patterns to obtain global patterns is that, the support
and confidence of synthesized patterns must be very nearly same if all the databases are integrated and mono-mining has been
done. Experiments conducted clearly establish that the proposed method of synthesizing high-frequency rules fairly meets the
stipulation.
相似文献
13.
A new method for data hiding in H.264/AVC streams is presented. The proposed method exploits the IPCM encoded macroblocks
during the intra prediction stage in order to hide the desired data. It is a blind data hiding scheme, i.e. the message can
be extracted directly from the encoded stream without the need of the original host video. Moreover, the method exhibits the
useful property of reusing the compressed stream for hiding different data numerous times without considerably affecting either
the bit-rate or the perceptual quality. This property allows data hiding directly in the compressed stream in real time. The
method perfectly suits to covert communication and content authentication applications .
相似文献
14.
Constrained pattern mining extracts patterns based on their individual merit. Usually this results in far more patterns than
a human expert or a machine leaning technique could make use of. Often different patterns or combinations of patterns cover
a similar subset of the examples, thus being redundant and not carrying any new information. To remove the redundant information
contained in such pattern sets, we propose two general heuristic algorithms—Bouncer and Picker—for selecting a small subset
of patterns. We identify several selection techniques for use in this general algorithm and evaluate those on several data
sets. The results show that both techniques succeed in severely reducing the number of patterns, while at the same time apparently
retaining much of the original information. Additionally, the experiments show that reducing the pattern set indeed improves
the quality of classification results. Both results show that the developed solutions are very well suited for the goals we
aim at.
相似文献
15.
Nowadays data mining plays an important role in decision making. Since many organizations do not possess the in-house expertise
of data mining, it is beneficial to outsource data mining tasks to external service providers. However, most organizations
hesitate to do so due to the concern of loss of business intelligence and customer privacy. In this paper, we present a Bloom
filter based solution to enable organizations to outsource their tasks of mining association rules, at the same time, protect
their business intelligence and customer privacy. Our approach can achieve high precision in data mining by trading-off the
storage requirement.
This research was supported by the USA National Science Foundation Grants CCR-0310974 and IIS-0546027.
相似文献
16.
This paper describes the simulated car racing competition that was arranged as part of the 2007 IEEE Congress on Evolutionary
Computation. Both the game that was used as the domain for the competition, the controllers submitted as entries to the competition
and its results are presented. With this paper, we hope to provide some insight into the efficacy of various computational
intelligence methods on a well-defined game task, as well as an example of one way of running a competition. In the process,
we provide a set of reference results for those who wish to use the simplerace game to benchmark their own algorithms. The paper is co-authored by the organizers and participants of the competition.
相似文献
17.
Querying live media streams is a challenging problem that is becoming an essential requirement in a growing number of applications.
Research in multimedia information systems has addressed and made good progress in dealing with archived data. Meanwhile,
research in stream databases has received significant attention for querying alphanumeric symbolic streams. The lack of a
data model capable of representing different multimedia data in a declarative way, hiding the media heterogeneity and providing
reasonable abstractions for querying live multimedia streams poses the challenge of how to make the best use of data in video,
audio and other media sources for various applications. In this paper we propose a system that enables directly capturing
media streams from sensors and automatically generating more meaningful feature streams that can be queried by a data stream
processor. The system provides an effective combination between extendible digital processing techniques and general data
stream management research. Together with other query techniques developed in related data stream management streams, our
system can be used in those application areas where multifarious live media senors are deployed for surveillance, disaster
response, live conferencing, telepresence, etc.
相似文献
18.
Assumptions are frequently made during requirements analysis of a system about the trustworthiness of its various components
(including human components). These trust assumptions, whether implicit or explicit, affect the scope of the analysis, derivation
of security requirements, and in some cases how functionality is realized. This paper presents trust assumptions in the context
of analysis of security requirements. A running example shows how trust assumptions can be used by a requirements engineer
to help define and limit the scope of analysis and to document the decisions made during the process. The paper concludes
with a case study examining the impact of trust assumptions on software that uses the secure electronic transaction specification.
相似文献
19.
This paper introduces the problem of discovering maximum-length repeating patterns in music objects. A novel algorithm is
presented for the extraction of this kind of patterns from a melody music object. The proposed algorithm discovers all maximum-length
repeating patterns using an “aggressive” accession during searching, by avoiding costly repetition frequency calculation and
by examining as few as possible repeating patterns in order to reach the maximum-length repeating pattern(s). Detailed experimental
results illustrate the significant performance gains due to the proposed algorithm, compared to an existing baseline algorithm.
相似文献
20.
The results of an online questionnaire for Spanish-speaking Internet users with visual impairment are presented in this work.
The research was carried out during 2002. The data collected shows that Internet is a very valuable tool for visually impaired
people, as it allows them to access written information in an autonomous and instantaneous manner and to communicate with
others through individual email and discussion lists. However, all users found frequent and important problems of accessibility
in many web pages that prevent them from having complete access to the information and functionalities of those websites.
相似文献
|