首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
The concept of support is central to data mining. While the definition of support in transaction databases is intuitive and simple, that is not the case in graph datasets and databases. Most mining algorithms require the support of a pattern to be no greater than that of its subpatterns, a property called anti-monotonicity, or admissibility. This paper examines the requirements for admissibility of a support measure. Support measures for mining graphs are usually based on the notion of an instance graph---a graph representing all the instances of the pattern in a database and their intersection properties. Necessary and sufficient conditions for support measure admissibility, based on operations on instance graphs, are developed and proved. The sufficient conditions are used to prove admissibility of one support measure—the size of the independent set in the instance graph. Conversely, the necessary conditions are used to quickly show that some other support measures, such as weighted count of instances, are not admissible. *Partially supported by the KITE consortium under contract to the Israeli Ministry of Trade and Industry, and by the Paul Ivanier Center for Robotics and Production Management.  相似文献   

2.
Clustering is a classic problem in the machine learning and pattern recognition area, however a few complications arise when we try to transfer proposed solutions in the data stream model. Recently there have been proposed new algorithms for the basic clustering problem for massive data sets that produce an approximate solution using efficiently the memory, which is the most critical resource for streaming computation. In this paper, based on these solutions, we present a new model for clustering clickstream data which applies three different phases in the data processing, and is validated through a set of experiments.  相似文献   

3.
We study the Weighted t-Uniform Sparsest Cut (Weighted t-USC) and other related problems. In an instance of the Weighted t-USC problem, a parameter t and an undirected graph G=(V,E) with edge-weights w:ER0 and vertex-weights η:VR+ are given. The goal is to find a vertex set SV with |S|t while minimizing w(S,V\S)/η(S), where w(S,V\S) is the total weight of the edges with exactly one endpoint in S and η(S)=vSη(v). For this problem, we present a (O(logt),1+ϵ) factor bicriteria approximation algorithm. Our algorithm outperforms the current best algorithm when t=no(1). We also present better approximation algorithms for Weighted ρ-Unbalanced Cut and Min–Max k-Partitioning problems.  相似文献   

4.
Privacy-preserving is a major concern in the application of data mining techniques to datasets containing personal, sensitive, or confidential information. Data distortion is a critical component to preserve privacy in security-related data mining applications, such as in data mining-based terrorist analysis systems. We propose a sparsified Singular Value Decomposition (SVD) method for data distortion. We also put forth a few metrics to measure the difference between the distorted dataset and the original dataset and the degree of the privacy protection. Our experimental results using synthetic and real world datasets show that the sparsified SVD method works well in preserving privacy as well as maintaining utility of the datasets. Shuting Xu received her PhD in Computer Science from the University of Kentucky in 2005. Dr. Xu is presently an Assistant Professor in the Department of Computer Information Systems at the Virginia State University. Her research interests include data mining and information retrieval, database systems, parallel, and distributed computing. Jun Zhang received a PhD from The George Washington University in 1997. He is an Associate Professor of Computer Science and Director of the Laboratory for High Performance Scientific Computing & Computer Simulation and Laboratory for Computational Medical Imaging & Data Analysis at the University of Kentucky. His research interests include computational neuroinformatics, data miningand information retrieval, large scale parallel and scientific computing, numerical simulation, iterative and preconditioning techniques for large scale matrix computation. Dr. Zhang is associate editor and on the editorial boards of four international journals in computer simulation andcomputational mathematics, and is on the program committees of a few international conferences. His research work has been funded by the U.S. National Science Foundation and the Department of Energy. He is recipient of the U.S. National Science Foundation CAREER Award and several other awards. Dianwei Han received an M.E. degree from Beijing Institute of Technology, Beijing, China, in 1995. From 1995to 1998, he worked in a Hitachi company(BHH) in Beijing, China. He received an MS degree from Lamar University, USA, in 2003. He is currently a PhD student in the Department of Computer Science, University of Kentucky, USA. His research interests include data mining and information retrieval, computational medical imaging analysis, and artificial intelligence. Jie Wang received the masters degree in Industrial Automation from Beijing University of Chemical Technology in 1996. She is currently a PhD student and a member of the Laboratory for High Performance Computing and Computer Simulation in the Department of Computer Science at the University of Kentucky, USA. Her research interests include data mining and knowledge discovery, information filtering and retrieval, inter-organizational collaboration mechanism, and intelligent e-Technology.  相似文献   

5.
In this paper we consider discrete-time linear positive systems, that is systems defined by a pair (A,B) of non-negative matrices. We study the reachability of such systems which in this case amounts to the freedom of steering the state in the positive orthant by using non-negative control sequences. This problem was solved recently [Canonical forms for positive discrete-time linear control systems, Linear Algebra Appl., 310 (2000) 49]. However we derive here necessary and sufficient conditions for reachability in a simpler and more compact form. These conditions are expressed in terms of particular paths in the graph which is naturally associated with the system.  相似文献   

6.
Several fast algorithms for clustering very large data sets have been proposed in the literature, including CLARA, CLARANS, GAC-R3, and GAC-RARw. CLARA is a combination of a sampling procedure and the classical PAM algorithm, while CLARANS adopts a serial randomized search strategy to find the optimal set of medoids. GAC-R3 and GAC-RARw exploit genetic search heuristics for solving clustering problems. In this research, we conducted an empirical comparison of these four clustering algorithms over a wide range of data characteristics described by data size, number of clusters, cluster distinctness, cluster asymmetry, and data randomness. According to the experimental results, CLARANS outperforms its counterparts both in clustering quality and execution time when the number of clusters increases, clusters are more closely related, more asymmetric clusters are present, or more random objects exist in the data set. With a specific number of clusters, CLARA can efficiently achieve satisfactory clustering quality when the data size is larger, whereas GAC-R3 and GAC-RARw can achieve satisfactory clustering quality and efficiency when the data size is small, the number of clusters is small, and clusters are more distinct and symmetric.  相似文献   

7.
8.
Internet measured data collected via passive measurement are analyzed to obtain localization information on nodes by clustering (i.e., grouping together) nodes that exhibit similar network path properties. Since traditional clustering algorithms fail to correctly identify clusters of homogeneous nodes, we propose the NetCluster novel framework, suited to analyze Internet measurement datasets. We show that the proposed framework correctly analyzes synthetically generated traces. Finally, we apply it to real traces collected at the access link of Politecnico di Torino campus LAN and discuss the network characteristics as seen at the vantage point.  相似文献   

9.
The Hypercube Segmentation problem was recently introduced by Kleinberg et al. [J. ACM 51 (2004) 263-280], along with several algorithms that select each segment's prototype vector from the segment. The algorithms were shown to have an approximation ratio of at least . We show that a lemma used in this proof is tight, and that the asymptotic approximation ratio of no algorithm of this type can exceed 5/6≈0.833.  相似文献   

10.
One of the major challenges in data mining is the extraction of comprehensible knowledge from recorded data. In this paper, a coevolutionary-based classification technique, namely COevolutionary Rule Extractor (CORE), is proposed to discover classification rules in data mining. Unlike existing approaches where candidate rules and rule sets are evolved at different stages in the classification process, the proposed CORE coevolves rules and rule sets concurrently in two cooperative populations to confine the search space and to produce good rule sets that are comprehensive. The proposed coevolutionary classification technique is extensively validated upon seven datasets obtained from the University of California, Irvine (UCI) machine learning repository, which are representative artificial and real-world data from various domains. Comparison results show that the proposed CORE produces comprehensive and good classification rules for most datasets, which are competitive as compared with existing classifiers in literature. Simulation results obtained from box plots also unveil that CORE is relatively robust and invariant to random partition of datasets.  相似文献   

11.
Modular decomposition of graphs is a powerful tool for designing efficient algorithms for problems on graphs such as Maximum Weight Stable Set (MWS) and Maximum Weight Clique. Using this tool we obtain O(n·m) time algorithms for MWS on chair- and xbull-free graphs which considerably extend an earlier result on bull- and chair-free graphs by De Simone and Sassano (the chair is the graph with vertices a,b,c,d,e and edges ab,bc,cd,be, and the xbull is the graph with vertices a,b,c,d,e,f and edges ab,bc,cd,de,bf,cf). Moreover, our algorithm is robust in the sense that we do not have to check in advance whether the input graphs are indeed chair- and xbull-free.  相似文献   

12.
In privacy-preserving data mining (PPDM), a widely used method for achieving data mining goals while preserving privacy is based on k-anonymity. This method, which protects subject-specific sensitive data by anonymizing it before it is released for data mining, demands that every tuple in the released table should be indistinguishable from no fewer than k subjects. The most common approach for achieving compliance with k-anonymity is to replace certain values with less specific but semantically consistent values. In this paper we propose a different approach for achieving k-anonymity by partitioning the original dataset into several projections such that each one of them adheres to k-anonymity. Moreover, any attempt to rejoin the projections, results in a table that still complies with k-anonymity. A classifier is trained on each projection and subsequently, an unlabelled instance is classified by combining the classifications of all classifiers.Guided by classification accuracy and k-anonymity constraints, the proposed data mining privacy by decomposition (DMPD) algorithm uses a genetic algorithm to search for optimal feature set partitioning. Ten separate datasets were evaluated with DMPD in order to compare its classification performance with other k-anonymity-based methods. The results suggest that DMPD performs better than existing k-anonymity-based algorithms and there is no necessity for applying domain dependent knowledge. Using multiobjective optimization methods, we also examine the tradeoff between the two conflicting objectives in PPDM: privacy and predictive performance.  相似文献   

13.
Data mining is crucial in many areas and there are ongoing efforts to improve its effectiveness in both the scientific and the business world. There is an obvious need to improve the outcomes of mining techniques such as clustering and other classifiers without abandoning the standard mining tools that are popular with researchers and practitioners alike. Currently, however, standard tools do not have the flexibility to control similarity relations between attribute values, a critical feature in improving mining-clustering results. The study presented here introduces the Similarity Adjustment Model (SAM) where adjusted Fuzzy Similarity Functions (FSF) control similarity relations between attribute values and hence ameliorate clustering results obtained with standard data mining tools such as SPSS and SAS. The SAM draws on principles of binary database representation models and employs FSF adjusted via an iterative learning process that yields improved segmentation regardless of the choice of mining-clustering algorithm. The SAM model is illustrated and evaluated on three common datasets with the standard SPSS package. The datasets were run with several clustering algorithms. Comparison of “Naïve” runs (which used original data) and “Fuzzy” runs (which used SAM) shows that the SAM improves segmentation in all cases.  相似文献   

14.
15.
Data mining involves nontrivial process of extracting knowledge or patterns from large databases. Genetic Algorithms are efficient and robust searching and optimization methods that are used in data mining. In this paper we propose a Self-Adaptive Migration Model GA (SAMGA), where parameters of population size, the number of points of crossover and mutation rate for each population are adaptively fixed. Further, the migration of individuals between populations is decided dynamically. This paper gives a mathematical schema analysis of the method stating and showing that the algorithm exploits previously discovered knowledge for a more focused and concentrated search of heuristically high yielding regions while simultaneously performing a highly explorative search on the other regions of the search space. The effective performance of the algorithm is then shown using standard testbed functions and a set of actual classification datamining problems. Michigan style of classifier was used to build the classifier and the system was tested with machine learning databases of Pima Indian Diabetes database, Wisconsin Breast Cancer database and few others. The performance of our algorithm is better than others.  相似文献   

16.
In this paper we introduce a method called CL.E.D.M. (CLassification through ELECTRE and Data Mining), that employs aspects of the methodological framework of the ELECTRE I outranking method, and aims at increasing the accuracy of existing data mining classification algorithms. In particular, the method chooses the best decision rules extracted from the training process of the data mining classification algorithms, and then it assigns the classes that correspond to these rules, to the objects that must be classified. Three well known data mining classification algorithms are tested in five different widely used databases to verify the robustness of the proposed method.  相似文献   

17.
A crucial issue related to data mining on time-series is that of training period duration. The training horizon used impacts the nature of rules obtained and their predictability over time. Longer training horizons are generally sought, in order to discern sustained patterns with robust training data performance that extends well into the predictive period. However, in dynamic environments patterns that persist over time may be unavailable, and shorter-term patterns may hold higher predictive ability, albeit with shorter predictive periods. Such potentially useful shorter-term patterns may be lost when the training duration covers much longer periods. Too short a training duration can, of course, be susceptible to over-fitting to noise. We conduct experiments using different training horizons with daily-data for the S&P500 index and report the sensitivity of the performance of the obtained rules with respect to the training durations. We show that while the performance of the rules in the training period is important for inducing the “best” rules, it is not indicative of their performance in the test-period and propose alternative measures that can be used to help identify the appropriate training durations.  相似文献   

18.
A couple of approximate inversion techniques are presented which provide a parallel enhancement to several iterative methods for solving linear systems arising from the discretization of boundary value problems. In particular, the Jacobi, Gauss‐Seidel, and successive overrelaxation methods can be improved substantially in a parallel environment by the extensions considered. A special case convergence proof is presented. The use of our approximate inverses with the preconditioned conjugate gradient method is examined and comparisons are made with some recently proposed algorithms in this area that also employ approximate inverses. The methods considered are compared under sequential and parallel hardware assumptions.  相似文献   

19.
During the last decade artificial immune systems have drawn much of the researchers’ attention. All the work that has been done allowed to develop many interesting algorithms which come in useful when solving engineering problems such as data mining and analysis, anomaly detection and many others. Being constantly developed and improved, the algorithms based on immune metaphors have some limitations, though. In this paper we elaborate on the concept of a novel artificial immune algorithm by considering the possibility of combining the clonal selection principle and the well known K-means algorithm. This novel approach and a new way of performing suppression (based on the usefulness of the evolving lymphocytes) in clonal selection result in a very effective and stable immune algorithm for both unsupervised and supervised learning. Further improvements to the cluster analysis by means of the proposed algorithm, immune K-means, are introduced. Different methods for clusters construction are compared, together with multi-point cluster validity index and a novel strategy based on minimal spanning tree (mst) and a analysis of the midpoints of the edges of the (mst). Interesting and useful improvements of the proposed approach by means of negative selection algorithms are proposed and discussed.  相似文献   

20.
Sensor node localization in mobile ad-hoc sensor networks is a challenging problem. Often, the anchor nodes tend to line up in a linear fashion in a mobile sensor network when nodes are deployed in an ad-hoc manner. This paper discusses novel node localization methods under the conditions of collinear ambiguity of the anchors. Additionally, the work presented herein also describes a methodology to fuse data available from multiple sensors for improved localization performance under conditions of collinear ambiguity. In this context, data is first acquired from multiple sensors sensing different modalities. The data acquired from each sensor is used to compute attenuation models for each sensor. Subsequently, a combined multi-sensor attenuation model is developed. The fusion methodology uses a joint error optimization approach on the multi-sensor data. The distance between each sensor node and anchor is itself computed using the differential power principle. These distances are used in the localization of sensor nodes under the condition of collinear ambiguity of anchors. Localization error analysis is also carried out in indoor conditions and compared with the Cramer–Rao lower bound. Experimental results on node localization using simulations and real field deployments indicate reasonable improvements in terms of localization accuracy when compared to methods likes MLAR and MGLR.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号