共查询到20条相似文献,搜索用时 0 毫秒
1.
Privacy preserving clustering on horizontally partitioned data 总被引:3,自引:0,他引:3
Data mining has been a popular research area for more than a decade due to its vast spectrum of applications. However, the popularity and wide availability of data mining tools also raised concerns about the privacy of individuals. The aim of privacy preserving data mining researchers is to develop data mining techniques that could be applied on databases without violating the privacy of individuals. Privacy preserving techniques for various data mining models have been proposed, initially for classification on centralized data then for association rules in distributed environments. In this work, we propose methods for constructing the dissimilarity matrix of objects from different sites in a privacy preserving manner which can be used for privacy preserving clustering as well as database joins, record linkage and other operations that require pair-wise comparison of individual private data objects horizontally distributed to multiple sites. We show communication and computation complexity of our protocol by conducting experiments over synthetically generated and real datasets. Each experiment is also performed for a baseline protocol, which has no privacy concern to show that the overhead comes with security and privacy by comparing the baseline protocol and our protocol. 相似文献
2.
针对垂直分布下的隐私保护关联规则挖掘算法安全性不高和挖掘效率较低的问题,提出了一种隐私保护关联规则挖掘算法.算法采用一种新的点积协议,通过引入逆矩阵和随机数隐藏原始输入信息,具有较好的安全性;利用挖掘最大频繁项集来代替挖掘所有频繁项集,采用深度优先遍历策略,结合各种剪枝策略,明显加快了频繁项集的生成速度,大大减少计算代价.实验结果表明,挖掘效率得到了很大提高. 相似文献
3.
As the total amount of traffic data in networks has been growing at an alarming rate, there is currently a substantial body of research that attempts to mine traffic data with the purpose of obtaining useful information. For instance, there are some investigations into the detection of Internet worms and intrusions by discovering abnormal traffic patterns. However, since network traffic data contain information about the Internet usage patterns of users, network users’ privacy may be compromised during the mining process. In this paper, we propose an efficient and practical method that preserves privacy during sequential pattern mining on network traffic data. In order to discover frequent sequential patterns without violating privacy, our method uses the N-repository server model, which operates as a single mining server and the retention replacement technique, which changes the answer to a query probabilistically. In addition, our method accelerates the overall mining process by maintaining the meta tables in each site so as to determine quickly whether candidate patterns have ever occurred in the site or not. Extensive experiments with real-world network traffic data revealed the correctness and the efficiency of the proposed method. 相似文献
4.
A novel multiseed nonhierarchical data clustering technique 总被引:5,自引:0,他引:5
Chaudhuri D. Chaudhuri B.B. 《IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics》1997,27(5):871-876
Clustering techniques such as K-means and Forgy as well as their improved version ISODATA group data around one seed point for each cluster, It is well known that these methods do not work well if the shape of the cluster is elongated or nonconvex. We argue that for a elongated or nonconvex shaped cluster, more than one seed is needed, In this paper a multiseed clustering algorithm is proposed. A density based representative point selection algorithm is used to choose the initial seed points. To assign several seed points to one cluster, a minimal spanning tree guided novel technique is proposed. Also, a border point detection algorithm is proposed for the detection of shape of the cluster. This border in turn signifies whether the cluster is elongated or not, Experimental results show the efficiency of this clustering technique. 相似文献
5.
The Journal of Supercomputing - In recent years, the data exchange among the service providers and users has been increased tremendously. Various organizations like banking sectors, health as well... 相似文献
6.
7.
Tung-Shou Chen Wei-Bin Lee Jeanne Chen Yuan-Hung Kao Pei-Wen Hou 《The Journal of supercomputing》2013,66(2):907-917
Privacy Preserving Data Mining (PPDM) can prevent private data from disclosure in data mining. However, the current PPDM methods damaged the values of original data where knowledge from the mined data cannot be verified from the original data. In this paper, we combine the concept and technique based on the reversible data hiding to propose the reversible privacy preserving data mining scheme in order to solve the irrecoverable problem of PPDM. In the proposed privacy difference expansion (PDE) method, the original data is perturbed and embedded with a fragile watermark to accomplish privacy preserving and data integrity of mined data and to also recover the original data. Experimental tests are performed on classification accuracy, probabilistic information loss, and privacy disclosure risk used to evaluate the efficiency of PDE for privacy preserving and knowledge verification. 相似文献
8.
9.
Polygons provide natural representations for many types of geospatial objects, such as countries, buildings, and pollution hotspots. Thus, polygon-based data mining techniques are particularly useful for mining geospatial datasets. In this paper, we propose a polygon-based clustering and analysis framework for mining multiple geospatial datasets that have inherently hidden relations. In this framework, polygons are first generated from multiple geospatial point datasets by using a density-based contouring algorithm called DCONTOUR. Next, a density-based clustering algorithm called Poly-SNN with novel dissimilarity functions is employed to cluster polygons to create meta-clusters of polygons. Finally, post-processing analysis techniques are proposed to extract interesting patterns and user-guided summarized knowledge from meta-clusters. These techniques employ plug-in reward functions that capture a domain expert’s notion of interestingness to guide the extraction of knowledge from meta-clusters. The effectiveness of our framework is tested in a real-world case study involving ozone pollution events in Texas. The experimental results show that our framework can reveal interesting relationships between different ozone hotspots represented by polygons; it can also identify interesting hidden relations between ozone hotspots and several meteorological variables, such as outdoor temperature, solar radiation, and wind speed. 相似文献
10.
This paper presents SCALE, a fully automated transactional clustering framework. The SCALE design highlights three unique
features. First, we introduce the concept of Weighted Coverage Density as a categorical similarity measure for efficient clustering
of transactional datasets. The concept of weighted coverage density is intuitive and it allows the weight of each item in
a cluster to be changed dynamically according to the occurrences of items. Second, we develop the weighted coverage density
measure based clustering algorithm, a fast, memory-efficient, and scalable clustering algorithm for analyzing transactional
data. Third, we introduce two clustering validation metrics and show that these domain specific clustering evaluation metrics
are critical to capture the transactional semantics in clustering analysis. Our SCALE framework combines the weighted coverage
density measure for clustering over a sample dataset with self-configuring methods. These self-configuring methods can automatically
tune the two important parameters of our clustering algorithms: (1) the candidates of the best number K of clusters; and (2) the application of two domain-specific cluster validity measures to find the best result from the set
of clustering results. We have conducted extensive experimental evaluation using both synthetic and real datasets and our
results show that the weighted coverage density approach powered by the SCALE framework can efficiently generate high quality
clustering results in a fully automated manner. 相似文献
11.
Kweku-Muata Osei-Bryson 《Information Sciences》2010,180(3):414-47
Clustering is a popular non-directed learning data mining technique for partitioning a dataset into a set of clusters (i.e. a segmentation). Although there are many clustering algorithms, none is superior on all datasets, and so it is never clear which algorithm and which parameter settings are the most appropriate for a given dataset. This suggests that an appropriate approach to clustering should involve the application of multiple clustering algorithms with different parameter settings and a non-taxing approach for comparing the various segmentations that would be generated by these algorithms. In this paper we are concerned with the situation where a domain expert has to evaluate several segmentations in order to determine the most appropriate segmentation (set of clusters) based on his/her specified objective(s). We illustrate how a data mining process model could be applied to address this problem. 相似文献
12.
Crime data mining: a general framework and some examples 总被引:2,自引:0,他引:2
A major challenge facing all law-enforcement and intelligence-gathering organizations is accurately and efficiently analyzing the growing volumes of crime data. Detecting cybercrime can likewise be difficult because busy network traffic and frequent online transactions generate large amounts of data, only a small portion of which relates to illegal activities. Data mining is a powerful tool that enables criminal investigators who may lack extensive training as data analysts to explore large databases quickly and efficiently. We present a general framework for crime data mining that draws on experience gained with the Coplink project, which researchers at the University of Arizona have been conducting in collaboration with the Tucson and Phoenix police departments since 1997. 相似文献
13.
CLARANS: a method for clustering objects for spatial data mining 总被引:14,自引:0,他引:14
Spatial data mining is the discovery of interesting relationships and characteristics that may exist implicitly in spatial databases. To this end, this paper has three main contributions. First, it proposes a new clustering method called CLARANS, whose aim is to identify spatial structures that may be present in the data. Experimental results indicate that, when compared with existing clustering methods, CLARANS is very efficient and effective. Second, the paper investigates how CLARANS can handle not only point objects, but also polygon objects efficiently. One of the methods considered, called the IR-approximation, is very efficient in clustering convex and nonconvex polygon objects. Third, building on top of CLARANS, the paper develops two spatial data mining algorithms that aim to discover relationships between spatial and nonspatial attributes. Both algorithms can discover knowledge that is difficult to find with existing spatial data mining algorithms. 相似文献
14.
15.
Today’s major search engines return ranked search results that match the keywords the user specifies. There have been many proposals to rank the search results such that they match the user’s intentions and needs more closely. Despite good advances during the past decade, this problem still requires considerable research, as the number of search results has become ever larger. We define the collection of each search result and all the Web pages that are linked to the result as a search-result drilldown. We hypothesize that by mining and analyzing the top terms in the search-result drilldown of search results, it may be possible to make each search result more meaningful to the user, so that the user may select the desired search results with higher confidence. In this paper, we describe this technique, and show the results of preliminary validation work that we have done. 相似文献
16.
Shankar Sahana P. Naresh E. Agrawal Harshit 《Innovations in Systems and Software Engineering》2022,18(2):251-261
Innovations in Systems and Software Engineering - Software quality has been the important area of interest for decades now in the IT sector and software firms. Defect prediction gives the tester... 相似文献
17.
Charu C. Aggarwal 《Knowledge and Information Systems》2012,30(1):1-29
Data Streams have become ubiquitous in recent years because of advances in hardware technology which have enabled automated
recording of large amounts of data. The primary constraint in the effective mining of streams is the large volume of data
which must be processed in real time. In many cases, it is desirable to store a summary of the data stream segments in order
to perform data mining tasks. Since density estimation provides a comprehensive overview of the probabilistic data distribution
of a stream segment, it is a natural choice for this purpose. A direct use of density distributions can however turn out to
be an inefficient storage and processing mechanism in practice. In this paper, we introduce the concept of cluster histograms, which provides an efficient way to estimate and summarize the most important data distribution profiles over different stream
segments. These profiles can be constructed in a supervised or unsupervised way depending upon the nature of the underlying
application. The profiles can also be used for change detection, anomaly detection, segmental nearest neighbor search, or
supervised stream segment classification. Furthermore, these techniques can also be used for modeling other kinds of data
such as text and categorical data. The flexibility of the tasks which can be performed from the cluster histogram framework
follows from its generality in storing the historical density profile of the data stream. As a result, this method provides a holistic framework for density-based mining of data streams. We discuss
and test the application of the cluster histogram framework to a variety of interesting data mining applications. 相似文献
18.
This paper studies the problem of probabilistic range query over uncertain data. Although existing solutions could support such query, it still has space for improvement. In this paper, we firstly propose a novel index called S-MRST for indexing uncertain data. For one thing, via using an irregular shape for bounding uncertain data, it has a stronger space pruning ability. For another, by taking the gradient of probability density function into consideration, S-MRST is also powerful in terms of probability pruning ability. More important, S-MRST is a general index which could support multiple types of probabilistic queries. Theoretical analysis and extensive experimental results demonstrate the effectiveness and efficiency of the proposed index. 相似文献
19.
Mehrdad Kargari Mohammad Mehdi Sepehri 《Expert systems with applications》2012,39(5):4740-4748
Clustering of retail stores in a distribution network with specific geographical limits plays an important and effective role in distribution and transportation costs reduction. In this paper, the relevant data and information for an established automotive spare-parts distribution and after-sales services company (ISACO) for a 3-year period have been analyzed. With respect to the diversity and lot size of the available information such as stores location, order, goods, transportation vehicles and road and traffic information, three effecting factors with specific weights have been defined for the similarity function: 1. Euclidean distance, 2. Lot size 3. Order concurrency. Based on these three factors, the similarity function has been examined through 5 steps using the Association Rules principles, where the clustering of the stores is performed using k-means algorithm and similar stores are allocated to the clusters. These steps include: 1. Similarity function based on the Euclidean distances, 2. Similarity function based on the order concurrency, 3. Similarity function based on the combination of the order concurrency and lot size, 4. Similarity function based on the combination of these three factors and 5. Improved similarity function. The above mentioned clustering operation for each 5 cases addressed in data mining have been carried out using R software and the improved combinational function has been chosen as the optimal clustering function. Then, trend of each retail store have been analyzed using the improved combinational function and along with determining the priority of the depot center establishment for every cluster, the appropriate distribution policies have been formulated for every cluster. The obtained results of this study indicate a significant cost reduction (32%) in automotive spare-parts distribution and transportation costs. 相似文献
20.
In this paper, we make an effort to overcome the sensitivity of traditional clustering algorithms to noisy data points (noise and outliers). A novel pruning method, in terms of information theory, is therefore proposed to phase out noisy points for robust data clustering. This approach identifies and prunes the noisy points based on the maximization of mutual information against input data distributions such that the resulting clusters are least affected by noise and outliers, where the degree of robustness is controlled through a separate parameter to make a trade-off between rejection of noisy points and optimal clustered data. The pruning approach is general, and it can improve the robustness of many existing traditional clustering methods. In particular, we apply the pruning approach to improve the robustness of fuzzy c-means clustering and its extensions, e.g., fuzzy c-spherical shells clustering and kernel-based fuzzy c-means clustering. As a result, we obtain three clustering algorithms that are the robust versions of the existing ones. The effectiveness of the proposed pruning approach is supported by experimental results. 相似文献