首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Transaction data record various information about individuals, including their purchases and diagnoses, and are increasingly published to support large-scale and low-cost studies in domains such as marketing and medicine. However, the dissemination of transaction data may lead to privacy breaches, as it allows an attacker to link an individual's record to their identity. Approaches that anonymize data by eliminating certain values in an individual's record or by replacing them with more general values have been proposed recently, but they often produce data of limited usefulness. This is because these approaches adopt value transformation strategies that do not guarantee data utility in intended applications and objective measures that may lead to excessive data distortion. In this paper, we propose a novel approach for anonymizing data in a way that satisfies data publishers' utility requirements and incurs low information loss. To achieve this, we introduce an accurate information loss measure and an effective anonymization algorithm that explores a large part of the problem space. An extensive experimental study, using click-stream and medical data, demonstrates that our approach permits many times more accurate query answering than the state-of-the-art methods, while it is comparable to them in terms of efficiency.  相似文献   

2.
So far, the data anonymization approaches based on k-anonymity and l-diversity has contributed much to privacy protection from record and attributes linkage attacks. However, the existing solutions are not efficient when applied to multimedia Big Data anonymization. This paper analyzes this problem in detail in terms of the processing time, memory space, and usability, and presents two schemes to overcome such inefficiency. The first one is to reduce the processing time and space by minimizing the temporary buffer usage during anonymization process. The second is to construct an early taxonomy during the database design. The idea behind this approach is that database designers should take preliminary actions for anonymization during the early stage of a database design to alleviate the burden placed on data publishers. To evaluate the effectiveness and feasibility of these schemes, specific application tools based on the proposed approaches were implemented and experiments were conducted.  相似文献   

3.
Past two decades have seen a growing interest in methods for providing privacy preserving data publishing. Several models, algorithms and system designs have been proposed in the literature to protect identities in the published data. A number of these have been implemented and are successfully in use by diverse applications such as medical, supermarkets and e-commerce. This work presents a comprehensive survey of the previous research done to develop techniques for ensuring privacy while publishing data. We have identified and described in detail the concepts, models and algorithms related to this problem. The emphasis of the present work, in particular, is upon preserving privacy of patient data that includes demographics data, diagnosis codes and the data containing both demographics and diagnosis codes. Finally, we have summarized some of the major open problems in privacy preserving data publishing and discussed the possible directions for future work in this domain.  相似文献   

4.
5.
Anonymization is a practical approach to protect privacy in data. The major objective of privacy preserving data publishing is to protect private information in data whereas data is still useful for some intended applications, such as building classification models. In this paper, we argue that data generalization in anonymization should be determined by the classification capability of data rather than the privacy requirement. We make use of mutual information for measuring classification capability for generalization, and propose two k-anonymity algorithms to produce anonymized tables for building accurate classification models. The algorithms generalize attributes to maximize the classification capability, and then suppress values by a privacy requirement k (IACk) or distributional constraints (IACc). Experimental results show that algorithm IACk supports more accurate classification models and is faster than a benchmark utility-aware data anonymization algorithm.  相似文献   

6.
A framework for condensation-based anonymization of string data   总被引:1,自引:0,他引:1  
In recent years, privacy preserving data mining has become an important problem because of the large amount of personal data which is tracked by many business applications. An important method for privacy preserving data mining is the method of condensation. This method is often used in the case of multi-dimensional data in which pseudo-data is generated to mask the true values of the records. However, these methods are not easily applicable to the case of string data, since they require the use of multi-dimensional statistics in order to generate the pseudo-data. String data are especially important in the privacy preserving data-mining domain because most DNA and biological data are coded as strings. In this article, we will discuss a new method for privacy preserving mining of string data with the use of simple template-based models. The template-based model turns out to be effective in practice, and preserves important statistical characteristics of the strings such as intra-record distances. We will explore the behavior in the context of a classification application, and show that the accuracy of the application is not affected significantly by the anonymization process. This article is an extended version of the conference version of the paper presented at the SIAM Conference on Data Mining, 2007 Aggarwal and Yu (2007). Available at .  相似文献   

7.
Anomaly detection is considered an important data mining task, aiming at the discovery of elements (known as outliers) that show significant diversion from the expected case. More specifically, given a set of objects the problem is to return the suspicious objects that deviate significantly from the typical behavior. As in the case of clustering, the application of different criteria leads to different definitions for an outlier. In this work, we focus on distance-based outliers: an object x is an outlier if there are less than k objects lying at distance at most R from x. The problem offers significant challenges when a stream-based environment is considered, where data arrive continuously and outliers must be detected on-the-fly. There are a few research works studying the problem of continuous outlier detection. However, none of these proposals meets the requirements of modern stream-based applications for the following reasons: (i) they demand a significant storage overhead, (ii) their efficiency is limited and (iii) they lack flexibility in the sense that they assume a single configuration of the k and R parameters. In this work, we propose new algorithms for continuous outlier monitoring in data streams, based on sliding windows. Our techniques are able to reduce the required storage overhead, are more efficient than previously proposed techniques and offer significant flexibility with regard to the input parameters. Experiments performed on real-life and synthetic data sets verify our theoretical study.  相似文献   

8.
International Journal of Information Security - The introduction of advanced metering infrastructure (AMI) smart meters has given rise to fine-grained electricity usage data at different levels of...  相似文献   

9.
The solution of the factor analysis problem is discussed. A method of factor analysis that provides processing of the data in the form of a transaction base is developed. It involves the rules extraction from the given transaction bases, which results in data generalization and, therefore, exclusion of the extract features, which allows one to reduce the search space and factor analysis execution time. The calculation complexity of the method is analyzed. The experimental investigation of solving practical and test problems is carried out.  相似文献   

10.
11.
Recent advances in AI planning have centred upon the reduction of planning to a constraint satisfaction problem (CSP) enabling the application of the efficient search algorithms available in this area. This paper continues this approach, presenting a novel technique which exploits (restriction/relaxation-based) dynamic CSP (rrDCSP) in order to further improve planner performance. Using the standard Graphplan framework, it is shown how significant efficiency gains may be obtained by viewing plan extraction as the solution of a hierarchy of such rrDCSPs. Furthermore, by using flexible constraints as a formal foundation, it is shown how the traditional boolean notion of planning can be extended to incorporate prioritised and preference-based information. Plan extraction in this context is shown to generalise the Boolean rrDCSP approach, being systematically supported by the recently developed solution techniques for dynamic flexible CSPs (DFCSPs). The proposed techniques are evaluated via benchmark boolean problems and a novel flexible benchmark problem. Results obtained are very encouraging.  相似文献   

12.
A matrix M is said to be k-anonymous if for each row r in M there are at least k ? 1 other rows in M which are identical to r. The NP-hard k-Anonymity problem asks, given an n × m-matrix M over a fixed alphabet and an integer s > 0, whether M can be made k-anonymous by suppressing (blanking out) at most s entries. Complementing previous work, we introduce two new “data-driven” parameterizations for k-Anonymity—the number t in of different input rows and the number t out of different output rows—both modeling aspects of data homogeneity. We show that k-Anonymity is fixed-parameter tractable for the parameter t in , and that it is NP-hard even for t out = 2 and alphabet size four. Notably, our fixed-parameter tractability result implies that k-Anonymity can be solved in linear time when t in is a constant. Our computational hardness results also extend to the related privacy problems p-Sensitivity and ?-Diversity, while our fixed-parameter tractability results extend to p-Sensitivity and the usage of domain generalization hierarchies, where the entries are replaced by more general data instead of being completely suppressed.  相似文献   

13.
In this paper, we propose a novel edge modification technique that better preserves the communities of a graph while anonymizing it. By maintaining the core number sequence of a graph, its coreness, we retain most of the information contained in the network while allowing changes in the degree sequence, i. e. obfuscating the visible data an attacker has access to. We reach a better trade-off between data privacy and data utility than with existing methods by capitalizing on the slack between apparent degree (node degree) and true degree (node core number). Our extensive experiments on six diverse standard network datasets support this claim. Our framework compares our method to other that are used as proxies for privacy protection in the relevant literature. We demonstrate that our method leads to higher data utility preservation, especially in clustering, for the same levels of randomization and k-anonymity.  相似文献   

14.
In recent years, online social networks have become a part of everyday life for millions of individuals. Also, data analysts have found a fertile field for analyzing user behavior at individual and collective levels, for academic and commercial reasons. On the other hand, there are many risks for user privacy, as information a user may wish to remain private becomes evident upon analysis. However, when data is anonymized to make it safe for publication in the public domain, information is inevitably lost with respect to the original version, a significant aspect of social networks being the local neighborhood of a user and its associated data. Current anonymization techniques are good at identifying risks and minimizing them, but not so good at maintaining local contextual data which relate users in a social network. Thus, improving this aspect will have a high impact on the data utility of anonymized social networks. Also, there is a lack of systems which facilitate the work of a data analyst in anonymizing this type of data structures and performing empirical experiments in a controlled manner on different datasets. Hence, in the present work we address these issues by designing and implementing a sophisticated synthetic data generator together with an anonymization processor with strict privacy guarantees and which takes into account the local neighborhood when anonymizing. All this is done for a complex dataset which can be fitted to a real dataset in terms of data profiles and distributions. In the empirical section we perform experiments to demonstrate the scalability of the method and the improvement in terms of reduction of information loss with respect to approaches which do not consider the local neighborhood context when anonymizing.  相似文献   

15.
The concept of cloud computing has emerged as the next generation of computing infrastructure to reduce the costs associated with the management of hardware and software resources. It is vital to its success that cloud computing is featured efficient, flexible and secure characteristics. In this paper, we propose an efficient and anonymous data sharing protocol with flexible sharing style, named EFADS, for outsourcing data onto the cloud. Through formal security analysis, we demonstrate that EFADS provides data confidentiality and data sharer's anonymity without requiring any fully-trusted party. From experimental results, we show that EFADS is more efficient than existing competing approaches. Furthermore, the proxy re-encryption scheme we propose in this paper may be independent of interests, i.e., compared to those previously reported proxy re-encryption schemes, the proposed scheme is the first pairing-free, anonymous and unidirectional proxy re-encryption scheme in the standard model.  相似文献   

16.
17.
关联规则挖掘是数据挖掘重要研究课题,大数据处理对关联规则挖掘算法效率提出了更高要求,而关联规则挖掘的最耗时的步骤是频繁模式挖掘。针对当前频繁模式挖掘算法效率不高的问题,结合Apriori算法和FP-growth算法,提出一种基于事务映射区间求交的频繁模式挖掘算法IITM(interval interaction and transaction mapping),只需扫描数据集两次来生成FP树,然后扫描FP树将每个项的ID映射到区间中,通过区间求交来进行模式增长。该算法解决了Apriori算法需要多次扫描数据集,FP-growth算法需要迭代地生成条件FP树来进行模式增长而带来的效率下降的问题。在真实数据集上的实验显示,在不同的支持度下IITM算法都要要优于Apriori、FP-growth以及PIETM算法。  相似文献   

18.
This paper describes a novel method for extracting text from document pages of mixed content. The method works by detecting pieces of text lines in small overlapping columns of width , shifted with respect to each other by image elements (good default values are: of the image width, ) and by merging these pieces in a bottom-up fashion to form complete text lines and blocks of text lines. The algorithm requires about 1.3 s for a 300 dpi image on a PC with a Pentium II CPU, 300 MHz, MotherBoard Intel440LX. The algorithm is largely independent of the layout of the document, the shape of the text regions, and the font size and style. The main assumptions are that the background be uniform and that the text sit approximately horizontally. For a skew of up to about 10 degrees no skew correction mechanism is necessary. The algorithm has been tested on the UW English Document Database I of the University of Washington and its performance has been evaluated by a suitable measure of segmentation accuracy. Also, a detailed analysis of the segmentation accuracy achieved by the algorithm as a function of noise and skew has been carried out. Received April 4, 1999 / Revised June 1, 1999  相似文献   

19.
提出了移动广播环境中有效处理实时只读事务的方法。给出了多种多版本广播磁盘组织。采用多版本机制,实现移动只读事务无阻塞提交。通过乐观方法,消除移动只读事务和移动更新事务的冲突。使用多版本动态调整串行次序技术,避免了不必要的事务重启动。在移动主机上如果移动只读事务通过向后有效性确认,则可提交,不需要提交到服务器处理,降低移动只读事务的响应时间。通过模拟仿真对提出的方法进行了性能测试,实验结果表明新方法要优于其他协议。  相似文献   

20.
In big data applications, data privacy is one of the most concerned issues because processing large-scale privacy-sensitive data sets often requires computation resources provisioned by public cloud services. Sub-tree data anonymization is a widely adopted scheme to anonymize data sets for privacy preservation. Top–Down Specialization (TDS) and Bottom–Up Generalization (BUG) are two ways to fulfill sub-tree anonymization. However, existing approaches for sub-tree anonymization fall short of parallelization capability, thereby lacking scalability in handling big data in cloud. Still, either TDS or BUG individually suffers from poor performance for certain valuing of k-anonymity parameter. In this paper, we propose a hybrid approach that combines TDS and BUG together for efficient sub-tree anonymization over big data. Further, we design MapReduce algorithms for the two components (TDS and BUG) to gain high scalability. Experiment evaluation demonstrates that the hybrid approach significantly improves the scalability and efficiency of sub-tree anonymization scheme over existing approaches.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号