首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Transaction data record various information about individuals, including their purchases and diagnoses, and are increasingly published to support large-scale and low-cost studies in domains such as marketing and medicine. However, the dissemination of transaction data may lead to privacy breaches, as it allows an attacker to link an individual's record to their identity. Approaches that anonymize data by eliminating certain values in an individual's record or by replacing them with more general values have been proposed recently, but they often produce data of limited usefulness. This is because these approaches adopt value transformation strategies that do not guarantee data utility in intended applications and objective measures that may lead to excessive data distortion. In this paper, we propose a novel approach for anonymizing data in a way that satisfies data publishers' utility requirements and incurs low information loss. To achieve this, we introduce an accurate information loss measure and an effective anonymization algorithm that explores a large part of the problem space. An extensive experimental study, using click-stream and medical data, demonstrates that our approach permits many times more accurate query answering than the state-of-the-art methods, while it is comparable to them in terms of efficiency.  相似文献   

2.
So far, the data anonymization approaches based on k-anonymity and l-diversity has contributed much to privacy protection from record and attributes linkage attacks. However, the existing solutions are not efficient when applied to multimedia Big Data anonymization. This paper analyzes this problem in detail in terms of the processing time, memory space, and usability, and presents two schemes to overcome such inefficiency. The first one is to reduce the processing time and space by minimizing the temporary buffer usage during anonymization process. The second is to construct an early taxonomy during the database design. The idea behind this approach is that database designers should take preliminary actions for anonymization during the early stage of a database design to alleviate the burden placed on data publishers. To evaluate the effectiveness and feasibility of these schemes, specific application tools based on the proposed approaches were implemented and experiments were conducted.  相似文献   

3.
4.
Anonymization is a practical approach to protect privacy in data. The major objective of privacy preserving data publishing is to protect private information in data whereas data is still useful for some intended applications, such as building classification models. In this paper, we argue that data generalization in anonymization should be determined by the classification capability of data rather than the privacy requirement. We make use of mutual information for measuring classification capability for generalization, and propose two k-anonymity algorithms to produce anonymized tables for building accurate classification models. The algorithms generalize attributes to maximize the classification capability, and then suppress values by a privacy requirement k (IACk) or distributional constraints (IACc). Experimental results show that algorithm IACk supports more accurate classification models and is faster than a benchmark utility-aware data anonymization algorithm.  相似文献   

5.
A framework for condensation-based anonymization of string data   总被引:1,自引:0,他引:1  
In recent years, privacy preserving data mining has become an important problem because of the large amount of personal data which is tracked by many business applications. An important method for privacy preserving data mining is the method of condensation. This method is often used in the case of multi-dimensional data in which pseudo-data is generated to mask the true values of the records. However, these methods are not easily applicable to the case of string data, since they require the use of multi-dimensional statistics in order to generate the pseudo-data. String data are especially important in the privacy preserving data-mining domain because most DNA and biological data are coded as strings. In this article, we will discuss a new method for privacy preserving mining of string data with the use of simple template-based models. The template-based model turns out to be effective in practice, and preserves important statistical characteristics of the strings such as intra-record distances. We will explore the behavior in the context of a classification application, and show that the accuracy of the application is not affected significantly by the anonymization process. This article is an extended version of the conference version of the paper presented at the SIAM Conference on Data Mining, 2007 Aggarwal and Yu (2007). Available at .  相似文献   

6.
International Journal of Information Security - The introduction of advanced metering infrastructure (AMI) smart meters has given rise to fine-grained electricity usage data at different levels of...  相似文献   

7.
The solution of the factor analysis problem is discussed. A method of factor analysis that provides processing of the data in the form of a transaction base is developed. It involves the rules extraction from the given transaction bases, which results in data generalization and, therefore, exclusion of the extract features, which allows one to reduce the search space and factor analysis execution time. The calculation complexity of the method is analyzed. The experimental investigation of solving practical and test problems is carried out.  相似文献   

8.
9.
A matrix M is said to be k-anonymous if for each row r in M there are at least k ? 1 other rows in M which are identical to r. The NP-hard k-Anonymity problem asks, given an n × m-matrix M over a fixed alphabet and an integer s > 0, whether M can be made k-anonymous by suppressing (blanking out) at most s entries. Complementing previous work, we introduce two new “data-driven” parameterizations for k-Anonymity—the number t in of different input rows and the number t out of different output rows—both modeling aspects of data homogeneity. We show that k-Anonymity is fixed-parameter tractable for the parameter t in , and that it is NP-hard even for t out = 2 and alphabet size four. Notably, our fixed-parameter tractability result implies that k-Anonymity can be solved in linear time when t in is a constant. Our computational hardness results also extend to the related privacy problems p-Sensitivity and ?-Diversity, while our fixed-parameter tractability results extend to p-Sensitivity and the usage of domain generalization hierarchies, where the entries are replaced by more general data instead of being completely suppressed.  相似文献   

10.
In this paper, we propose a novel edge modification technique that better preserves the communities of a graph while anonymizing it. By maintaining the core number sequence of a graph, its coreness, we retain most of the information contained in the network while allowing changes in the degree sequence, i. e. obfuscating the visible data an attacker has access to. We reach a better trade-off between data privacy and data utility than with existing methods by capitalizing on the slack between apparent degree (node degree) and true degree (node core number). Our extensive experiments on six diverse standard network datasets support this claim. Our framework compares our method to other that are used as proxies for privacy protection in the relevant literature. We demonstrate that our method leads to higher data utility preservation, especially in clustering, for the same levels of randomization and k-anonymity.  相似文献   

11.
The concept of cloud computing has emerged as the next generation of computing infrastructure to reduce the costs associated with the management of hardware and software resources. It is vital to its success that cloud computing is featured efficient, flexible and secure characteristics. In this paper, we propose an efficient and anonymous data sharing protocol with flexible sharing style, named EFADS, for outsourcing data onto the cloud. Through formal security analysis, we demonstrate that EFADS provides data confidentiality and data sharer's anonymity without requiring any fully-trusted party. From experimental results, we show that EFADS is more efficient than existing competing approaches. Furthermore, the proxy re-encryption scheme we propose in this paper may be independent of interests, i.e., compared to those previously reported proxy re-encryption schemes, the proposed scheme is the first pairing-free, anonymous and unidirectional proxy re-encryption scheme in the standard model.  相似文献   

12.
In recent years, online social networks have become a part of everyday life for millions of individuals. Also, data analysts have found a fertile field for analyzing user behavior at individual and collective levels, for academic and commercial reasons. On the other hand, there are many risks for user privacy, as information a user may wish to remain private becomes evident upon analysis. However, when data is anonymized to make it safe for publication in the public domain, information is inevitably lost with respect to the original version, a significant aspect of social networks being the local neighborhood of a user and its associated data. Current anonymization techniques are good at identifying risks and minimizing them, but not so good at maintaining local contextual data which relate users in a social network. Thus, improving this aspect will have a high impact on the data utility of anonymized social networks. Also, there is a lack of systems which facilitate the work of a data analyst in anonymizing this type of data structures and performing empirical experiments in a controlled manner on different datasets. Hence, in the present work we address these issues by designing and implementing a sophisticated synthetic data generator together with an anonymization processor with strict privacy guarantees and which takes into account the local neighborhood when anonymizing. All this is done for a complex dataset which can be fitted to a real dataset in terms of data profiles and distributions. In the empirical section we perform experiments to demonstrate the scalability of the method and the improvement in terms of reduction of information loss with respect to approaches which do not consider the local neighborhood context when anonymizing.  相似文献   

13.
14.
This paper describes a novel method for extracting text from document pages of mixed content. The method works by detecting pieces of text lines in small overlapping columns of width , shifted with respect to each other by image elements (good default values are: of the image width, ) and by merging these pieces in a bottom-up fashion to form complete text lines and blocks of text lines. The algorithm requires about 1.3 s for a 300 dpi image on a PC with a Pentium II CPU, 300 MHz, MotherBoard Intel440LX. The algorithm is largely independent of the layout of the document, the shape of the text regions, and the font size and style. The main assumptions are that the background be uniform and that the text sit approximately horizontally. For a skew of up to about 10 degrees no skew correction mechanism is necessary. The algorithm has been tested on the UW English Document Database I of the University of Washington and its performance has been evaluated by a suitable measure of segmentation accuracy. Also, a detailed analysis of the segmentation accuracy achieved by the algorithm as a function of noise and skew has been carried out. Received April 4, 1999 / Revised June 1, 1999  相似文献   

15.
Segmenting customers by transaction data with concept hierarchy   总被引:1,自引:0,他引:1  
The segmentation of customers is crucial for an organization wishing to develop appropriate promotion strategies for different clusters. Clustering customers provides an in-depth understanding of their behavior. However, previous studies have paid little attention to the similarity of different items in transaction. Lack of categories and concept levels of items, results from item-based segmentation methods are not as good as expected. Through employing a concept hierarchy of items, this study proposes a segmentation methodology to identify similarities between customers. First, the dissimilarity between transaction sequences is defined. Second, we adopt hierarchical clustering method to segment customers by their transaction data with concept hierarchy of consumed items. After segmentation, three cluster validation indices are used for optimizing the number of clusters of customers. Through the compassion of normalized index, the segmentation method proposed by this study rendered better results than other traditional methods.  相似文献   

16.
In big data applications, data privacy is one of the most concerned issues because processing large-scale privacy-sensitive data sets often requires computation resources provisioned by public cloud services. Sub-tree data anonymization is a widely adopted scheme to anonymize data sets for privacy preservation. Top–Down Specialization (TDS) and Bottom–Up Generalization (BUG) are two ways to fulfill sub-tree anonymization. However, existing approaches for sub-tree anonymization fall short of parallelization capability, thereby lacking scalability in handling big data in cloud. Still, either TDS or BUG individually suffers from poor performance for certain valuing of k-anonymity parameter. In this paper, we propose a hybrid approach that combines TDS and BUG together for efficient sub-tree anonymization over big data. Further, we design MapReduce algorithms for the two components (TDS and BUG) to gain high scalability. Experiment evaluation demonstrates that the hybrid approach significantly improves the scalability and efficiency of sub-tree anonymization scheme over existing approaches.  相似文献   

17.
AdaBoost is a famous ensemble learning method and has achieved successful applications in many fields.The existing studies illustrate that AdaBoost easily suffe...  相似文献   

18.
Let us consider the following situation: \(t\) entities (e.g., hospitals) hold different databases containing different records for the same type of confidential (e.g., medical) data. They want to deliver a protected version of this data to third parties (e.g., pharmaceutical researchers), preserving in some way both the utility and the privacy of the original data. This can be done by applying a statistical disclosure control (SDC) method. One possibility is that each entity protects its own database individually, but this strategy provides less utility and privacy than a collective strategy where the entities cooperate, by means of a distributed protocol, to produce a global protected dataset. In this paper, we investigate the problem of distributed protocols for SDC protection methods. We propose a simple, efficient and secure distributed protocol for the specific SDC method of rank shuffling. We run some experiments to evaluate the quality of this protocol and to compare the individual and collective strategies for solving the problem of protecting a distributed database. With respect to other distributed versions of SDC methods, the new protocol provides either more security or more efficiency, as we discuss through the paper.  相似文献   

19.
Most algorithms related to association rule mining are designed to discover frequent itemsets from a binary database. Other factors such as profit, cost, or quantity are not concerned in binary databases. Utility mining was thus proposed to measure the utility values of purchased items for finding high-utility itemsets from a static database. In real-world applications, transactions are changed whether insertion or deletion in a dynamic database. An existing maintenance approach for handling high-utility itemsets in dynamic databases with transaction deletion must rescan the database when necessary. In this paper, an efficient algorithm, called PRE-HUI-DEL, for updating high-utility itemsets based on the pre-large concept for transaction deletion is proposed. The pre-large concept is used to partition transaction-weighted utilization itemsets into three sets with nine cases according to whether they have large (high), pre-large, or small transaction-weighted utilization in the original database and in the deleted transactions. Specific procedures are then applied to each case for maintaining and updating the discovered high-utility itemsets. Experimental results show that the proposed PRE-HUI-DEL algorithm outperforms a batch two-phase algorithm and a FUP2-based algorithm in maintaining high-utility itemsets.  相似文献   

20.
As the amount of multimedia data is increasing day-by-day thanks to cheaper storage devices and increasing number of information sources, the machine learning algorithms are faced with large-sized datasets. When original data is huge in size small sample sizes are preferred for various applications. This is typically the case for multimedia applications. But using a simple random sample may not obtain satisfactory results because such a sample may not adequately represent the entire data set due to random fluctuations in the sampling process. The difficulty is particularly apparent when small sample sizes are needed. Fortunately the use of a good sampling set for training can improve the final results significantly. In KDD’03 we proposed EASE that outputs a sample based on its ‘closeness’ to the original sample. Reported results show that EASE outperforms simple random sampling (SRS). In this paper we propose EASIER that extends EASE in two ways. (1) EASE is a halving algorithm, i.e., to achieve the required sample ratio it starts from a suitable initial large sample and iteratively halves. EASIER, on the other hand, does away with the repeated halving by directly obtaining the required sample ratio in one iteration. (2) EASE was shown to work on IBM QUEST dataset which is a categorical count data set. EASIER, in addition, is shown to work on continuous data of images and audio features. We have successfully applied EASIER to image classification and audio event identification applications. Experimental results show that EASIER outperforms SRS significantly. Surong Wang received the B.E. and M.E. degree from the School of Information Engineering, University of Science and Technology Beijing, China, in 1999 and 2002 respectively. She is currently studying toward for the Ph.D. degree at the School of Computer Engineering, Nanyang Technological University, Singapore. Her research interests include multimedia data processing, image processing and content-based image retrieval. Manoranjan Dash obtained Ph.D. and M. Sc. (Computer Science) degrees from School of Computing, National University of Singapore. He has worked in academic and research institutes extensively and has published more than 30 research papers (mostly refereed) in various reputable machine learning and data mining journals, conference proceedings, and books. His research interests include machine learning and data mining, and their applications in bioinformatics, image processing, and GPU programming. Before joining School of Computer Engineering (SCE), Nanyang Technological University, Singapore, as Assistant Professor, he worked as a postdoctoral fellow in Northwestern University. He is a member of IEEE and ACM. He has served as program committee member of many conferences and he is in the editorial board of “International journal of Theoretical and Applied Computer Science.” Liang-Tien Chia received the B.S. and Ph.D. degrees from Loughborough University, in 1990 and 1994, respectively. He is an Associate Professor in the School of Computer Engineering, Nanyang Technological University, Singapore. He has recently been appointed as Head, Division of Computer Communications and he also holds the position of Director, Centre for Multimedia and Network Technology. His research interests include image/video processing & coding, multimodal data fusion, multimedia adaptation/transmission and multimedia over the Semantic Web. He has published over 80 research papers.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号