共查询到20条相似文献,搜索用时 15 毫秒
1.
To protect individual privacy in data mining, when a miner collects data from respondents, the respondents should remain anonymous. The existing technique of Anonymity-Preserving Data Collection partially solves this problem, but it assumes that the data do not contain any identifying information about the corresponding respondents. On the other hand, the existing technique of Privacy-Enhancing k-Anonymization can make the collected data anonymous by eliminating the identifying information. However, it assumes that each respondent submits her data through an unidentified communication channel. In this paper, we propose k-Anonymous Data Collection, which has the advantages of both Anonymity-Preserving Data Collection and Privacy-Enhancing k-Anonymization but does not rely on their assumptions described above. We give rigorous proofs for the correctness and privacy of our protocol, and experimental results for its efficiency. Furthermore, we extend our solution to the fully malicious model, in which a dishonest participant can deviate from the protocol and behave arbitrarily. 相似文献
2.
Standard algorithms for association rule mining are based on identification of frequent itemsets. In this paper, we study how to maintain privacy in distributed mining of frequent itemsets. That is, we study how two (or more) parties can find frequent itemsets in a distributed database without revealing each party’s portion of the data to the other. The existing solution for vertically partitioned data leaks a significant amount of information, while the existing solution for horizontally partitioned data only works for three parties or more. In this paper, we design algorithms for both vertically and horizontally partitioned data, with cryptographically strong privacy. We give two algorithms for vertically partitioned data; one of them reveals only the support count and the other reveals nothing. Both of them have computational overheads linear in the number of transactions. Our algorithm for horizontally partitioned data works for two parties and above and is more efficient than the existing solution. 相似文献
3.
This paper presents a data mining approach for modeling the adiabatic temperature rise during concrete hydration. The model was developed based on experimental data obtained in the last thirty years for several mass concrete constructions in Brazil, including some of the hugest hydroelectric power plants in operation in the world. The input of the model is a variable data set corresponding to the binder physical and chemical properties and concrete mixture proportions. The output is a set of three parameters that determine a function which is capable to describe the adiabatic temperature rise during concrete hydration. The comparison between experimental data and modeling results shows the accuracy of the proposed approach and that data mining is a potential tool to predict thermal stresses in the design of massive concrete structures. 相似文献
4.
k-anonymity provides a measure of privacy protection by preventing re-identification of data to fewer than a group of k data items. While algorithms exist for producing k-anonymous data, the model has been that of a single source wanting to publish data. Due to privacy issues, it is common that data from different sites cannot be shared directly. Therefore, this paper presents a two-party framework along with an application that generates k-anonymous data from two vertically partitioned sources without disclosing data from one site to the other. The framework is privacy preserving in the sense that it satisfies the secure definition commonly defined in the literature of Secure Multiparty Computation. 相似文献
5.
Data mining is most commonly used in attempts to induce association rules from transaction data. In the past, we used the fuzzy and GA concepts to discover both useful fuzzy association rules and suitable membership functions from quantitative values. The evaluation for fitness values was, however, quite time-consuming. Due to dramatic increases in available computing power and concomitant decreases in computing costs over the last decade, learning or mining by applying parallel processing techniques has become a feasible way to overcome the slow-learning problem. In this paper, we thus propose a parallel genetic-fuzzy mining algorithm based on the master–slave architecture to extract both association rules and membership functions from quantitative transactions. The master processor uses a single population as a simple genetic algorithm does, and distributes the tasks of fitness evaluation to slave processors. The evolutionary processes, such as crossover, mutation and production are performed by the master processor. It is very natural and efficient to run the proposed algorithm on the master–slave architecture. The time complexities for both sequential and parallel genetic-fuzzy mining algorithms have also been analyzed, with results showing the good effect of the proposed one. When the number of generations is large, the speed-up can be nearly linear. The experimental results also show this point. Applying the master–slave parallel architecture to speed up the genetic-fuzzy data mining algorithm is thus a feasible way to overcome the low-speed fitness evaluation problem of the original algorithm. 相似文献
6.
In this paper, we propose a novel face detection method based on the MAFIA algorithm. Our proposed method consists of two phases, namely, training and detection. In the training phase, we first apply Sobel's edge detection operator, morphological operator, and thresholding to each training image, and transform it into an edge image. Next, we use the MAFIA algorithm to mine the maximal frequent patterns from those edge images and obtain the positive feature pattern. Similarly, we can obtain the negative feature pattern from the complements of edge images. Based on the feature patterns mined, we construct a face detector to prune non-face candidates. In the detection phase, we apply a sliding window to the testing image in different scales. For each sliding window, if the slide window passes the face detector, it is considered as a human face. The proposed method can automatically find the feature patterns that capture most of facial features. By using the feature patterns to construct a face detector, the proposed method is robust to races, illumination, and facial expressions. The experimental results show that the proposed method has outstanding performance in the MIT-CMU dataset and comparable performance in the BioID dataset in terms of false positive and detection rate. 相似文献
7.
Algorithms for feature selection in predictive data mining for classification problems attempt to select those features that are relevant, and are not redundant for the classification task. A relevant feature is defined as one which is highly correlated with the target function. One problem with the definition of feature relevance is that there is no universally accepted definition of what it means for a feature to be ‘highly correlated with the target function or highly correlated with the other features’. A new feature selection algorithm which incorporates domain specific definitions of high, medium and low correlations is proposed in this paper. The proposed algorithm conducts a heuristic search for the most relevant features for the prediction task. 相似文献
8.
Mining discriminative spatial patterns in image data is an emerging subject of interest in medical imaging, meteorology, engineering, biology, and other fields. In this paper, we propose a novel approach for detecting spatial regions that are highly discriminative among different classes of three dimensional (3D) image data. The main idea of our approach is to treat the initial 3D image as a hyper-rectangle and search for discriminative regions by adaptively partitioning the space into progressively smaller hyper-rectangles (sub-regions). We use statistical information about each hyper-rectangle to guide the selectivity of the partitioning. A hyper-rectangle is partitioned only if its attribute cannot adequately discriminate among the distinct labeled classes, and it is sufficiently large for further splitting. To evaluate the discriminative power of the attributes corresponding to the detected regions, we performed classification experiments on artificial and real datasets. Our results show that the proposed method outperforms major competitors, achieving 30% and 15% better classification accuracy on synthetic and real data respectively while reducing by two orders of magnitude the number of statistical tests required by voxel-based approaches. 相似文献
9.
This paper presents an informatics framework to apply feature-based engineering concept for cost estimation supported with data mining algorithms. The purpose of this research work is to provide a practical procedure for more accurate cost estimation by using the commonly available manufacturing process data associated with ERP systems. The proposed method combines linear regression and data-mining techniques, leverages the unique strengths of the both, and creates a mechanism to discover cost features. The final estimation function takes the user’s confidence level over each member technique into consideration such that the application of the method can phase in gradually in reality by building up the data mining capability. A case study demonstrates the proposed framework and compares the results from empirical cost prediction and data mining. The case study results indicate that the combined method is flexible and promising for determining the costs of the example welding features. With the result comparison between the empirical prediction and five different data mining algorithms, the ANN algorithm shows to be the most accurate for welding operations. 相似文献
10.
To preserve client privacy in the data mining process, a variety of techniques based on random perturbation of individual
data records have been proposed recently. In this paper, we present FRAPP, a generalized matrix-theoretic framework of random
perturbation, which facilitates a systematic approach to the design of perturbation mechanisms for privacy-preserving mining.
Specifically, FRAPP is used to demonstrate that (a) the prior techniques differ only in their choices for the perturbation
matrix elements, and (b) a symmetric positive-definite perturbation matrix with minimal condition number can be identified,
substantially enhancing the accuracy even under strict privacy requirements. We also propose a novel perturbation mechanism
wherein the matrix elements are themselves characterized as random variables, and demonstrate that this feature provides significant
improvements in privacy at only a marginal reduction in accuracy. The quantitative utility of FRAPP, which is a general-purpose
random-perturbation-based privacy-preserving mining technique, is evaluated specifically with regard to association and classification
rule mining on a variety of real datasets. Our experimental results indicate that, for a given privacy requirement, either
substantially lower modeling errors are incurred as compared to the prior techniques, or the errors are comparable to those
of direct mining on the true database.
A partial and preliminary version of this paper appeared in the Proc. of the 21st IEEE Intl. Conf. on Data Engineering (ICDE),
Tokyo, Japan, 2005, pgs. 193–204. 相似文献
11.
The three-mode partitioning model is a clustering model for three-way three-mode data sets that implies a simultaneous partitioning of all three modes involved in the data. In the associated data analysis, a data array is approximated by a model array that can be represented by a three-mode partitioning model of a prespecified rank, minimizing a least squares loss function in terms of differences between data and model. Algorithms have been proposed for this minimization, but their performance is not yet clear. A framework for alternating least-squares methods is described in order to offset the performance problem. Furthermore, a number of both existing and novel algorithms are discussed within this framework. An extensive simulation study is reported in which these algorithms are evaluated and compared according to sensitivity to local optima. The recovery of the truth underlying the data is investigated in order to assess the optimal estimates. The ordering of the algorithms with respect to performance in finding the optimal solution appears to change as compared to the results obtained from the simulation study when a collection of four empirical data sets have been used. This finding is attributed to violations of the implicit stochastic model underlying both the least-squares loss function and the simulation study. Support for the latter attribution is found in a second simulation study. 相似文献
12.
Privacy is becoming an increasingly important issue in many data-mining applications. This has triggered the development of many privacy-preserving data-mining techniques. A large fraction of them use randomized data-distortion techniques to mask the data for preserving the privacy of sensitive data. This methodology attempts to hide the sensitive data by randomly modifying the data values often using additive noise. This paper questions the utility of the random-value distortion technique in privacy preservation. The paper first notes that random matrices have predictable structures in the spectral domain and then it develops a random matrix-based spectral-filtering technique to retrieve original data from the dataset distorted by adding random values. The proposed method works by comparing the spectrum generated from the observed data with that of random matrices. This paper presents the theoretical foundation and extensive experimental results to demonstrate that, in many cases, random-data distortion preserves very little data privacy. The analytical framework presented in this paper also points out several possible avenues for the development of new privacy-preserving data-mining techniques. Examples include algorithms that explicitly guard against privacy breaches through linear transformations, exploiting multiplicative and colored noise for preserving privacy in data mining applications. 相似文献
13.
The number, variety and complexity of projects involving data mining or knowledge discovery in databases activities have increased just lately at such a pace that aspects related to their development process need to be standardized for results to be integrated, reused and interchanged in the future. Data mining projects are quickly becoming engineering projects, and current standard processes, like CRISP-DM, need to be revisited to incorporate this engineering viewpoint. This is the central motivation of this paper that makes the point that experience gained about the software development process over almost 40 years could be reused and integrated to improve data mining processes. Consequently, this paper proposes to reuse ideas and concepts underlying the IEEE Std 1074 and ISO 12207 software engineering model processes to redefine and add to the CRISP-DM process and make it a data mining engineering standard. 相似文献
14.
A crucial issue related to data mining on time-series is that of training period duration. The training horizon used impacts the nature of rules obtained and their predictability over time. Longer training horizons are generally sought, in order to discern sustained patterns with robust training data performance that extends well into the predictive period. However, in dynamic environments patterns that persist over time may be unavailable, and shorter-term patterns may hold higher predictive ability, albeit with shorter predictive periods. Such potentially useful shorter-term patterns may be lost when the training duration covers much longer periods. Too short a training duration can, of course, be susceptible to over-fitting to noise. We conduct experiments using different training horizons with daily-data for the S&P500 index and report the sensitivity of the performance of the obtained rules with respect to the training durations. We show that while the performance of the rules in the training period is important for inducing the “best” rules, it is not indicative of their performance in the test-period and propose alternative measures that can be used to help identify the appropriate training durations. 相似文献
15.
Time series data mining (TSDM) techniques permit exploring large amounts of time series data in search of consistent patterns and/or interesting relationships between variables. TSDM is becoming increasingly important as a knowledge management tool where it is expected to reveal knowledge structures that can guide decision making in conditions of limited certainty. Human decision making in problems related with analysis of time series databases is usually based on perceptions like “end of the day”, “high temperature”, “quickly increasing”, “possible”, etc. Though many effective algorithms of TSDM have been developed, the integration of TSDM algorithms with human decision making procedures is still an open problem. In this paper, we consider architecture of perception-based decision making system in time series databases domains integrating perception-based TSDM, computing with words and perceptions, and expert knowledge. The new tasks which should be solved by the perception-based TSDM methods to enable their integration in such systems are discussed. These tasks include: precisiation of perceptions, shape pattern identification, and pattern retranslation. We show how different methods developed so far in TSDM for manipulation of perception-based information can be used for development of a fuzzy perception-based TSDM approach. This approach is grounded in computing with words and perceptions permitting to formalize human perception-based inference mechanisms. The discussion is illustrated by examples from economics, finance, meteorology, medicine, etc. 相似文献
16.
Classification of intrusion attacks and normal network traffic is a challenging and critical problem in pattern recognition and network security. In this paper, we present a novel intrusion detection approach to extract both accurate and interpretable fuzzy IF-THEN rules from network traffic data for classification. The proposed fuzzy rule-based system is evolved from an agent-based evolutionary framework and multi-objective optimization. In addition, the proposed system can also act as a genetic feature selection wrapper to search for an optimal feature subset for dimensionality reduction. To evaluate the classification and feature selection performance of our approach, it is compared with some well-known classifiers as well as feature selection filters and wrappers. The extensive experimental results on the KDD-Cup99 intrusion detection benchmark data set demonstrate that the proposed approach produces interpretable fuzzy systems, and outperforms other classifiers and wrappers by providing the highest detection accuracy for intrusion attacks and low false alarm rate for normal network traffic with minimized number of features. 相似文献
17.
Data mining can dig out valuable information from databases to assist a business in approaching knowledge discovery and improving
business intelligence. Database stores large structured data. The amount of data increases due to the advanced database technology
and extensive use of information systems. Despite the price drop of storage devices, it is still important to develop efficient
techniques for database compression. This paper develops a database compression method by eliminating redundant data, which
often exist in transaction database. The proposed approach uses a data mining structure to extract association rules from
a database. Redundant data will then be replaced by means of compression rules. A heuristic method is designed to resolve
the conflicts of the compression rules. To prove its efficiency and effectiveness, the proposed approach is compared with
two other database compression methods.
Chin-Feng Lee is an associate professor with the Department of Information Management at Chaoyang University of Technology, Taiwan, R.O.C.
She received her M.S. and Ph.D. degrees in 1994 and 1998, respectively, from the Department of Computer Science and Information
Engineering at National Chung Cheng University. Her current research interests include database design, image processing and
data mining techniques.
S. Wesley Changchien is a professor with the Institute of Electronic Commerce at National Chung-Hsing University, Taiwan, R.O.C. He received a
BS degree in Mechanical Engineering (1989) and completed his MS (1993) and Ph.D. (1996) degrees in Industrial Engineering
at State University of New York at Buffalo, USA. His current research interests include electronic commerce, internet/database
marketing, knowledge management, data mining, and decision support systems.
Jau-Ji Shen received his Ph.D. degree in Information Engineering and Computer Science from National Taiwan University at Taipei, Taiwan
in 1988. From 1988 to 1994, he was the leader of the software group in Institute of Aeronautic, Chung-Sung Institute of Science
and Technology. He is currently an associate professor of information management department in the National Chung Hsing University
at Taichung. His research areas focus on the digital multimedia, database and information security. His current research areas
focus on data engineering, database techniques and information security.
Wei-Tse Wang received the B.A. (2001) and M.B.A (2003) degrees in Information Management at Chaoyang University of Technology, Taiwan,
R.O.C. His research interests include data mining, XML, and database compression. 相似文献
18.
Data mining is an important real-life application for businesses. It is critical to find efficient ways of mining large data sets. In order to benefit from the experience with relational databases, a set-oriented approach to mining data is needed. In such an approach, the data mining operations are expressed in terms of relational or set-oriented operations. Query optimization technology can then be used for efficient processing. In this paper, we describe set-oriented algorithms for mining association rules. Such algorithms imply performing multiple joins and thus may appear to be inherently less efficient than special-purpose algorithms. We develop new algorithms that can be expressed as SQL queries, and discuss optimization of these algorithms. After analytical evaluation, an algorithm named SETM emerges as the algorithm of choice. Algorithm SETM uses only simple database primitives, viz., sorting and merge-scan join. Algorithm SETM is simple, fast, and stable over the range of parameter values. It is easily parallelized and we suggest several additional optimizations. The set-oriented nature of Algorithm SETM makes it possible to develop extensions easily and its performance makes it feasible to build interactive data mining tools for large databases. 相似文献
19.
As the total amount of traffic data in networks has been growing at an alarming rate, there is currently a substantial body of research that attempts to mine traffic data with the purpose of obtaining useful information. For instance, there are some investigations into the detection of Internet worms and intrusions by discovering abnormal traffic patterns. However, since network traffic data contain information about the Internet usage patterns of users, network users’ privacy may be compromised during the mining process. In this paper, we propose an efficient and practical method that preserves privacy during sequential pattern mining on network traffic data. In order to discover frequent sequential patterns without violating privacy, our method uses the N-repository server model, which operates as a single mining server and the retention replacement technique, which changes the answer to a query probabilistically. In addition, our method accelerates the overall mining process by maintaining the meta tables in each site so as to determine quickly whether candidate patterns have ever occurred in the site or not. Extensive experiments with real-world network traffic data revealed the correctness and the efficiency of the proposed method. 相似文献
20.
This paper presents a system where the personal route of a user is predicted using a probabilistic model built from the historical trajectory data. Route patterns are extracted from personal trajectory data using a novel mining algorithm, Continuous Route Pattern Mining (CRPM), which can tolerate different kinds of disturbance in trajectory data. Furthermore, a client-server architecture is employed which has the dual purpose of guaranteeing the privacy of personal data and greatly reducing the computational load on mobile devices. An evaluation using a corpus of trajectory data from 17 people demonstrates that CRPM can extract longer route patterns than current methods. Moreover, the average correct rate of one step prediction of our system is greater than 71%, and the average Levenshtein distance of continuous route prediction of our system is about 30% shorter than that of the Markov model based method. 相似文献
|