首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A flexible approach for visual data mining   总被引:3,自引:0,他引:3  
The exploration of heterogenous information spaces requires suitable mining methods as well as effective visual interfaces. Most of the existing systems concentrate either on mining algorithms or on visualization techniques. This paper describes a flexible framework for visual data mining which combines analytical and visual methods to achieve a better understanding of the information space. We provide several pre-processing methods for unstructured information spaces, such as a flexible hierarchy generation with user-controlled refinement. Moreover, we develop new visualization techniques, including an intuitive focus+context technique to visualize complex hierarchical graphs. A special feature of our system is a new paradigm for visualizing information structures within their frame of reference  相似文献   

2.
Data mining has attracted a lot of research efforts during the past decade. However, little work has been reported on the efficiency of supporting a large number of users who issue different data mining queries periodically when there are new needs and when data is updated. Our work is motivated by the fact that the pattern-growth method is one of the most efficient methods for frequent pattern mining which constructs an initial tree and mines frequent patterns on top of the tree. In this paper, we present a data mining proxy approach that can reduce the I/O costs to construct an initial tree by utilizing the trees that have already been resident in memory. The tree we construct is the smallest for a given data mining query. In addition, our proxy approach can also reduce CPU cost in mining patterns, because the cost of mining relies on the sizes of trees. The focus of the work is to construct an initial tree efficiently. We propose three tree operations to construct a tree. With a unique coding scheme, we can efficiently project subtrees from on-disk trees or in-memory trees. Our performance study indicated that the data mining proxy significantly reduces the I/O cost to construct trees and CPU cost to mine patterns over the trees constructed.  相似文献   

3.
In this paper, we propose a hybrid multi-group approach for privacy preserving data mining. We make two contributions in this paper. First, we propose a hybrid approach. Previous work has used either the randomization approach or the secure multi-party computation (SMC) approach. However, these two approaches have complementary features: the randomization approach is much more efficient but less accurate, while the SMC approach is less efficient but more accurate. We propose a novel hybrid approach, which takes advantage of the strength of both approaches to balance the accuracy and efficiency constraints. Compared to the two existing approaches, our proposed approach can achieve much better accuracy than randomization approach and much reduced computation cost than SMC approach. We also propose a multi-group scheme that makes it flexible for the data miner to control the balance between data mining accuracy and privacy. This scheme is motivated by the fact that existing randomization schemes that randomize data at individual attribute level can produce insufficient accuracy when the number of dimensions is high. We partition attributes into groups, and develop a scheme to conduct group-based randomization to achieve better data mining accuracy. To demonstrate the effectiveness of the proposed general schemes, we have implemented them for the ID3 decision tree algorithm and association rule mining problem and we also present experimental results.
Wenliang DuEmail:
  相似文献   

4.
Although the integration of engineering data within the framework of product data management systems has been successful in the recent years, the holistic analysis (from a systems engineering perspective) of multi-disciplinary data or data based on different representations and tools is still not realized in practice. At the same time, the application of advanced data mining techniques to complete designs is very promising and bears a high potential for synergy between different teams in the development process. In this paper, we propose shape mining as a framework to combine and analyze data from engineering design across different tools and disciplines. In the first part of the paper, we introduce unstructured surface meshes as meta-design representations that enable us to apply sensitivity analysis, design concept retrieval and learning as well as methods for interaction analysis to heterogeneous engineering design data. We propose a new measure of relevance to evaluate the utility of a design concept. In the second part of the paper, we apply the formal methods to passenger car design. We combine data from different representations, design tools and methods for a holistic analysis of the resulting shapes. We visualize sensitivities and sensitive cluster centers (after feature reduction) on the car shape. Furthermore, we are able to identify conceptual design rules using tree induction and to create interaction graphs that illustrate the interrelation between spatially decoupled surface areas. Shape data mining in this paper is studied for a multi-criteria aerodynamic problem, i.e. drag force and rear lift, however, the extension to quality criteria from different disciplines is straightforward as long as the meta-design representation is still applicable.  相似文献   

5.
A decision-theoretic approach to data mining   总被引:1,自引:0,他引:1  
In this paper, we develop a decision-theoretic framework for evaluating data mining systems, which employ classification methods, in terms of their utility in decision-making. The decision-theoretic model provides an economic perspective on the value of "extracted knowledge", in terms of its payoff to the organization, and suggests a wide range of decision problems that arise from this point of view. The relation between the quality of a data mining system and the amount of investment that the decision maker is willing to make is formalized. We propose two ways by which independent data mining systems can be combined and show that the combined data mining system can be used in the decision-making process of the organization to increase payoff. Examples are provided to illustrate the various concepts, and several ways by which the proposed framework can be extended are discussed.  相似文献   

6.
Data mining can dig out valuable information from databases to assist a business in approaching knowledge discovery and improving business intelligence. Database stores large structured data. The amount of data increases due to the advanced database technology and extensive use of information systems. Despite the price drop of storage devices, it is still important to develop efficient techniques for database compression. This paper develops a database compression method by eliminating redundant data, which often exist in transaction database. The proposed approach uses a data mining structure to extract association rules from a database. Redundant data will then be replaced by means of compression rules. A heuristic method is designed to resolve the conflicts of the compression rules. To prove its efficiency and effectiveness, the proposed approach is compared with two other database compression methods. Chin-Feng Lee is an associate professor with the Department of Information Management at Chaoyang University of Technology, Taiwan, R.O.C. She received her M.S. and Ph.D. degrees in 1994 and 1998, respectively, from the Department of Computer Science and Information Engineering at National Chung Cheng University. Her current research interests include database design, image processing and data mining techniques. S. Wesley Changchien is a professor with the Institute of Electronic Commerce at National Chung-Hsing University, Taiwan, R.O.C. He received a BS degree in Mechanical Engineering (1989) and completed his MS (1993) and Ph.D. (1996) degrees in Industrial Engineering at State University of New York at Buffalo, USA. His current research interests include electronic commerce, internet/database marketing, knowledge management, data mining, and decision support systems. Jau-Ji Shen received his Ph.D. degree in Information Engineering and Computer Science from National Taiwan University at Taipei, Taiwan in 1988. From 1988 to 1994, he was the leader of the software group in Institute of Aeronautic, Chung-Sung Institute of Science and Technology. He is currently an associate professor of information management department in the National Chung Hsing University at Taichung. His research areas focus on the digital multimedia, database and information security. His current research areas focus on data engineering, database techniques and information security. Wei-Tse Wang received the B.A. (2001) and M.B.A (2003) degrees in Information Management at Chaoyang University of Technology, Taiwan, R.O.C. His research interests include data mining, XML, and database compression.  相似文献   

7.
In this paper, we propose a novel face detection method based on the MAFIA algorithm. Our proposed method consists of two phases, namely, training and detection. In the training phase, we first apply Sobel's edge detection operator, morphological operator, and thresholding to each training image, and transform it into an edge image. Next, we use the MAFIA algorithm to mine the maximal frequent patterns from those edge images and obtain the positive feature pattern. Similarly, we can obtain the negative feature pattern from the complements of edge images. Based on the feature patterns mined, we construct a face detector to prune non-face candidates. In the detection phase, we apply a sliding window to the testing image in different scales. For each sliding window, if the slide window passes the face detector, it is considered as a human face. The proposed method can automatically find the feature patterns that capture most of facial features. By using the feature patterns to construct a face detector, the proposed method is robust to races, illumination, and facial expressions. The experimental results show that the proposed method has outstanding performance in the MIT-CMU dataset and comparable performance in the BioID dataset in terms of false positive and detection rate.  相似文献   

8.
Visual tracking encompasses a wide range of applications in surveillance, medicine and the military arena. There are however roadblocks that hinder exploiting the full capacity of the tracking technology. Depending on specific applications, these roadblocks may include computational complexity, accuracy and robustness of the tracking algorithms. In the paper, we present a grid-based algorithm for tracking that drastically outperforms the existing algorithms in terms of computational efficiency, accuracy and robustness. Furthermore, by judiciously incorporating feature representation, sample generation and sample weighting, the grid-based approach accommodates contrast change, jitter, target deformation and occlusion. Tracking performance of the proposed grid-based algorithm is compared with two recent algorithms, the gradient vector flow snake tracker and the Monte Carlo tracker, in the context of leukocyte (white blood cell) tracking and UAV-based tracking. This comparison indicates that the proposed tracking algorithm is approximately 100 times faster, and at the same time, is significantly more accurate and more robust, thus enabling real-time robust tracking.  相似文献   

9.
10.
Data mining is most commonly used in attempts to induce association rules from transaction data. In the past, we used the fuzzy and GA concepts to discover both useful fuzzy association rules and suitable membership functions from quantitative values. The evaluation for fitness values was, however, quite time-consuming. Due to dramatic increases in available computing power and concomitant decreases in computing costs over the last decade, learning or mining by applying parallel processing techniques has become a feasible way to overcome the slow-learning problem. In this paper, we thus propose a parallel genetic-fuzzy mining algorithm based on the master–slave architecture to extract both association rules and membership functions from quantitative transactions. The master processor uses a single population as a simple genetic algorithm does, and distributes the tasks of fitness evaluation to slave processors. The evolutionary processes, such as crossover, mutation and production are performed by the master processor. It is very natural and efficient to run the proposed algorithm on the master–slave architecture. The time complexities for both sequential and parallel genetic-fuzzy mining algorithms have also been analyzed, with results showing the good effect of the proposed one. When the number of generations is large, the speed-up can be nearly linear. The experimental results also show this point. Applying the master–slave parallel architecture to speed up the genetic-fuzzy data mining algorithm is thus a feasible way to overcome the low-speed fitness evaluation problem of the original algorithm.  相似文献   

11.
Since sport marketing is a commercial activity, precise customer and marketing segmentation must be investigated frequently and it would help to know the sport market after a specific customer profile, segmentation, or pattern come with marketing activities has found. Such knowledge would not only help sport firms, but would also contribute to the broader field of sport customer behavior and marketing. This paper proposes using the Apriori algorithm of association rules, and clustering analysis based on an ontology-based data mining approach, for mining customer knowledge from the database. Knowledge extracted from data mining results is illustrated as knowledge patterns, rules, and maps in order to propose suggestions and solutions to the case firm, Taiwan Adidas, for possible product promotion and sport marketing.  相似文献   

12.
The number, variety and complexity of projects involving data mining or knowledge discovery in databases activities have increased just lately at such a pace that aspects related to their development process need to be standardized for results to be integrated, reused and interchanged in the future. Data mining projects are quickly becoming engineering projects, and current standard processes, like CRISP-DM, need to be revisited to incorporate this engineering viewpoint. This is the central motivation of this paper that makes the point that experience gained about the software development process over almost 40 years could be reused and integrated to improve data mining processes. Consequently, this paper proposes to reuse ideas and concepts underlying the IEEE Std 1074 and ISO 12207 software engineering model processes to redefine and add to the CRISP-DM process and make it a data mining engineering standard.  相似文献   

13.
We present a new approach to address the problem of large sequence mining from big data. The particular problem of interest is the effective mining of long sequences from large-scale location data to be practical for Reality Mining applications, which suffer from large amounts of noise and lack of ground truth. To address this complex data, we propose an unsupervised probabilistic topic model called the distant n-gram topic model (DNTM). The DNTM is based on latent Dirichlet allocation (LDA), which is extended to integrate sequential information. We define the generative process for the model, derive the inference procedure, and evaluate our model on both synthetic data and real mobile phone data. We consider two different mobile phone datasets containing natural human mobility patterns obtained by location sensing, the first considering GPS/wi-fi locations and the second considering cell tower connections. The DNTM discovers meaningful topics on the synthetic data as well as the two mobile phone datasets. Finally, the DNTM is compared to LDA by considering log-likelihood performance on unseen data, showing the predictive power of the model. The results show that the DNTM consistently outperforms LDA as the sequence length increases.  相似文献   

14.
In privacy-preserving data mining (PPDM), a widely used method for achieving data mining goals while preserving privacy is based on k-anonymity. This method, which protects subject-specific sensitive data by anonymizing it before it is released for data mining, demands that every tuple in the released table should be indistinguishable from no fewer than k subjects. The most common approach for achieving compliance with k-anonymity is to replace certain values with less specific but semantically consistent values. In this paper we propose a different approach for achieving k-anonymity by partitioning the original dataset into several projections such that each one of them adheres to k-anonymity. Moreover, any attempt to rejoin the projections, results in a table that still complies with k-anonymity. A classifier is trained on each projection and subsequently, an unlabelled instance is classified by combining the classifications of all classifiers.Guided by classification accuracy and k-anonymity constraints, the proposed data mining privacy by decomposition (DMPD) algorithm uses a genetic algorithm to search for optimal feature set partitioning. Ten separate datasets were evaluated with DMPD in order to compare its classification performance with other k-anonymity-based methods. The results suggest that DMPD performs better than existing k-anonymity-based algorithms and there is no necessity for applying domain dependent knowledge. Using multiobjective optimization methods, we also examine the tradeoff between the two conflicting objectives in PPDM: privacy and predictive performance.  相似文献   

15.
The recent trends in collecting huge and diverse datasets have created a great challenge in data analysis. One of the characteristics of these gigantic datasets is that they often have significant amounts of redundancies. The use of very large multi-dimensional data will result in more noise, redundant data, and the possibility of unconnected data entities. To efficiently manipulate data represented in a high-dimensional space and to address the impact of redundant dimensions on the final results, we propose a new technique for the dimensionality reduction using Copulas and the LU-decomposition (Forward Substitution) method. The proposed method is compared favorably with existing approaches on real-world datasets: Diabetes, Waveform, two versions of Human Activity Recognition based on Smartphone, and Thyroid Datasets taken from machine learning repository in terms of dimensionality reduction and efficiency of the method, which are performed on statistical and classification measures.  相似文献   

16.
Data distribution management (DDM) plays a key role in traffic control for large-scale distributed simulations. In recent years, several solutions have been devised to make DDM more efficient and adaptive to different traffic conditions. Examples of such systems include the region-based, fixed grid-based, and dynamic grid-based (DGB) schemes, as well as grid-filtered region-based and agent-based DDM schemes. However, less effort has been directed toward improving the processing performance of DDM techniques. This paper presents a novel DDM scheme called the adaptive dynamic grid-based (ADGB) scheme that optimizes DDM time through the analysis of matching performance. ADGB uses an advertising scheme in which information about the target cell involved in the process of matching subscribers to publishers is known in advance. An important concept known as the distribution rate (DR) is devised. The DR represents the relative processing load and communication load generated at each federate. The DR and the matching performance are used as part of the ADGB method to select, throughout the simulation, the devised advertisement scheme that achieves the maximum gain with acceptable network traffic overhead. If we assume the same worst case propagation delays, when the matching probability is high, the performance estimation of ADGB has shown that a maximum efficiency gain of 66% can be achieved over the DGB scheme. The novelty of the ADGB scheme is its focus on improving performance, an important (and often forgotten) goal of DDM strategies.  相似文献   

17.
Recently, the class imbalance problem has attracted much attention from researchers in the field of data mining. When learning from imbalanced data in which most examples are labeled as one class and only few belong to another class, traditional data mining approaches do not have a good ability to predict the crucial minority instances. Unfortunately, many real world data sets like health examination, inspection, credit fraud detection, spam identification and text mining all are faced with this situation. In this study, we present a novel model called the “Information Granulation Based Data Mining Approach” to tackle this problem. The proposed methodology, which imitates the human ability to process information, acquires knowledge from Information Granules rather then from numerical data. This method also introduces a Latent Semantic Indexing based feature extraction tool by using Singular Value Decomposition, to dramatically reduce the data dimensions. In addition, several data sets from the UCI Machine Learning Repository are employed to demonstrate the effectiveness of our method. Experimental results show that our method can significantly increase the ability of classifying imbalanced data.  相似文献   

18.
The paper is concerned with practices for tuning the parameters of metaheuristics. Settings such as, e.g., the cooling factor in simulated annealing, may greatly affect a metaheuristic’s efficiency as well as effectiveness in solving a given decision problem. However, procedures for organizing parameter calibration are scarce and commonly limited to particular metaheuristics. We argue that the parameter selection task can appropriately be addressed by means of a data mining based approach. In particular, a hybrid system is devised, which employs regression models to learn suitable parameter values from past moves of a metaheuristic in an online fashion. In order to identify a suitable regression method and, more generally, to demonstrate the feasibility of the proposed approach, a case study of particle swarm optimization is conducted. Empirical results suggest that characteristics of the decision problem as well as search history data indeed embody information that allows suitable parameter values to be determined, and that this type of information can successfully be extracted by means of nonlinear regression models.  相似文献   

19.
Mobile computing systems usually express a user movement trajectory as a sequence of areas that capture the user movement trace. Given a set of user movement trajectories, user movement patterns refer to the sequences of areas through which a user frequently travels. In an attempt to obtain user movement patterns for mobile applications, prior studies explore the problem of mining user movement patterns from the movement logs of mobile users. These movement logs generate a data record whenever a mobile user crosses base station coverage areas. However, this type of movement log does not exist in the system and thus generates extra overheads. By exploiting an existing log, namely, call detail records, this article proposes a Regression-based approach for mining User Movement Patterns (abbreviated as RUMP). This approach views call detail records as random sample trajectory data, and thus, user movement patterns are represented as movement functions in this article. We propose algorithm LS (standing for Large Sequence) to extract the call detail records that capture frequent user movement behaviors. By exploring the spatio-temporal locality of continuous movements (i.e., a mobile user is likely to be in nearby areas if the time interval between consecutive calls is small), we develop algorithm TC (standing for Time Clustering) to cluster call detail records. Then, by utilizing regression analysis, we develop algorithm MF (standing for Movement Function) to derive movement functions. Experimental studies involving both synthetic and real datasets show that RUMP is able to derive user movement functions close to the frequent movement behaviors of mobile users.  相似文献   

20.
Longitudinal data refer to the situation where repeated observations are available for each sampled object. Clustered data, where observations are nested in a hierarchical structure within objects (without time necessarily being involved) represent a similar type of situation. Methodologies that take this structure into account allow for the possibilities of systematic differences between objects that are not related to attributes and autocorrelation within objects across time periods. A standard methodology in the statistics literature for this type of data is the mixed effects model, where these differences between objects are represented by so-called “random effects” that are estimated from the data (population-level relationships are termed “fixed effects,” together resulting in a mixed effects model). This paper presents a methodology that combines the structure of mixed effects models for longitudinal and clustered data with the flexibility of tree-based estimation methods. We apply the resulting estimation method, called the RE-EM tree, to pricing in online transactions, showing that the RE-EM tree is less sensitive to parametric assumptions and provides improved predictive power compared to linear models with random effects and regression trees without random effects. We also apply it to a smaller data set examining accident fatalities, and show that the RE-EM tree strongly outperforms a tree without random effects while performing comparably to a linear model with random effects. We also perform extensive simulation experiments to show that the estimator improves predictive performance relative to regression trees without random effects and is comparable or superior to using linear models with random effects in more general situations.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号