首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
A contrast pattern is a set of items (itemset) whose frequency differs significantly between two classes of data. Such patterns describe distinguishing characteristics between datasets, are meaningful to human experts, have strong discriminating ability and can be used for powerful classifiers. Incrementally mining such patterns is very important for evolving datasets, where transactions can be either inserted or deleted and mining needs to be repeated after changes occur. When the change is small, it is undesirable to carry out mining from scratch. Rather, the set of previously mined contrast patterns should be reused where possible to compute the new patterns. A primary example of evolving data is a data stream, where the data is a sequence of continuously arriving transactions (or itemsets). In this paper, we propose an efficient technique for incrementally mining contrast patterns. Our algorithm particularly aims to avoid redundant computation which might occur due to simultaneous transaction insertion and deletion, as is the case for data streams. In an experimental study using real and synthetic data streams, we show our algorithm can be substantially faster than the previous approach.  相似文献   

2.
Data mining involves nontrivial process of extracting knowledge or patterns from large databases. Genetic Algorithms are efficient and robust searching and optimization methods that are used in data mining. In this paper we propose a Self-Adaptive Migration Model GA (SAMGA), where parameters of population size, the number of points of crossover and mutation rate for each population are adaptively fixed. Further, the migration of individuals between populations is decided dynamically. This paper gives a mathematical schema analysis of the method stating and showing that the algorithm exploits previously discovered knowledge for a more focused and concentrated search of heuristically high yielding regions while simultaneously performing a highly explorative search on the other regions of the search space. The effective performance of the algorithm is then shown using standard testbed functions and a set of actual classification datamining problems. Michigan style of classifier was used to build the classifier and the system was tested with machine learning databases of Pima Indian Diabetes database, Wisconsin Breast Cancer database and few others. The performance of our algorithm is better than others.  相似文献   

3.
Although in the past machine learning algorithms have been successfully used in many problems, their serious practical use is affected by the fact that often they cannot produce reliable and unbiased assessments of their predictions' quality. In last few years, several approaches for estimating reliability or confidence of individual classifiers have emerged, many of them building upon the algorithmic theory of randomness, such as (historically ordered) transduction-based confidence estimation, typicalness-based confidence estimation, and transductive reliability estimation. Unfortunately, they all have weaknesses: either they are tightly bound with particular learning algorithms, or the interpretation of reliability estimations is not always consistent with statistical confidence levels. In the paper we describe typicalness and transductive reliability estimation frameworks and propose a joint approach that compensates the above-mentioned weaknesses by integrating typicalness-based confidence estimation and transductive reliability estimation into a joint confidence machine. The resulting confidence machine produces confidence values in the statistical sense. We perform series of tests with several different machine learning algorithms in several problem domains. We compare our results with that of a proprietary method as well as with kernel density estimation. We show that the proposed method performs as well as proprietary methods and significantly outperforms density estimation methods. Matjaž Kukar is currently Assistant Professor in the Faculty of Computer and Information Science at University of Ljubljana. His research interests include machine learning, data mining and intelligent data analysis, ROC analysis, cost-sensitive learning, reliability estimation, and latent structure analysis, as well as applications of data mining in medical and business problems.  相似文献   

4.
Cognitive rehabilitation (CR) treatment consists of hierarchically organized tasks that require repetitive use of impaired cognitive functions in a progressively more demanding sequence. Active monitoring of the progress of the subjects is therefore required, and the difficulty of the tasks must be progressively increased, always pushing the subjects to reach a goal just beyond what they can attain. There is an important lack of well-established criteria by which to identify the right tasks to propose to the patient. In this paper, the NeuroRehabilitation Range (NRR) is introduced as a means of identifying formal operational models. These are to provide the therapist with dynamic decision support information for assigning the most appropriate CR plan to each patient. Data mining techniques are used to build data-driven models for NRR. The Sectorized and Annotated Plane (SAP) is proposed as a visual tool by which to identify NRR, and two data-driven methods to build the SAP are introduced and compared. Application to a specific representative cognitive task is presented. The results obtained suggest that the current clinical hypothesis about NRR might be reconsidered. Prior knowledge in the area is taken into account to introduce the number of task executions and task performance into NRR models and a new model is proposed which outperforms the current clinical hypothesis. The NRR is introduced as a key concept to provide an operational model identifying when a patient is experiencing activities in his or her Zone of Proximal Development and, consequently, experiencing maximum improvement. For the first time, data collected through a CR platform has been used to find a model for the NRR.  相似文献   

5.
In recent years, data mining has become one of the most popular techniques for data owners to determine their strategies. Association rule mining is a data mining approach that is used widely in traditional databases and usually to find the positive association rules. However, there are some other challenging rule mining topics like data stream mining and negative association rule mining. Besides, organizations want to concentrate on their own business and outsource the rest of their work. This approach is named “database as a service concept” and provides lots of benefits to data owner, but, at the same time, brings out some security problems. In this paper, a rule mining system has been proposed that provides efficient and secure solution to positive and negative association rule computation on XML data streams in database as a service concept. The system is implemented and several experiments have been done with different synthetic data sets to show the performance and efficiency of the proposed system.  相似文献   

6.
Tremendous advances in different areas of knowledge are producing vast volumes of data, a quantity so large that it has made necessary the development of new computational algorithms. Among the algorithms developed, we find Machine Learning models and specific data mining techniques that might be useful for all areas of knowledge. The use of computational tools for data analysis is increasingly required, given the need to extract meaningful information from such large volumes of data. However, there are no free access libraries, modules, or web services that comprise a vast array of analytical techniques in a user-friendly environment for non-specific users. Those that exist raise high usability barriers for those untrained in the field as they usually have specific installation requirements and require in-depth programming knowledge, or may result expensive. As an alternative, we have developed DMAKit, a user-friendly web platform powered by DMAKit-lib, a new library implemented in Python, which facilitates the analysis of data of different kind and origins. Our tool implements a wide array of state-of-the-art data mining and pattern recognition techniques, allowing the user to quickly implement classification, prediction or clustering models, statistical evaluation, and feature analysis of different attributes in diverse datasets without requiring any specific programming knowledge. DMAKit is especially useful for users who have large volumes of data to be analyzed but do not have the informatics, mathematical, or statistical knowledge to implement models. We expect this platform to provide a way to extract information and analyze patterns through data mining techniques for anyone interested in applying them with no specific knowledge required. Particularly, we present several cases of study in the areas of biology, biotechnology, and biomedicine, where we highlight the applicability of our tool to ease the labor of non-specialist users to apply data analysis and pattern recognition techniques. DMAKit is available for non-commercial use as an open-access library, licensed under the GNU General Public License, version GPL 3.0. The web platform is publicly available at https://pesb2.cl/dmakitWeb. Demonstrative and tutorial videos for the web platform are available in https://pesb2.cl/dmakittutorials/. Complete urls for relevant content are listed in the Data Availability section.  相似文献   

7.
Implement web learning environment based on data mining   总被引:2,自引:0,他引:2  
Qinglin Guo  Ming Zhang   《Knowledge》2009,22(6):439-442
The need for providing learners with web-based learning content that match their accessibility needs and preferences, as well as providing ways to match learning content to user’s devices has been identified as an important issue in accessible educational environment. For a web-based open and dynamic learning environment, personalized support for learners becomes more important. In order to achieve optimal efficiency in a learning process, individual learner’s cognitive learning style should be taken into account. Due to different types of learners using these systems, it is necessary to provide them with an individualized learning support system. However, the design and development of web-based learning environments for people with special abilities has been addressed so far by the development of hypermedia and multimedia based on educational content. In this paper a framework of individual web-based learning system is presented by focusing on learner’s cognitive learning process, learning pattern and activities, as well as the technology support needed. Based on the learner-centered mode and cognitive learning theory, we demonstrate an online course design and development that supports the students with the learning flexibility and the adaptability. The proposed framework utilizes data mining algorithm for representing and extracting a dynamic learning process and learning pattern to support students’ deep learning, efficient tutoring and collaboration in web-based learning environment. And experiments do prove that it is feasible to use the method to develop an individual web-based learning system, which is valuable for further study in more depth.  相似文献   

8.
Clustering is a classic problem in the machine learning and pattern recognition area, however a few complications arise when we try to transfer proposed solutions in the data stream model. Recently there have been proposed new algorithms for the basic clustering problem for massive data sets that produce an approximate solution using efficiently the memory, which is the most critical resource for streaming computation. In this paper, based on these solutions, we present a new model for clustering clickstream data which applies three different phases in the data processing, and is validated through a set of experiments.  相似文献   

9.
The visual senses for humans have a unique status, offering a very broadband channel for information flow. Visual approaches to analysis and mining attempt to take advantage of our abilities to perceive pattern and structure in visual form and to make sense of, or interpret, what we see. Visual Data Mining techniques have proven to be of high value in exploratory data analysis and they also have a high potential for mining large databases. In this work, we try to investigate and expand the area of visual data mining by proposing new visual data mining techniques for the visualization of mining outcomes.  相似文献   

10.
In this paper, we used data mining techniques for the automatic discovering of useful temporal abstraction in reinforcement learning. This idea was motivated by the ability of data mining algorithms in automatic discovering of structures and patterns, when applied to large data sets. The state transitions and action trajectories of the learning agent are stored as the data sets for data mining techniques. The proposed state clustering algorithms partition the state space to different regions. Policies for reaching different parts of the space are separately learned and added to the model in a form of options (macro-actions). The main idea of the proposed action sequence mining is to search for patterns that occur frequently within an agent’s accumulated experience. The mined action sequences are also added to the model in a form of options. Our experiments with different data sets indicate a significant speedup of the Q-learning algorithm using the options discovered by the state clustering and action sequence mining algorithms.  相似文献   

11.
Sequential pattern mining is an important data mining problem with broad applications. While the current methods are inducing sequential patterns within a single attribute, the proposed method is able to detect them among different attributes. By incorporating the additional attributes, the sequential patterns found are richer and more informative to the user. This paper proposes a new method for inducing multi-dimensional sequential patterns with the use of Hellinger entropy measure. A number of theorems are proposed to reduce the computational complexity of the sequential pattern systems. The proposed method is tested on some synthesized transaction databases. Dr. Chang-Hwan Lee is a full professor at the Department of Information and Communications at DongGuk University, Seoul, Korea since 1996. He has received his B.Sc. and M.Sc in Computer Science and Statistics from Seoul National University in 1982 and 1988, respectively. He received his Ph.D. in Computer Science and Engineering from University of Connecticut in 1994. Prior to joining DongGuk University in Korea, he had worked for AT&T Bell Laboratories, Middletown, USA. (1994-1995). He also had been a visiting professor at the University of Illinois at Urbana-Champaign (2000-2001). He is author or co-author of more than 50 refereed articles on topics such as machine learning, data mining, artificial intelligence, pattern recognition, and bioinformatics.  相似文献   

12.
Drilling process is one of the most important operations in aeronautic industry. It is performed on the wings of the aeroplanes and its main problem lies with the burr generation. At present moment, there is a visual inspection and manual burr elimination task subsequent to the drilling and previous to the riveting to ensure the quality of the product. These operations increase the cost and the resources required during the process. The article shows the use of data mining techniques to obtain a reliable model to detect the generation of burr during high speed drilling in dry conditions on aluminium Al 7075-T6. It makes possible to eliminate the unproductive operations in order to optimize the process and reduce economic cost. Furthermore, this model should be able to be implemented later in a monitoring system to detect automatically and on-line when the generated burr is out of tolerance limits or not. The article explains the whole process of data analysis from the data preparation to the evaluation and selection of the final model.  相似文献   

13.
Instance selection aims at filtering out noisy data (or outliers) from a given training set, which not only reduces the need for storage space, but can also ensure that the classifier trained by the reduced set provides similar or better performance than the baseline classifier trained by the original set. However, since there are numerous instance selection algorithms, there is no concrete winner that is the best for various problem domain datasets. In other words, the instance selection performance is algorithm and dataset dependent. One main reason for this is because it is very hard to define what the outliers are over different datasets. It should be noted that, using a specific instance selection algorithm, over-selection may occur by filtering out too many ‘good’ data samples, which leads to the classifier providing worse performance than the baseline. In this paper, we introduce a dual classification (DuC) approach, which aims to deal with the potential drawback of over-selection. Specifically, performing instance selection over a given training set, two classifiers are trained using both a ‘good’ and ‘noisy’ sets respectively identified by the instance selection algorithm. Then, a test sample is used to compare the similarities between the data in the good and noisy sets. This comparison guides the input of the test sample to one of the two classifiers. The experiments are conducted using 50 small scale and 4 large scale datasets and the results demonstrate the superior performance of the proposed DuC approach over the baseline instance selection approach.  相似文献   

14.
Most data-mining algorithms assume static behavior of the incoming data. In the real world, the situation is different and most continuously collected data streams are generated by dynamic processes, which may change over time, in some cases even drastically. The change in the underlying concept, also known as concept drift, causes the data-mining model generated from past examples to become less accurate and relevant for classifying the current data. Most online learning algorithms deal with concept drift by generating a new model every time a concept drift is detected. On one hand, this solution ensures accurate and relevant models at all times, thus implying an increase in the classification accuracy. On the other hand, this approach suffers from a major drawback, which is the high computational cost of generating new models. The problem is getting worse when a concept drift is detected more frequently and, hence, a compromise in terms of computational effort and accuracy is needed. This work describes a series of incremental algorithms that are shown empirically to produce more accurate classification models than the batch algorithms in the presence of a concept drift while being computationally cheaper than existing incremental methods. The proposed incremental algorithms are based on an advanced decision-tree learning methodology called “Info-Fuzzy Network” (IFN), which is capable to induce compact and accurate classification models. The algorithms are evaluated on real-world streams of traffic and intrusion-detection data.  相似文献   

15.
随着经济的高速发展,我们的社会进入了网络信息的时代,产生了数以万计的数据。在这些各色各样的数据背后隐藏着大量的信息。怎样从这些不同的数据中找出规律,发现对我们的生活有帮助的信息,这一问题,越来越多的受到人们的关注。而我们所讲的数据挖掘就是从大量的,不完全的,有各种声音的,模糊不清的,随机的实际应用数据中,提纯隐含在最里面的,人们原先并不清楚的,但是又是潜藏且有用信息和知识的过程。数据挖掘的技术其目的就是应对当今社会信息的爆炸,为大量信息的处理提供了科学和行之有效的手段。  相似文献   

16.
The objective of this paper is to present and discuss a link mining algorithm called CorpInterlock and its application to the financial domain. This algorithm selects the largest strongly connected component of a social network and ranks its vertices using several indicators of distance and centrality. These indicators are merged with other relevant indicators in order to forecast new variables using a boosting algorithm. We applied the algorithm CorpInterlock to integrate the metrics of an extended corporate interlock (social network of directors and financial analysts) with corporate fundamental variables and analysts’ predictions (consensus). CorpInterlock used these metrics to forecast the trend of the cumulative abnormal return and earnings surprise of S&P 500 companies. The rationality behind this approach is that the corporate interlock has a direct effect on future earnings and returns because these variables affect directors and managers’ compensation. The financial analysts engage in what the agency theory calls the “earnings game”: Managers want to meet the financial forecasts of the analysts and analysts want to increase their compensation or business of the company that they follow. Following the CorpInterlock algorithm, we calculated a group of well-known social network metrics and integrated with economic variables using Logitboost. We used the results of the CorpInterlock algorithm to evaluate several trading strategies. We observed an improvement of the Sharpe ratio (risk-adjustment return) when we used “long only” trading strategies with the extended corporate interlock instead of the basic corporate interlock before the regulation Fair Disclosure (FD) was adopted (1998–2001). There was no major difference among the trading strategies after 2001. Additionally, the CorpInterlock algorithm implemented with Logitboost showed a significantly lower test error than when the CorpInterlock algorithm was implemented with logistic regression. We conclude that the CorpInterlock algorithm showed to be an effective forecasting algorithm and supported profitable trading strategies. A preliminary version of this paper was presented at the Link Analysis: Dynamics and Statics of Large Networks Workshop on the International Conference on Knowledge Discovery and Data Mining (KDD) 2006.  相似文献   

17.
This paper introduces a software tool named KEEL which is a software tool to assess evolutionary algorithms for Data Mining problems of various kinds including as regression, classification, unsupervised learning, etc. It includes evolutionary learning algorithms based on different approaches: Pittsburgh, Michigan and IRL, as well as the integration of evolutionary learning techniques with different pre-processing techniques, allowing it to perform a complete analysis of any learning model in comparison to existing software tools. Moreover, KEEL has been designed with a double goal: research and educational. Supported by the Spanish Ministry of Science and Technology under Projects TIN-2005-08386-C05-(01, 02, 03, 04 and 05). The work of Dr. Bacardit is also supported by the UK Engineering and Physical Sciences Research Council (EPSRC) under grant GR/T07534/01.  相似文献   

18.
数据挖掘是指从数据库的大量数据中揭示隐含的、先前未知的、潜在有用信息的非平凡的过程.使用可视化数据挖掘的技术从足球比赛的数据集中找到模式.这些模式可以在足球比赛中直接或间接地提供有益的见解,并在比赛中运用决策支持系统.  相似文献   

19.
In the past few years, active learning has been reasonably successful and it has drawn a lot of attention. However, recent active learning methods have focused on strategies in which a large unlabeled dataset has to be reprocessed at each learning iteration. As the datasets grow, these strategies become inefficient or even a tremendous computational challenge. In order to address these issues, we propose an effective and efficient active learning paradigm which attains a significant reduction in the size of the learning set by applying an a priori process of identification and organization of a small relevant subset. Furthermore, the concomitant classification and selection processes enable the classification of a very small number of samples, while selecting the informative ones. Experimental results showed that the proposed paradigm allows to achieve high accuracy quickly with minimum user interaction, further improving its efficiency.  相似文献   

20.
Classification is a well-studied problem in data mining. Classification performance was originally gauged almost exclusively using predictive accuracy, but as work in the field progressed, more sophisticated measures of classifier utility that better represented the value of the induced knowledge were introduced. Nonetheless, most work still ignored the cost of acquiring training examples, even though this cost impacts the total utility of the data mining process. In this article we analyze the relationship between the number of acquired training examples and the utility of the data mining process and, given the necessary cost information, we determine the number of training examples that yields the optimum overall performance. We then extend this analysis to include the cost of model induction—measured in terms of the CPU time required to generate the model. While our cost model does not take into account all possible costs, our analysis provides some useful insights and a template for future analyses using more sophisticated cost models. Because our analysis is based on experiments that acquire the full set of training examples, it cannot directly be used to find a classifier with optimal or near-optimal total utility. To address this issue we introduce two progressive sampling strategies that are empirically shown to produce classifiers with near-optimal total utility.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号