首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 656 毫秒
1.
2.
一种基于混合集成方法的数据流概念漂移检测方法   总被引:1,自引:0,他引:1  
近年来,数据流分类问题研究受到了普遍关注,而漂移检测是其中一个重要的研究问题。已有的分类模型有单一集成模型和混合模型,其漂移检测机制多基于理想的分布假设。单一模型集成可能导致分类误差扩大,噪音环境下分类效果受到了一定影响,而混合集成模型多存在分类精度和时间性能难以两者兼顾的问题。为此,基于简单的WE集成框架,构建了基于决策树和bayes混合模型的集成分类方法 WE-DTB,并利用典型的概念漂移检测机制Hoeffding Bounds和μ检验来进行数据流环境下概念漂移的检测和分类。大量实验表明,WE-DTB能够有效检测概念漂移且具有较好的分类精度及时空性能。  相似文献   

3.
重要位置是指人们在日常生活中的主要活动地点,比如居住地和工作地.智能手机的不断发展与普及为人们的日常生活带来了极大的便利.除了通话、上网等传统应用之外,手机连接基站自动生成的日志记录也是用于用户行为模式挖掘的重要数据来源,例如重要位置发现.然而,相关工作面临着诸多挑战,包括轨迹数据规模庞大、位置精度低以及手机用户的多样性.为此,本文提出了一个通用解决框架以提高轨迹数据可用性.该框架包含一个基于状态的过滤模块,提高了数据的可用性,以及一个重要位置挖掘模块.基于此框架设计了两种分布式挖掘算法:GPMA(Grid-based Parallel Mining Algorithm)和SPMA(Station-based Parallel Mining Algorithm).进一步地,为提高挖掘结果的准确性和精确度,从三个方面进行优化:1)使用多元数据的融合技术,提高结果的准确性;2)提出了无工作地人群的发现算法;3)提出了夜间工作人群的发现算法.理论分析和实验结果表明本文算法具有较高的执行效率、可扩展性,并具有更高的精度.  相似文献   

4.
A data mining (DM) process involves multiple stages. A simple, but typical, process might include preprocessing data, applying a data mining algorithm, and postprocessing the mining results. There are many possible choices for each stage, and only some combinations are valid. Because of the large space and nontrivial interactions, both novices and data mining specialists need assistance in composing and selecting DM processes. Extending notions developed for statistical expert systems we present a prototype intelligent discovery assistant (IDA), which provides users with 1) systematic enumerations of valid DM processes, in order that important, potentially fruitful options are not overlooked, and 2) effective rankings of these valid processes by different criteria, to facilitate the choice of DM processes to execute. We use the prototype to show that an IDA can indeed provide useful enumerations and effective rankings in the context of simple classification processes. We discuss how an IDA could be an important tool for knowledge sharing among a team of data miners. Finally, we illustrate the claims with a demonstration of cost-sensitive classification using a more complicated process and data from the 1998 KDDCUP competition.  相似文献   

5.
Textual databases are useful sources of information and knowledge and if these are well utilised then issues related to future project management and product or service quality improvement may be resolved. A large part of corporate information, approximately 80%, is available in textual data formats. Text Classification techniques are well known for managing on-line sources of digital documents. The identification of key issues discussed within textual data and their classification into two different classes could help decision makers or knowledge workers to manage their future activities better. This research is relevant for most text based documents and is demonstrated on Post Project Reviews (PPRs) which are valuable source of information and knowledge. The application of textual data mining techniques for discovering useful knowledge and classifying textual data into different classes is a relatively new area of research. The research work presented in this paper is focused on the use of hybrid applications of text mining or textual data mining techniques to classify textual data into two different classes. The research applies clustering techniques at the first stage and Apriori Association Rule Mining at the second stage. The Apriori Association Rule of Mining is applied to generate Multiple Key Term Phrasal Knowledge Sequences (MKTPKS) which are later used for classification. Additionally, studies were made to improve the classification accuracies of the classifiers i.e. C4.5, K-NN, Naïve Bayes and Support Vector Machines (SVMs). The classification accuracies were measured and the results compared with those of a single term based classification model. The methodology proposed could be used to analyse any free formatted textual data and in the current research it has been demonstrated on an industrial dataset consisting of Post Project Reviews (PPRs) collected from the construction industry. The data or information available in these reviews is codified in multiple different formats but in the current research scenario only free formatted text documents are examined. Experiments showed that the performance of classifiers improved through adopting the proposed methodology.  相似文献   

6.
In the present day, the oversaturation of data has complicated the process of finding information from a data source. Recommender systems aim to alleviate this problem in various domains by actively suggesting selective information to potential users based on their personal preferences. Amongst these approaches, collaborative filtering based recommenders (CF recommenders), which make use of users’ implicit and explicit ratings for items, are widely regarded as the most successful type of recommender system. However, CF recommenders are sensitive to issues caused by data sparsity, where users rate very few items, or items receive very few ratings from users, meaning there is not enough data to give a recommendation. The majority of studies have attempted to solve these issues by focusing on developing new algorithms within a single domain. Recently, cross-domain recommenders that use multiple domain datasets have attracted increasing attention amongst the research community. Cross-domain recommenders assume that users who express their preferences in one domain (called the target domain) will also express their preferences in another domain (called the source domain), and that these additional preferences will improve precision and recall of recommendations to the user. The purpose of this study is to investigate the effects of various data sparsity and data overlap issues on the performance of cross-domain CF recommenders, using various aggregation functions. In this study, several different cross-domain recommenders were created by collecting three datasets from three separate domains of a large Korean fashion company and combining them with different algorithms and different aggregation approaches. The cross-recommenders that used high performance, high overlap domains showed significant improvement of precision and recall of recommendation when the recommendation scores of individual domains were combined using the summation aggregation function. However, the cross-recommenders that used low performance, low overlap domains showed little or no performance improvement in all areas. This result implies that the use of cross-domain recommenders do not guarantee performance improvement, rather that it is necessary to consider relevant factors carefully to achieve performance improvement when using cross-domain recommenders.  相似文献   

7.
方丁  王刚 《计算机系统应用》2012,21(7):177-181,248
随着Web2.0的迅速发展,越来越多的用户乐于在互联网上分享自己的观点或体验。这类评论信息迅速膨胀,仅靠人工的方法难以应对网上海量信息的收集和处理,因此基于计算机的文本情感分类技术应运而生,并且研究的重点之一就是提高分类的精度。由于集成学习理论是提高分类精度的一种有效途径,并且已在许多领域显示出其优于单个分类器的良好性能,为此,提出基于集成学习理论的文本情感分类方法。实验结果显示三种常用的集成学习方法 Bagging、Boosting和Random Subspace对基础分类器的分类精度都有提高,并且在不同的基础分类器条件下,Random Subspace方法较Bagging和Boosting方法在统计意义上更优,以上结果进一步验证了集成学习理论在文本情感分类中应用的有效性。  相似文献   

8.
As geospatial data grows explosively, there is a great demand for the incorporation of data mining techniques into a geospatial context. Association rules mining is a core technique in data mining and is a solid candidate for the associative analysis of large geospatial databases. In this article, we propose a geospatial knowledge discovery framework for automating the detection of multivariate associations based on a given areal base map. We investigate a series of geospatial preprocessing steps involving data conversion and classification so that the traditional Boolean and quantitative association rules mining can be applied. Our framework has been integrated into GISs using a dynamic link library to allow the automation of both the preprocessing and data mining phases to provide greater ease of use for users. Experiments with real-crime datasets quickly reveal interesting frequent patterns and multivariate associations, which demonstrate the robustness and efficiency of our approach.  相似文献   

9.
Classification is one of the most popular data mining techniques applied to many scientific and industrial problems. The efficiency of a classification model is evaluated by two parameters, namely the accuracy and the interpretability of the model. While most of the existing methods claim their accurate superiority over others, their models are usually complex and hardly understandable for the users. In this paper, we propose a novel classification model that is based on easily interpretable fuzzy association rules and fulfils both efficiency criteria. Since the accuracy of a classification model can be largely affected by the partitioning of numerical attributes, this paper discusses several fuzzy and crisp partitioning techniques. The proposed classification method is compared to 15 previously published association rule-based classifiers by testing them on five benchmark data sets. The results show that the fuzzy association rule-based classifier presented in this paper, offers a compact, understandable and accurate classification model.  相似文献   

10.
CCDM 2014数据挖掘竞赛基于医学诊断数据,提出了实际生活中广泛出现的多类标问题和多类分类问题。针对两个问题出现的类别不平衡现象以及训练样本较少等特点,为了更好地完成数据挖掘任务,借助二次学习和集成学习的思想,提出了一个新的学习框架--二次集成学习。该学习框架通过首次集成学习得到若干置信度较高的样本,将其加入到原始训练集,并在新的训练集上进行二次学习,进而得到泛化性能更高的分类器。竞赛结果表明,与常用的集成学习相比,二次集成学习在两个问题上均取得了非常理想的结果。  相似文献   

11.
In Spatial Data Mining, spatial dimension adds a substantial complexity to the data mining task. First, spatial objects are characterized by a geometrical representation and relative positioning with respect to a reference system, which implicitly define both spatial relationships and properties. Second, spatial phenomena are characterized by autocorrelation, i.e., observations of spatially distributed random variables are not location-independent. Third, spatial objects can be considered at different levels of abstraction (or granularity). The recently proposed SPADA algorithm deals with all these sources of complexity, but it offers a solution for the task of spatial association rules discovery. In this paper the problem of mining spatial classifiers is faced by building an associative classification framework on SPADA. We consider two alternative solutions for associative classification: a propositional and a structural method. In the former, SPADA obtains a propositional representation of training data even in spatial domains which are inherently non-propositional, thus allowing the application of traditional data mining algorithms. In the latter, the Bayesian framework is extended following a multi-relational data mining approach in order to cope with spatial classification tasks. Both methods are evaluated and compared on two real-world spatial datasets and results provide several empirical insights on them.  相似文献   

12.
In this paper we introduce a method called CL.E.D.M. (CLassification through ELECTRE and Data Mining), that employs aspects of the methodological framework of the ELECTRE I outranking method, and aims at increasing the accuracy of existing data mining classification algorithms. In particular, the method chooses the best decision rules extracted from the training process of the data mining classification algorithms, and then it assigns the classes that correspond to these rules, to the objects that must be classified. Three well known data mining classification algorithms are tested in five different widely used databases to verify the robustness of the proposed method.  相似文献   

13.
One of the known classification approaches in data mining is rule induction (RI). RI algorithms such as PRISM usually produce If-Then classifiers, which have a comparable predictive performance to other traditional classification approaches such as decision trees and associative classification. Hence, these classifiers are favourable for carrying out decisions by users and therefore they can be utilised as decision making tools. Nevertheless, RI methods, including PRISM and its successors, suffer from a number of drawbacks primarily the large number of rules derived. This can be a burden especially when the input data is largely dimensional. Therefore, pruning unnecessary rules becomes essential for the success of this type of classifiers. This article proposes a new RI algorithm that reduces the search space for candidate rules by early pruning any irrelevant items during the process of building the classifier. Whenever a rule is generated, our algorithm updates the candidate items frequency to reflect the discarded data examples associated with the rules derived. This makes items frequency dynamic rather static and ensures that irrelevant rules are deleted in preliminary stages when they don't hold enough data representation. The major benefit will be a concise set of decision making rules that are easy to understand and controlled by the decision maker. The proposed algorithm has been implemented in WEKA (Waikato Environment for Knowledge Analysis) environment and hence it can now be utilised by different types of users such as managers, researchers, students and others. Experimental results using real data from the security domain as well as sixteen classification datasets from University of California Irvine (UCI) repository reveal that the proposed algorithm is competitive in regards to classification accuracy when compared to known RI algorithms. Moreover, the classifiers produced by our algorithm are smaller in size which increase their possible use in practical applications.  相似文献   

14.
The use of social networks has grown noticeably in recent years and this fact has led to the production of numerous volumes of data. Data that are widely used by users on the social media sites are very large, noisy, unstructured and dynamic. Providing a flexible framework and method to apply in all of these networks can be the perfect solution. The uncertainties arising from the complexity of decisions in recognition of the Tie Strength among people have led researchers to seek effective variables of intimacy among people. Since there are several effective variables which their effectiveness rate are not precisely determined and their relations are nonlinear and complex, using data mining techniques can be considered as one of the practical solutions for this problem. Some types of unsupervised mining methods have been conducted in the field of detecting the type of tie. Data mining could be considered as one of the applicable tools for researchers in exploring the relationships among users.In this paper, the problem of tie strength prediction is modeled as a data mining problem on which different supervised and unsupervised mining methods are applicable. We propose a comprehensive study on the effects of using different classification techniques such as decision trees, Naive Bayes and so on; in addition to some ensemble classification methods such as Bagging and Boosting methods for predicting tie strength of users of a social network. LinkedIn social network is used as a real case study and our experimental results are proposed on its extracted data. Several models, based on basic techniques and ensemble methods are created and their efficiencies are compared based on F-Measure, accuracy, and average executing time. Our experimental results show that, our profile-behavioral based model has much better accuracy in comparison with profile-data based models techniques.  相似文献   

15.
Remotely sensed imagery has become increasingly important in several applications domains, such as environmental monitoring, change detection, fire risk mapping and land use, to name only a few. Several advanced image classification techniques have been developed to analyze such imagery and in particular to improve the accuracy of classifying images in the context of such applications. However, most of the proposed classifiers remain a black box to users, leaving them with little to no means to explore and thus further improve the classification process, in particular for misclassified pixel samples. In this paper, we present the concepts, design and implementation of VDM-RS, a visual data mining system for classifying remotely sensed images and exploring image classification processes. The system provides users with two classes of components. First, visual components are offered that are specific to classifying remotely sensed images and provide traditional interfaces, such as a map view and an error matrix view. Second, the decision tree classifier view provides users with the functionality to trace and explore the classification process of individual pixel samples. This feature allows users to inspect how a sample has been correctly classified using the classifier, but more importantly, it also allows for a detailed exploration of the steps in which a sample has been misclassified. The integration of these features into a coherent, user-friendly system not only helps users in getting more insights into the data, but also to better understand and subsequently improve a classifier for remotely sensed images. We demonstrate the functionality of the system's components and their interaction for classifying imagery using a hyperspectral image dataset.  相似文献   

16.
一种基于OLAP与DM的OLAM模型的研究   总被引:2,自引:0,他引:2  
OLAP(On-Line Analytical Processing)技术与数据挖掘技术,其实现技术以及适用范围不尽相同,在决策分析中必须协调使用才能发挥更好的作用。基于OLAP与DM的OLAM(OLAP Mining)模型,是将数据挖掘和OLAP技术结合在一个统一的框架之中,使得决策人员和分析人员能够以不同的角度、不同的层次观察数据,加强了决策分析的功能和灵活性。有助于发现更合理的模式,为用户提供更多的支持。  相似文献   

17.
Frequent sequential pattern mining has become one of the most important tasks in data mining. It has many applications, such as sequential analysis, classification, and prediction. How to generate candidates and how to control the combinatorically explosive number of intermediate subsequences are the most difficult problems. Intelligent systems such as recommender systems, expert systems, and business intelligence systems use only a few patterns, namely those that satisfy a number of defined conditions. Challenges include the mining of top-k patterns, top-rank-k patterns, closed patterns, and maximal patterns. In many cases, end users need to find itemsets that occur with a sequential pattern. Therefore, this paper proposes approaches for mining top-k co-occurrence items usually found with a sequential pattern. The Naive Approach Mining (NAM) algorithm discovers top-k co-occurrence items by directly scanning the sequence database to determine the frequency of items. The Vertical Approach Mining (VAM) algorithm is based on vertical database scanning. The Vertical with Index Approach Mining (VIAM) algorithm is based on a vertical database with index scanning. VAM and VIAM use pruning strategies to reduce the search space, thus improving performance. VAM and VIAM are especially effective in mining the co-occurrence items of a long input pattern. The three algorithms were evaluated using real-world databases. The experimental results show that these algorithms perform well, especially VAM and VIAM.  相似文献   

18.
This paper investigates a number of computational intelligence techniques in the detection of heart disease. Particularly, comparison of six well known classifiers for the well used Cleveland data is performed. Further, this paper highlights the potential of an expert judgment based (i.e., medical knowledge driven) feature selection process (termed as MFS), and compare against the generally employed computational intelligence based feature selection mechanism. Also, this article recognizes that the publicly available Cleveland data becomes imbalanced when considering binary classification. Performance of classifiers, and also the potential of MFS are investigated considering this imbalanced data issue. The experimental results demonstrate that the use of MFS noticeably improved the performance, especially in terms of accuracy, for most of the classifiers considered and for majority of the datasets (generated by converting the Cleveland dataset for binary classification). MFS combined with the computerized feature selection process (CFS) has also been investigated and showed encouraging results particularly for NaiveBayes, IBK and SMO. In summary, the medical knowledge based feature selection method has shown promise for use in heart disease diagnostics.  相似文献   

19.
It is challenging to use traditional data mining techniques to deal with real-time data stream classifications. Existing mining classifiers need to be updated frequently to adapt to the changes in data streams. To address this issue, in this paper we propose an adaptive ensemble approach for classification and novel class detection in concept drifting data streams. The proposed approach uses traditional mining classifiers and updates the ensemble model automatically so that it represents the most recent concepts in data streams. For novel class detection we consider the idea that data points belonging to the same class should be closer to each other and should be far apart from the data points belonging to other classes. If a data point is well separated from the existing data clusters, it is identified as a novel class instance. We tested the performance of this proposed stream classification model against that of existing mining algorithms using real benchmark datasets from UCI (University of California, Irvine) machine learning repository. The experimental results prove that our approach shows great flexibility and robustness in novel class detection in concept drifting and outperforms traditional classification models in challenging real-life data stream applications.  相似文献   

20.
Mining data streams is the process of extracting information from non-stopping, rapidly flowing data records to provide knowledge that is reliable and timely. Streaming data algorithms need to be one pass and operate under strict limitations of memory and response time. In addition, the classification of streaming data requires learning in an environment where the data characteristics might change constantly. Many of the classification algorithms presented in literature assume a 100 % labeling rate, which is impractical and expensive when data records are rapidly flowing in. In this paper, a new incremental grid density based learning framework, the GC3 framework, is proposed to perform classification of streaming data with concept drift and limited labeling. The proposed framework uses grid density clustering to detect changes in the input data space. It maintains an evolving ensemble of classifiers to learn and adapt to the model changes over time. The framework also uses a uniform grid density sampling mechanism to obtain a uniform subset of samples for better classification performance with a lower labeling rate. The entire framework is designed to be one-pass, incremental and work with limited memory to perform any-time classification on demand. Experimental comparison with state of the art concept drift handling systems demonstrate the GC3 frameworks ability to provide high classification performance, using fewer models in the ensemble and with only 4-6 % of the samples labeled. The results show that the GC3 framework is effective and attractive for use in real world data stream classification applications.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号