期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Credit scoring for a microcredit data set using the synthetic minority oversampling technique and ensemble classifiers

Adaleta Gici&#x; Abdulhamit Subasi 《Expert Systems》2019,36(2)

Although microfinance organizations play an important role in developing economies, decision support models for microfinance credit scoring have not been sufficiently covered in the literature, particularly for microcredit enterprises. The aim of this paper is to create a three‐class model that can improve credit risk assessment in the microfinance context. The real‐world microcredit data set used in this study includes data from retail, micro, and small enterprises. To the best of the authors' knowledge, existing research on microfinance credit scoring has been limited to regression and genetic algorithms, thereby excluding novel machine learning algorithms. The aim of this research is to close this gap. The proposed models predict default events by analysing different ensemble classification methods that empower the effects of the synthetic minority oversampling technique (SMOTE) used in the preprocessing of the imbalanced microcredit data set. Initial results have shown improvement in the prediction results for certain classes when the oversampling technique with homogeneous and heterogeneous ensemble classifier methods was applied. A prediction improvement for all classes was achieved via application of SMOTE and the Consolidated Trees Construction algorithm together with Rotation Forest. To obtain a complete view of all aspects, an additional set of metrics is used in the evaluation of performance. 相似文献

2.

Learning accurate very fast decision trees from uncertain data streams

Chunquan Liang Peng Shi Zhengguo Hu 《International journal of systems science》2013,44(16):3032-3050

Most existing works on data stream classification assume the streaming data is precise and definite. Such assumption, however, does not always hold in practice, since data uncertainty is ubiquitous in data stream applications due to imprecise measurement, missing values, privacy protection, etc. The goal of this paper is to learn accurate decision tree models from uncertain data streams for classification analysis. On the basis of very fast decision tree (VFDT) algorithms, we proposed an algorithm for constructing an uncertain VFDT tree with classifiers at tree leaves (uVFDTc). The uVFDTc algorithm can exploit uncertain information effectively and efficiently in both the learning and the classification phases. In the learning phase, it uses Hoeffding bound theory to learn from uncertain data streams and yield fast and reasonable decision trees. In the classification phase, at tree leaves it uses uncertain naive Bayes (UNB) classifiers to improve the classification performance. Experimental results on both synthetic and real-life datasets demonstrate the strong ability of uVFDTc to classify uncertain data streams. The use of UNB at tree leaves has improved the performance of uVFDTc, especially the any-time property, the benefit of exploiting uncertain information, and the robustness against uncertainty. 相似文献

3.

Mining evolving data streams for frequent patterns

Pierre-Alain Laur Jean-Emile Symphor 《Pattern recognition》2007,40(2):492-503

A data stream is a potentially uninterrupted flow of data. Mining this flow makes it necessary to cope with uncertainty, as only a part of the stream can be stored. In this paper, we evaluate a statistical technique which biases the estimation of the support of patterns, so as to maximize either the precision or the recall, as chosen by the user, and limit the degradation of the other criterion. Theoretical results show that the technique is not far from the optimum, from the statistical standpoint. Experiments performed tend to demonstrate its potential, as it remains robust even under significant distribution drifts. 相似文献

4.

面向不均衡数据的融合谱聚类的自适应过采样法

下载免费PDF全文

刘金平周嘉铭贺俊宾唐朝晖徐鹏飞张国勇《智能系统学报》2020,15(4):732-739

分类是模式识别领域中的研究热点,大多数经典的分类器往往默认数据集是分布均衡的,而现实中的数据集往往存在类别不均衡问题,即属于正常/多数类别的数据的数量与属于异常/少数类数据的数量之间的差异很大。若不对数据进行处理往往会导致分类器忽略少数类、偏向多数类,使得分类结果恶化。针对数据的不均衡分布问题,本文提出一种融合谱聚类的综合采样算法。首先采用谱聚类方法对不均衡数据集的少数类样本的分布信息进行分析,再基于分布信息对少数类样本进行过采样,获得相对均衡的样本,用于分类模型训练。在多个不均衡数据集上进行了大量实验,结果表明,所提方法能有效解决数据的不均衡问题,使得分类器对于少数类样本的分类精度得到提升。相似文献

5.

基于SMOTE和深度信念网络的异常检测

沈学利覃淑娟《计算机应用》2018,38(7):1941-1945

针对现有海量非平衡数据集中少数类别样本入侵检测率低的问题,提出了一种基于合成少数类过采样技术（SMOTE）和深度信念网络（DBN）的异常检测（SMOTE-DBN）方法。首先,用SMOTE技术增加了少数类别样本的样本数;然后在预处理后的较平衡数据集上,用非监督的受限玻尔兹曼机（RBM）对预处理后的高维数据进行特征降维;其次,用反向传播（BP）算法微调模型参数,获得预处理后数据的较优低维表示;最后通过softmax分类器对较优低维数据进行分类。KDD1999数据集仿真实验表明,SMOTE优化处理能够提高模型对少数类别样本的检测率,在相同数据集上,SMOTE-DBN方法与DBN方法、支持向量机（SVM）方法相比,检测率分别提高了3.31个百分点和7.34个百分点,误报率分别降低了1.11个百分点和2.67个百分点。相似文献

6.

Reservoir of diverse adaptive learners and stacking fast hoeffding drift detection methods for evolving data streams

Ali Pesaranghader Herna Viktor Eric Paquet 《Machine Learning》2018,107(11):1711-1743

The last decade has seen a surge of interest in adaptive learning algorithms for data stream classification, with applications ranging from predicting ozone level peaks, learning stock market indicators, to detecting computer security violations. In addition, a number of methods have been developed to detect concept drifts in these streams. Consider a scenario where we have a number of classifiers with diverse learning styles and different drift detectors. Intuitively, the current ‘best’ (classifier, detector) pair is application dependent and may change as a result of the stream evolution. Our research builds on this observation. We introduce the Tornado framework that implements a reservoir of diverse classifiers, together with a variety of drift detection algorithms. In our framework, all (classifier, detector) pairs proceed, in parallel, to construct models against the evolving data streams. At any point in time, we select the pair which currently yields the best performance. To this end, we introduce the CAR measure, which is employed to balance classification, adaptation and resource utilization requirements. We further incorporate two novel stacking-based drift detection methods, namely the FHDDMS and \(\hbox {FHDDMS}_{\mathrm{add}}\) approaches. The experimental evaluation confirms that the current ‘best’ (classifier, detector) pair is not only heavily dependent on the characteristics of the stream, but also that this selection evolves as the stream flows. Further, our FHDDMS variants detect concept drifts accurately in a timely fashion while outperforming the state-of-the-art. 相似文献

7.

Scalable and efficient multi-label classification for evolving data streams

Jesse Read Albert Bifet Geoff Holmes Bernhard Pfahringer 《Machine Learning》2012,88(1-2):243-272

Many challenging real world problems involve multi-label data streams. Efficient methods exist for multi-label classification in non-streaming scenarios. However, learning in evolving streaming scenarios is more challenging, as classifiers must be able to deal with huge numbers of examples and to adapt to change using limited time and memory while being ready to predict at any point. This paper proposes a new experimental framework for learning and evaluating on multi-label data streams, and uses it to study the performance of various methods. From this study, we develop a multi-label Hoeffding tree with multi-label classifiers at the leaves. We show empirically that this method is well suited to this challenging task. Using our new framework, which allows us to generate realistic multi-label data streams with concept drift (as well as real data), we compare with a selection of baseline methods, as well as new learning methods from the literature, and show that our Hoeffding tree method achieves fast and more accurate performance. 相似文献

8.

Synthetic minority oversampling for function approximation problems

Lourdes Pelayo Scott Dick 《国际智能系统杂志》2019,34(11):2741-2768

Imbalanced data sets are a common occurrence in important machine learning problems. Research in improving learning under imbalanced conditions has largely focused on classification problems (ie, problems with a categorical dependent variable). However, imbalanced data also occur in function approximation, and far less attention has been paid to this case. We present a novel stratification approach for imbalanced function approximation problems. Our solution extends the SMOTE oversampling preprocessing technique to continuous-valued dependent variables by identifying regions of the feature space with a low density of examples and high variance in the dependent variable. Synthetic examples are then generated between nearest neighbors in these regions. In an empirical validation, our approach reduces the normalized mean-squared prediction error in 18 out of 21 benchmark data sets, and compares favorably with state-of-the-art approaches. 相似文献

9.

A framework for on-demand classification of evolving data streams 总被引：4，自引：0，他引：4

Aggarwal C.C. Han J. Jianyong Wang Yu P.S. 《Knowledge and Data Engineering, IEEE Transactions on》2006,18(5):577-589

Current models of the classification problem do not effectively handle bursts of particular classes coming in at different times. In fact, the current model of the classification problem simply concentrates on methods for one-pass classification modeling of very large data sets. Our model for data stream classification views the data stream classification problem from the point of view of a dynamic approach in which simultaneous training and test streams are used for dynamic classification of data sets. This model reflects real-life situations effectively, since it is desirable to classify test streams in real time over an evolving training and test stream. The aim here is to create a classification system in which the training model can adapt quickly to the changes of the underlying data stream. In order to achieve this goal, we propose an on-demand classification process which can dynamically select the appropriate window of past training data to build the classifier. The empirical results indicate that the system maintains an high classification accuracy in an evolving data stream, while providing an efficient solution to the classification task. 相似文献

10.

演化数据流上的连续异常检测

胡雪艳苏亮高春鸣《计算机工程与应用》2008,44(7):174-178

基于滑动窗口的异常检测是数据流挖掘研究的一个重要课题,在许多应用中数据流通常在一个分布网络上传输,解决这类问题时常采用分布计算技术,以便获得实时高质量的计算结果。对分布演化数据流上连续异常检测问题,进行形式化地阐述,提出了两个基于核密度估计的异常检测定义和算法,并通过大量真实数据集的实验,表明该算法具有良好的高效性和可扩展性,完全适应数据流应用的需求。相似文献

11.

Learning model trees from evolving data streams 总被引：2，自引：0，他引：2

Elena Ikonomovska João Gama Sašo Džeroski 《Data mining and knowledge discovery》2011,23(1):128-168

The problem of real-time extraction of meaningful patterns from time-changing data streams is of increasing importance for the machine learning and data mining communities. Regression in time-changing data streams is a relatively unexplored topic, despite the apparent applications. This paper proposes an efficient and incremental stream mining algorithm which is able to learn regression and model trees from possibly unbounded, high-speed and time-changing data streams. The algorithm is evaluated extensively in a variety of settings involving artificial and real data. To the best of our knowledge there is no other general purpose algorithm for incremental learning regression/model trees able to perform explicit change detection and informed adaptation. The algorithm performs online and in real-time, observes each example only once at the speed of arrival, and maintains at any-time a ready-to-use model tree. The tree leaves contain linear models induced online from the examples assigned to them, a process with low complexity. The algorithm has mechanisms for drift detection and model adaptation, which enable it to maintain accurate and updated regression models at any time. The drift detection mechanism exploits the structure of the tree in the process of local change detection. As a response to local drift, the algorithm is able to update the tree structure only locally. This approach improves the any-time performance and greatly reduces the costs of adaptation. 相似文献

12.

On change diagnosis in evolving data streams 总被引：1，自引：0，他引：1

Aggarwal C.C. 《Knowledge and Data Engineering, IEEE Transactions on》2005,17(5):587-600

In recent years, the progress in hardware technology has made it possible for organizations to store and record large streams of transactional data. This results in databases which grow without limit at a rapid rate. This data can often show important changes in trends over time. In such cases, it is useful to understand, visualize, and diagnose the evolution of these trends. In this paper, we introduce the concept of velocity density estimation, a technique used to understand, visualize, and determine trends in the evolution of fast data streams. We show how to use velocity density estimation in order to create both temporal velocity profiles and spatial velocity profiles at periodic instants in time. These profiles are then used in order to predict three kinds of data evolution: dissolution, coagulation, and shift. Methods are proposed to visualize the changing data trends in a single online scan of the data stream and a computational requirement which is linear in the number of data points. The visualization techniques can also be used to provide online animations which show the changes in the data characteristics while they occur. In addition, batch processing techniques are proposed in order to quantify the level of change across different combinations of dimensions. This quantification is then used in order to determine dimensional combinations with significant evolution. The techniques discussed in this paper can be easily extended to spatiotemporal data, changes in data snapshots at fixed instances in time, or any other data which has a temporal component during its evolution. 相似文献

13.

Improving semi-supervised co-forest algorithm in evolving data streams

Yi Wang Tao Li 《Applied Intelligence》2018,48(10):3248-3262

Semi-supervised learning, which uses a large amount of unlabeled data to improve the performance of a classifier when only a limited amount of labeled data is available, has become a hot topic in machine learning research recently. In this paper, we propose a semi-supervised ensemble of classifiers approach, for learning in time-varying data streams. This algorithm maintains all the desirable properties of the semi-supervised Co-trained random FOREST algorithm (Co-Forest) and extends it into evolving data streams. It assigns a weight to each example according to Poisson(1) to simulate the bootstrap sample method in data streams, which is used to keep the diversity of Random Forest. By utilizing incremental learning technology, it avoids unnecessary repetition training and improves the accuracy of base models. In addition, the ADaptive WINdowing (ADWIN2) is introduced to deal with concept drift, which makes it adapt to the varying environment. Empirical evaluation on both synthetic data and UCI data reveals that our proposed method outperforms state-of-the-art semi-supervised and supervised methods in time-varying data streams, and also achieves relatively high performance in stationary streams. 相似文献

14.

Statistical hierarchical clustering algorithm for outlier detection in evolving data streams

Krleža Dalibor Vrdoljak Boris Brčić Mario 《Machine Learning》2021,110(1):139-184

Machine Learning - Anomaly detection is a hard data analysis process that requires constant creation and improvement of data analysis algorithms. Using traditional clustering algorithms to analyse... 相似文献

15.

Very fast decision rules for classification in data streams

Petr Kosina João Gama 《Data mining and knowledge discovery》2015,29(1):168-202

相似文献

16.

Tracking clusters in evolving data streams over sliding windows 总被引：6，自引：4，他引：2

Aoying Zhou Feng Cao Weining Qian Cheqing Jin 《Knowledge and Information Systems》2008,15(2):181-214

Mining data streams poses great challenges due to the limited memory availability and real-time query response requirement. Clustering an evolving data stream is especially interesting because it captures not only the changing distribution of clusters but also the evolving behaviors of individual clusters. In this paper, we present a novel method for tracking the evolution of clusters over sliding windows. In our SWClustering algorithm, we combine the exponential histogram with the temporal cluster features, propose a novel data structure, the Exponential Histogram of Cluster Features (EHCF). The exponential histogram is used to handle the in-cluster evolution, and the temporal cluster features represent the change of the cluster distribution. Our approach has several advantages over existing methods: (1) the quality of the clusters is improved because the EHCF captures the distribution of recent records precisely; (2) compared with previous methods, the mechanism employed to adaptively maintain the in-cluster synopsis can track the cluster evolution better, while consuming much less memory; (3) the EHCF provides a flexible framework for analyzing the cluster evolution and tracking a specific cluster efficiently without interfering with other clusters, thus reducing the consumption of computing resources for data stream clustering. Both the theoretical analysis and extensive experiments show the effectiveness and efficiency of the proposed method. Aoying Zhou is currently a Professor in Computer Science at Fudan University, Shanghai, P.R. China. He won his Bachelor and Master degrees in Computer Science from Sichuan University in Chengdu, Sichuan, P.R. China in 1985 and 1988, respectively, and Ph.D. degree from Fudan University in 1993. He served as the member or chair of program committee for many international conferences such as WWW, SIGMOD, VLDB, EDBT, ICDCS, ER, DASFAA, PAKDD, WAIM, and etc. His papers have been published in ACM SIGMOD, VLDB, ICDE, and several other international journals. His research interests include Data mining and knowledge discovery, XML data management, Web mining and searching, data stream analysis and processing, peer-to-peer computing. Feng Cao is currently an R&D engineer in IBM China Research Laboratories. He received a B.E. degree from Xi'an Jiao Tong University, Xi'an, P.R. China, in 2000 and an M.E. degree from Huazhong University of Science and Technology, Wuhan, P.R. China, in 2003. From October 2004 to March 2005, he worked in Fudan-NUS Competency Center for Peer-to-Peer Computing, Singapore. In 2006, he received his Ph.D. degree from Fudan University, Shanghai, P.R. China. His current research interests include data mining and data stream. Weining Qian is currently an Assistant Professor in computer science at Fudan University, Shanghai, P.R. China. He received his M.S. and Ph.D. degree in computer science from Fudan University in 2001 and 2004, respectively. He is supported by Shanghai Rising-Star Program under Grant No. 04QMX1404 and National Natural Science Foundation of China (NSFC) under Grant No. 60673134. He served as the program committee member of several international conferences, including DASFAA 2006, 2007 and 2008, APWeb/WAIM 2007, INFOSCALE 2007, and ECDM 2007. His papers have been published in ICDE, SIAM DM, and CIKM. His research interests include data stream query processing and mining, and large-scale distributed computing for database applications. Cheqing Jin is currently an Assistant Professor in Computer Science at East China University of Science and Technology. He received his Bachelor and Master degrees in Computer Science from Zhejiang University in Hangzhou, P.R. China in 1999 and 2002, respectively, and the Ph.D. degree from Fudan University, Shanghai, P.R. China. He worked as a Research Assistant at E-business Technology Institute, the Hong Kong University from December 2003 to May 2004. His current research interests include data mining and data stream. 相似文献

17.

An adaptive algorithm for anomaly and novelty detection in evolving data streams

Mohamed-Rafik Bouguelia Slawomir Nowaczyk Amir H. Payberah 《Data mining and knowledge discovery》2018,32(6):1597-1633

In the era of big data, considerable research focus is being put on designing efficient algorithms capable of learning and extracting high-level knowledge from ubiquitous data streams in an online fashion. While, most existing algorithms assume that data samples are drawn from a stationary distribution, several complex environments deal with data streams that are subject to change over time. Taking this aspect into consideration is an important step towards building truly aware and intelligent systems. In this paper, we propose GNG-A, an adaptive method for incremental unsupervised learning from evolving data streams experiencing various types of change. The proposed method maintains a continuously updated network (graph) of neurons by extending the Growing Neural Gas algorithm with three complementary mechanisms, allowing it to closely track both gradual and sudden changes in the data distribution. First, an adaptation mechanism handles local changes where the distribution is only non-stationary in some regions of the feature space. Second, an adaptive forgetting mechanism identifies and removes neurons that become irrelevant due to the evolving nature of the stream. Finally, a probabilistic evolution mechanism creates new neurons when there is a need to represent data in new regions of the feature space. The proposed method is demonstrated for anomaly and novelty detection in non-stationary environments. Results show that the method handles different data distributions and efficiently reacts to various types of change. 相似文献

18.

Efficient unsupervised drift detector for fast and high-dimensional data streams

Souza Vinicius M. A. Parmezan Antonio R. S. Chowdhury Farhan A. Mueen Abdullah 《Knowledge and Information Systems》2021,63(6):1497-1527

相似文献

19.

Exploiting punctuation semantics in continuous data streams 总被引：4，自引：0，他引：4

Tucker P.A. Maier D. Sheard T. Fegaras L. 《Knowledge and Data Engineering, IEEE Transactions on》2003,15(3):555-568

As most current query processing architectures are already pipelined, it seems logical to apply them to data streams. However, two classes of query operators are impractical for processing long or infinite data streams. Unbounded stateful operators maintain state with no upper bound in size and, so, run out of memory. Blocking operators read an entire input before emitting a single output and, so, might never produce a result. We believe that a priori knowledge of a data stream can permit the use of such operators in some cases. We discuss a kind of stream semantics called punctuated streams. Punctuations in a stream mark the end of substreams allowing us to view an infinite stream as a mixture of finite streams. We introduce three kinds of invariants to specify the proper behavior of operators in the presence of punctuation. Pass invariants define when results can be passed on. Keep invariants define what must be kept in local state to continue successful operation. Propagation invariants define when punctuation can be passed on. We report on our initial implementation and show a strategy for proving implementations of these invariants are faithful to their relational counterparts. 相似文献

20.

Learning from evolving data streams through ensembles of random patches

Gomes Heitor Murilo Read Jesse Bifet Albert Durrant Robert J. 《Knowledge and Information Systems》2021,63(7):1597-1625

Knowledge and Information Systems - Ensemble methods represent an effective way to solve supervised learning problems. Such methods are prevalent for learning from evolving data streams. One of the... 相似文献