期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Boosting support vector machines for imbalanced data sets 总被引：2，自引：2，他引：0

Benjamin X. Wang Nathalie Japkowicz 《Knowledge and Information Systems》2010,25(1):1-20

Real world data mining applications must address the issue of learning from imbalanced data sets. The problem occurs when the number of instances in one class greatly outnumbers the number of instances in the other class. Such data sets often cause a default classifier to be built due to skewed vector spaces or lack of information. Common approaches for dealing with the class imbalance problem involve modifying the data distribution or modifying the classifier. In this work, we choose to use a combination of both approaches. We use support vector machines with soft margins as the base classifier to solve the skewed vector spaces problem. We then counter the excessive bias introduced by this approach with a boosting algorithm. We found that this ensemble of SVMs makes an impressive improvement in prediction performance, not only for the majority class, but also for the minority class. 相似文献

2.

Node similarity in the citation graph

Wangzhong Lu J. Janssen E. Milios N. Japkowicz Yongzheng Zhang 《Knowledge and Information Systems》2007,11(1):105-129

Published scientific articles are linked together into a graph, the citation graph, through their citations. This paper explores the notion of similarity based on connectivity alone, and proposes several algorithms to quantify it. Our metrics take advantage of the local neighborhoods of the nodes in the citation graph. Two variants of link-based similarity estimation between two nodes are described, one based on the separate local neighborhoods of the nodes, and another based on the joint local neighborhood expanded from both nodes at the same time. The algorithms are implemented and evaluated on a subgraph of the citation graph of computer science in a retrieval context. The results are compared with text-based similarity, and demonstrate the complementarity of link-based and text-based retrieval. Wangzhong Lu holds a Bachelor's degree from Hefei University of Technology (1993), and a Master's degree from Dalhousie University (2001), both in computer science. From 1993 to 1999 he worked as a developer with China National Computer Software and Technical Service Corp. in Beijing. From 2001 to 2005 he held industrial positions as a senior software architect in Atlantic Canada. He is currently with DST Systems, Charlotte, NC, as a senior data architect. Jeannette Janssen's research area is applied graph theory. She has worked on the problem of frequency assignment in cellular and digital broadcasting networks. Her current interest is in graph theory applied to the World Wide Web and other networked information spaces. Dr. Janssen did her Master's studies at Eindhoven University of Technology in the Netherlands, and her doctorate at Lehigh University, USA. She is currently an associate professor at Dalhousie University, Canada. Evangelos Milios received a diploma in electrical engineering from the National Technical University of Athens, and Master's and Ph.D. degrees in electrical engineering and computer science from the Massachusetts Institute of Technology. He held faculty positions at the University of Toronto and York University. He is currently a professor of computer science at Dalhousie University, Canada, where he was Director of the Graduate Program. He has served on the committees of the ACM Dissertation Award, and the AAAI/SIGART Doctoral Consortium. He has worked on the interpretation of visual and range signals for landmark-based positioning, navigation and map construction in single- and multi-agent robotics. His current research activity is centered on Networked Information Spaces, Web information retrieval, and aquatic robotics. He is a senior member of the IEEE. Nathalie Japkowicz is an associate professor at the School of Information Technology and Engineering of the University of Ottawa. She obtained her Ph.D. from Rutgers University, her M.Sc. from the University of Toronto, and her B.Sc. from McGill University. Prior to joining the University of Ottawa, she taught at Ohio State University and Dalhousie University. Her area of specialization is Machine Learning and her most recent research interests focused on the class imbalance problem. She made over 50 contributions in the form of journal articles, conference articles, workshop articles, magazine articles, technical reports or edited volumes. Yongzheng Zhang obtained a B.E. in computer applications from Southeast University, China, in 1997 and a M.S. in computer science from Dalhousie University in 2002. From 1997 to 1999 he was an instructor and undergraduate advisor at Southeast University. He also worked as a software engineer in Ricom Information and Telecommunications Co. Ltd., China. He is currently a Ph.D. candidate at Dalhousie University. His research interests are in the areas of Information Retrieval, Machine Learning, Natural Language Processing, and Web Mining, particularly centered on Web Document Summarization. A paper based on his Master's thesis received the best paper award at the 2003 Canadian Artificial Intelligence conference. 相似文献

3.

Warning: statistical benchmarking is addictive. Kicking the habit in machine learning

Chris Drummond Nathalie Japkowicz 《人工智能实验与理论杂志》2013,25(1):67-80

Algorithm performance evaluation is so entrenched in the machine learning community that one could call it an addiction. Like most addictions, it is harmful and very difficult to give up. It is harmful because it has serious limitations. Yet, we have great faith in practicing it in a ritualistic manner: we follow a fixed set of rules telling us the measure, the data sets and the statistical test to use. When we read a paper, even as reviewers, we are not sufficiently critical of results that follow these rules. Here, we will debate what are the limitations and how to best address them. This article may not cure the addiction but hopefully it will be a good first step along that road. 相似文献

4.

Threaded ensembles of autoencoders for stream learning

下载免费PDF全文

Yue Dong Nathalie Japkowicz 《Computational Intelligence》2018,34(1):261-281

Anomaly detection in streaming data is an important problem in numerous application domains. Most existing model‐based approaches to stream learning are based on decision trees due to their fast construction speed. This paper introduces streaming autoencoder (SA), a fast and novel anomaly detection algorithm based on ensembles of neural networks for evolving data streams. It is a one‐class learner, which only requires data from the positive class for training and is accurate even when anomalous training data are rare. It features an ensemble of threaded autoencoders with continuous learning capacity. Furthermore, the SA uses a 2‐step detection mechanism to ensure that real anomalies are detected with low false‐positive rates. The method is highly efficient because it processes data streams in parallel with multithreads and alternating buffers. Our analysis shows that SA has a linear runtime and requires constant memory space. Empirical comparisons to the state‐of‐the‐art methods on multiple benchmark data sets demonstrate that the proposed method detects anomalies efficiently with fewer false alarms. 相似文献

5.

Adaptive learning on mobile network traffic data

Zhen Liu Nathalie Japkowicz Deyu Tang 《连接科学》2019,31(2):185-214

ABSTRACT

Machine learning based mobile traffic classification has become a popular topic in recent years. As mobile traffic data is dynamic in nature, the static model has become ineffective for the task of classifying future traffic. This is known as the concept drift problem in data streams. To this end, this paper presents an adaptive mobile traffic classification method. Specifically, a method based on the fuzzy competence model is devised to detect concept drift, and a dynamic learning method is presented to update the classification model, so as to adapt to an ever-changing environment at an appropriate time. The concept drift detection method relies on the data distribution instead of the classification error rate. Furthermore, the weights of flow samples are dynamically updated and flow samples are resampled for training a new model when a concept drift is detected. Moreover, recently trained models are saved and used for classification in weighted voting. The weight of each model is updated according to the performance it obtains on the most recent flow samples. On mobile traffic data, experimental results show that our proposed method obtains lower classification error rate with less time consumption on updating models as compared to related methods designed for handling concept drift problems. 相似文献

6.

Parallelizing Feature Selection

Jerffeson Teixeira de Souza Stan Matwin Nathalie Japkowicz 《Algorithmica》2006,45(3):433-456

Classification is a key problem in machine learning/data mining. Algorithms for classification have the ability to predict the class of a new instance after having been trained on data representing past experience in classifying instances. However, the presence of a large number of features in training data can hurt the classification capacity of a machine learning algorithm. The Feature Selection problem involves discovering a subset of features such that a classifier built only with this subset would attain predictive accuracy no worse than a classifier built from the entire set of features. Several algorithms have been proposed to solve this problem. In this paper we discuss how parallelism can be used to improve the performance of feature selection algorithms. In particular, we present, discuss and evaluate a coarse-grained parallel version of the feature selection algorithm FortalFS. This algorithm performs well compared with other solutions and it has certain characteristics that makes it a good candidate for parallelization. Our parallel design is based on the master--slave design pattern. Promising results show that this approach is able to achieve near optimum speedups in the context of Amdahl's Law. 相似文献

7.

Supervised Versus Unsupervised Binary-Learning by Feedforward Neural Networks

Japkowicz Nathalie 《Machine Learning》2001,42(1-2):97-122

Binary classification is typically achieved by supervised learning methods. Nevertheless, it is also possible using unsupervised schemes. This paper describes a connectionist unsupervised approach to binary classification and compares its performance to that of its supervised counterpart. The approach consists of training an autoassociator to reconstruct the positive class of a domain at the output layer. After training, the autoassociator is used for classification, relying on the idea that if the network generalizes to a novel instance, then this instance must be positive, but that if generalization fails, then the instance must be negative. When tested on three real-world domains, the autoassociator proved more accurate at classification than its supervised counterpart, MLP, on two of these domains and as accurate on the third (Japkowicz, Myers, & Gluck, 1995). The paper seeks to generalize these results and concludes that, in addition to learning aconcept in the absence of negative examples, 1) autoassociation is more efficient than MLP in multi-modal domains, and 2) it is more accurate than MLP in multi-modal domains for which the negative class creates a particularly strong need for specialization or the positive class creates a particularly weak need for specialization. In multi-modal domains for which the positive class creates a particularly strong need for specialization, on the other hand, MLP is more accurate than autoassociation. 相似文献

8.

Unknown malcode detection and the imbalance problem

Robert Moskovitch Dima Stopel Clint Feher Nir Nissim Nathalie Japkowicz Yuval Elovici 《Journal in Computer Virology》2009,5(4):295-308

The recent growth in network usage has motivated the creation of new malicious code for various purposes. Today’s signature-based antiviruses are very accurate for known malicious code, but can not detect new malicious code. Recently, classification algorithms were used successfully for the detection of unknown malicious code. But, these studies involved a test collection with a limited size and the same malicious: benign file ratio in both the training and test sets, a situation which does not reflect real-life conditions. We present a methodology for the detection of unknown malicious code, which examines concepts from text categorization, based on n-grams extraction from the binary code and feature selection. We performed an extensive evaluation, consisting of a test collection of more than 30,000 files, in which we investigated the class imbalance problem. In real-life scenarios, the malicious file content is expected to be low, about 10% of the total files. For practical purposes, it is unclear as to what the corresponding percentage in the training set should be. Our results indicate that greater than 95% accuracy can be achieved through the use of a training set that has a malicious file content of less than 33.3%. 相似文献

9.

Special issue on discovery science

Nathalie Japkowicz Stan Matwin 《Machine Learning》2017,106(6):741-743

相似文献

10.

Learning over subconcepts: Strategies for 1‐class classification

下载免费PDF全文

Shiven Sharma Anil Somayaji Nathalie Japkowicz 《Computational Intelligence》2018,34(2):440-467

In machine learning research and application, multiclass classification algorithms reign supreme. Their fundamental property is the reliance on the availability of data from all known categories to induce effective classifiers. Unfortunately, data from so‐called real‐world domains sometimes do not satisfy this property, and researchers use methods such as sampling to make the data more conducive for classification. However, there are scenarios in which even such explicit methods to rectify distributions fail. In such cases, 1‐class classification algorithms become the practical alternative. Unfortunately, domain complexity severely impacts their ability to produce effective classifiers. The work in this article addresses this issue and develops a strategy that allows for 1‐class classification over complex domains. In particular, we introduce the notion of learning along the lines of underlying domain concepts; an important source of complexity in domains is the presence of subconcepts, and by learning over them explicitly rather than on the entire domain as a whole, we can produce powerful 1‐class classification systems. The level of knowledge regarding these subconcepts will naturally vary by domain, and thus, we develop 3 distinct methodologies that take the amount of domain knowledge available into account. We demonstrate these over 3 real‐world domains. 相似文献