期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Statistical topic models for multi-label document classification 总被引：2，自引：0，他引：2

Timothy N. Rubin America Chambers Padhraic Smyth Mark Steyvers 《Machine Learning》2012,88(1-2):157-208

Machine learning approaches to multi-label document classification have to date largely relied on discriminative modeling techniques such as support vector machines. A?drawback of these approaches is that performance rapidly drops off as the total number of labels and the number of labels per document increase. This problem is amplified when the label frequencies exhibit the type of highly skewed distributions that are often observed in real-world datasets. In this paper we investigate a class of generative statistical topic models for multi-label documents that associate individual word tokens with different labels. We investigate the advantages of this approach relative to discriminative models, particularly with respect to classification problems involving large numbers of relatively rare labels. We compare the performance of generative and discriminative approaches on document labeling tasks ranging from datasets with several thousand labels to datasets with tens of labels. The experimental results indicate that probabilistic generative models can achieve competitive multi-label classification performance compared to discriminative methods, and have advantages for datasets with many labels and skewed label frequencies. 相似文献

2.

Automated analysis and exploration of image databases: Results,progress, and challenges

Usama M. Fayyad Padhraic Smyth Nicholas Weir S. Djorgovski 《Journal of Intelligent Information Systems》1995,4(1):7-25

In areas as diverse as earth remote sensing, astronomy, and medical imaging, image acquisition technology has undergone tremendous improvements in recent years. The vast amounts of scientific data are potential treasure-troves for scientific investigation and analysis. Unfortunately, advances in our ability to deal with this volume of data in an effective manner have not paralleled the hardware gains. While special-purpose tools for particular applications exist, there is a dearth of useful general-purpose software tools and algorithms which can assist a scientist in exploring large scientific image databases. This paper presents our recent progress in developing interactive semi-automated image database exploration tools based on pattern recognition and machine learning technology. We first present a completed and successful application that illustrates the basic approach: the SKICAT system used for the reduction and analysis of a 3 terabyte astronomical data set. SKICAT integrates techniques from image processing, data classification, and database management. It represents a system in which machine learning played a powerful and enabling role, and solved a difficult, scientifically significant problem. We then proceed to discuss the general problem of automated image database exploration, the particular aspects of image databases which distinguish them from other databases, and how this impacts the application of off-the-shelf learning algorithms to problems of this nature. A second large image database is used to ground this discussion: Magellan's images of the surface of the planet Venus. The paper concludes with a discussion of current and future challenges. 相似文献

3.

Statistical Themes and Lessons for Data Mining 总被引：14，自引：1，他引：13

Clark Glymour David Madigan Daryl Pregibon Padhraic Smyth 《Data mining and knowledge discovery》1997,1(1):11-28

Data mining is on the interface of Computer Science andStatistics, utilizing advances in both disciplines to make progressin extracting information from large databases. It is an emergingfield that has attracted much attention in a very short period oftime. This article highlights some statistical themes and lessonsthat are directly relevant to data mining and attempts to identifyopportunities where close cooperation between the statistical andcomputational communities might reasonably provide synergy forfurther progress in data analysis. 相似文献

4.

Bayesian Detection of Changepoints in Finite-State Markov Chains for Multiple Sequences

Petter Arnesen Tracy Holsclaw Padhraic Smyth 《技术计量学》2016,58(2):205-213

We consider the analysis of sets of categorical sequences consisting of piecewise homogenous Markov segments. The sequences are assumed to be governed by a common underlying process with segments occurring in the same order for each sequence. Segments are defined by a set of unobserved changepoints where the positions and number of changepoints can vary from sequence to sequence. We propose a Bayesian framework for analyzing such data, placing priors on the locations of the changepoints and on the transition matrices and using Markov chain Monte Carlo (MCMC) techniques to obtain posterior samples given the data. Experimental results using simulated data illustrate how the methodology can be used for inference of posterior distributions for parameters and changepoints, as well as the ability to handle considerable variability in the locations of the changepoints across different sequences. We also investigate the application of the approach to sequential data from an application involving monsoonal rainfall patterns. Supplementary materials for this article are available online. 相似文献

5.

Modeling individual email patterns over time with latent variable models

Nicholas Navaroli Christopher DuBois Padhraic Smyth 《Machine Learning》2013,92(2-3):431-455

As digital communication devices play an increasingly prominent role in our daily lives, the ability to analyze and understand our communication patterns becomes more important. In this paper, we investigate a latent variable modeling approach for extracting information from individual email histories, focusing in particular on understanding how an individual communicates over time with recipients in their social network. The proposed model consists of latent groups of recipients, each of which is associated with a piecewise-constant Poisson rate over time. Inference of group memberships, temporal changepoints, and rate parameters is carried out via Markov Chain Monte Carlo (MCMC) methods. We illustrate the utility of the model by applying it to both simulated and real-world email data sets. 相似文献

6.

Linearly Combining Density Estimators via Stacking 总被引：1，自引：0，他引：1

Smyth Padhraic Wolpert David 《Machine Learning》1999,36(1-2):59-83

This paper presents experimental results with both real and artificial data combining unsupervised learning algorithms using stacking. Specifically, stacking is used to form a linear combination of finite mixture model and kernel density estimators for non-parametric multivariate density estimation. The method outperforms other strategies such as choosing the single best model based on cross-validation, combining with uniform weights, and even using the single best model chosen by Cheating and examining the test set. We also investigate (1) how the utility of stacking changes when one of the models being combined is the model that generated the data, (2) how the stacking coefficients of the models compare to the relative frequencies with which cross-validation chooses among the models, (3) visualization of combined effective kernels, and (4) the sensitivity of stacking to overfitting as model complexity increases. 相似文献

7.

Model-Based Clustering and Visualization of Navigation Patterns on a Web Site 总被引：5，自引：1，他引：5

Igor Cadez David Heckerman Christopher Meek Padhraic Smyth Steven White 《Data mining and knowledge discovery》2003,7(4):399-424

We present a new methodology for exploring and analyzing navigation patterns on a web site. The patterns that can be analyzed consist of sequences of URL categories traversed by users. In our approach, we first partition site users into clusters such that users with similar navigation paths through the site are placed into the same cluster. Then, for each cluster, we display these paths for users within that cluster. The clustering approach we employ is model-based (as opposed to distance-based) and partitions users according to the order in which they request web pages. In particular, we cluster users by learning a mixture of first-order Markov models using the Expectation-Maximization algorithm. The runtime of our algorithm scales linearly with the number of clusters and with the size of the data; and our implementation easily handles hundreds of thousands of user sessions in memory. In the paper, we describe the details of our method and a visualization tool based on it called WebCANVAS. We illustrate the use of our approach on user-traffic data from msnbc.com. 相似文献

8.

Analysis of Pattern Discovery in Sequences Using a Bayes Error Framework

Darya Chudova Padhraic Smyth 《Data mining and knowledge discovery》2003,7(3):273-299

In this paper we investigate the general problem of discovering recurrent patterns that are embedded in categorical sequences. An important real-world problem of this nature is motif discovery in DNA sequences. There are a number of fundamental aspects of this data mining problem that can make discovery easy or hard—we characterize the difficulty of this problem using an analysis based on the Bayes error rate under a Markov assumption. The Bayes error framework demonstrates why certain patterns are much harder to discover than others. It also explains the role of different parameters such as pattern length and pattern frequency in sequential discovery. We demonstrate how the Bayes error can be used to calibrate existing discovery algorithms, providing a lower bound on achievable performance. We discuss a number of fundamental issues that characterize sequential pattern discovery in this context, present a variety of empirical results to complement and verify the theoretical analysis, and apply our methodology to real-world motif-discovery problems in computational biology. 相似文献

9.

Learning to Recognize Volcanoes on Venus 总被引：1，自引：0，他引：1

Burl Michael C. Asker Lars Smyth Padhraic Fayyad Usama Perona Pietro Crumpler Larry Aubele Jayne 《Machine Learning》1998,30(2-3):165-194

Dramatic improvements in sensor and image acquisition technology have created a demand for automated tools that can aid in the analysis of large image databases. We describe the development of JARtool, a trainable software system that learns to recognize volcanoes in a large data set of Venusian imagery. A machine learning approach is used because it is much easier for geologists to identify examples of volcanoes in the imagery than it is to specify domain knowledge as a set of pixel-level constraints. This approach can also provide portability to other domains without the need for explicit reprogramming; the user simply supplies the system with a new set of training examples. We show how the development of such a system requires a completely different set of skills than are required for applying machine learning to toy world domains. This paper discusses important aspects of the application process not commonly encountered in the toy world, including obtaining labeled training data, the difficulties of working with pixel data, and the automatic extraction of higher-level features. 相似文献

10.

Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data

Cadez Igor V. Smyth Padhraic McLachlan Geoff J. McLaren Christine E. 《Machine Learning》2002,47(1):7-34

相似文献