首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 280 毫秒
1.
We present a method for measuring similarity between source codes. We approach this task from the machine learning perspective using character and word n-grams as features and examining different machine learning algorithms. Furthermore, we explore the contribution of the latent semantic analysis in this task. We developed a corpus in order to evaluate the proposed approach. The corpus consists of around 10,000 source codes written in the Karel programming language to solve 100 different tasks. The results show that the highest classification accuracy is achieved when using Support Vector Machines classifier, applying the latent semantic analysis, and selecting as features trigrams of words.  相似文献   

2.
Outlier detection algorithms are often computationally intensive because of their need to score each point in the data. Even simple distance-based algorithms have quadratic complexity. High-dimensional outlier detection algorithms such as subspace methods are often even more computationally intensive because of their need to explore different subspaces of the data. In this paper, we propose an exceedingly simple subspace outlier detection algorithm, which can be implemented in a few lines of code, and whose complexity is linear in the size of the data set and the space requirement is constant. We show that this outlier detection algorithm is much faster than both conventional and high-dimensional algorithms and also provides more accurate results. The approach uses randomized hashing to score data points and has a neat subspace interpretation. We provide a visual representation of this interpretability in terms of outlier sensitivity histograms. Furthermore, the approach can be easily generalized to data streams, where it provides an efficient approach to discover outliers in real time. We present experimental results showing the effectiveness of the approach over other state-of-the-art methods.  相似文献   

3.
In this paper, we propose a research work on speaker discrimination using a multi-classifier fusion with focus on feature reduction effects. Speaker discrimination consists in the automatic distinction between two speakers using the vocal characteristics of their speeches. A number of features are extracted using Mel Frequency Spectral Coefficients and then reduced using Relative Speaker Characteristic (RSC) along with the Principal Components Analysis (PCA). Several classification methods are implemented to ensure the discrimination task. Since different classifiers are employed, two fusion algorithms at the decision level, referred to as Weighted Fusion and Fuzzy Fusion, are proposed to boost the classification performances. These algorithms are based on the weighting of the different classifiers outputs. Furthermore, the effects of speaker gender and feature reduction on the speaker discrimination task have been examined too. The evaluation of our approaches was conducted on a subset of Hub-4 Broadcast-News. The experimental results have shown that the speaker discrimination accuracy is improved by 5–15% using the (RSC–PCA) feature reduction. In addition, the proposed fusion methods recorded an improvement of about 10% compared to the individual scores of the classifiers. Finally, we noticed that the gender has an important impact on the discrimination performances.  相似文献   

4.
Bayesian belief nets (BNs) are often used for classification tasks—typically to return the most likely class label for each specified instance. Many BN-learners, however, attempt to find the BN that maximizes a different objective function—viz., likelihood, rather than classification accuracy—typically by first learning an appropriate graphical structure, then finding the parameters for that structure that maximize the likelihood of the data. As these parameters may not maximize the classification accuracy, “discriminative parameter learners” follow the alternative approach of seeking the parameters that maximize conditional likelihood (CL), over the distribution of instances the BN will have to classify. This paper first formally specifies this task, shows how it extends standard logistic regression, and analyzes its inherent sample and computational complexity. We then present a general algorithm for this task, ELR, that applies to arbitrary BN structures and that works effectively even when given incomplete training data. Unfortunately, ELR is not guaranteed to find the parameters that optimize conditional likelihood; moreover, even the optimal-CL parameters need not have minimal classification error. This paper therefore presents empirical evidence that ELR produces effective classifiers, often superior to the ones produced by the standard “generative” algorithms, especially in common situations where the given BN-structure is incorrect.  相似文献   

5.
Time series classification is related to many different domains, such as health informatics, finance, and bioinformatics. Due to its broad applications, researchers have developed many algorithms for this kind of tasks, e.g., multivariate time series classification. Among the classification algorithms, k-nearest neighbor (k-NN) classification (particularly 1-NN) combined with dynamic time warping (DTW) achieves the state of the art performance. The deficiency is that when the data set grows large, the time consumption of 1-NN with DTWwill be very expensive. In contrast to 1-NN with DTW, it is more efficient but less effective for feature-based classification methods since their performance usually depends on the quality of hand-crafted features. In this paper, we aim to improve the performance of traditional feature-based approaches through the feature learning techniques. Specifically, we propose a novel deep learning framework, multi-channels deep convolutional neural networks (MC-DCNN), for multivariate time series classification. This model first learns features from individual univariate time series in each channel, and combines information from all channels as feature representation at the final layer. Then, the learnt features are applied into a multilayer perceptron (MLP) for classification. Finally, the extensive experiments on real-world data sets show that our model is not only more efficient than the state of the art but also competitive in accuracy. This study implies that feature learning is worth to be investigated for the problem of time series classification.  相似文献   

6.
A comparative study of staff removal algorithms   总被引:1,自引:0,他引:1  
This paper presents a quantitative comparison of different algorithms for the removal of stafflines from music images. It contains a survey of previously proposed algorithms and suggests a new skeletonization based approach. We define three different error metrics, compare the algorithms with respect to these metrics and measure their robustness with respect to certain image defects. Our test images are computer-generated scores on which we apply various image deformations typically found in real-world data. In addition to modern western music notation our test set also includes historic music notation such as mensural notation and lute tablature. Our general approach and evaluation methodology is not specific to staff removal, but applicable to other segmentation problems as well.  相似文献   

7.
In this paper, we discuss a quantum approach for the all-pair multiclass classification problem. In an all-pair approach, there is one binary classification problem for each pair of classes, and so there are k(k???1)/2 classifiers for a k-class classification problem. As compared to the classical multiclass support vector machine that can be implemented with polynomial run time complexity, our approach exhibits exponential speedup due to quantum computing. The quantum all-pair algorithm can also be used with other classification algorithms, and a speedup gain can be achieved as compared to their classical counterparts.  相似文献   

8.
Multimedia content understanding research requires rigorous approach to deal with the complexity of the data. At the crux of this problem is the method to deal with multilevel data whose structure exists at multiple scales and across data sources. A common example is modeling tags jointly with images to improve retrieval, classification and tag recommendation. Associated contextual observation, such as metadata, is rich that can be exploited for content analysis. A major challenge is the need for a principal approach to systematically incorporate associated media with the primary data source of interest. Taking a factor modeling approach, we propose a framework that can discover low-dimensional structures for a primary data source together with other associated information. We cast this task as a subspace learning problem under the framework of Bayesian nonparametrics and thus the subspace dimensionality and the number of clusters are automatically learnt from data instead of setting these parameters a priori. Using Beta processes as the building block, we construct random measures in a hierarchical structure to generate multiple data sources and capture their shared statistical at the same time. The model parameters are inferred efficiently using a novel combination of Gibbs and slice sampling. We demonstrate the applicability of the proposed model in three applications: image retrieval, automatic tag recommendation and image classification. Experiments using two real-world datasets show that our approach outperforms various state-of-the-art related methods.  相似文献   

9.
There has been a growing interest in applying human computation – particularly crowdsourcing techniques – to assist in the solution of multimedia, image processing, and computer vision problems which are still too difficult to solve using fully automatic algorithms, and yet relatively easy for humans. In this paper we focus on a specific problem – object segmentation within color images – and compare different solutions which combine color image segmentation algorithms with human efforts, either in the form of an explicit interactive segmentation task or through an implicit collection of valuable human traces with a game. We use Click’n’Cut, a friendly, web-based, interactive segmentation tool that allows segmentation tasks to be assigned to many users, and Ask’nSeek, a game with a purpose designed for object detection and segmentation. The two main contributions of this paper are: (i) We use the results of Click’n’Cut campaigns with different groups of users to examine and quantify the crowdsourcing loss incurred when an interactive segmentation task is assigned to paid crowd-workers, comparing their results to the ones obtained when computer vision experts are asked to perform the same tasks. (ii) Since interactive segmentation tasks are inherently tedious and prone to fatigue, we compare the quality of the results obtained with Click’n’Cut with the ones obtained using a (fun, interactive, and potentially less tedious) game designed for the same purpose. We call this contribution the assessment of the gamification loss, since it refers to how much quality of segmentation results may be lost when we switch to a game-based approach to the same task. We demonstrate that the crowdsourcing loss is significant when using all the data points from workers, but decreases substantially (and becomes comparable to the quality of expert users performing similar tasks) after performing a modest amount of data analysis and filtering out of users whose data are clearly not useful. We also show that – on the other hand – the gamification loss is significantly more severe: the quality of the results drops roughly by half when switching from a focused (yet tedious) task to a more fun and relaxed game environment.  相似文献   

10.
11.
Many complex multi-target prediction problems that concern large target spaces are characterised by a need for efficient prediction strategies that avoid the computation of predictions for all targets explicitly. Examples of such problems emerge in several subfields of machine learning, such as collaborative filtering, multi-label classification, dyadic prediction and biological network inference. In this article we analyse efficient and exact algorithms for computing the top-K predictions in the above problem settings, using a general class of models that we refer to as separable linear relational models. We show how to use those inference algorithms, which are modifications of well-known information retrieval methods, in a variety of machine learning settings. Furthermore, we study the possibility of scoring items incompletely, while still retaining an exact top-K retrieval. Experimental results in several application domains reveal that the so-called threshold algorithm is very scalable, performing often many orders of magnitude more efficiently than the naive approach.  相似文献   

12.
Dictionary learning plays a key role in image representation for classification. A multi-modal dictionary is usually learned from feature samples across different classes and shared in the feature encoding process. Ideally each atom in dictionary corresponds to a single class of images, while each class of images corresponds to a certain group of atoms. Image features are encoded as linear combinations of selected atoms in a given dictionary. We propose to use elastic net as regularizer to select atoms in feature coding and related dictionary learning process, which not only benefits from the sparsity similar as ?1 penalty but also encourages a grouping effect that helps improve image representation. Experimental results of image classification on benchmark datasets show that with dictionary learned in the proposed way outperforms state-of-the-art dictionary learning algorithms.  相似文献   

13.
14.
Rapid advances in image acquisition and storage technology underline the need for real-time algorithms that are capable of solving large-scale image processing and computer-vision problems. The minimum st cut problem, which is a classical combinatorial optimization problem, is a prominent building block in many vision and imaging algorithms such as video segmentation, co-segmentation, stereo vision, multi-view reconstruction, and surface fitting to name a few. That is why finding a real-time algorithm which optimally solves this problem is of great importance. In this paper, we introduce to computer vision the Hochbaum’s pseudoflow (HPF) algorithm, which optimally solves the minimum st cut problem. We compare the performance of HPF, in terms of execution times and memory utilization, with three leading published algorithms: (1) Goldberg’s and Tarjan’s Push-Relabel; (2) Boykov’s and Kolmogorov’s augmenting paths; and (3) Goldberg’s partial augment-relabel. While the common practice in computer-vision is to use either BK or PRF algorithms for solving the problem, our results demonstrate that, in general, HPF algorithm is more efficient and utilizes less memory than these three algorithms. This strongly suggests that HPF is a great option for many real-time computer-vision problems that require solving the minimum st cut problem.  相似文献   

15.
Emergence of MapReduce (MR) framework for scaling data mining and machine learning algorithms provides for Volume, while handling of Variety and Velocity needs to be skilfully crafted in algorithms. So far, scalable clustering algorithms have focused solely on Volume, taking advantage of the MR framework. In this paper we present a MapReduce algorithm—data aware scalable clustering (DASC), which is capable of handling the 3 Vs of big data by virtue of being (i) single scan and distributed to handle Volume, (ii) incremental to cope with Velocity and (iii) versatile in handling numeric and categorical data to accommodate Variety. DASC algorithm incrementally processes infinitely growing data set stored on distributed file system and delivers quality clustering scheme while ensuring recency of patterns. The up-to-date synopsis is preserved by the algorithm for the data seen so far. Each new data increment is processed and merged with the synopsis. Since the synopsis itself may grow very large in size, the algorithm stores it as a file. This makes DASC algorithm truly scalable. Exclusive clusters are obtained on demand by applying connected component analysis (CCA) algorithm over the synopsis. CCA presents subtle roadblock to effective parallelism during clustering. This problem is overcome by accomplishing the task in two stages. In the first stage, hyperclusters are identified based on prevailing data characteristics. The second stage utilizes this knowledge to determine the degree of parallelism, thereby making DASC data aware. Hyperclusters are distributed over the available compute nodes for discovering embedded clusters in parallel. Staged approach for clustering yields dual advantage of improved parallelism and desired complexity in \(\mathcal {MRC}^0\) class. DASC algorithm is empirically compared with incremental Kmeans and Scalable Kmeans++ algorithms. Experimentation on real-world and synthetic data with approximately 1.2 billion data points demonstrates effectiveness of DASC algorithm. Empirical observations of DASC execution are in consonance with the theoretical analysis with respect to stability in resources utilization and execution time.  相似文献   

16.
In this paper we investigate the use of a multimodal feature learning approach, using neural network based models such as Skip-gram and Denoising Autoencoders, to address sentiment analysis of micro-blogging content, such as Twitter short messages, that are composed by a short text and, possibly, an image. The approach used in this work is motivated by the recent advances in: i) training language models based on neural networks that have proved to be extremely efficient when dealing with web-scale text corpora, and have shown very good performances when dealing with syntactic and semantic word similarities; ii) unsupervised learning, with neural networks, of robust visual features, that are recoverable from partial observations that may be due to occlusions or noisy and heavily modified images. We propose a novel architecture that incorporates these neural networks, testing it on several standard Twitter datasets, and showing that the approach is efficient and obtains good classification results.  相似文献   

17.
The preservation of musical works produced in the past requires their digitalization and transformation into a machine-readable format. The processing of handwritten musical scores by computers remains far from ideal. One of the fundamental stages to carry out this task is the staff line detection. We investigate a general-purpose, knowledge-free method for the automatic detection of music staff lines based on a stable path approach. Lines affected by curvature, discontinuities, and inclination are robustly detected. Experimental results show that the proposed technique consistently outperforms well-established algorithms.  相似文献   

18.
In this paper we study multi-label learning with weakly labeled data, i.e., labels of training examples are incomplete, which commonly occurs in real applications, e.g., image classification, document categorization. This setting includes, e.g., (i) semi-supervised multi-label learning where completely labeled examples are partially known; (ii) weak label learning where relevant labels of examples are partially known; (iii) extended weak label learning where relevant and irrelevant labels of examples are partially known. Previous studies often expect that the learning method with the use of weakly labeled data will improve the performance, as more data are employed. This, however, is not always the cases in reality, i.e., weakly labeled data may sometimes degenerate the learning performance. It is desirable to learn safe multi-label prediction that will not hurt performance when weakly labeled data is involved in the learning procedure. In this work we optimize multi-label evaluation metrics (\(\hbox {F}_1\) score and Top-k precision) given that the ground-truth label assignment is realized by a convex combination of base multi-label learners. To cope with the infinite number of possible ground-truth label assignments, cutting-plane strategy is adopted to iteratively generate the most helpful label assignments. The whole optimization is cast as a series of simple linear programs in an efficient manner. Extensive experiments on three weakly labeled learning tasks, namely, (i) semi-supervised multi-label learning; (ii) weak label learning and (iii) extended weak label learning, clearly show that our proposal improves the safeness of using weakly labeled data compared with many state-of-the-art methods.  相似文献   

19.
Software developers, testers and customers routinely submit issue reports to software issue trackers to record the problems they face in using a software. The issues are then directed to appropriate experts for analysis and fixing. However, submitters often misclassify an improvement request as a bug and vice versa. This costs valuable developer time. Hence automated classification of the submitted reports would be of great practical utility. In this paper, we analyze how machine learning techniques may be used to perform this task. We apply different classification algorithms, namely naive Bayes, linear discriminant analysis, k-nearest neighbors, support vector machine (SVM) with various kernels, decision tree and random forest separately to classify the reports from three open-source projects. We evaluate their performance in terms of F-measure, average accuracy and weighted average F-measure. Our experiments show that random forests perform best, while SVM with certain kernels also achieve high performance.  相似文献   

20.
Flexible integration of multimedia sub-queries with qualitative preferences   总被引:1,自引:0,他引:1  
Complex multimedia queries, aiming to retrieve from large databases those objects that best match the query specification, are usually processed by splitting them into a set of m simpler sub-queries, each dealing with only some of the query features. To determine which are the overall best-matching objects, a rule is then needed to integrate the results of such sub-queries, i.e., how to globally rank the m-dimensional vectors of matching degrees, or partial scores, that objects obtain on the m sub-queries. It is a fact that state-of-the-art approaches all adopt as integration rule a scoring function, such as weighted average, that aggregates the m partial scores into an overall (numerical) similarity score, so that objects can be linearly ordered and only the highest scored ones returned to the user. This choice however forces the system to compromise between the different sub-queries and can easily lead to miss relevant results. In this paper we explore the potentialities of a more general approach, based on the use of qualitative preferences, able to define arbitrary partial (rather than only linear) orders on database objects, so that a larger flexibility is gained in shaping what the user is looking for. For the purpose of efficient evaluation, we propose two integration algorithms able to work with any (monotone) partial order (thus also with scoring functions): MPO, which delivers objects one layer of the partial order at a time, and iMPO, which can incrementally return one object at a time, thus also suitable for processing top k queries. Our analysis demonstrates that using qualitative preferences pays off. In particular, using Skyline and Region-prioritized Skyline preferences for queries on a real image database, we show that the results we get have a precision comparable to that obtainable using scoring functions, yet they are obtained much faster, saving up to about 70% database accesses.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号