共查询到20条相似文献,搜索用时 0 毫秒
1.
Solving multi-instance problems with classifier ensemble based on constructive clustering 总被引:3,自引:0,他引:3
In multi-instance learning, the training set is composed of labeled bags each consists of many unlabeled instances, that is, an object is represented by a set of feature vectors instead of only
one feature vector. Most current multi-instance learning algorithms work through adapting single-instance learning algorithms
to the multi-instance representation, while this paper proposes a new solution which goes at an opposite way, that is, adapting
the multi-instance representation to single-instance learning algorithms. In detail, the instances of all the bags are collected
together and clustered into d groups first. Each bag is then re-represented by d binary features, where the value of the ith feature is set to one if the concerned bag has instances falling into the ith group and zero otherwise. Thus, each bag is represented by one feature vector so that single-instance classifiers can be
used to distinguish different classes of bags. Through repeating the above process with different values of d, many classifiers can be generated and then they can be combined into an ensemble for prediction. Experiments show that the
proposed method works well on standard as well as generalized multi-instance problems.
Zhi-Hua Zhou is currently Professor in the Department of Computer Science & Technology and head of the LAMDA group at Nanjing University.
His main research interests include machine learning, data mining, information retrieval, and pattern recognition. He is associate
editor of Knowledge and Information Systems and on the editorial boards of Artificial Intelligence in Medicine, International Journal of Data Warehousing and Mining, Journal of Computer Science & Technology, and Journal of Software. He has also been involved in various conferences.
Min-Ling Zhang received his B.Sc. and M.Sc. degrees in computer science from Nanjing University, China, in 2001 and 2004, respectively.
Currently he is a Ph.D. candidate in the Department of Computer Science & Technology at Nanjing University and a member of
the LAMDA group. His main research interests include machine learning and data mining, especially in multi-instance learning
and multi-label learning. 相似文献
2.
In multi-instance multi-label learning (MIML), each example is not only represented by multiple instances but also associated with multiple class labels. Several learning frameworks, such as the traditional supervised learning, can be regarded as degenerated versions of MIML. Therefore, an intuitive way to solve MIML problem is to identify its equivalence in its degenerated versions. However, this identification process would make useful information encoded in training examples get lost and thus impair the learning algorithm's performance. In this paper, RBF neural networks are adapted to learn from MIML examples. Connections between instances and labels are directly exploited in the process of first layer clustering and second layer optimization. The proposed method demonstrates superior performance on two real-world MIML tasks. 相似文献
3.
Songhe FengAuthor Vitae Hong BaoAuthor Vitae Congyan LangAuthor Vitae 《Neurocomputing》2011,74(17):3619-3627
Tag ranking has emerged as an important research topic recently due to its potential application on web image search. Existing tag relevance ranking approaches mainly rank the tags according to their relevance levels with respect to a given image. Nonetheless, such algorithms heavily rely on the large-scale image dataset and the proper similarity measurement to retrieve semantic relevant images with multi-labels. In contrast to the existing tag relevance ranking algorithms, in this paper, we propose a novel tag saliency ranking scheme, which aims to automatically rank the tags associated with a given image according to their saliency to the image content. To this end, this paper presents an integrated framework for tag saliency ranking, which combines both visual attention model and multi-instance learning to investigate the saliency ranking order information of tags with respect to the given image. Specifically, tags annotated on the image-level are propagated to the region-level via an efficient multi-instance learning algorithm firstly; then, visual attention model is employed to measure the importance of regions in the given image. Finally, tags are ranked according to the saliency values of the corresponding regions. Experiments conducted on the COREL and MSRC image datasets demonstrate the effectiveness and efficiency of the proposed framework. 相似文献
4.
Local anomaly detection refers to detecting small anomalies or outliers that exist in some subsegments of events or behaviors. Such local anomalies are easily overlooked by most of the existing approaches since they are designed for detecting global or large anomalies. In this paper, an accurate and flexible three-phase framework TRASMIL is proposed for local anomaly detection based on TRAjectory Segmentation and Multi-Instance Learning. Firstly, every motion trajectory is segmented into independent sub-trajectories, and a metric with Diversity and Granularity is proposed to measure the quality of segmentation. Secondly, the segmented sub-trajectories are modeled by a sequence learning model. Finally, multi-instance learning is applied to detect abnormal trajectories and sub-trajectories which are viewed as bags and instances, respectively. We validate the TRASMIL framework in terms of 16 different algorithms built on the three-phase framework. Substantial experiments show that algorithms based on the TRASMIL framework outperform existing methods in effectively detecting the trajectories with local anomalies in terms of the whole trajectory. In particular, the MDL-C algorithm (the combination of HDP-HMM with MDL segmentation and Citation kNN) achieves the highest accuracy and recall rates. We further show that TRASMIL is generic enough to adopt other algorithms for identifying local anomalies. 相似文献
5.
Applying graph theory to clustering, we propose a partitional clustering method and a clustering tendency index. No initial assumptions about the data set are requested by the method. The number of clusters and the partition that best fits the data set, are selected according to the optimal clustering tendency index value. 相似文献
6.
In this paper we propose a clustering algorithm to cluster data with arbitrary shapes without knowing the number of clusters in advance. The proposed algorithm is a two-stage algorithm. In the first stage, a neural network incorporated with an ART-like training algorithm is used to cluster data into a set of multi-dimensional hyperellipsoids. At the second stage, a dendrogram is built to complement the neural network. We then use dendrograms and so-called tables of relative frequency counts to help analysts to pick some trustable clustering results from a lot of different clustering results. Several data sets were tested to demonstrate the performance of the proposed algorithm. 相似文献
7.
Multitask Bregman clustering 总被引:1,自引:0,他引:1
Traditional clustering methods deal with a single clustering task on a single data set. In some newly emerging applications, multiple similar clustering tasks are involved simultaneously. In this case, we not only desire a partition for each task, but also want to discover the relationship among clusters of different tasks. It is also expected that utilizing the relationship among tasks can improve the individual performance of each task. In this paper, we propose general approaches to extend a wide family of traditional clustering models/algorithms to multitask settings. We first generally formulate the multitask clustering as minimizing a loss function composed of a within-task loss and a task regularization. Then based on the general Bregman divergences, the within-task loss is defined as the average Bregman divergence from a data sample to its cluster centroid. And two types of task regularizations are proposed to encourage coherence among clustering results of tasks. Afterwards, we further provide a probabilistic interpretation to the proposed formulations from a viewpoint of joint density estimation. Finally, we propose alternate procedures to solve the induced optimization problems. In such procedures, the clustering models and the relationship among clusters of different tasks are updated alternately, and the two phases boost each other. Empirical results on several real data sets validate the effectiveness of the proposed approaches. 相似文献
8.
The facts show that multi-instance multi-label (MIML) learning plays a pivotal role in Artificial Intelligence studies. Evidently, the MIML learning introduces a framework in which data is described by a bag of instances associated with a set of labels. In this framework, the modeling of the connection is the challenging problem for MIML. The RBF neural network can explain the complex relations between the instances and labels in the MIMLRBF. The parameters estimation of the RBF network is a difficult task. In this paper, the computational convergence and the modeling accuracy of the RBF network has been improved. The present study aimed to investigate the impact of a novel hybrid algorithm consisting of Gases Brownian Motion optimization (GBMO) algorithm and the gradient based fast converging parameter estimation method on multi-instance multi-label learning. In the current study, a hybrid algorithm was developed to estimate the RBF neural network parameters (the weights, widths and centers of the hidden units) simultaneously. The algorithm uses the robustness of the GBMO to search the parameter space and the efficiency of the gradient. For this purpose, two real-world MIML tasks and a Corel dataset were utilized within a two-step experimental design. In the first step, the GBMO algorithm was used to determine the widths and centers of the network nodes. In the second step, for each molecule with fixed inputs and number of hidden nodes, the parameters were optimized by a structured nonlinear parameter optimization method (SNPOM). The findings demonstrated the superior performance of the hybrid algorithmic method. Additionally, the results for training and testing the dataset revealed that the hybrid method enhances RBF network learning more efficiently in comparison with other conventional RBF approaches. The results obtain better modeling accuracy than some other algorithms. 相似文献
9.
Traditionally, many science fields require great support for a massive workflow, which utilizes multiple cores simultaneously. In order to support such large-scale scientific workflows, high-capacity parallel systems such as supercomputers are widely used. To increase the utilization of these systems, most schedulers use backfilling policy based on user’s estimated runtime. However, it is found to be extremely inaccurate because users overestimate their jobs. Therefore, in this paper, an efficient machine learning approach is present to predict the runtime of parallel application. The proposed method is divided into three phases. First is to analyze important feature of the history log data by factor analysis. Second is to carry out clustering for the parallel program based on the important features. Third is to build a prediction models by pattern similarity of parallel program log data and estimate runtime. In the experiments, we use workload logs on parallel systems (i.e., NASA-iPSC, LANL-CM5, SDSC-Par95, SDSC-Par96, and CTC-SP2) to evaluate the effectiveness of our approach. Comparing root-mean-square error with other techniques, experimental results show that the proposed method improves the accuracy up to 69.56%. 相似文献
10.
Evolutionary semi-supervised fuzzy clustering 总被引:3,自引:0,他引:3
For learning classifier from labeled and unlabeled data, this paper proposes an evolutionary semi-supervised fuzzy clustering algorithm. Class labels information provided by labeled data is used to guide the evolution process of each fuzzy partition on unlabeled data, which plays the role of chromosome. The fitness of each chromosome is evaluated with a combination of fuzzy within cluster variance of unlabeled data and misclassification error of labeled data. The structure of the clusters obtained can be used to classify a future new pattern. The performance of the proposed approach is evaluated using two benchmark data sets. Experimental results indicate that the proposed approach can improve classification accuracy significantly, compared to classifier trained with a small number of labeled data only. Also, it outperforms a similar approach SSFCM. 相似文献
11.
This study presents a new pattern recognition neural network for clustering problems, and illustrates its use for machine cell design in group technology. The proposed algorithm involves modifications of the learning procedure and resonance test of the Fuzzy ART neural network. These modifications enable the neural network to process integer values rather than binary valued inputs or the values in the interval [0, 1], and improve the clustering performance of the neural network. A two-stage clustering approach is also developed in order to obtain an informative and intelligent decision for the problem of designing a machine cell. At the first stage, we identify the part families with very similar parts (i.e., high similarity exists in their processing requirements), and the resultant part families are input to the second stage, which forms the groups of machines. Experimental studies show that the proposed approach leads to better results in comparison with those produced by the Fuzzy ART and other similar neural network classifiers. 相似文献
12.
Picture fuzzy set (PFS), which is a generalization of traditional fuzzy set and intuitionistic fuzzy set, shows great promises of better adaptation to many practical problems in pattern recognition, artificial life, robotic, expert and knowledge-based systems than existing types of fuzzy sets. An emerging research trend in PFS is development of clustering algorithms which can exploit and investigate hidden knowledge from a mass of datasets. Distance measure is one of the most important tools in clustering that determine the degree of relationship between two objects. In this paper, we propose a generalized picture distance measure and integrate it to a novel hierarchical picture fuzzy clustering method called Hierarchical Picture Clustering (HPC). Experimental results show that the clustering quality of the proposed algorithm is better than those of the relevant ones. 相似文献
13.
Recognized as one the most serious security threats on current Internet infrastructure, botnets can not only be implemented by existing well known applications, e.g. IRC, HTTP, or Peer-to-Peer, but also can be constructed by unknown or creative applications, which makes the botnet detection a challenging problem. Previous attempts for detecting botnets are mostly to examine traffic content for bot command on selected network links or by setting up honeypots. Traffic content, however, can be encrypted with the evolution of botnet, and as a result leading to a fail of content based detection approaches. In this paper, we address this issue and propose a new approach for detecting and clustering botnet traffic on large-scale network application communities, in which we first classify the network traffic into different applications by using traffic payload signatures, and then a novel decision tree model is used to classify those traffic to be unknown by the payload content (e.g. encrypted traffic) into known application communities where network traffic is clustered based on n-gram features selected and extracted from the content of network flows in order to differentiate the malicious botnet traffic created by bots from normal traffic generated by human beings on each specific application. We evaluate our approach with seven different traffic trace collected on three different network links and results show the proposed approach successfully detects two IRC botnet traffic traces with a high detection rate and an acceptable low false alarm rate. 相似文献
14.
聚类是机器学习领域中的一个研究热点,弱监督学习是半监督学习中一个重要的研究方向,有广泛的应用场景.在对聚类与弱监督学习的研究中,提出了一种基于k个标记样本的弱监督学习框架.该框架首先用聚类及聚类置信度实现了标记样本的扩展.其次,对受限玻尔兹曼机的能量函数进行改进,提出了基于k个标记样本的受限玻尔兹曼机学习模型.最后,完成了对该模型的推理并设计相关算法.为了完成对该框架和模型的检验,选择公开的数据集进行对比实验,实验结果表明,基于k个标记样本的弱监督学习框架实验效果较好. 相似文献
15.
Clustering analysis is to identify inherent structures and discover useful information from large amount of data. However, the decision makers may suffer insufficient understanding the nature of the data and do not know how to set the optimal parameters for the clustering method. To overcome the drawback above, this paper proposes a new entropy clustering method using adaptive learning. The proposed method considers the data spreading to determine the adaptive threshold within parameters optimized by adaptive learning. Four datasets in UCI database are used as the experimental data to compare the accuracy of the proposed method with the listing clustering methods. The experimental results indicate that the proposed method is superior to the listing methods. 相似文献
16.
Clustering with constraints is a powerful method that allows users to specify background knowledge and the expected cluster
properties. Significant work has explored the incorporation of instance-level constraints into non-hierarchical clustering
but not into hierarchical clustering algorithms. In this paper we present a formal complexity analysis of the problem and
show that constraints can be used to not only improve the quality of the resultant dendrogram but also the efficiency of the
algorithms. This is particularly important since many agglomerative style algorithms have running times that are quadratic
(or faster growing) functions of the number of instances to be clustered. We present several bounds on the improvement in
the running times of algorithms obtainable using constraints.
A preliminary version of this paper appeared as Davidson and Ravi (2005b). 相似文献
17.
Urban mobility impacts urban life to a great extent. To enhance urban mobility, much research was invested in traveling time prediction: given an origin and destination, provide a passenger with an accurate estimation of how long a journey lasts. In this work, we investigate a novel combination of methods from Queueing Theory and Machine Learning in the prediction process. We propose a prediction engine that, given a scheduled bus journey (route) and a ‘source/destination’ pair, provides an estimate for the traveling time, while considering both historical data and real-time streams of information that are transmitted by buses. We propose a model that uses natural segmentation of the data according to bus stops and a set of predictors, some use learning while others are learning-free, to compute traveling time. Our empirical evaluation, using bus data that comes from the bus network in the city of Dublin, demonstrates that the snapshot principle, taken from Queueing Theory, works well yet suffers from outliers. To overcome the outliers problem, we use Machine Learning techniques as a regulator that assists in identifying outliers and propose prediction based on historical data. 相似文献
18.
Estimation theory is used to derive a new approach to the clustering problem. The new method is a unification of centroid and mode estimation, achieved by considering the effect of spatial scale on the estimator. The result is a multiresolution method which spans a range of spatial scales, giving enhanced robustness both to noise in the data and to changes of scale in the data, by using comparison between scales as a test of cluster validity. Iterative and non-iterative algorithms based on the new estimator are presented and are shown to be more accurate than simple scale-space filtering in identifying and locating the cluster centres from noisy test data. Results from a wide range of applications are used to illustrate the power and versatility of the new method. 相似文献
19.
为了提高图像检索的性能,提出了一种基于流行排序的多示例图像检索方法,将分割后的图像表示为多示例的形式,通过给出适合图像在包空间的度量方式,有效结合流行排序和多示例学习的方法来进行图像检索.实验结果表明,采用所提出的方法的检索结果与传统的检索方法相比,检索率得到了明显的提高,检索结果更符合人的视觉习惯. 相似文献
20.
Weiling Cai Author Vitae Songcan Chen Author Vitae Daoqiang Zhang Author Vitae 《Pattern recognition》2009,42(7):1248-1259
Traditional pattern recognition generally involves two tasks: unsupervised clustering and supervised classification. When class information is available, fusing the advantages of both clustering learning and classification learning into a single framework is an important problem worthy of study. To date, most algorithms generally treat clustering learning and classification learning in a sequential or two-step manner, i.e., first execute clustering learning to explore structures in data, and then perform classification learning on top of the obtained structural information. However, such sequential algorithms cannot always guarantee the simultaneous optimality for both clustering and classification learning. In fact, the clustering learning in these algorithms just aids the subsequent classification learning and does not benefit from the latter. To overcome this problem, a simultaneous learning framework for clustering and classification (SCC) is presented in this paper. SCC aims to achieve three goals: (1) acquiring the robust classification and clustering simultaneously; (2) designing an effective and transparent classification mechanism; (3) revealing the underlying relationship between clusters and classes. To this end, with the Bayesian theory and the cluster posterior probabilities of classes, we define a single objective function to which the clustering process is directly embedded. By optimizing this objective function, the effective and robust clustering and classification results are achieved simultaneously. Experimental results on both synthetic and real-life datasets show that SCC achieves promising classification and clustering results at one time. 相似文献