首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
In multi-instance learning, the training set is composed of labeled bags each consists of many unlabeled instances, that is, an object is represented by a set of feature vectors instead of only one feature vector. Most current multi-instance learning algorithms work through adapting single-instance learning algorithms to the multi-instance representation, while this paper proposes a new solution which goes at an opposite way, that is, adapting the multi-instance representation to single-instance learning algorithms. In detail, the instances of all the bags are collected together and clustered into d groups first. Each bag is then re-represented by d binary features, where the value of the ith feature is set to one if the concerned bag has instances falling into the ith group and zero otherwise. Thus, each bag is represented by one feature vector so that single-instance classifiers can be used to distinguish different classes of bags. Through repeating the above process with different values of d, many classifiers can be generated and then they can be combined into an ensemble for prediction. Experiments show that the proposed method works well on standard as well as generalized multi-instance problems. Zhi-Hua Zhou is currently Professor in the Department of Computer Science & Technology and head of the LAMDA group at Nanjing University. His main research interests include machine learning, data mining, information retrieval, and pattern recognition. He is associate editor of Knowledge and Information Systems and on the editorial boards of Artificial Intelligence in Medicine, International Journal of Data Warehousing and Mining, Journal of Computer Science & Technology, and Journal of Software. He has also been involved in various conferences. Min-Ling Zhang received his B.Sc. and M.Sc. degrees in computer science from Nanjing University, China, in 2001 and 2004, respectively. Currently he is a Ph.D. candidate in the Department of Computer Science & Technology at Nanjing University and a member of the LAMDA group. His main research interests include machine learning and data mining, especially in multi-instance learning and multi-label learning.  相似文献   

在多示例学习中引入利用未标记示例的机制,能降低训练的成本并提高学习器的泛化能力。当前半监督多示例学习算法大部分是基于对包中的每一个示例进行标记,把多示例学习转化为一个单示例半监督学习问题。考虑到包的类标记由包中示例及包的结构决定,提出一种直接在包层次上进行半监督学习的多示例学习算法。通过定义多示例核,利用所有包(有标记和未标记)计算包层次的图拉普拉斯矩阵,作为优化目标中的光滑性惩罚项。在多示例核所张成的RKHS空间中寻找最优解被归结为确定一个经过未标记数据修改的多示例核函数,它能直接用在经典的核学习方法上。在实验数据集上对算法进行了测试,并和已有的算法进行了比较。实验结果表明,基于半监督多示例核的算法能够使用更少量的训练数据而达到与监督学习算法同样的精度,在有标记数据集相同的情况下利用未标记数据能有效地提高学习器的泛化能力。  相似文献   

Min-Ling  Zhi-Jian 《Neurocomputing》2009,72(16-18):3951
In multi-instance multi-label learning (MIML), each example is not only represented by multiple instances but also associated with multiple class labels. Several learning frameworks, such as the traditional supervised learning, can be regarded as degenerated versions of MIML. Therefore, an intuitive way to solve MIML problem is to identify its equivalence in its degenerated versions. However, this identification process would make useful information encoded in training examples get lost and thus impair the learning algorithm's performance. In this paper, RBF neural networks are adapted to learn from MIML examples. Connections between instances and labels are directly exploited in the process of first layer clustering and second layer optimization. The proposed method demonstrates superior performance on two real-world MIML tasks.  相似文献   

甘睿  印鉴 《计算机科学》2012,39(7):144-147
在多示例学习问题中,训练数据集里面的每一个带标记的样本都是由多个示例组成的包,其最终目的是利用这一数据集去训练一个分类器,使得可以利用该分类器去预测还没有被标记的包。在以往的关于多示例学习问题的研究中,有的是通过修改现有的单示例学习算法来迎合多示例的需要,有的则是通过提出新的方法来挖掘示例与包之间的关系并利用挖掘的结果来解决问题。以改变包的表现形式为出发点,提出了一个解决多示例学习问题的算法——概念评估算法。该算法首先利用聚类算法将所有示例聚成d簇,每一个簇可以看作是包含在示例中的概念;然后利用原本用于文本检索的TF-IDF(Term Frequency-Inverse Document Frequency)算法来评估出每一个概念在每个包中的重要性;最后将包表示成一个d维向量——概念评估向量,其第i个位置表示第i个簇所代表的概念在某个包中的重要程度。经重新表示后,原有的多示例数据集已不再是"多示例",以至于一些现有的单示例学习算法能够用来高效地解决多示例学习问题。  相似文献   

为了有效地解决多示例图像分类问题,基于稀疏表示提出了一种新的多示例图像分类方法.该方法将图像看作多示例包,图像中的区域作为包中示例,利用示例嵌入策略计算包特征;然后将待分类图像包特征表示为训练图像包特征集上的稀疏线性组合,利用Z1优化方法求得稀疏解;最后根据稀疏系数提出一个为待分类图像预测标记的方法.在Corel数据集上的实验结果表明,与其他方法相比,所提方法具有更高的分类精度.  相似文献   

多示例多标签学习是一种新型的机器学习框架。在多示例多标签学习中,样本以包的形式存在,一个包由多个示例组成,并被标记多个标签。以往的多示例多标签学习研究中,通常认为包中的示例是独立同分布的,但这个假设在实际应用中是很难保证的。为了利用包中示例的相关性特征,提出了一种基于示例非独立同分布的多示例多标签分类算法。该算法首先通过建立相关性矩阵表示出包内示例的相关关系,每个多示例包由一个相关性矩阵表示;然后建立基于不同尺度的相关性矩阵的核函数;最后考虑到不同标签的预测对应不同的核函数,引入多核学习构造并训练针对不同标签预测的多核SVM分类器。图像和文本数据集上的实验结果表明,该算法大大提高了多标签分类的准确性。  相似文献   

Tag ranking has emerged as an important research topic recently due to its potential application on web image search. Existing tag relevance ranking approaches mainly rank the tags according to their relevance levels with respect to a given image. Nonetheless, such algorithms heavily rely on the large-scale image dataset and the proper similarity measurement to retrieve semantic relevant images with multi-labels. In contrast to the existing tag relevance ranking algorithms, in this paper, we propose a novel tag saliency ranking scheme, which aims to automatically rank the tags associated with a given image according to their saliency to the image content. To this end, this paper presents an integrated framework for tag saliency ranking, which combines both visual attention model and multi-instance learning to investigate the saliency ranking order information of tags with respect to the given image. Specifically, tags annotated on the image-level are propagated to the region-level via an efficient multi-instance learning algorithm firstly; then, visual attention model is employed to measure the importance of regions in the given image. Finally, tags are ranked according to the saliency values of the corresponding regions. Experiments conducted on the COREL and MSRC image datasets demonstrate the effectiveness and efficiency of the proposed framework.  相似文献   

A two-class classification problem is considered where the objects to be classified are bags of instances in d-space. The classification rule is defined in terms of an open d-ball. A bag is labeled positive if it meets the ball and labeled negative otherwise. Determining the center and radius of the ball is modeled as a SVM-like margin optimization problem. Necessary optimality conditions are derived leading to a polynomial algorithm in fixed dimension. A VNS type heuristic is developed and experimentally tested. The methodology is extended to classification by several balls and to more than two classes.  相似文献   

One of the industrial applications of computer vision is automatic visual inspection. In the last decade, standard supervised learning methods have been used to detect defects in different kind of products. These methods are trained with a set of images where every image has to be manually segmented and labeled by experts in the application domain. These manual segmentations require a large amount of high quality delineations (on pixels), which can be time consuming and often a difficult task. Multi-instance learning (MIL), in contrast to standard supervised classifiers, avoids this task and can, therefore, be trained with weakly labeled images. In this paper, we propose an approach for the automatic visual inspection that uses MIL for defect detection. The approach has been tested with data from three artificial benchmark datasets and three real-world industrial scenarios: inspection of artificial teeth, weld defect detection and fishbone detection. Results show that the proposed approach can be used with weakly labeled images for defect detection on automatic visual inspection systems. This approach is able to increase the area under the receiver-operating characteristic curve (AUC) up to 6.3% compared with the naïve MIL approach of propagating the bag labels.  相似文献   

Local anomaly detection refers to detecting small anomalies or outliers that exist in some subsegments of events or behaviors. Such local anomalies are easily overlooked by most of the existing approaches since they are designed for detecting global or large anomalies. In this paper, an accurate and flexible three-phase framework TRASMIL is proposed for local anomaly detection based on TRAjectory Segmentation and Multi-Instance Learning. Firstly, every motion trajectory is segmented into independent sub-trajectories, and a metric with Diversity and Granularity is proposed to measure the quality of segmentation. Secondly, the segmented sub-trajectories are modeled by a sequence learning model. Finally, multi-instance learning is applied to detect abnormal trajectories and sub-trajectories which are viewed as bags and instances, respectively. We validate the TRASMIL framework in terms of 16 different algorithms built on the three-phase framework. Substantial experiments show that algorithms based on the TRASMIL framework outperform existing methods in effectively detecting the trajectories with local anomalies in terms of the whole trajectory. In particular, the MDL-C algorithm (the combination of HDP-HMM with MDL segmentation and Citation kNN) achieves the highest accuracy and recall rates. We further show that TRASMIL is generic enough to adopt other algorithms for identifying local anomalies.  相似文献   

Applying graph theory to clustering, we propose a partitional clustering method and a clustering tendency index. No initial assumptions about the data set are requested by the method. The number of clusters and the partition that best fits the data set, are selected according to the optimal clustering tendency index value.  相似文献   

In this paper we propose a clustering algorithm to cluster data with arbitrary shapes without knowing the number of clusters in advance. The proposed algorithm is a two-stage algorithm. In the first stage, a neural network incorporated with an ART-like training algorithm is used to cluster data into a set of multi-dimensional hyperellipsoids. At the second stage, a dendrogram is built to complement the neural network. We then use dendrograms and so-called tables of relative frequency counts to help analysts to pick some trustable clustering results from a lot of different clustering results. Several data sets were tested to demonstrate the performance of the proposed algorithm.  相似文献   

The facts show that multi-instance multi-label (MIML) learning plays a pivotal role in Artificial Intelligence studies. Evidently, the MIML learning introduces a framework in which data is described by a bag of instances associated with a set of labels. In this framework, the modeling of the connection is the challenging problem for MIML. The RBF neural network can explain the complex relations between the instances and labels in the MIMLRBF. The parameters estimation of the RBF network is a difficult task. In this paper, the computational convergence and the modeling accuracy of the RBF network has been improved. The present study aimed to investigate the impact of a novel hybrid algorithm consisting of Gases Brownian Motion optimization (GBMO) algorithm and the gradient based fast converging parameter estimation method on multi-instance multi-label learning. In the current study, a hybrid algorithm was developed to estimate the RBF neural network parameters (the weights, widths and centers of the hidden units) simultaneously. The algorithm uses the robustness of the GBMO to search the parameter space and the efficiency of the gradient. For this purpose, two real-world MIML tasks and a Corel dataset were utilized within a two-step experimental design. In the first step, the GBMO algorithm was used to determine the widths and centers of the network nodes. In the second step, for each molecule with fixed inputs and number of hidden nodes, the parameters were optimized by a structured nonlinear parameter optimization method (SNPOM). The findings demonstrated the superior performance of the hybrid algorithmic method. Additionally, the results for training and testing the dataset revealed that the hybrid method enhances RBF network learning more efficiently in comparison with other conventional RBF approaches. The results obtain better modeling accuracy than some other algorithms.  相似文献   

In this paper, we derive two novel learning algorithms for time series clustering; namely for learning mixtures of Markov Models and mixtures of Hidden Markov Models. Mixture models are special latent variable models that require the usage of local search heuristics such as Expectation Maximization (EM) algorithm, that can only provide locally optimal solutions. In contrast, we make use of the spectral learning algorithms, recently popularized in the machine learning community. Under mild assumptions, spectral learning algorithms are able to estimate the parameters in latent variable models by solving systems of equations via eigendecompositions of matrices or tensors of observable moments. As such, spectral methods can be viewed as an instance of the method of moments for parameter estimation, an alternative to maximum likelihood. The popularity stems from the fact that these methods provide a computationally cheap and local optima free alternative to EM. We conduct classification experiments on human action sequences extracted from videos, clustering experiments on motion capture data and network traffic data to illustrate the viability of our approach. We conclude that the spectral methods are a practical and useful alternative in terms of computational effort and solution quality to standard iterative techniques such as EM in several sequence clustering applications.  相似文献   

Multitask Bregman clustering   总被引:1,自引:0,他引:1  
Traditional clustering methods deal with a single clustering task on a single data set. In some newly emerging applications, multiple similar clustering tasks are involved simultaneously. In this case, we not only desire a partition for each task, but also want to discover the relationship among clusters of different tasks. It is also expected that utilizing the relationship among tasks can improve the individual performance of each task. In this paper, we propose general approaches to extend a wide family of traditional clustering models/algorithms to multitask settings. We first generally formulate the multitask clustering as minimizing a loss function composed of a within-task loss and a task regularization. Then based on the general Bregman divergences, the within-task loss is defined as the average Bregman divergence from a data sample to its cluster centroid. And two types of task regularizations are proposed to encourage coherence among clustering results of tasks. Afterwards, we further provide a probabilistic interpretation to the proposed formulations from a viewpoint of joint density estimation. Finally, we propose alternate procedures to solve the induced optimization problems. In such procedures, the clustering models and the relationship among clusters of different tasks are updated alternately, and the two phases boost each other. Empirical results on several real data sets validate the effectiveness of the proposed approaches.  相似文献   

Traditionally, many science fields require great support for a massive workflow, which utilizes multiple cores simultaneously. In order to support such large-scale scientific workflows, high-capacity parallel systems such as supercomputers are widely used. To increase the utilization of these systems, most schedulers use backfilling policy based on user’s estimated runtime. However, it is found to be extremely inaccurate because users overestimate their jobs. Therefore, in this paper, an efficient machine learning approach is present to predict the runtime of parallel application. The proposed method is divided into three phases. First is to analyze important feature of the history log data by factor analysis. Second is to carry out clustering for the parallel program based on the important features. Third is to build a prediction models by pattern similarity of parallel program log data and estimate runtime. In the experiments, we use workload logs on parallel systems (i.e., NASA-iPSC, LANL-CM5, SDSC-Par95, SDSC-Par96, and CTC-SP2) to evaluate the effectiveness of our approach. Comparing root-mean-square error with other techniques, experimental results show that the proposed method improves the accuracy up to 69.56%.  相似文献   

Evolutionary semi-supervised fuzzy clustering   总被引:3,自引:0,他引:3  
For learning classifier from labeled and unlabeled data, this paper proposes an evolutionary semi-supervised fuzzy clustering algorithm. Class labels information provided by labeled data is used to guide the evolution process of each fuzzy partition on unlabeled data, which plays the role of chromosome. The fitness of each chromosome is evaluated with a combination of fuzzy within cluster variance of unlabeled data and misclassification error of labeled data. The structure of the clusters obtained can be used to classify a future new pattern. The performance of the proposed approach is evaluated using two benchmark data sets. Experimental results indicate that the proposed approach can improve classification accuracy significantly, compared to classifier trained with a small number of labeled data only. Also, it outperforms a similar approach SSFCM.  相似文献   

付治  王红军  李天瑞  滕飞  张继 《软件学报》2020,31(4):981-990
聚类是机器学习领域中的一个研究热点,弱监督学习是半监督学习中一个重要的研究方向,有广泛的应用场景.在对聚类与弱监督学习的研究中,提出了一种基于k个标记样本的弱监督学习框架.该框架首先用聚类及聚类置信度实现了标记样本的扩展.其次,对受限玻尔兹曼机的能量函数进行改进,提出了基于k个标记样本的受限玻尔兹曼机学习模型.最后,完成了对该模型的推理并设计相关算法.为了完成对该框架和模型的检验,选择公开的数据集进行对比实验,实验结果表明,基于k个标记样本的弱监督学习框架实验效果较好.  相似文献   

Recognized as one the most serious security threats on current Internet infrastructure, botnets can not only be implemented by existing well known applications, e.g. IRC, HTTP, or Peer-to-Peer, but also can be constructed by unknown or creative applications, which makes the botnet detection a challenging problem. Previous attempts for detecting botnets are mostly to examine traffic content for bot command on selected network links or by setting up honeypots. Traffic content, however, can be encrypted with the evolution of botnet, and as a result leading to a fail of content based detection approaches. In this paper, we address this issue and propose a new approach for detecting and clustering botnet traffic on large-scale network application communities, in which we first classify the network traffic into different applications by using traffic payload signatures, and then a novel decision tree model is used to classify those traffic to be unknown by the payload content (e.g. encrypted traffic) into known application communities where network traffic is clustered based on n-gram features selected and extracted from the content of network flows in order to differentiate the malicious botnet traffic created by bots from normal traffic generated by human beings on each specific application. We evaluate our approach with seven different traffic trace collected on three different network links and results show the proposed approach successfully detects two IRC botnet traffic traces with a high detection rate and an acceptable low false alarm rate.  相似文献   

The modified fuzzy art and a two-stage clustering approach to cell design   总被引:1,自引:0,他引:1  
This study presents a new pattern recognition neural network for clustering problems, and illustrates its use for machine cell design in group technology. The proposed algorithm involves modifications of the learning procedure and resonance test of the Fuzzy ART neural network. These modifications enable the neural network to process integer values rather than binary valued inputs or the values in the interval [0, 1], and improve the clustering performance of the neural network. A two-stage clustering approach is also developed in order to obtain an informative and intelligent decision for the problem of designing a machine cell. At the first stage, we identify the part families with very similar parts (i.e., high similarity exists in their processing requirements), and the resultant part families are input to the second stage, which forms the groups of machines. Experimental studies show that the proposed approach leads to better results in comparison with those produced by the Fuzzy ART and other similar neural network classifiers.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号