首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 156 毫秒
1.
李绍园  姜远 《软件学报》2020,31(5):1497-1510
传统的多标记学习任务要求训练数据拥有完整的或者至少部分的真实标记,而真实标记耗费昂贵并且难以获取.不同于由昂贵受限的专家标注真实标记,众包环境下,多标记任务被分配给多个容易获取的非专家标注,学习目标是从有错误的非专家标注中估计样本的真实标记.这一问题的关键在于如何融合非专家标注.以往的众包学习主要集中在单标记任务上,忽视了多标记任务的标记相关性;而多标记任务上的众包工作集中在局部标记相关性的利用如标记共同出现的概率,标记间条件相关性,其估计很敏感地受到标记数量和质量的影响.考虑到多标记任务上多个标注者的标注结果整体上存在低秩结构关系,提出一种基于低秩张量矫正的方法.首先,将标注结果组织成三维的张量(样本,标记,标注者),用低秩张量补全的方法对收集到的标注做预处理,以同时达到两个目的:1)优化已有标注;2)补全标注者在其未标注的标记上的标注结果.然后,对所有标注融合,测试了3种融合方法,分别从不同的方面考虑标注的置信度.真实数据上的实验结果验证了所提方法的有效性.  相似文献   

2.
针对有监督排序学习所需带标记训练数据集不易获得的情况,引入众包这种新型大众网络聚集模式来完成标注工作,为解决排序学习所需大量训练数据集标注工作耗时耗力的难题提供了新的思路。首先介绍了众包标注方法,着重提出两种个人分类器模型来解决众包结果质量控制问题,同时考虑标注者能力和众包任务的难度这两个影响众包质量的因素。再基于得到的训练集使用RankingSVM进行排序学习并在微软OHSUMED数据集上衡量了该方法在NDCG@n评价准则下的性能。实验结果表明该众包标注方法能够达到95%以上的正确率,所得排序模型的性能基本和RankingSVM算法持平,从而验证了众包应用于排序学习的可行性和优越性。  相似文献   

3.
针对动态在线任务分配策略难以有效利用历史数据进行学习、同时未考虑当前决策对未来收益的影响的问题,提出基于深度强化学习的空间众包任务分配策略.首先,以最大化长期累积收益为优化目标,基于马尔科夫决策过程从单个众包工作者的角度建模,将任务分配问题转化为对状态动作价值Q的求解及工作者与任务的一对一分配.然后采用改进的深度强化学习算法对历史任务数据进行离线学习,构建关于Q值的预测模型.最后,动态在线分配过程中实时预测Q值,作为KM(Kuhn-Munkres)算法的边权,实现全局累积收益的最优分配.在出租车真实出行数据集上的实验表明,当工作者数量在一定规模内时,文中策略可提高长期累积收益.  相似文献   

4.
中文微博命名实体的有效识别对使用微博进行社会舆论监测具有重要意义。鉴于微博更新速度快、语言不规范、噪声多,使得命名实体识别成本高、识别效率低。针对这些问题,提出基于众包标注的中文微博命名实体识别的方法。对众包工作者的能力进行评估,使用最大期望算法(EM算法)对评估后的能力值进行分析学习,过滤掉每个标注者的噪声并对众包标注的结果进行优化,从而确定最后的命名实体。实验结果表明,该方法能够有效地提高中文微博中命名实体识别的准确率。  相似文献   

5.
众包是一个新兴的收集数据集标签的方法。虽然它经济实惠,但面临着数据标签质量无法保证的问题。尤其是当客观原因存在使得众包工作者工作质量较差时,所得的标签会更加不可靠。因此提出一个名为基于特征扩维提高众包质量的方法(FA-method),其基本思想是,首先由专家标注少部分标签,再利用众包者标注的数据集训练模型,对专家集进行预测,所得结果作为专家数据集新的特征,并利用扩维后的专家集训练模型进行预测,计算每个实例为噪声的可能性以及噪声数量上限来过滤出潜在含噪声标签的数据集,类似地,对过滤后的高质量集再次使用扩维的方法进一步校正噪声。在8个UCI数据集上进行验证的结果表明,和现有的结合噪声识别和校正的众包标签方法相比,所提方法能够在重复标签数量较少或标注质量较低时均取得很好的效果。  相似文献   

6.
针对基于图的半监督学习方法在多媒体研究应用中忽略视频相关性的问题,提出了一种基于相关核映射线性近邻传播的视频标注算法.该算法首先通过核函数按照半监督学习调整后的距离计算出迭代标记传播系数;其次利用传播系数求得表示低层特征空间的样本,再根据视频相关性建模构造出语义概念间的关联表;最后完成近邻图的构造,并利用已标注视频信息迭代传播到未标注视频中,完成视频标注.实验结果表明,该算法不仅可以提高视频标注的准确度,还能弥补已标注视频数据数量的不足.  相似文献   

7.
叶晨  王宏志  高宏  李建中 《软件学报》2020,31(4):1162-1172
传统方法多数采用机器学习算法对数据进行清洗.这些方法虽然能够解决部分问题,但存在计算难度大、缺乏充足的知识等局限性.近年来,随着众包平台的兴起,越来越多的研究将众包引入数据清洗过程,通过众包来提供机器学习所需要的知识.由于众包的有偿性,研究如何将机器学习算法与众包有效且低成本结合在一起是必要的.提出了两种支持基于众包的数据清洗的主动学习模型,通过主动学习技术来减少众包开销,实现了对给定的数据集基于真实众包平台的数据清洗,最大程度减少成本的同时提高了数据的质量.在真实数据集上的实验结果验证了所提模型的有效性.  相似文献   

8.
深度学习在图像识别领域凸显出了优势,而在深度学习图像识别模型训练的准备阶段,制备图像数据集需要人工将图片上的信息进行标注.这一准备过程往往需要耗费大量人力成本与时间成本.为了提升数据制备阶段的工作效率,从而加速深度学习模型的生成与迭代,提出了一种基于微服务架构的多人协作众包式图像数据集标注系统.通过将繁重的标注任务划分为不同的小任务,使更多的人能够参与并协同完成数据标定.通过引入对象存储机制并采用微服务架构,提升了系统性能,在开发阶段使用了基于Gitlab的持续集成与持续部署,实现了系统的快速迭代与部署,提升了微服务系统在开发过程中的集成效率.  相似文献   

9.
近年来,人工智能蓬勃发展,伴随着计算硬件算力的提升,深度学习已成为了人工智能算法的新范式.然而深度学习依赖大量精确标注的数据,在现实的多类别分类场景中,受限于标注成本和隐私数据保护等因素,大量精准标注的数据往往难以获得.近些年,移动众包和网络爬虫这类经济廉价的数据收集方法被广泛采用,但他们不可避免地引入了错误标注,即标签噪声.鉴于深度神经网络强大的数据拟合能力,标签噪声的存在将造成算法的过拟合,严重制约了深度学习方法的泛化能力.针对标签噪声问题,现有研究大多显式或隐式地依赖锚点(明确属于某一类别的样本),然而在现实场景中锚点难以获取,这使得现有解决方案不再适用.为解决这一问题,本文创造性地将多类别标签噪声学习问题转化为混合比例估计(mixture proportion estimation, MPE)问题,构建了一种不依赖锚点的满足统计一致性的学习算法.本文的主要贡献包括:(1)对现有的仅适用于二组成物MPE场景的R-MPE(regrouping-MPE)方法进行推广,提出了多组成物场景下不依赖不可约假设的MPE方法 MR-MPE(multi-component oriented R-...  相似文献   

10.
多标签答案聚合问题是通过融合众包收集的大量非专家标注来估计样本的真实标签,由于数字文化遗产数据具有标注成本高、样本类别多、分布不均衡等特点,给数据集多标签答案聚合问题带来了极大挑战。以往的方法主要集中在单标签任务,忽视了多标签任务的标签关联性;大部分多标签聚合方法虽然在一定程度上考虑了标签相关性,但是很敏感地受噪声和离群值的影响。为解决这些问题,提出一种基于自适应图正则化与联合低秩矩阵分解的多标签答案聚合方法AGR-JMF。首先,将标注矩阵分解成纯净标注和噪声标注两部分;对纯净标注采用自适应图正则化方法构建标签间的关联矩阵;最后,利用标注质量、标签关联性、标注人员行为属性相似性等信息指导低秩矩阵分解,以实现多标签答案的聚合。真实数据集和莫高窟壁画数据集上的实验表明,AGR-JMF相较于现有算法在聚合准确率、识别欺诈者等方面具有明显优势。  相似文献   

11.
Traditional supervised learning requires the groundtruth labels for the training data, which can be difficult to collect in many cases. In contrast, crowdsourcing learning collects noisy annotations from multiple non-expert workers and infers the latent true labels through some aggregation approach. In this paper, we notice that existing deep crowdsourcing work does not sufficiently model worker correlations, which is, however, shown to be helpful for learning by previous non-deep learning approaches. We propose a deep generative crowdsourcing learning approach to incorporate the strengths of Deep Neural Networks (DNNs) and exploit worker correlations. The model comprises a DNN classifier as a prior and an annotation generation process. A mixture model of workers'' capabilities within each class is introduced into the annotation generation process for worker correlation modeling. For adaptive trade-off between model complexity and data fitting, we implement fully Bayesian inference. Based on the natural-gradient stochastic variational inference techniques developed for the Structured Variational AutoEncoder (SVAE), we combine variational message passing for conjugate parameters and stochastic gradient descent for DNN parameters into a unified framework for efficient end-to-end optimization. Experimental results on 22 real crowdsourcing datasets demonstrate the effectiveness of the proposed approach.  相似文献   

12.
Ranking items is an essential problem in recommendation systems. Since comparing two items is the simplest type of queries in order to measure the relevance of items, the problem of aggregating pairwise comparisons to obtain a global ranking has been widely studied. Furthermore, ranking with pairwise comparisons has recently received a lot of attention in crowdsourcing systems where binary comparative queries can be used effectively to make assessments faster for precise rankings. In order to learn a ranking based on a training set of queries and their labels obtained from annotators, machine learning algorithms are generally used to find the appropriate ranking model which describes the data set the best.In this paper, we propose a probabilistic model for learning multiple latent rankings by using pairwise comparisons. Our novel model can capture multiple hidden rankings underlying the pairwise comparisons. Based on the model, we develop an efficient inference algorithm to learn multiple latent rankings as well as an effective inference algorithm for active learning to update the model parameters in crowdsourcing systems whenever new pairwise comparisons are supplied. The performance study with synthetic and real-life data sets confirms the effectiveness of our model and inference algorithms.  相似文献   

13.
概率生成模型是知识表示的重要方法,在该模型上计算似然函数的概率推理问题一般是难解的.变分推理是重要的确定性近似推理方法,具有较快的收敛速度、坚实的理论基础.尤其随着大数据时代的到来,概率生成模型变分推理方法受到工业界和学术界的极大关注.综述了多种概率生成模型变分推理框架及最新进展,具体包括:首先综述了概率生成模型变分推理一般框架及基于变分推理的生成模型参数学习过程;然后对于条件共轭指数族分布,给出了具有解析优化式的变分推理框架及该框架下可扩展的随机化变分推理;进一步,对于一般概率分布,给出了基于随机梯度的黑盒变分推理框架,并简述了该框架下多种变分推理算法的具体实现;最后分析了结构化变分推理,通过不同方式丰富变分分布提高推理精度并改善近似推理一致性.此外,展望了概率生成模型变分推理的发展趋势.  相似文献   

14.
Multilabel classification is an extension of conventional classification in which a single instance can be associated with multiple labels. Recent research has shown that, just like for conventional classification, instance-based learning algorithms relying on the nearest neighbor estimation principle can be used quite successfully in this context. However, since hitherto existing algorithms do not take correlations and interdependencies between labels into account, their potential has not yet been fully exploited. In this paper, we propose a new approach to multilabel classification, which is based on a framework that unifies instance-based learning and logistic regression, comprising both methods as special cases. This approach allows one to capture interdependencies between labels and, moreover, to combine model-based and similarity-based inference for multilabel classification. As will be shown by experimental studies, our approach is able to improve predictive accuracy in terms of several evaluation criteria for multilabel prediction.  相似文献   

15.
面向边缘设备的高能效深度学习任务调度策略   总被引:2,自引:0,他引:2  
任杰  高岭  于佳龙  袁璐 《计算机学报》2020,43(3):440-452
近年来,深度学习在图像和自然语言处理等诸多领域表现出色,与深度学习相关的各类移动应用发展迅速,但由于移动网络状态的不稳定性及网络带宽的限制,基于云计算的深度模型任务可能出现较大响应延迟,严重影响用户体验.与此同时,深度模型对设备的计算及存储能力有较高的要求,无法直接在资源受限的移动设备中进行部署.因此,亟须设计一种新的计算模式,使得基于深度模型的移动应用能够满足用户对快速响应、低能耗及高准确率的期望.本文提出一种面向边缘设备的深度模型分类任务调度策略,该策略通过协同移动设备与边缘服务器,充分利用智能移动终端的便捷性和边缘服务器强大的计算能力,综合考虑分类任务的复杂度和用户期望,完成深度模型在移动设备和边缘服务器中的动态部署,并对推理任务进行动态调度,从而提升任务执行效率,降低深度学习模型推理开销.本文以基于卷积神经网络的图像识别应用为例,实验结果表明,在移动环境中,相比于准确率最高的深度模型,本文提出的高能效调度策略的推理能耗可降低93.2%、推理时间降低91.6%,同时准确率提升3.88%.  相似文献   

16.
Crowdsourcing services have been proven efficient in collecting large amount of labeled data for supervised learning tasks. However, the low cost of crowd workers leads to unreliable labels, a new problem for learning a reliable classifier. Various methods have been proposed to infer the ground truth or learn from crowd data directly though, there is no guarantee that these methods work well for highly biased or noisy crowd labels. Motivated by this limitation of crowd data, in this paper, we propose a novel framewor for improving the performance of crowdsourcing learning tasks by some additional expert labels, that is, we treat each labeler as a personal classifier and combine all labelers’ opinions from a model combination perspective, and summarize the evidence from crowds and experts naturally via a Bayesian classifier in the intermediate feature space formed by personal classifiers. We also introduce active learning to our framework and propose an uncertainty sampling algorithm for actively obtaining expert labels. Experiments show that our method can significantly improve the learning quality as compared with those methods solely using crowd labels.  相似文献   

17.
Construction workplace hazard detection requires engineers to analyze scenes manually against many safety rules, which is time-consuming, labor-intensive, and error-prone. Computer vision algorithms are yet to achieve reliable discrimination of anomalous and benign object relations underpinning safety violation detections. Recently developed deep learning-based computer vision algorithms need tens of thousands of images, including labels of the safety rules violated, in order to train deep-learning networks for acquiring spatiotemporal reasoning capacity in complex workplaces. Such training processes need human experts to label images and indicate whether the relationship between the worker, resource, and equipment in the scenes violate spatiotemporal arrangement rules for safe and productive operations. False alarms in those manual labels (labeling no-violation images as having violations) can significantly mislead the machine learning process and result in computer vision models that produce inaccurate hazard detections. Compared with false alarms, another type of mislabels, false negatives (labeling images having violations as “no violations”), seem to have fewer impacts on the reliability of the trained computer vision models.This paper examines a new crowdsourcing approach that achieves above 95% accuracy in labeling images of complex construction scenes having safety-rule violations, with a focus on minimizing false alarms while keeping acceptable rates of false negatives. The development and testing of this new crowdsourcing approach examine two fundamental questions: (1) How to characterize the impacts of a short safety-rule training process on the labeling accuracy of non-professional image annotators? And (2) How to properly aggregate the image labels contributed by ordinary people to filter out false alarms while keeping an acceptable false negative rate? In designing short training sessions for online image annotators, the research team split a large number of safety rules into smaller sets of six. An online image annotator learns six safety rules randomly assigned to him or her, and then labels workplace images as “no violation” or ‘violation” of certain rules among the six learned by him or her. About one hundred and twenty anonymous image annotators participated in the data collection. Finally, a Bayesian-network-based crowd consensus model aggregated these labels from annotators to obtain safety-rule violation labeling results. Experiment results show that the proposed model can achieve close to 0% false alarm rates while keeping the false negative rate below 10%. Such image labeling performance outdoes existing crowdsourcing approaches that use majority votes for aggregating crowdsourced labels. Given these findings, the presented crowdsourcing approach sheds lights on effective construction safety surveillance by integrating human risk recognition capabilities into advanced computer vision.  相似文献   

18.
在传统的 crowdsourcing,工人们被期望提供独立答案给任务以便保证答案的差异。然而,最近的研究证明人群不是许多独立工人,但是相反工人们与对方一起交流并且协作。与小努力追求更多的报酬,一些工人可以共谋勾结提供重复答案,它将损坏聚集的结果的质量。尽管如此,就在 crowdsourcing 的结果推理上的串通的否定影响而言有很少努力。在这份报纸,我们特殊在公共平台为一般 crowdsourcing 任务担心防串通的结果推理问题。到那个目的,我们设计一个度量标准,工人表演变化率,由在移开重复答案前后计算吝啬的工人表演的差别识别共谋勾结的答案。然后,我们把串通察觉结果合并到存在结果推理方法甚至与串通行为的出现保证聚集的结果的质量。与真实世界、合成的数据集,我们进行了我们的途径的评估的一个广泛的集合。试验性的结果与最先进的方法比较表明我们的途径的优势。  相似文献   

19.
Crowdsourcing is an effective method to obtain large databases of manually-labeled images, which is especially important for image understanding with supervised machine learning algorithms. However, for several kinds of tasks regarding image labeling, e.g., dog breed recognition, it is hard to achieve high-quality results. Therefore, further optimizing crowdsourcing workflow mainly involves task allocation and result inference. For task allocation, we design a two-round crowdsourcing framework, which contains a smart decision mechanism based on information entropy to determine whether to perform the second round task allocation. Regarding result inference, after quantifying the similarity of all labels, two graphical models are proposed to describe the labeling process and corresponding inference algorithms are designed to further improve the result quality of image labeling. Extensive experiments on real-world tasks in Crowdflower and synthesis datasets were conducted. The experimental results demonstrate the superiority of these methods in comparison with state-of-the-art methods.  相似文献   

20.
Over the last few years, multiply-annotated data has become a very popular source of information. Online platforms such as Amazon Mechanical Turk have revolutionized the labelling process needed for any classification task, sharing the effort between a number of annotators (instead of the classical single expert). This crowdsourcing approach has introduced new challenging problems, such as handling disagreements on the annotated samples or combining the unknown expertise of the annotators. Probabilistic methods, such as Gaussian Processes (GP), have proven successful to model this new crowdsourcing scenario. However, GPs do not scale up well with the training set size, which makes them prohibitive for medium-to-large datasets (beyond 10K training instances). This constitutes a serious limitation for current real-world applications. In this work, we introduce two scalable and efficient GP-based crowdsourcing methods that allow for processing previously-prohibitive datasets. The first one is an efficient and fast approximation to GP with squared exponential (SE) kernel. The second allows for learning a more flexible kernel at the expense of a heavier training (but still scalable to large datasets). Since the latter is not a GP-SE approximation, it can be also considered as a whole new scalable and efficient crowdsourcing method, useful for any dataset size. Both methods use Fourier features and variational inference, can predict the class of new samples, and estimate the expertise of the involved annotators. A complete experimentation compares them with state-of-the-art probabilistic approaches in synthetic and real crowdsourcing datasets of different sizes. They stand out as the best performing approach for large scale problems. Moreover, the second method is competitive with the current state-of-the-art for small datasets.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号