首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
标签噪声会极大地降低深度网络模型的性能. 针对这一问题, 本文提出了一种基于对比学习的标签带噪图像分类方法. 该方法包括自适应阈值、对比学习模块和基于类原型的标签去噪模块. 首先采用对比学习最大化一幅图像的两个增强视图的相似度来提取图像鲁棒特征; 接下来通过一种新颖的自适应阈值过滤训练样本, 在模型训练过程中根据各个类别的学习情况动态调整阈值; 然后创新性地引入基于类原型的标签去噪模块, 通过计算样本特征向量与原型向量的相似度更新伪标签, 从而避免标签中噪声的影响; 在公开数据集CIFAR-10、CIFAR-100和真实数据集ANIMAL10上进行对比实验, 实验结果表明, 在人工合成噪声的条件下, 本文方法实验结果均高于常规方法, 通过计算图像鲁棒的特征向量与各个原型向量的相似度更新伪标签的方式, 降低了噪声标签的负面影响, 在一定程度上提高模型的抗噪声能力, 验证了该模型的有效性.  相似文献   

2.
Unsupervised domain adaptation (UDA), which aims to use knowledge from a label-rich source domain to help learn unlabeled target domain, has recently attracted much attention. UDA methods mainly concentrate on source classification and distribution alignment between domains to expect the correct target prediction. While in this paper, we attempt to learn the target prediction end to end directly, and develop a Self-corrected unsupervised domain adaptation (SCUDA) method with probabilistic label correction. SCUDA adopts a probabilistic label corrector to learn and correct the target labels directly. Specifically, besides model parameters, those target pseudo-labels are also updated in learning and corrected by the anchor-variable, which preserves the class candidates for samples. Experiments on real datasets show the competitiveness of SCUDA.  相似文献   

3.
传统的视频字幕生成模型大多都采用编码器—译码器框架。在编码阶段,使用卷积神经网络对视频进行处理。在解码阶段,使用长短期记忆网络生成视频的相应字幕。基于视频的时序相关性和多模态性,提出了一个混合型模型,即基于硬注意力的多模态视频字幕的生成模型。该模型在编码阶段使用不同的融合模型将视频和音频两种模态进行关联,在解码阶段基于长短期记忆网络的基础上加入了硬注意力机制来生成对视频的描述。这个混合模型在数据集MSR-VTT(Microsoft research video to text)上得到的机器翻译指标较基础模型有0.2%~3.8%的提升。根据实验结果可以判定基于硬注意力机制的多模态混合模型可以生成视频的精准描述字幕。  相似文献   

4.
目的 视频描述定位是视频理解领域一个重要且具有挑战性的任务,该任务需要根据一个自然语言描述的查询,从一段未修剪的视频中定位出文本描述的视频片段。由于语言模态与视频模态之间存在巨大的特征表示差异,因此如何构建出合适的视频—文本多模态特征表示,并准确高效地定位目标片段成为该任务的关键点和难点。针对上述问题,本文聚焦于构建视频—文本多模态特征的优化表示,提出使用视频中的运动信息去激励多模态特征表示中的运动语义信息,并以无候选框的方式实现视频描述定位。方法 基于自注意力的方法提取自然语言描述中的多个短语特征,并与视频特征进行跨模态融合,得到多个关注不同语义短语的多模态特征。为了优化多模态特征表示,分别从时序维度及特征通道两个方面进行建模: 1)在时序维度上使用跳连卷积,即一维时序卷积对运动信息的局部上下文进行建模,在时序维度上对齐语义短语与视频片段; 2)在特征通道上使用运动激励,通过计算时序相邻的多模态特征向量之间的差异,构建出响应运动信息的通道权重分布,从而激励多模态特征中表示运动信息的通道。本文关注不同语义短语的多模态特征融合,采用非局部神经网络(non-local neural network)建模不同语义短语之间的依赖关系,并采用时序注意力池化模块将多模态特征融合为一个特征向量,回归得到目标片段的开始与结束时刻。结果 在多个数据集上验证了本文方法的有效性。在Charades-STA数据集和ActivityNet Captions数据集上,模型的平均交并比(mean intersection over union,mIoU)分别达到了52.36%和42.97%,模型在两个数据集上的召回率R@1 (Recall@1)分别在交并比阈值为0.3、0.5和0.7时达到了73.79%、61.16%和52.36%以及60.54%、43.68%和25.43%。与LGI (local-global video-text interactions)和CPNet (contextual pyramid network)等方法相比,本文方法在性能上均有明显的提升。结论 本文在视频描述定位任务上提出了使用运动特征激励优化视频—文本多模态特征表示的方法,在多个数据集上的实验结果证明了运动激励下的特征能够更好地表征视频片段和语言查询的匹配信息。  相似文献   

5.
Traditional knowledge graphs (KG) representation learning focuses on the link information between entities, and the effectiveness of learning is influenced by the complexity of KGs. Considering a multi-modal knowledge graph (MKG), due to the introduction of considerable other modal information(such as images and texts), the complexity of KGs further increases, which degrades the effectiveness of representation learning. To resolve this solve the problem, this study proposed the multi-modal knowledge graphs representation learning via multi-head self-attention (MKGRL-MS) model, which improved the effectiveness of link prediction by adding rich multi-modal information to the entity. We first generated a single-modal feature vector corresponding to each entity. Then, we used multi-headed self-attention to obtain the attention degree of different modal features of entities in the process of semantic synthesis. In this manner, we learned the multi-modal feature representation of entities. New knowledge representation is the sum of traditional knowledge representation and an entity’s multi-modal feature representation. Simultaneously, we successfully train our model on two existing models and two different datasets and verified its versatility and effectiveness on the link prediction task.  相似文献   

6.
In this paper, we tackle the problem of segmenting out a sequence of actions from videos. The videos contain background and actions which are usually composed of ordered sub-actions. We refer the sub-actions and the background as semantic units. Considering the possible overlap between two adjacent semantic units, we propose a bidirectional sliding window method to generate the label distributions for various segments in the video. The label distribution covers a certain number of semantic unit labels, representing the degree to which each label describes the video segment. The mapping from a video segment to its label distribution is then learned by a Label Distribution Learning (LDL) algorithm. Based on the LDL model, a soft video parsing method with segmental regular grammars is proposed to construct a tree structure for the video. Each leaf of the tree stands for a video clip of background or sub-action. The proposed method shows promising results on the THUMOS’14, MSR-II and UCF101 datasets and its computational complexity is much less than the compared state-of-the-art video parsing method.  相似文献   

7.
随着网络视频的爆炸式增长,视频记忆度成为热点研究方向。视频记忆度是衡量一个视频令人难忘的程度指标,设计自动预测视频记忆度的计算模型有广泛的应用和前景。当前对视频记忆度预测的研究多集中于普遍的视觉特征或语义因素,没有考虑深度特征对视频记忆度的影响。着重探索了视频的深度特征,在视频预处理后利用现有的深度估计模型提取深度图,将视频原始图像和深度图一起输入预训练的ResNet152网络来提取深度特征;使用TF-IDF算法提取视频的语义特征,并对视频记忆度有影响的单词赋予不同的权重;将深度特征、语义特征和从视频内容中提取的C3D时空特征进行后期融合,提出了一个融合多模态的视频记忆度预测模型。在MediaEval 2019会议提供的大型公开数据集(VideoMem)上进行实验,在视频的短期记忆度预测任务中达到了0.545(长期记忆度预测任务:0.240)的Spearman相关性,证明了该模型的有效性。  相似文献   

8.
Action recognition solely based on video data has known to be very sensitive to background activity, and also lacks the ability to discriminate complex 3D motion. With the development of commercial depth cameras, skeleton-based action recognition is becoming more and more popular. However, the skeleton-based approach is still very challenging because of the large variation in human actions and temporal dynamics. In this paper, we propose a hierarchical model for action recognition. To handle confusing motions, a motion-based grouping method is proposed, which can efficiently assign each video a group label, and then for each group, a pre-trained classifier is used for frame-labeling. Unlike previous methods, we adopt a bottom-up approach that first performs action recognition for each frame. The final action label is obtained by fusing the classification to its frames, with the effect of each frame being adaptively adjusted based on its local properties. To achieve online real-time performance and suppressing noise, bag-of-words is used to represent the classification features. The proposed method is evaluated using two challenge datasets captured by a Kinect. Experiments show that our method can robustly recognize actions in real-time.  相似文献   

9.
李群  肖甫  张子屹  张锋  李延超 《软件学报》2022,33(9):3195-3209
视频摘要生成是计算机视觉领域必不可少的关键任务,这一任务的目标是通过选择视频内容中信息最丰富的部分来生成一段简洁又完整的视频摘要,从而对视频内容进行总结.所生成的视频摘要通常为一组有代表性的视频帧(如视频关键帧)或按时间顺序将关键视频片段缝合所形成的一个较短的视频.虽然视频摘要生成方法的研究已经取得了相当大的进展,但现有的方法存在缺乏时序信息和特征表示不完备的问题,很容易影响视频摘要的正确性和完整性.为了解决视频摘要生成问题,本文提出一种空时变换网络模型,该模型包括三大模块,分别为:嵌入层、特征变换与融合层、输出层.其中,嵌入层可同时嵌入空间特征和时序特征,特征变换与融合层可实现多模态特征的变换和融合,最后输出层通过分段预测和关键镜头选择完成视频摘要的生成.通过空间特征和时序特征的分别嵌入,以弥补现有模型对时序信息表示的不足;通过多模态特征的变换和融合,以解决特征表示不完备的问题.我们在两个基准数据集上做了充分的实验和分析,验证了我们模型的有效性.  相似文献   

10.
针对大多数视频问答(VideoQA)模型将视频和问题嵌入到同一空间进行答案推理所面临的多模态交互困难、视频语义特征保留能力差等问题,提出了一种视频描述机制来获得视频语义特征的文本表示,从而避免了多模态的交互.提出方法将视频特征通过描述机制得到相应的视频描述文本,并将描述文本特征与问题特征进行阅读理解式的交互与分析,最后推理出问题的答案.在MSVD-QA以及MSRVTT-QA数据集上的测试结果显示,提出问答模型的回答准确率较现有模型均有不同程度的提升,说明所提方法能更好地完成视频问答任务.  相似文献   

11.
张艳红  王宝会 《计算机科学》2016,43(4):252-255, 263
社会媒体网络中不仅包含了用户、文本、图片和视频等多种模态的数据,还包含了反映不同模态数据之间交互的群体特征。为了更好地描述社会媒体网络,从而为上层应用提供更好的服务,提出了一种基于深度神经网络的社会媒体网络模型。该模型采用深度神经网络对单个模态的数据进行学习,从而得到任意一个模态数据的潜在特征表示方法。对于两种不同模态的数据,利用具有高斯分布的先验矩阵与两个模态数据的后验分布建立反映这两个模态数据间群体特征的生成模型。实验结果表明,提出的模型在网络结构的链接分析中具有更好的预测效果,能有效地描述社会媒体网络的整体特征。  相似文献   

12.
Parkinson's disease (PD) is the world's second most common progressive neurodegenerative disease. This disease is characterized by a combination of various non-motor symptoms (e.g., depression, olfactory, and sleep disturbance) and motor symptoms (e.g., bradykinesia, tremor, rigidity), therefore diagnosis and treatment of PD are usually complex. There are some machine learning techniques that automate PD diagnosis and predict clinical scores. These techniques are promising in assisting assessment of stage of pathology and predicting PD progression. However, existing PD research mainly focuses on single-function model (i.e., only classification or prediction) using one modality, which limits performance. In this work, we propose a novel feature selection framework based on multi-modal neuroimaging data for joint PD detection and clinical score prediction. Specifically, a unique objective function is designed to capture discriminative features which are used to train a support vector regression (SVR) model for predicting clinical score (e.g., sleep scores and olfactory scores), and a support vector classification (SVC) model for class label identification. We evaluate our method using a dataset of 208 subjects, which includes 56 normal controls (NC), 123 PD and 29 scans without evidence of dopamine deficit (SWEDD) via a 10-fold cross-validation strategy. Our experimental results demonstrate that multi-modal data effectively improves the performance in disease status identification and clinical scores prediction as compared to one single modality. Comparative analysis reveals that the proposed method outperforms state-of-art methods.  相似文献   

13.
目的 人脸美丽预测是研究如何使计算机具有与人类相似的人脸美丽判断或预测能力,然而利用深度神经网络进行人脸美丽预测存在过度拟合噪声标签样本问题,从而影响深度神经网络的泛化性。因此,本文提出一种自纠正噪声标签方法用于人脸美丽预测。方法 该方法包括自训练教师模型机制和重标签再训练机制。自训练教师模型机制以自训练的方式获得教师模型,帮助学生模型进行干净样本选择和训练,直至学生模型泛化能力超过教师模型并成为新的教师模型,并不断重复该过程;重标签再训练机制通过比较最大预测概率和标签对应预测概率,从而纠正噪声标签。同时,利用纠正后的数据反复执行自训练教师模型机制。结果 在大规模人脸美丽数据库LSFBD (large scale facial beauty database)和SCUT-FBP5500数据库上进行实验。结果表明,本文方法在人工合成噪声标签的条件下可降低噪声标签的负面影响,同时在原始LSFBD数据库和SCUT-FBP5500数据库上分别取得60.8%和75.5%的准确率,高于常规方法。结论 在人工合成噪声标签条件下的LSFBD和SCUT-FBP5500数据库以及原始LSFBD和SCUT-FBP5500数据库上的实验表明,所提自纠正噪声标签方法具有选择干净样本学习、充分利用全部数据的特点,可降低噪声标签的负面影响,能在一定程度上降低人脸美丽预测中噪声标签的负面影响,提高预测准确率。  相似文献   

14.
章磊敏  董建锋  包翠竹  纪守领  王勋 《软件学报》2022,33(12):4838-4850
视频的点击率预估是视频推荐系统中的重要任务之一,推荐系统可以根据点击率的预估调整视频推荐顺序以提升视频推荐的效果.近年来,随着视频数量的爆炸式增长,视频推荐的冷启动问题也变得愈发严重.针对这个问题,提出了一个新的视频点击率预估模型,通过使用视频的内容特征以及上下文特征来加强视频点击率预估的效果;同时,通过对冷启动场景的模拟训练和基于近邻的替代方法提升模型应对新视频点击率预估的能力.提出的模型可以同时对旧视频和新视频进行点击率预估.在两个真实的电视剧(Track_1_series)和电影(Track_2_movies)点击率预估数据集上的实验表明:提出的模型可以显著改善对旧视频的点击率预估性能,并在两个数据集上均超过了现有的模型;对于新视频,相比于不考虑冷启动问题的模型只能获得0.57左右的AUC性能,该模型在两个数据集上分别获得0.645和0.615的性能,表现出针对冷启动问题更好的鲁棒性.  相似文献   

15.
3D shape recognition has been actively investigated in the field of computer graphics. With the rapid development of deep learning, various deep models have been introduced and achieved remarkable results. Most 3D shape recognition methods are supervised and learn only from the large amount of labeled shapes. However, it is expensive and time consuming to obtain such a large training set. In contrast to these methods, this paper studies a semi-supervised learning framework to train a deep model for 3D shape recognition by using both labeled and unlabeled shapes. Inspired by the co-training algorithm, our method iterates between model training and pseudo-label generation phases. In the model training phase, we train two deep networks based on the point cloud and multi-view representation simultaneously. In the pseudo-label generation phase, we generate the pseudo-labels of the unlabeled shapes using the joint prediction of two networks, which augments the labeled set for the next iteration. To extract more reliable consensus information from multiple representations, we propose an uncertainty-aware consistency loss function to combine the two networks into a multimodal network. This not only encourages the two networks to give similar predictions on the unlabeled set, but also eliminates the negative influence of the large performance gap between the two networks. Experiments on the benchmark ModelNet40 demonstrate that, with only 10% labeled training data, our approach achieves competitive performance to the results reported by supervised methods.  相似文献   

16.
Video to text conversion is a vital activity in the field of computer vision. In recent years, deep learning algorithms have dominated automatic text generation in English, but there are a few research works available for other languages. In this paper, we propose a novel encoding–decoding system that generates character-level Arabic sentences from isolated RGB videos of Moroccan sign language. The video sequence was encoded by a spatiotemporal feature extraction using pose estimation models, while the label text of the video is transmitted to a sequence of representative vectors. Both the features and the label vector are joined and treated by a decoder layer to derive a final prediction. We trained the proposed system on an isolated Moroccan Sign Language dataset (MoSLD), composed of RGB videos from 125 MoSL signs. The experimental results reveal that the proposed model attains the best performance under several evaluation metrics.  相似文献   

17.
王晖  沙基昌  孙晓  陶钧 《计算机仿真》2006,23(12):148-152
针对MPEG-4 FGS可伸缩的视频流量,采用马尔可夫链调制的一阶自回归方法对其统计特性进行建模,通过与Trace流量的仿真结果对比,证明了该模型的有效性。在此基础上,提出了基于FGS流量模型的层次化速率控制方法,在NS-2中将该方法与三种典型的CBR层次流量模型方法对分层视频组播RLM协议性能的影响进行了仿真实验对比。仿真结果表明:采用CBR模型来模拟MPEG-4 FGS层次流量对RLM协议进行性能仿真评价存在较大的误差,采用所提出的基于FGS流量模型的层次化速率控制方法对自适应视频组播协议的性能进行仿真具有更好的精确性。  相似文献   

18.
Fall incidents have been reported as the second most common cause of death, especially for elderly people. Human fall detection is necessary in smart home healthcare systems. Recently various fall detection approaches have been proposed., among which computer vision based approaches offer a promising and effective way. In this paper, we proposed a new framework for fall detection based on automatic feature learning methods. First, the extracted frames, including human from video sequences of different views, form the training set. Then, a PCANet model is trained by using all samples to predict the label of every frame. Because a fall behavior is contained in many continuous frames, the reliable fall detection should not only analyze one frame but also a video sequence. Based on the prediction result of the trained PCANet model for each frame, an action model is further obtained by SVM with the predicted labels of frames in video sequences. Experiments show that the proposed method achieved reliable results compared with other commonly used methods based on the multiple cameras fall dataset, and a better result is further achieved in our dataset which contains more training samples.  相似文献   

19.
Due to the variability of wireless channel state, video quality monitoring became very important for guaranteeing users’ Quality of Experience (QoE). QoE presents the overall perceptual quality of service from the subjective users’ perspective. However, because of diverse characteristics of video content, Human Visual System (HVS) cannot give the same attention to whole scene simultaneously when facing video sequence. In this paper, we proposed a video quality assessment model by considering the influence of fast motion and scene change. The motion change contribution factor and scene change contribution factor are defined to quantify the characteristics of video content, which is closely related to the users’ QoE. Based on G.1070, our proposed model considers the influential factors of loss nature of video coding, variability of practical network and video features. Also, the proposed model owns low computational complexity due to the compressed domain approach for the estimation of the model parameters. Therefore, the video quality is assessed without fully decoding the video stream. The performance of our proposed model has been compared with five existing models and the results also shown that our model has high prediction accuracy closing to human perception.  相似文献   

20.
领域自适应将源域上学习到的知识迁移到目标域上,使得在带标签数据少的情况下也可以有效地训练模型。采用伪标签的领域自适应模型未考虑错误伪标签的影响,并且在决策边界处样本的分类准确率较低,针对上述问题提出了基于加权分类损失和核范数的领域自适应模型。该模型使用带有伪标签的可信样本特征与带有真实标签的源域样本特征构建辅助域,在辅助域上设计加权分类损失函数,降低错误伪标签在训练过程中产生的影响;加入批量核范数最大化损失,提高决策边界处样本的分类准确率。在Office31、Office-Home、Image-CLEFDA基准数据集上与之前模型的对比实验表明,该模型有更高的精确度。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号