首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper presents our research on automatic annotation of a five-billion-word corpus of Japanese blogs with information on affect and sentiment. We first perform a study in emotion blog corpora to discover that there has been no large scale emotion corpus available for the Japanese language. We choose the largest blog corpus for the language and annotate it with the use of two systems for affect analysis: ML-Ask for word- and sentence-level affect analysis and CAO for detailed analysis of emoticons. The annotated information includes affective features like sentence subjectivity (emotive/non-emotive) or emotion classes (joy, sadness, etc.), useful in affect analysis. The annotations are also generalized on a two-dimensional model of affect to obtain information on sentence valence (positive/negative), useful in sentiment analysis. The annotations are evaluated in several ways. Firstly, on a test set of a thousand sentences extracted randomly and evaluated by over forty respondents. Secondly, the statistics of annotations are compared to other existing emotion blog corpora. Finally, the corpus is applied in several tasks, such as generation of emotion object ontology or retrieval of emotional and moral consequences of actions.  相似文献   

2.
Quality annotated resources are essential for Natural Language Processing. The objective of this work is to present a corpus of clinical narratives in French annotated for linguistic, semantic and structural information, aimed at clinical information extraction. Six annotators contributed to the corpus annotation, using a comprehensive annotation scheme covering 21 entities, 11 attributes and 37 relations. All annotators trained on a small, common portion of the corpus before proceeding independently. An automatic tool was used to produce entity and attribute pre-annotations. About a tenth of the corpus was doubly annotated and annotation differences were resolved in consensus meetings. To ensure annotation consistency throughout the corpus, we devised harmonization tools to automatically identify annotation differences to be addressed to improve the overall corpus quality. The annotation project spanned over 24 months and resulted in a corpus comprising 500 documents (148,476 tokens) annotated with 44,740 entities and 26,478 relations. The average inter-annotator agreement is 0.793 F-measure for entities and 0.789 for relations. The performance of the pre-annotation tool for entities reached 0.814 F-measure when sufficient training data was available. The performance of our entity pre-annotation tool shows the value of the corpus to build and evaluate information extraction methods. In addition, we introduced harmonization methods that further improved the quality of annotations in the corpus.  相似文献   

3.
Automatic affect recognition in real-world environments is an important task towards a natural interaction between humans and machines. The recent years, several advancements have been accomplished in determining the emotional states with the use of Deep Neural Networks (DNNs). In this paper, we propose an emotion recognition system that utilizes the raw text, audio and visual information in an end-to-end manner. To capture the emotional states of a person, robust features need to be extracted from the various modalities. To this end, we utilize Convolutional Neural Networks (CNNs) and propose a novel transformer-based architecture for the text modality that can robustly capture the semantics of sentences. We develop an audio model to process the audio channel, and adopt a variation of a high resolution network (HRNet) to process the visual modality. To fuse the modality-specific features, we propose novel attention-based methods. To capture the temporal dynamics in the signal, we utilize Long Short-Term Memory (LSTM) networks. Our model is trained on the SEWA dataset of the AVEC 2017 research sub-challenge on emotion recognition, and produces state-of-the-art results in the text, visual and multimodal domains, and comparable performance in the audio case when compared with the winning papers of the challenge that use several hand-crafted and DNN features. Code is available at: https://github.com/glam-imperial/multimodal-affect-recognition.  相似文献   

4.
5.
中文电子病历命名实体和实体关系语料库构建   总被引:1,自引:0,他引:1  
电子病历是由医务人员撰写的面向患者个体描述医疗活动的记录,蕴含了大量的医疗知识和患者的健康信息.电子病历命名实体识别和实体关系抽取等信息抽取研究对于临床决策支持、循证医学实践和个性化医疗服务等具有重要意义,而电子病历命名实体和实体关系标注语料库的构建是首当其冲的.在调研了国内外电子病历命名实体和实体关系标注语料库构建的基础上,结合中文电子病历的特点,提出适合中文电子病历的命名实体和实体关系的标注体系,在医生的指导和参与下,制定了命名实体和实体关系的详细标注规范,构建了标注体系完整、规模较大且一致性较高的标注语料库.语料库包含病历文本992份,命名实体标注一致性达到0.922,实体关系一致性达到0.895.为中文电子病历信息抽取后续研究打下了坚实的基础.  相似文献   

6.
事件语料库是研究语义Web中事件知识的抽取、表示、推理和挖掘的基础和关键技术之一。该文以事件作为文本知识单元,在LTP分析的基础上,用序列模式挖掘算法PrefixSpan从现有的小规模语料库中挖掘事件要素的词性规则等,用同义词词林(扩展版)对触发词表进行了扩充,结合自定义的事件要素词典,采用多遍过滤、逐遍完善的思想提出一种针对大规模突发事件语料库构建的自动标注方法,在实验部分不仅与人工标注做了对比,同时与Stanford CoreNLP NER进行了对比,实验效果理想。  相似文献   

7.
As a special group, visually impaired people (VIP) find it difficult to access and use visual information in the same way as sighted individuals. In recent years, benefiting from the development of computer hardware and deep learning techniques, significant progress have been made in assisting VIP with visual perception. However, most existing datasets are annotated in single scenario and lack of sufficient annotations for diversity obstacles to meet the realistic needs of VIP. To address this issue, we propose a new dataset called Walk On The Road (WOTR), which has nearly 190 K objects, with approximately 13.6 objects per image. Specially, WOTR contains 15 categories of common obstacles and 5 categories of road judging objects, including multiple scenario of walking on sidewalks, tactile pavings, crossings, and other locations. Additionally, we offer a series of baselines by training several advanced object detectors on WOTR. Furthermore, we propose a simple but effective PC-YOLO to obtain excellent detection results on WOTR and PASCAL VOC datasets. The WOTR dataset is available at https://github.com/kxzr/WOTR  相似文献   

8.
Effectiveness of hypermedia annotations for foreign language reading   总被引:2,自引:0,他引:2  
Abstract This study first explores intermediate-level English learners' preferences for hypermedia annotations while they are engaged in reading a hypermedia text. Second, it examines whether multimedia annotations facilitate reading comprehension in the second language. The participants were 44 adult learners of English as a foreign language studying English for Academic Purposes. Data were collected through a tracking tool, a reading comprehension test, a questionnaire, and interviews. Results indicate that learners preferred visual annotations significantly more than textual and audio annotations. On the other hand, a negative relationship was found between annotation use and reading comprehension. Especially, pronunciations, audiorecordings, and videos were found to affect reading comprehension negatively. However, the qualitative data revealed that the participants had positive attitudes towards annotations and hypermedia reading in general.  相似文献   

9.
An emotional text may be judged to belong to multiple emotion categories because it may evoke different emotions with varying degrees of intensity. For emotion analysis of text in a supervised manner, it is required to annotate text corpus with emotion categories. Because emotion is a very subjective entity, producing reliable annotation is of prime requirement for developing a robust emotion analysis model, so it is wise to have the data set annotated by multiple human judges and generate an aggregated data set provided that the emotional responses provided by different annotators over the data set exhibit substantial agreement. In reality, multiple emotional responses for an emotional text are common. So, the data set is a multilabel one where a single data item may belong to more than one category simultaneously. This article presents a new agreement measure to compute interannotator reliability in multilabel annotation. The new reliability coefficient has been applied to measure the quality of an emotion text corpus. The procedure for generating aggregated data and some corpus cleaning techniques are also discussed.  相似文献   

10.
术语和惯用短语可以体现文本特征。无监督的抽取特征词语对诸多自然语言处理工作起到支持作用。该文提出了“聚类-验证”过程,使用主题模型对文本中的字符进行聚类,并采用自然标注信息对提取出的字符串进行验证和过滤,从而实现了从未分词领域语料中无监督获得词语表的方法。通过优化和过滤,我们可以进一步获得了富含有术语信息和特征短语的高置信度特征词表。在对计算机科学等六类不同领域语料的实验中,该方法抽取的特征词表具有较好的文体区分度和领域区分度。  相似文献   

11.
Given the contemporary trend to modular NLP architectures and multiple annotation frameworks, the existence of concurrent tokenizations of the same text represents a pervasive problem in everyday’s NLP practice and poses a non-trivial theoretical problem to the integration of linguistic annotations and their interpretability in general. This paper describes a solution for integrating different tokenizations using a standoff XML format, and discusses the consequences from a corpus-linguistic perspective.  相似文献   

12.
目的 基于深度模型的跟踪算法往往需要大规模的高质量标注训练数据集,而人工逐帧标注视频数据会耗费大量的人力及时间成本。本文提出一个基于Transformer模型的轻量化视频标注算法(Transformer-based label network,TLNet),实现对大规模稀疏标注视频数据集的高效逐帧标注。方法 该算法通过Transformer模型来处理时序的目标外观和运动信息,并融合前反向的跟踪结果。其中质量评估子网络用于筛选跟踪失败帧,进行人工标注;回归子网络则对剩余帧的初始标注进行优化,输出更精确的目标框标注。该算法具有强泛化性,能够与具体跟踪算法解耦,应用现有的任意轻量化跟踪算法,实现高效的视频自动标注。结果 在2个大规模跟踪数据集上生成标注。对于LaSOT (large-scale single object tracking)数据集,自动标注过程仅需约43 h,与真实标注的平均重叠率(mean intersection over union,mIoU)由0.824提升至0.871。对于TrackingNet数据集,本文使用自动标注重新训练3种跟踪算法,并在3个数据集上测试跟踪性能,使用本文标注训练的模型在跟踪性能上超过使用TrackingNet原始标注训练的模型。结论 本文算法TLNet能够挖掘时序的目标外观和运动信息,对前反向跟踪结果进行帧级的质量评估并进一步优化目标框。该方法与具体跟踪算法解耦,具有强泛化性,并能节省超过90%的人工标注成本,高效地生成高质量的视频标注。  相似文献   

13.
In this work we address the challenging case of answering count queries in web search, such as number of songs by John Lennon. Prior methods merely answer these with a single, and sometimes puzzling number or return a ranked list of text snippets with different numbers. This paper proposes a methodology for answering count queries with inference, contextualization and explanatory evidence. Unlike previous systems, our method infers final answers from multiple observations, supports semantic qualifiers for the counts, and provides evidence by enumerating representative instances. Experiments with a wide variety of queries, including existing benchmark show the benefits of our method, and the influence of specific parameter settings. Our code, data and an interactive system demonstration are publicly available at https://github.com/ghoshs/CoQEx and https://nlcounqer.mpi-inf.mpg.de/.  相似文献   

14.
一种新的图像语义自动标注模型   总被引:1,自引:0,他引:1       下载免费PDF全文
根据图像低层特征和高级语义间的对应关系,自动进行图像语义标注是目前图像检索系统研究的热点。简要介绍了基于图像语义连接网络的图像检索框架,提出了一种基于该框架的图像自动标注模型。该模型通过积累用户反馈信息,学习并获得图像语义,从而进行自动的图像标注。图像语义及标注可以在与用户交互过程中得到实时更新。还提出了一种词义相关度分析的方法剔除冗余标注词,解决标注误传播的问题。通过在Corel图像集上的对比实验,验证了该方法的有效性。  相似文献   

15.
义原(sememe)被定义为人类语言中不可再分的最小语义单位.一个词语的意义可以由多个义原的组合来表示.以往人们已经人工为词语标注义原并构建了知网(HowNet)这一语言知识库,并借此将义原应用到了多种自然语言处理任务.但传统的人工标注费时费力,而且不同的专家进行标注难免会引入标注者的主观偏差,导致标注的一致性和准确性...  相似文献   

16.
Tremendous advances in different areas of knowledge are producing vast volumes of data, a quantity so large that it has made necessary the development of new computational algorithms. Among the algorithms developed, we find Machine Learning models and specific data mining techniques that might be useful for all areas of knowledge. The use of computational tools for data analysis is increasingly required, given the need to extract meaningful information from such large volumes of data. However, there are no free access libraries, modules, or web services that comprise a vast array of analytical techniques in a user-friendly environment for non-specific users. Those that exist raise high usability barriers for those untrained in the field as they usually have specific installation requirements and require in-depth programming knowledge, or may result expensive. As an alternative, we have developed DMAKit, a user-friendly web platform powered by DMAKit-lib, a new library implemented in Python, which facilitates the analysis of data of different kind and origins. Our tool implements a wide array of state-of-the-art data mining and pattern recognition techniques, allowing the user to quickly implement classification, prediction or clustering models, statistical evaluation, and feature analysis of different attributes in diverse datasets without requiring any specific programming knowledge. DMAKit is especially useful for users who have large volumes of data to be analyzed but do not have the informatics, mathematical, or statistical knowledge to implement models. We expect this platform to provide a way to extract information and analyze patterns through data mining techniques for anyone interested in applying them with no specific knowledge required. Particularly, we present several cases of study in the areas of biology, biotechnology, and biomedicine, where we highlight the applicability of our tool to ease the labor of non-specialist users to apply data analysis and pattern recognition techniques. DMAKit is available for non-commercial use as an open-access library, licensed under the GNU General Public License, version GPL 3.0. The web platform is publicly available at https://pesb2.cl/dmakitWeb. Demonstrative and tutorial videos for the web platform are available in https://pesb2.cl/dmakittutorials/. Complete urls for relevant content are listed in the Data Availability section.  相似文献   

17.
语料标注是语料库构建的一项重要的基础性工作。基于搜狗日志,该文借助XML文档的结构化特点,将语料标注转换成节点属性的改写,根据语料的特点,制定了一套服务于搜索引擎用短语词典构建的短语语料标注加工规范及执行原则,并对标注集及加工规范进行了详细描述。利用此规范,已完成145 645条查询词串的标注,而且标注质量很高。  相似文献   

18.
Protein structure visualization tools render images that allow the user to explore structural features of a protein. Context specific information relating to a particular protein or protein family is, however, not easily integrated and must be uploaded from databases or provided through manual curation of input files. Protein Engineers spend considerable time iteratively reviewing both literature and protein structure visualizations manually annotated with mutated residues. Meanwhile, text mining tools are increasingly used to extract specific units of raw text from scientific literature and have demonstrated the potential to support the activities of Protein Engineers. The transfer of mutation specific raw-text annotations to protein structures requires integrated data processing pipelines that can co-ordinate information retrieval, information extraction, protein sequence retrieval, sequence alignment and mutant residue mapping. We describe the Mutation Miner pipeline designed for this purpose and present case study evaluations of the key steps in the process. Starting with literature about mutations made to protein families; haloalkane dehalogenase, bi-phenyl dioxygenase, and xylanase we enumerate relevant documents available for text mining analysis, the available electronic formats, and the number of mutations made to a given protein family. We review the efficiency of NLP driven protein sequence retrieval from databases and report on the effectiveness of Mutation Miner in mapping annotations to protein structure visualizations. We highlight the feasibility and practicability of the approach. Funding project: Ontologies, the semantic web and intelligent systems for genomics Génome Québec, 630, boul. René-Lévesque Ouest, bureau 2660, Montréal (Québec) H3B 1S6, e-mail: gqinfo@genomequebec.com  相似文献   

19.
In this paper, an unsupervised learning-based approach is presented for fusing bracketed exposures into high-quality images that avoids the need for interim conversion to intermediate high dynamic range (HDR) images. As an objective quality measure – the colored multi-exposure fusion structural similarity index measure (MEF-SSIMc) – is optimized to update the network parameters, the unsupervised learning can be realized without using any ground truth (GT) images. Furthermore, an unreferenced gradient fidelity term is added in the loss function to recover and supplement the image information for the fused image. As shown in the experiments, the proposed algorithm performs well in terms of structure, texture, and color. In particular, it maintains the order of variations in the original image brightness and suppresses edge blurring and halo effects, and it also produces good visual effects that have good quantitative evaluation indicators. Our code will be publicly available at https://github.com/cathying-cq/UMEF.  相似文献   

20.
Large scale labeled datasets are of key importance for the development of automatic video analysis tools as they, from one hand, allow multi-class classifiers training and, from the other hand, support the algorithms’ evaluation phase. This is widely recognized by the multimedia and computer vision communities, as witnessed by the growing number of available datasets; however, the research still lacks in annotation tools able to meet user needs, since a lot of human concentration is necessary to generate high quality ground truth data. Nevertheless, it is not feasible to collect large video ground truths, covering as much scenarios and object categories as possible, by exploiting only the effort of isolated research groups. In this paper we present a collaborative web-based platform for video ground truth annotation. It features an easy and intuitive user interface that allows plain video annotation and instant sharing/integration of the generated ground truths, in order to not only alleviate a large part of the effort and time needed, but also to increase the quality of the generated annotations. The tool has been on-line in the last four months and, at the current date, we have collected about 70,000 annotations. A comparative performance evaluation has also shown that our system outperforms existing state of the art methods in terms of annotation time, annotation quality and system’s usability.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号