This paper reports on a collaborative project, currently being carried out by the Centre for English Language Teacher Education and the Warwick Writing Programme at the University of Warwick, England, to compile a multimillion-word corpus of student writing. Since May 2001, we have collected samples of proficient written coursework produced by students at all levels and in a range of disciplines. We believe this student writing collection will eventually provide an invaluable database for use by researchers and writing teachers, enabling them to identify and describe, in a systematic way, the characteristics of assigned work across disciplines and levels of study. Our corpus is confined to shorter assignments assessed within departments—the most common form of student writing, but unpublished and therefore generally unavailable to researchers. This paper describes the project, and explains the rationale for developing the corpus. It also considers the corpus’ potential role as a resource for research and teaching within and across subject disciplines.  相似文献   

综合型语言知识库的建设与利用   总被引:15,自引:4,他引:15  
语言知识库的规模和质量决定了自然语言处理系统的成败。经过18年的努力,北京大学计算语言学研究所已经积累了一系列颇具规模、质量上乘的语言数据资源:现代汉语语法信息词典,大规模基本标注语料库,现代汉语语义词典,中文概念词典,不同单位对齐的双语语料库,多个专业领域的术语库,现代汉语短语结构规则库,中国古代诗词语料库等等。本项研究将把这些语言数据资源集成为一个综合型的语言知识库。集成不同的语言数据资源时,必须克服它们之间的“缝隙”。规划中的综合型语言知识库除了有统一的友好的使用界面和方便的应用程序接口外,还将提供支持知识挖掘的工具软件,促使现有的语言数据资源从初级产品形式向深加工产品形式不断发展;提供多种形式的知识传播和信息服务机制,让综合型语言知识库为语言信息处理研究、语言学本体研究和语言教学提供全方位的、多层次的支持。  相似文献   

语言工程的软件体系结构已经逐渐发展成为语言工程的主要研究领域之一。它面向通用的自然语言应用,为其提供架构层次的参考方案。研究内容涵盖与体系结构相关的计算资源、语言资源、方法和应用等多个方面。在一定意义上,可以把它看作是在语言工程领域内的特定领域软件体系结构(DSSA)。本文概要介绍了该领域的发展历程和研究意义,然后对其基本概念和当前主要研究进展进行了阐述和分析,并展望了进一步的发展趋势。  相似文献   

Internet网络新闻文本自动摘要的研究   总被引:6,自引:0,他引:6  
官礼和 《计算机工程与设计》2007,28(14):3518-3520,F0003
给出了Internet网络新闻中文文本自动摘要的基本思路和基本步骤,讨论了断句、分词算法.针对自动摘要中新闻文本的4种形式特征,提出了一套新的自动摘要方案:首先综合新闻文本的4种形式特征对词汇和句子赋予不同的权值,然后根据权值大小按给定的比例挑选句子,并进行平滑处理,生成文字流畅且具备一定质量的摘要.最后实验分析表明效果较好.  相似文献   

Many of the central notions and ultimate goals of both human-computer interaction (HCI) and natural language processing (NLP) are common to both disciplines. Both are concerned with communication as a core concept, and both attempt to maximize the naturalness of this communication for the end-user. A central challenge to both disciplines is the issue of the choice and adaptation of the appropriate form of communication for the specific user and context at hand. Despite these strong commonalities, we observe very little collaboration, cross-references or even mutual knowledge between the HCI and NLP communities. And, surprisingly enough, although their goals might be very similar, the methods and the evaluation frameworks used in both research and applicative work in both areas are distinct. We think that it is time to step back and re-assess the potential for collaboration between the two disciplines.In this paper, we argue that importing ideas and methods from each discipline into the other can be fruitful, and we review specific areas where this is the case. We argue that cross-fertilization between HCI and NLP is desirable in wider and in more fundamental ways than only for the design of natural language interfaces. The reflection presented in this paper is motivated by our own work over the last four years in a team comprised of both HCI and specialists in natural language generation (NLG), a subfield of NLP specifically concerned with the automatic generation of language.  相似文献   

Annotating Expressions of Opinions and Emotions in Language   总被引:3,自引:0,他引:3  
This paper describes a corpus annotation project to study issues in the manual annotation of opinions, emotions, sentiments, speculations, evaluations and other private states in language. The resulting corpus annotation scheme is described, as well as examples of its use. In addition, the manual annotation process and the results of an inter-annotator agreement study on a 10,000-sentence corpus of articles drawn from the world press are presented.  相似文献   

This paper describes the LT NSL system (McKelvie et al., 1996), an architecture for writing corpus processing tools. This system is then compared with two other systems which address similar issues, the GATE system (Cunningham et al., 1995) and the IMS Corpus Workbench (Christ, 1994). In particular we address the advantages and disadvantages of an SGML approach compared with a non-sgml database approach. This revised version was published online in July 2006 with corrections to the Cover Date.  相似文献   

面向自然语言信息处理的维吾尔语名词形态分析研究   总被引:2,自引:3,他引:2  
名词是人类语言中的基本词类之一。维吾尔语是一种形态变化很复杂的语言,其中名词是一种形态变化复杂的词类。因此名词的形态分析研究无论在语法研究还是在语言信息处理中都非常重要。本文对维吾尔语名词的形态变化(名词的数、人称、格等语法范畴)进行了形式化的描述和分析。指出了维吾尔语名词的基本形态参数,总结出参数的组配规律并统计了其类型,探索了维吾尔语名词的削尾方法。这些工作将为维吾尔语名词形态处理提供有效的方法和新的思路。  相似文献   

在大规模无监督语料上的BERT、XLNet等预训练语言模型,通常采用基于交叉熵损失函数的语言建模任务进行训练。模型的评价标准则采用困惑度或者模型在其他下游自然语言处理任务中的性能指标,存在损失函数和评测指标不匹配等问题。为解决这些问题,该文提出一种结合强化学习的对抗预训练语言模型RL-XLNet(Reinforcement Learning-XLNet)。RL-XLNet采用对抗训练方式训练一个生成器,基于上下文预测选定词,并训练一个判别器判断生成器预测的词是否正确。通过对抗网络生成器和判别器的相互促进作用,强化生成器对语义的理解,提高模型的学习能力。由于在文本生成过程中存在采样过程,导致最终的损失无法直接进行回传,故提出采用强化学习的方式对生成器进行训练。基于通用语言理解评估基准(GLUE Benchmark)和斯坦福问答任务(SQuAD 1.1)的实验,结果表明,与现有BERT、XLNet方法相比,RL-XLNet模型在多项任务中的性能上表现出较明显的优势: 在GLUE的六个任务中排名第1,一个任务排名第2,一个任务排名第3。在SQuAD 1.1任务中F1值排名第1。考虑到运算资源有限,基于小语料集的模型性能也达到了领域先进水平。  相似文献   

张烨  聂一鸣 《智能安全》2023,2(4):100-112
大语言模型一般指包含百亿个以上参数的预训练语言模型,通过在大规模语料库上进行训练,大语言模型不仅在自然语言处理问题上表现出色,而且在各个垂直领域中也展现出强大的能力,成为当前人工智能领域的热点研究内容之一。首先,介绍了仅编码器结构、编码器-解码器结构、仅解码器结构大语言模型的发展历程,重点关注相关预训练、适配微调等关键技术。然后,分析了大语言模型在医疗、编程、数据生成等领域的应用现状,以及因模型规模不断扩大而产生的计算资源、模型可解释性等方面的问题。最后,从智能安全的角度出发,探讨了大语言模型强大的文本理解、处理与生成能力在提升网络、交通等领域安全性方面的应用潜力。  相似文献   

近几年,神经网络因其强大的表征能力逐渐取代传统的机器学习成为自然语言处理任务的基本模型.然而经典的神经网络模型只能处理欧氏空间中的数据,自然语言处理领域中,篇章结构,句法甚至句子本身都以图数据的形式存在.因此,图神经网络引起学界广泛关注,并在自然语言处理的多个领域成功应用.该文对图神经网络在自然语言处理领域中的应用进行...  相似文献   

近年来,深度学习技术被广泛应用于各个领域,基于深度学习的预处理模型将自然语言处理带入一个新时代。预训练模型的目标是如何使预训练好的模型处于良好的初始状态,在下游任务中达到更好的性能表现。对预训练技术及其发展历史进行介绍,并按照模型特点划分为基于概率统计的传统模型和基于深度学习的新式模型进行综述;简要分析传统预训练模型的特点及局限性,重点介绍基于深度学习的预训练模型,并针对它们在下游任务的表现进行对比评估;梳理出具有启发意义的新式预训练模型,简述这些模型的改进机制以及在下游任务中取得的性能提升;总结目前预训练的模型所面临的问题,并对后续发展趋势进行展望。  相似文献   

提出了一种从宾州中文语料库中自动提取词汇化树邻接文法(LTAG)的算法。该算法的主要思想是从词汇化树库中归纳出三种类型的词汇化树,然后利用了中心词驱动短语结构文法的方法从语料库自动提取结构合理的词汇化树;最后由语言规则对不合法的词汇化树进行过滤。与手工创建词汇化树邻接文法相比,它需要较少的人力,并且避免了人工创建词汇化树可能造成的遗漏或不一致现象。  相似文献   

单词嵌入是指运用机器学习的方法,将位于高维离散空间(维数为词典单词数目)中的每个单词映射到低维连续空间的实数向量的技术。在很多文本处理的任务中,单词嵌入提供了更好的语义级别的单词特征表示,从而为文本处理任务带来了诸多便利。同时,大数据时代海量的未标注文本数据,以及以深度学习为代表的机器学习技术的发展使高效的单词嵌入技术成为可能。本文将给出单词嵌入的定义以及实际意义,同时将综述目前单词嵌入技术的几种典型方法,包括基于神经网络的方法、基于受限玻尔兹曼机的方法以及基于单词与上下文共生矩阵分解的方法。本文将详细介绍不同模型的数学定义、物理意义以及训练方法,并给出他们之间的比较。  相似文献   

本文提出了一个面向古代建筑领域的自然语言处理的系统模型,它被用于古建筑动画自动生成系统之中,承担着从简单中文描述到古建筑领域语义结果的计算工作。该模型分为三部分,分别为预处理过程,一般语义计算和面向古建筑领域的语义计算。通过调用Stanford大学的中文分词、语法分析程序完成分词、语法分析任务,使用Prolog语言完成一般语义计算,最终计算出古建筑构件以及它的搭建顺序、尺寸和位置,即所谓的面向古建筑领域的语义计算。  相似文献   

语气词用法的自动识别是现代汉语语气词知识库的核心问题。使用规则方法研究了语气词用法在多种语料库中的识别问题,从多种语料库中的语气词实际用法入手,修改和完善了语气词用法词典和语气词用法规则库。实验数据表明,经过修改和完善,语气词用法在各语料库中的识别准确率有所提高,增强了语气词知识库的适用性。  相似文献   

In conventional algorithms, the lack of entity information, reference, and semantic relations in the current corpus leads to a low rate of precision and efficiency in constructing cross‐language bilingual mapping. According to natural language processing and machine translation technology, to solve the problem, this paper aims to establish a parallel corpus for information extraction based on the OntoNotes corpus by combining automatic extraction and manual adjustment. To verify the validity of the parallel corpus constructed in this paper, a comparative experiment was carried out on the corpus. The corpus entity alignment rate, anaphora absence, and syntactic structure were analysed in detail based on statistics. The data set is well performed in language processing and machine translation. The parallel corpus for information extraction constructed in this paper can produce highly precise, stable, and efficient information in the process of bilingual mapping, which provides an effective parallel corpus for the study in machine translation of bilingual mapping.  相似文献   

视觉问答中的语言处理方法对视觉问答模型的性能影响巨大。语言处理方法源于自然语言处理,但在发展过程中与自然语言处理领域最先进技术脱节,导致视觉问答中涉及的问题理解和答案生成受阻。产生这一问题的根源主观上是研究人员对语言处理方法的重要性认识不足,客观上则是相关研究文献的匮乏。针对上述问题,通过分析语言处理对视觉问答的价值,调查视觉问答中涉及到的语言处理方法和最新研究成果,归纳总结语言处理方法的类型,从而为研究人员认识语言处理重要性提供基础;探讨了自然语言处理技术对视觉问答中语言处理方法的推动作用,并展望了语言处理方法未来的发展方向。  相似文献   

The term ‘corpus’ refers to a huge volume of structured datasets containing machine-readable texts. Such texts are generated in a natural communicative setting. The explosion of social media permitted individuals to spread data with minimal examination and filters freely. Due to this, the old problem of fake news has resurfaced. It has become an important concern due to its negative impact on the community. To manage the spread of fake news, automatic recognition approaches have been investigated earlier using Artificial Intelligence (AI) and Machine Learning (ML) techniques. To perform the medicinal text classification tasks, the ML approaches were applied, and they performed quite effectively. Still, a huge effort is required from the human side to generate the labelled training data. The recent progress of the Deep Learning (DL) methods seems to be a promising solution to tackle difficult types of Natural Language Processing (NLP) tasks, especially fake news detection. To unlock social media data, an automatic text classifier is highly helpful in the domain of NLP. The current research article focuses on the design of the Optimal Quad Channel Hybrid Long Short-Term Memory-based Fake News Classification (QCLSTM-FNC) approach. The presented QCLSTM-FNC approach aims to identify and differentiate fake news from actual news. To attain this, the proposed QCLSTM-FNC approach follows two methods such as the pre-processing data method and the Glove-based word embedding process. Besides, the QCLSTM model is utilized for classification. To boost the classification results of the QCLSTM model, a Quasi-Oppositional Sandpiper Optimization (QOSPO) algorithm is utilized to fine-tune the hyperparameters. The proposed QCLSTM-FNC approach was experimentally validated against a benchmark dataset. The QCLSTM-FNC approach successfully outperformed all other existing DL models under different measures.  相似文献   

