期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

龙从军康才畯李琳江荻《中文信息学报》2014,28(5):176-181

语义角色标注研究对自然语言处理具有十分重要的意义。英汉语语义角色标注研究已经获得了很多成果。然而藏语语义角色标注研究不管是资源建设,还是语义角色标注的技术探讨都鲜有报道。藏语具有比较丰富的句法标记,它们把一个句子天然地分割成功能不同的语义组块,而这些语义组块与语义角色之间存在一定的对应关系。根据这个特点,该文提出规则和统计相结合的、基于语义组块的语义角色标注策略。为了实现语义角色标注,文中首先对藏语语义角色进行分类,得到语义角色标注的分类体系;然后讨论标注规则的获得情况,包括手工编制初始规则集和采用错误驱动学习方法获得扩充规则集;统计技术上,选用了条件随机场模型,并添加了有效的语言特征,最终语义角色标注的结果准确率、召回率和F值分别达到82.78%、85.71%和83.91%。相似文献

2.

基于词典信息的先秦汉语全文词义标注方法研究

张颖杰李斌陈家骏陈小荷《中文信息学报》2012,26(3):65-72

词义消歧是自然语言处理中的一项基础任务,古汉语信息处理也急需深层次的语义标注工作。该文针对先秦古汉语这一特殊的语言材料,在训练语料和语义资源匮乏的条件下,采用《汉语大词典2.0》作为知识来源,将其词条释义作为义类,每个义项的例句作为训练语料,使用基于支持向量机(SVM)的半指导方法对《左传》进行全文的词义标注。按照频度不同、义项数量不同的原则,我们随机选取了22个词进行了人工检查,平均正确率达到67%。该方法可以广泛用于缺乏训练语料的古汉语义项标注工作,能够在古汉语全文词义标注的起步阶段提供初始结果,为人工标注词语义项提供良好的数据底本,补正传统词典释义不全的问题,进一步丰富汉语史发展研究资料。相似文献

3.

基于远监督的语义知识资源扩展研究

卢达威王星友袁毓林《中文信息学报》2016,30(6):147-155

语义知识资源蕴含了深刻的语言学理论,是语言学知识和语言工程的重要接口。该文以形容词句法语义词典为研究对象,探索对语义知识资源自动扩展的方法。该文的目标是利用大规模语料库,扩展原有词典的词表及其对应的句法格式。具体方法是根据词的句法格式将词典的词分类,将待扩展的新词通过分类器映射到原有词典的词中,以此把词典扩展问题转化为多类分类问题。依据的原理是词典词和待扩展新词在大规模语料中句法结构的相似性。该文通过远监督的方法构造训练数据,避免大量的人工标注。训练过程结合了浅层机器学习方法和深度神经网络,取得了有意义的成果。实验结果显示,深度神经网络能够习得句法结构信息,有效提升匹配的准确率。相似文献

4.

航空术语语义知识库辅助构建方法

王思博王裴岩张桂平《中文信息学报》2018,32(12):57-66

语义知识库是自然语言处理任务的基础性资源,广泛应用于语义计算和语义推理等任务。现有的大规模语义知识库基本都是通用型知识库,缺乏特定领域的语义知识。为了弥补这种不足,该文基于HowNet的语义理论体系,提出了一种辅助构建航空术语语义知识库的方法。该方法根据航空术语的特点将辅助构建分成四个关键过程,构建了2 000条术语概念描述(DEF)。最后通过对人工标注的术语间相似度与根据术语DEF计算的术语间相似度结果的对比,验证了该构建方法的有效性。相似文献

5.

古汉语词义标注语料库的构建及应用研究

舒蕾郭懿鸾王慧萍张学涛胡韧奋《中文信息学报》2022,36(5):21-30

古汉语以单音节词为主,其一词多义现象十分突出,这为现代人理解古文含义带来了一定的挑战。为了更好地实现古汉语词义的分析和判别,该研究基于传统辞书和语料库反映的语言事实,设计了针对古汉语多义词的词义划分原则,并对常用古汉语单音节词进行词义级别的知识整理,据此对包含多义词的语料开展词义标注。现有的语料库包含3.87万条标注数据,规模超过117.6万字,丰富了古代汉语领域的语言资源。实验显示,基于该语料库和BERT语言模型,词义判别算法准确率达到80%左右。进一步地,该文以词义历时演变分析和义族归纳为案例,初步探索了语料库与词义消歧技术在语言本体研究和词典编撰等领域的应用。相似文献

6.

三元搭配视角下的汉语动词语义角色知识库构建

王诚文钱青青荀恩东邢丹李梦饶高琦《中文信息学报》1986,34(9):19-27

动词语义角色一直是国内外语言学界研究的重点和难点。在自然语言处理领域,相关的语言资源也在逐步构建。对于汉语而言,国内大部分工作集中在语义角色标注上。该文创造性地提出了一种三元搭配的动词语义角色知识表征形式,并在前人研究的基础上,提出了一套语义角色分类体系。在该体系指导下,对汉语动词进行了穷尽式的语义角色认定及相关知识加工,以构建汉语动词语义角色知识库。截至目前,该工程考察了5 260个动词,加工了语义角色及引导词的动词数量为2 685个,加工认定语义角色4 307个。相似文献

7.

融合多层次特征的中文语义角色标注

下载免费PDF全文

王一成万福成马宁《智能系统学报》2020,15(1):107-113

随着人工智能和中文信息处理技术的迅猛发展,自然语言处理相关研究已逐步深入到语义理解层次上,而中文语义角色标注则是语义理解领域的核心技术。在统计机器学习仍占主流的中文信息处理领域,传统的标注方法对句子的句法及语义的解析程度依赖较大,因而标注准确率受限较大,已无法满足当前需求。针对上述问题,对基于Bi-LSTM的中文语义角色标注基础模型进行了改进研究,在模型后处理阶段结合了Max pooling技术,训练时融入了词法和句式等多层次的语言学特征,以实现对原有标注模型的深入改进。通过多组实验论证,结合语言学辅助分析,提出针对性的改进方法从而使模型标注准确率得到了显著提升,证明了结合Max pooling技术的Bi-LSTM语义角色标注模型中融入相关语言学特征能够改进模型标注效果。相似文献

8.

班智达藏文标注词典设计 总被引：1，自引：0，他引：1

才智杰才让卓玛《中文信息学报》2010,24(5):46-50

语料库加工是一项庞大的语言工程,其中分词标注是最基础性的工作,而分词标注词典是标注系统的重要组成,词典设计的优劣直接关系着分词标注的速度和效率。在设计国家语委项目《班智达藏文自动标注系统》的基础上,给出了分词标注词典库的结构及词典库索引查询算法。对85万字节藏语实验语料的分词和标注,分词准确率达99%,标注准确率达97%。相似文献

9.

基于词向量空间的大规模中文语义网络构建与复杂性分析

曹茂元陈毅东姚为龙杨雪娇《数字社区&智能家居》2014,(32):7703-7709,7711

当前对于汉语语义层次的语言网络研究方法仅限于静态词典生成以及人工手动生成两种方法,具有很大的局限性。对此,该文从大规模语料库生成的语义空间出发,结合语义空间丰富的语义信息和义类词典资源,提出一种新颖的基于分布语义的语义网络构建策略,并在此基础上探究了由不同性质的语义空间所构建的语义网络的统计特性。相比前人的方法,该文提出的方法优势在于无需依赖人工标注,支持大规模动态语料的网络自动构建。实验结果表明,语义网络具有复杂网络两个典型的特性:小世界效应和无标度特性。此外,由于语义网络描述的是词之间最为本质的语义关系,与不同文体中的措辞、使用习惯、风格等不存在直接的关系,因此当语义网络节点到达一定规模时,语义网络的某些统计特性可能会趋于一致。相似文献

10.

基于新闻语料库的越南语框架语义标注研究

林丽《中文信息学报》2013,27(6):201-209

越南是中国的重要邻国,相应的越南语海量信息处理正日益凸显其重要性。参考国内外有关框架语义标注的研究和实践,我们在构建越南语新闻语料库,对越南语文本进行分词和词性标注、命名实体标注等研究的基础上,尝试构建越南语框架语义知识库并初步探索了框架语义标注在越南语新闻事件抽取中的应用。相似文献

11.

基于百科语料的中英文双语词典提取

王星单力秋侯磊于济凡陈吉陶明阳《中文信息学报》2021,35(1):25-33

双语词典是跨语言自然语言处理中一项非常重要的资源。目前提取双语词典的方法主要是基于平行语料库和基于可比语料库,但是这两种方法在提取新词或者某些技术术语时都存在双语资源匮乏的问题。相比之下,基于部分双语语料的方法由于利用的是新闻或者百科知识,故可以很好地解决这个问题,然而目前基于部分双语语料的方法主要集中在对文本内容的提取上,缺乏对文本内容以外部分的提取。针对此不足,该文以中英文两种语言为例,提出了一种基于百科语料的中英文双语词典的提取方法。该方法是在对文本内容提取的基础上结合在线百科的结构特点,分别用五种不同的方法对百科语料进行提取,综合查重后得到的双语信息数量为969 308条。与以往的基于部分双语语料的双语词典的提取方法相比,该方法在在线百科语料上的提取数量提高了170.75%。相似文献

12.

Exploring and exploiting a historical corpus for Arabic

Bassam?Hammo Email author View author&#;s OrcID profile Sane?Yagi Omaima?Ismail Mohammad?AbuShariah 《Language Resources and Evaluation》2016,50(4):839-861

This paper presents a historical Arabic corpus named HAC. At this early embryonic stage of the project, we report about the design, the architecture and some of the experiments which we have conducted on HAC. The corpus, and accordingly the search results, will be represented using a primary XML exchange format. This will serve as an intermediate exchange tool within the project and will allow the user to process the results offline using some external tools. HAC is made up of Classical Arabic texts that cover 1600 years of language use; the Quranic text, Modern Standard Arabic texts, as well as a variety of monolingual Arabic dictionaries. The development of this historical corpus assists linguists and Arabic language learners to effectively explore, understand, and discover interesting knowledge hidden in millions of instances of language use. We used techniques from the field of natural language processing to process the data and a graph-based representation for the corpus. We provided researchers with an export facility to render further linguistic analysis possible. 相似文献

13.

词性标注中生词处理算法研究 总被引：6，自引：0，他引：6

张孝飞陈肇雄黄河燕蔡智《中文信息学报》2003,17(5):2-6

词性兼类是自然语言理解必须解决的一类非常重要的歧义现象,尤其是对生词的词性歧义处理有很大的难度。文章基于隐马尔科夫模型(HMM),通过将生词的词性标注问题转化为求词汇发射概率,在词性标注中提出了一种生词处理的新方法。该方法除了用到一个标注好的单语语料库外,没使用任何其他资源(比如语法词典、语法规则等),封闭测试正确率达97%左右,开放测试正确率也达95%左右,基本上达到了实用的程度。同时还给出了与其他同样基于HMM的词性标注方法的测试比较结果,结果表明本文方法的标注正确率有较大的提高。相似文献

14.

中医药古文献语料库设计与开发研究 总被引：3，自引：2，他引：1

刘耀段慧明王惠临周扬王振国李宏展《中文信息学报》2008,22(4):24-30

专业领域语料库是对专业领域文献进行自然语言处理的重要的不可或缺的基础,是对专业文本内容与意图进行深层把握的必由之路。通过对研究背景的分析,进一步明析了专业文献进行自然语言处理的必要性,并在对专业文献语料库的研究特点进行分析的基础上,深入探讨了专业语料库的设计思想及原理,同时,对语料库词类的标注信息进行了深入研究。成功地开发了针对专业领域语料库的辅助加工系统,为专业领域语料库建设提供了理论指导和技术支撑。相似文献

15.

基于字串切分统计词典的繁体中文拼写检错方法_*

王勇顾磊《计算机应用研究》2016,33(5)

针对繁体中文拼写检错的问题进行了研究,提出一种基于字串切分统计词典的检错方法。利用语料库中字串出现的频率信息作为检错依据,根据字串及其频率信息来建立统计词典,并设计了基于统计规则评判的检错算法。以SIGHAN7会议中文拼写校验任务中用于检错评测的1000句测试集作为实验测试集,并与此会议提交的结果进行比较,实验结果表明,与基于复杂语言模型的检错方法相比,该方法在实现简单的同时也有很好的检错效果,获得了较高的准确率和精确率以及较低的误报率。相似文献

16.

The parallel corpus for information extraction based on natural language processing and machine translation

Honghua He 《Expert Systems》2019,36(5)

In conventional algorithms, the lack of entity information, reference, and semantic relations in the current corpus leads to a low rate of precision and efficiency in constructing cross‐language bilingual mapping. According to natural language processing and machine translation technology, to solve the problem, this paper aims to establish a parallel corpus for information extraction based on the OntoNotes corpus by combining automatic extraction and manual adjustment. To verify the validity of the parallel corpus constructed in this paper, a comparative experiment was carried out on the corpus. The corpus entity alignment rate, anaphora absence, and syntactic structure were analysed in detail based on statistics. The data set is well performed in language processing and machine translation. The parallel corpus for information extraction constructed in this paper can produce highly precise, stable, and efficient information in the process of bilingual mapping, which provides an effective parallel corpus for the study in machine translation of bilingual mapping. 相似文献

17.

Chinese New Word Identification: A Latent Discriminative Model with Global Features 总被引：2，自引：0，他引：2

下载免费PDF全文

孙晓黄德根宋海玉任福继《计算机科学技术学报》2011,26(1):14-24

Chinese new words are particularly problematic in Chinese natural language processing.With the fast development of Internet and information explosion,it is impossible to get a complete system lexicon for applications in Chinese natural language processing,as new words out of dictionaries are always being created.The procedure of new words identification and POS tagging are usually separated and the features of lexical information cannot be fully used.A latent discriminative model,which combines the stren... 相似文献

18.

Compilation of an idiom example database for supervised idiom identification

Chikara Hashimoto Daisuke Kawahara 《Language Resources and Evaluation》2009,43(4):355-384

Some phrases can be interpreted in their context either idiomatically (figuratively) or literally. The precise identification of idioms is essential in order to achieve full-fledged natural language processing. Because of this, the authors of this paper have created an idiom corpus for Japanese. This paper reports on the corpus itself and the results of an idiom identification experiment conducted using the corpus. The corpus targeted 146 ambiguous idioms, and consists of 102,856 examples, each of which is annotated with a literal/idiomatic label. All sentences were collected from the World Wide Web. For idiom identification, 90 out of the 146 idioms were targeted and a word sense disambiguation (WSD) method was adopted using both common WSD features and idiom-specific features. The corpus and the experiment are both, as far as can be determined, the largest of their kinds. It was discovered that a standard supervised WSD method works well for idiom identification and it achieved accuracy levels of 89.25 and 88.86%, with and without idiom-specific features, respectively. It was also found that the most effective idiom-specific feature is the one that involves the adjacency of idiom constituents. 相似文献

19.

Automated access to a large medical dictionary: online assistance for research and application in natural language processing

A T McCray S Srinivasan 《Computers and biomedical research》1990,23(2):179-198

Online dictionaries can be important tools for research and application in natural language processing. This paper describes work with a machine-readable version of "Dorland's Illustrated Medical Dictionary". First the characteristics of the dictionary are briefly described, and then the complex process of converting the tape to an online interactive dictionary is discussed. The results of several experiments in automatically deriving information from the online dictionary are presented, and the paper ends with a discussion of the use of the online dictionary as a tool in the development of a natural language processing system designed for the biomedical domain. 相似文献

20.

Dictionary word sense distinctions: An enquiry into their nature

Adam Kilgarriff 《Computers and the Humanities》1992,26(5-6):365-387

The word senses in a published dictionary are a valuable resource for natural language processing and textual criticism alike. In order that they can be further exploited, their nature must be better understood. Lexicographers have always had to decide where to say a word has one sense, where two. The two studies described here look into their grounds for making distinctions. The first develops a classification scheme to describe the commonly occurring distinction types. The second examines the task of matching the usages of a word from a corpus with the senses a dictionary provides. Finally, a view of the ontological status of dictionary word senses is presented.Adam Kilgarriff has recently completed his doctoral thesis, entitled Polysemy, available as CSRP 261, from the School of Cognitive and Computing Science, University of Sussex. He is now working on the preparation of database versions of dictionaries for language research for Longman Dictionaries. 相似文献