首页 | 本学科首页   官方微博 | 高级检索  
     

篇章视角的汉语零指代语料库构建
引用本文:孔芳,葛海柱,周国栋. 篇章视角的汉语零指代语料库构建[J]. 软件学报, 2021, 32(12): 3782-3801
作者姓名:孔芳  葛海柱  周国栋
作者单位:苏州大学 计算机科学与技术学院 自然语言处理实验室,江苏 苏州 215006;江苏省计算机信息处理技术重点实验室,江苏 苏州 215006;苏州大学 计算机科学与技术学院 自然语言处理实验室,江苏 苏州 215006
基金项目:国家自然科学基金(61876118,61751206);江苏高校优势学科建设工程
摘    要:零指代是汉语中普遍存在的一个现象,在汉英机器翻译、文本摘要以及阅读理解等众多自然语言处理任务中都起着重要作用,目前已成为自然语言处理领域的一个研究热点.提出了篇章视角的汉语零指代表示体系,从服务于篇章分析的角度出发,首先以基本篇章单元为考察对象,判别其是否包含零元素;再根据零元素在基本篇章单元中承担的角色将零元素划分成主干类和修饰类两类;接着以段落对应的篇章修辞结构树为考察指代关系的基本单元,依据先行词与零元素间的位置关系将指代关系分成基本篇章单元内和基本篇章单元间两种,并针对基本篇章单元间的指代关系,根据零元素对应的先行词的状况将指代关系分成实体类、事件类、组合类和其他等4类;最后,基于篇章视角的汉语零指代表示体系,选取汉语树库CTB、连接词驱动的汉语篇章树库CDTB和OntoNotes语料中重叠的325篇文本进行了汉语零指代的标注,构建了服务于篇章分析的汉语零指代语料库.一方面,借助系统检测来说明所提出的表示体系合理有效,构造的语料库质量上乘;另一方面构建了完整的汉语零指代消解基准平台,从可计算的角度验证了所构建的汉语零指代语料库能够为篇章视角的汉语零指代研究提供必要的支撑.

关 键 词:零指代  语料库构建  篇章分析  基本篇章单元  零元素
收稿时间:2020-05-15
修稿时间:2020-06-22

Corpus Construction for Chinese Zero Anaphora from Discourse Perspective
KONG Fang,GE Hai-Zhu,ZHOU Guo-Dong. Corpus Construction for Chinese Zero Anaphora from Discourse Perspective[J]. Journal of Software, 2021, 32(12): 3782-3801
Authors:KONG Fang  GE Hai-Zhu  ZHOU Guo-Dong
Affiliation:Laboratory for Natural Language Processing, School of Computer Science and Technology, Soochow University, Suzhou 215006, China;Jiangsu Key Laboratory of Computer Information Processing Technology, Suzhou 215006, China
Abstract:As a common phenomenon in Chinese, zero anaphora plays an important role in many natural language processing tasks, such as machine translation, text summarization and machine reading comprehension. Currently, it has become a research hotspot in the field of natural language processing. Towards better discourse analysis, this study proposes a representation architecture for Chinese zero anaphora from the discourse perspective. Firstly, the elementary discourse unit is taken as the investigation object to determine whether it contains zero elements. Secondly, according to the roles of zero elements in the elementary discourse unit, the zero elements are divided into two categories: the core type and the modifier type. Thirdly, the discourse rhetorical tree of the paragraph is used as the basic unit to evaluate the Chinese zero coreferential relationship. According to the positional relationship between the antecedent and the zero element, the coreferential relationship is classified into two types, i.e., Intra-EDU and Inter-EDU. After that, for Inter-EDU type, the coreferential relationship is furtherly divided into four categories according to the status of the antecedent, i.e., entity, event, union, and others. Finally, this study selects the overlapped 325 texts of the Chinese treebank (CTB), the connective-driven Chinese discourse treebank (CDTB), and the OntoNotes corpus to annotate the Chinese zero anaphora. System evaluation shows the high quality of the constructed corpus for Chinese zero anaphora. Moreover, a complete zero anaphor resolution baseline system is constructed to show the appropriateness and the effectiveness of the proposed representation architecture for Chinese zero anaphora from computability perspective.
Keywords:zero anaphora  corpus construction  discourse analysis  elementary discourse unit  zero pronouns
本文献已被 万方数据 等数据库收录!
点击此处可从《软件学报》浏览原始摘要信息
点击此处可从《软件学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号