首页 | 本学科首页   官方微博 | 高级检索  
     

汉语篇章微观话题结构建模与语料库构建
引用本文:奚雪峰, 褚晓敏, 孙庆英, 周国栋. 汉语篇章微观话题结构建模与语料库构建[J]. 计算机研究与发展, 2017, 54(8): 1833-1852. DOI: 10.7544/issn1000-1239.2017.20170348
作者姓名:奚雪峰  褚晓敏  孙庆英  周国栋
作者单位:1.1(苏州大学计算机科学与技术学院 江苏苏州 215000);2.2(苏州科技大学计算机科学与工程系 江苏苏州 215009);3.3(苏州市虚拟现实智能交互及应用技术重点实验室 江苏苏州 215009) (xfxi@mail.usts.edu.cn)
基金项目:国家自然科学基金项目(61331011,61673290,61472264)
摘    要:篇章话题结构分析是自然语言理解的前沿基础,而大规模高质量的适用于汉语篇章分析的语料资源缺乏,严重制约了相关篇章话题计算模型的研究.针对上述问题,首先研究了汉语篇章话题结构的理论表示体系.分析了主述位理论、英语修辞结构理论和宾州篇章树库体系的优势,结合汉语复句句群理论以及汉语自身特点,提出了一种基于主述位理论的汉语篇章微观话题结构表示方式,并借助微观话题链构建了汉语篇章话题结构表示体系.随后,在此基础上,采用自顶向下、后向搜索的标注策略和人机结合的语料库标注方式,构建了基于篇章微观话题表示体系的汉语篇章话题结构语料库(Chinese discourse topic corpus, CDTC).CDTC共包含500个文档,对其进行了详细统计分析并展示了语料库的标注情况.与宾州篇章树库体系、广义话题结构理论的对比表明,所提篇章微观话题结构表示体系在理论上具有一定的优越性,并且符合汉语特点;一致性检验表明CDTC能够充分体现汉语篇章话题分析问题本身的难度,并能够为相关研究提供语料资源支持.

关 键 词:篇章话题结构  主位-述位理论  主位推进  话题链  语料库构建

Corpus Construction for Chinese Discourse Topic via Micro-Topic Scheme
Xi Xuefeng, Chu Xiaomin, Sun Qingying, Zhou Guodong. Corpus Construction for Chinese Discourse Topic via Micro-Topic Scheme[J]. Journal of Computer Research and Development, 2017, 54(8): 1833-1852. DOI: 10.7544/issn1000-1239.2017.20170348
Authors:Xi Xuefeng  Chu Xiaomin  Sun Qingying  Zhou Guodong
Affiliation:1.1(School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu 215000);2.2(Department of Computer Science and Engineering, Suzhou University of Science and Technology, Suzhou, Jiangsu 215009);3.3(Virtual Reality Key Laboratory of Intelligent Interaction and Application Technology of Suzhou, Suzhou, Jiangsu 215009)
Abstract:Currently discourse topic structure analysis is the fundamental research of natural language understanding. Due to the lack of a large number of high-quality discourse corpus resources, which are suitable for Chinese discourse analysis, it has seriously restricted the research of the relevant discourse topic computing models. In order to solve the above problems, we firstly study the theoretical representation system of Chinese discourse topic structure. From the theme-rheme theory, theory of English rhetorical structure and Pennsylvania discourse treebank system, research of Chinese complex sentence and sentence group, combined with Chinese characteristics, we propose a Chinese discourse micro-topic scheme based on theme-rheme theory and construct a Chinese discourse topic structure representation model based on the topic chain. Then, on the basis of the above, we adopt the top-down and backward search annotation strategy and the combination of the human machine and the corpus annotation method to construct the Chinese discourse topic corpus (CDTC). Moreover, we carry out a detailed statistical analysis of the CDTC which contains a total of 500 documents. Compared with the OntoNotes corpus and the generalized topic structure theory, this micro-topic scheme representation model has some advantages in theory and is consistent with the Chinese characteristics. Finally, the consistency test shows that CDTC can fully reflect the difficulty of Chinese discourse topic analysis, and can provide support for the relevant research.
Keywords:discourse topic structure  theme-rheme theory  thematic progression  topic chain  corpus construction
点击此处可从《计算机研究与发展》浏览原始摘要信息
点击此处可从《计算机研究与发展》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号