首页 | 本学科首页   官方微博 | 高级检索  
     

高频最大交集型歧义切分字段在汉语自动分词中的作用
引用本文:孙茂松,邹嘉彦.高频最大交集型歧义切分字段在汉语自动分词中的作用[J].中文信息学报,1999,13(1):28-35.
作者姓名:孙茂松  邹嘉彦
作者单位:1.清华大学智能技术与系统国家重点实验室2.香港城市大学语言资讯科学研究中心
摘    要:交集型歧义切分字段是影响汉语自动分词系统精度的一个重要因素。本文引入了最大交集型歧义切分字段的概念,并将之区分为真、伪两种主要类型。考察一个约1亿字的汉语语料库,我们发现,最大交集型歧义切分字段的高频部分表现出相当强的覆盖能力及稳定性:前4,619个的覆盖率为59.20% ,且覆盖率受领域变化的影响不大。而其中4,279个为伪歧义型,覆盖率高达53.35%。根据以上分析,我们提出了一种基于记忆的、高频最大交集型歧义切分字段的处理策略,可有效改善实用型非受限汉语自动分词系统的精度。

关 键 词:中文信息处理  汉语自动分词  高频最大交集型歧义切分字段  基于记忆的排歧策略  

The Role of High Frequent Maximal Crossing Ambiguities in Chinese Word Segmentation
Sun Maosong Zuo Zhengping Benjamin K Tsou The State Key Laboratory of Intelligent Technology and Systems,Tsinghua University,Beijing Language Information Sciences Research Centre,City University of Hong Kong.The Role of High Frequent Maximal Crossing Ambiguities in Chinese Word Segmentation[J].Journal of Chinese Information Processing,1999,13(1):28-35.
Authors:Sun Maosong Zuo Zhengping Benjamin K Tsou The State Key Laboratory of Intelligent Technology and Systems  Tsinghua University  Beijing Language Information Sciences Research Centre  City University of Hong Kong
Affiliation:1.The State Key Laboratory of Intelligent Technology and Systems , Tsinghua University2.Language Information Sciences Research Centre , City University of Hong Kong
Abstract:The solution of crossing ambiguities is still an open issue in the study of Chinese word segmentation. In this paper, we introduce the concept of maximal crossing ambiguity at first, divide it further into two major types, i.e., the true and the pseudo. Having observed a Chinese corpus with 100M characters, we find that the high frequent part of maximal crossing ambiguities is strong in coverage capacity (the coverage of the top 4,619 is as high as 59.20%, out of which 4,279 belongs to the pseudo type, with coverage 53.35%) and rather stable with regard to domain shifting. As a consequence, we propose for high frequent maximal crossing ambiguities a memory-based strategy that is expected to improve the performance of practical Chinese word segmenters significantly.
Keywords:Chinese information processing  Chinese word segmentation  maximal crossing ambiguities with high frequency  memory based disambiguation strategy
本文献已被 CNKI 维普 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号