首页 | 本学科首页   官方微博 | 高级检索  
     

利用上下文信息解决汉语自动分词中的组合型歧义
引用本文:肖云,孙茂松,邹嘉彦.利用上下文信息解决汉语自动分词中的组合型歧义[J].计算机工程与应用,2001,37(19):87-89.
作者姓名:肖云  孙茂松  邹嘉彦
作者单位:1. 清华大学
2. 香港城市大学语言资讯科学研究中心
基金项目:国家重点基础研究发展规划项目资助课题(课题编号:G1998030507)
摘    要:组合型歧义切分字段一直是汉语自动分词研究中的一个难点。该文将之视为与WordSenseDisambiguation(WSD)相等价的问题。文章借鉴了WSD研究中广泛使用的向量空间法,选取了20个典型的组合型歧义进行了详尽讨论。提出了根据它们的分布“分而治之”的策略,继而根据实验确定了与特征矩阵相关联的上下文窗口大小、窗口位置区分、权值估计等要素,并且针对数据稀疏问题,利用词的语义代码信息对特征矩阵进行了降维处理,取得了较好的效果。笔者相信,这个模型对组合型歧义切分字段的排歧具有一般性。

关 键 词:自然语言处理  中文计算  汉语自动分词  组合型歧义切分字段
文章编号:1002-8331-(2001)19-0087-03
修稿时间:2001年6月1日

Solving Combinatorial Ambiguity in Chinese Word Segmentation Using Contextual Information
Xiao Yun,Sun Maosong,Benjamin K Tsou.Solving Combinatorial Ambiguity in Chinese Word Segmentation Using Contextual Information[J].Computer Engineering and Applications,2001,37(19):87-89.
Authors:Xiao Yun  Sun Maosong  Benjamin K Tsou
Affiliation:Xiao Yun1 Sun Maosong1 Benjamin K Tsou21
Abstract:Combinatorial ambiguity is a vital issue in Chinese word segmentation.We regard it as an equivalence of the problem of word sense disambiguation(WSD)in language computing.In sight of the vector space model commonly used in WSD and based on detailed observations on 20 typical combinatorial ambiguities,this paper at first presents the strategy of treating these ambiguities separately according to their distribution,then determines by experiments the key factors regarding feature matrix(the size of the context window,the sensitivity of locations in the window as well as weighting of feature words),and lastly makes use of semantic codes of words so as to reduce the dimension of the feature matrix.Preliminary results show that the proposed scheme is satisfactory in performance and may serve as a general solution for processing combinatorial ambiguities.
Keywords:natural language processing  Chinese computing  Chinese word segmentation  Combinatorial ambiguity
本文献已被 CNKI 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号