首页 | 本学科首页   官方微博 | 高级检索  
     

基于语料库和网络的新词自动识别
引用本文:刘建舟,何婷婷,骆昌日.基于语料库和网络的新词自动识别[J].计算机应用,2004,24(7):132-134.
作者姓名:刘建舟  何婷婷  骆昌日
作者单位:1. 华中师范大学,计算机科学系,湖北,武汉,430079;湖北工业大学,信息工程学院,湖北,武汉,430068
2. 华中师范大学,计算机科学系,湖北,武汉,430079
基金项目:湖北省自然科学基金资助项目 (2 0 0 1ABB0 1 2 )
摘    要:汉语自动分词是进行中文信息处理的基础。目前,困扰汉语自动分词的一个主要难题就是新词自动识别,尤其是非专名新词的自动识别。同时,新词自动识别对于汉语词典的编纂也有着极为重要的意义。文中提出了一种新的新词自动识别的方法。这个方法用到了互信息和log-likelihood ratio两个参数的改进形式。主要分三个阶段完成:先从网络上下载丰富的语料,构建语料库;然后采用统计的方法进行多字词识别;最后与已有的词表进行对照,判定新词。

关 键 词:抽取多字词  页面解析  动态语料库
文章编号:1001-9081(2004)07-0132-03

Automatic New Words Detection Based on Corpus and Web
LIU Jian zhou ,HE Ting ting ,LUO Chang ri.Automatic New Words Detection Based on Corpus and Web[J].journal of Computer Applications,2004,24(7):132-134.
Authors:LIU Jian zhou    HE Ting ting  LUO Chang ri
Affiliation:LIU Jian zhou 1,2,HE Ting ting 1,LUO Chang ri 1
Abstract:Automatic Chinese segmentation is the basis of Chinese information processing. At present, automatic new word detection, especially automatic non proper noun detection is a dilemma for automatic Chinese segmentation. At the same time, automatic new word detection is very important to thesaurus compiling. This paper presents a new method for new word detection. It uses two improved parameters: mutual information and log likelihood ratio. This method mainly consists of three phrases. First, download adequate web documents and build a corpus; then recognize multi word units by using statistical method; finally, compare these words with the previous word list, so as to decide the new words. Experiments on real corpus show that the proposed method is more efficient and robust.
Keywords:multi  word unit extraction  page parsing  dynamic corpus
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号