基于语料库和网络的新词自动识别 Automatic New Words Detection Based on Corpus and Web期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于语料库和网络的新词自动识别

引用本文：	刘建舟,何婷婷,骆昌日.基于语料库和网络的新词自动识别[J].计算机应用,2004,24(7):132-134.

作者姓名：	刘建舟何婷婷骆昌日

作者单位：	1. 华中师范大学,计算机科学系,湖北,武汉,430079;湖北工业大学,信息工程学院,湖北,武汉,430068 2. 华中师范大学,计算机科学系,湖北,武汉,430079

基金项目：	湖北省自然科学基金资助项目 (2 0 0 1ABB0 1 2 )

摘要：	汉语自动分词是进行中文信息处理的基础。目前，困扰汉语自动分词的一个主要难题就是新词自动识别，尤其是非专名新词的自动识别。同时，新词自动识别对于汉语词典的编纂也有着极为重要的意义。文中提出了一种新的新词自动识别的方法。这个方法用到了互信息和log-likelihood ratio两个参数的改进形式。主要分三个阶段完成：先从网络上下载丰富的语料，构建语料库；然后采用统计的方法进行多字词识别；最后与已有的词表进行对照，判定新词。
关键词：	抽取多字词页面解析动态语料库
文章编号：	1001-9081(2004)07-0132-03
Automatic New Words Detection Based on Corpus and Web

LIU Jian zhou ,HE Ting ting ,LUO Chang ri.Automatic New Words Detection Based on Corpus and Web[J].journal of Computer Applications,2004,24(7):132-134.

Authors:	LIU Jian zhou HE Ting ting LUO Chang ri

Affiliation:	LIU Jian zhou 1,2,HE Ting ting 1,LUO Chang ri 1

Abstract:	Automatic Chinese segmentation is the basis of Chinese information processing. At present, automatic new word detection, especially automatic non proper noun detection is a dilemma for automatic Chinese segmentation. At the same time, automatic new word detection is very important to thesaurus compiling. This paper presents a new method for new word detection. It uses two improved parameters: mutual information and log likelihood ratio. This method mainly consists of three phrases. First, download adequate web documents and build a corpus; then recognize multi word units by using statistical method; finally, compare these words with the previous word list, so as to decide the new words. Experiments on real corpus show that the proposed method is more efficient and robust.

Keywords:	multi word unit extraction page parsing dynamic corpus
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏