基于大规模语料库的新词检测 New Word Detection Based on Large-Scale Corpus期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于大规模语料库的新词检测

引用本文：	崔世起,刘群,孟遥,于浩,西野文人.基于大规模语料库的新词检测[J].计算机研究与发展,2006,43(5):927-932.

作者姓名：	崔世起刘群孟遥于浩西野文人

作者单位：	1. 中国科学院计算技术研究所数字化技术研究室,北京,100080;中国科学院研究生院,北京,100049 2. 中国科学院计算技术研究所数字化技术研究室,北京,100080 3. 富士通研究开发中心有限公司,北京,100016

基金项目：	国家科技攻关项目;中国科学院计算技术研究所和富士通研究开发中心有限公司合作项目

摘要：	自然语言的发展提出了快速跟踪新词的要求.提出了一种基于大规模语料库的新词检测方法,首先在大规模的Internet生语料上进行中文词法切分,然后在分词的基础上进行频度统计得到大量的候选新词.针对二元新词、三元新词、四元新词等的常见模式,用自学习的方法产生3个垃圾词典和一个词缀词典对候选新词进行垃圾过滤,最后使用词性过滤规则和独立词概率技术进一步过滤.据此实现了一个基于Internet的进行在线新词检测的系统,并取得了令人满意的性能.系统已经可以应用到新词检测、术语库建立、热点命名实体统计和词典编纂等领域.
关键词：	新词垃圾串垃圾头垃圾尾独立词概率
收稿时间：	03 4 2005 12:00AM
修稿时间：	2005-03-042005-08-11
New Word Detection Based on Large-Scale Corpus

Cui Shiqi,Liu Qun,Meng Yao,Yu Hao,Nishino Fumihito.New Word Detection Based on Large-Scale Corpus[J].Journal of Computer Research and Development,2006,43(5):927-932.

Authors:	Cui Shiqi Liu Qun Meng Yao Yu Hao Nishino Fumihito

Abstract:	New word detection is a part of unknown word detection. The development of natural languages requires us to detect new words as soon as possible. In this paper, a new approach to detect new words based on large-scale corpus is presented. It first segments the corpus from the Internet with ICTCLAS, and searches for repeated strings, and then designs different filtering mechanisms to separate the true new words from the garbage strings, using rich features of various new word patterns. While getting rid of the garbage strings, three garbage lexicons and a suffix lexicon are used, which are learned by the system, and good results are achieved. Finally, the results of the experiments are discussed, which seem to be promising.

Keywords:	new word garbage string garbage head garbage tail IWP
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏