无词典高频字串快速提取和统计算法研究 Research on Fast High-frequency Strings Extracting and Statistics Algorithm with no Thesaurus期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

无词典高频字串快速提取和统计算法研究

引用本文：	韩客松,王永成,陈桂林.无词典高频字串快速提取和统计算法研究[J].中文信息学报,2001,15(2):24-31.

作者姓名：	韩客松王永成陈桂林

作者单位：	上海交通大学电子信息学院

基金项目：	国家 8 6 3计划! (86 3- 30 6 -ZD0 3- 0 4- 1)

摘要：	本文提出了一种快速的高频字串提取和统计方法。使用Hash技术,该方法不需要词典,也不需要语料库的训练,不进行分词操作,依靠统计信息,提取高频字串。用语言学知识进行前缀后缀等处理后,得到的高频字串可以作为未登录词处理、歧义消解和加权处理等的辅助信息。实验显示了该方法速度较快且不受文章本身的限制,在处理小说等真实文本时体现了较高的可用性。
关键词：	Hash技术高频字串统计算法
修稿时间：	2000年5月8日
Research on Fast High-frequency Strings Extracting and Statistics Algorithm with no Thesaurus

HAN Ke-song,WANG Yong-cheng,CHEN Gui-lin.Research on Fast High-frequency Strings Extracting and Statistics Algorithm with no Thesaurus[J].Journal of Chinese Information Processing,2001,15(2):24-31.

Authors:	HAN Ke-song WANG Yong-cheng CHEN Gui-lin

Affiliation:	School of Electronics & Information ,Shanghai Jiaotong University

Abstract:	In this paper we describe a fast high frequency strings extracting algorithm. Our approach uses HASH technology to avoid relying on corpus and word segmentation. To extract the high frequency strings, we only use statistics information. After processing the prefixes and suffixes, the high frequency strings we get can be the supplement knowledge for the un-login words processing, word disambiguation and word weighting. The experimental results show that it has a high speed and can work on arbitrary texts. Our method has good effect when processing novels and other real texts.

Keywords:	Hash high frequency strings statistics algorithm
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《中文信息学报》浏览原始摘要信息
	点击此处可从《中文信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏