大规模语料中频繁模式增量发现算法 Frequent Pattern Increment Discovery Algorithm in Large-scale Corpus期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

大规模语料中频繁模式增量发现算法

引用本文：	廖豪,陈洁,谭建龙.大规模语料中频繁模式增量发现算法[J].计算机工程,2011,37(23):27-29,32.

作者姓名：	廖豪陈洁谭建龙

作者单位：	1. 中国科学院计算技术研究所,北京100190;中国科学院研究生院,北京100049 2. 中国科学院计算技术研究所,北京100190;北京邮电大学计算机学院,北京100876 3. 中国科学院计算技术研究所,北京,100190

基金项目：	国家"973"计划基金资助项目，国家自然科学基金资助项目

摘要：	提出一种适用于大规模语料的频繁模式增量发现算法。统计局部区域提取的字符串频度，对局部相对低频字符串进行剪枝。利用多模式串匹配算法，统计剪枝后局部相对高频字符串在整个语料中的频度，得到频度大于阈值的频繁模式。实验结果表明，该算法具有较低的空间复杂度和时间复杂度，内存消耗为基于后缀数组的频繁模式发现算法的20%左右。
关键词：	频繁模式增量式多模式串匹配算法后缀树后缀数组
收稿时间：	2011-06-03
Frequent Pattern Increment Discovery Algorithm in Large-scale Corpus

LIAO Hao,CHEN Jie,TAN Jian-long.Frequent Pattern Increment Discovery Algorithm in Large-scale Corpus[J].Computer Engineering,2011,37(23):27-29,32.

Authors:	LIAO Hao CHEN Jie TAN Jian-long

Affiliation:	1(1.Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China;2.Graduate University of Chinese Academy of Sciences,Beijing 100049,China;3.School of Computer Science,Beijing University of Posts and Telecommunications,Beijing 100876,China)

Abstract:	This paper presents a memory-based frequent pattern incremental discovering algorithm for large-scale corpus.It extracts strings and counts frequencies of them from local area,prunes the local relative low frequency strings,and uses multi-mode string matching algorithm to count the local relative high frequency strings in the whole corpus,eventually gets the frequent patterns that the frequency is greater than the threshold.Experimental result shows that the algorithm has a better space complexity and the highest consumption of the memory size in the process of frequent-pattern discovery is about 20% to the size of the algorithm based on suffix array.

Keywords:	frequent pattern incremental multi-pattern string matching algorithm suffix tree suffix array
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《计算机工程》浏览原始摘要信息
	点击此处可从《计算机工程》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏