基于字矩阵交运算的n-grams特征选择加权算法 N-grams feature selection and weighting algorithm based on single-word matrix intersection期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于字矩阵交运算的n-grams特征选择加权算法

引用本文：	邱云飞,刘世兴,邵良杉.基于字矩阵交运算的n-grams特征选择加权算法[J].计算机工程与应用,2016,52(22):86-92.

作者姓名：	邱云飞刘世兴邵良杉

作者单位：	1.辽宁工程技术大学软件学院，辽宁葫芦岛 125105 2.辽宁工程技术大学系统工程研究所，辽宁葫芦岛 125105

摘要：	中文文本中，传统的n-grams特征选择加权算法（如滑动窗口法等）存在两点不足：在将每个词进行组合、生成n-grams特征之前必须对每篇文本调用分词接口。无法删除n-grams中的冗余词，使得冗余的n-grams特征对其他有用的n-grams特征产生干扰，降低分类准确率。为解决以上问题，根据汉语单、双字词识别研究理论，将文本转化为字矩阵。通过对字矩阵中元素进行冗余过滤和交运算得到n-grams特征，避免了n-grams特征中存在冗余词的情况，且不需对文本调用任何分词接口。在搜狗中文新闻语料库和网易文本语料库中的实验结果表明，相比于滑动窗口法和其他n-grams特征选择加权算法，基于字矩阵交运算的n-grams特征选择加权算法得到的n-grams特征耗时更短，在支持向量机（Support Vector Machine，SVM）中的分类效果更好。
关键词：	汉语单双字识别字矩阵交运算特征选择特征加权
N-grams feature selection and weighting algorithm based on single-word matrix intersection

QIU Yunfei,LIU Shixing,SHAO Liangshan.N-grams feature selection and weighting algorithm based on single-word matrix intersection[J].Computer Engineering and Applications,2016,52(22):86-92.

Authors:	QIU Yunfei LIU Shixing SHAO Liangshan

Affiliation:	1.School of Software, Liaoning Technical University, Huludao, Liaoning 125105, China 2.System Engineering Institute, Liaoning Technical University, Huludao, Liaoning 125105, China

Abstract:	In Chinese text, traditional n-grams feature selection and weighting methods（Sliding window method and so on） have two shortages: the word segmentation must be called before words’ combination and n-grams’ generation. The redundancy n-grams disturb other useful n-grams and reduce the precision of classification because of the redundancy words in n-grams that can’t be deleted. To solve the problems, transform the text to single-word matrix according to Chinese single, double word identification theory. Avoid redundancy words existing in n-grams and calling word segmentation to text by redundancy filtering and intersection in single-word matrix. The experiment results in Sogou Chinese news corpus and NetEase text corpus show that compared with sliding window and other methods, the n-grams features using the method of n-grams feature selection and weighting algorithm based on single-word matrix intersection cost less time and behave better in SVM（Support Vector Machine）.

Keywords:	Chinese single and double word recognition single-word matrix intersection feature selection feature weighting

	点击此处可从《计算机工程与应用》浏览原始摘要信息
	点击此处可从《计算机工程与应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏