首页 | 本学科首页   官方微博 | 高级检索  
     

基于字矩阵交运算的n-grams特征选择加权算法
引用本文:邱云飞,刘世兴,邵良杉.基于字矩阵交运算的n-grams特征选择加权算法[J].计算机工程与应用,2016,52(22):86-92.
作者姓名:邱云飞  刘世兴  邵良杉
作者单位:1.辽宁工程技术大学 软件学院,辽宁 葫芦岛 125105 2.辽宁工程技术大学 系统工程研究所,辽宁 葫芦岛 125105
摘    要:中文文本中,传统的n-grams特征选择加权算法(如滑动窗口法等)存在两点不足:在将每个词进行组合、生成n-grams特征之前必须对每篇文本调用分词接口。无法删除n-grams中的冗余词,使得冗余的n-grams特征对其他有用的n-grams特征产生干扰,降低分类准确率。为解决以上问题,根据汉语单、双字词识别研究理论,将文本转化为字矩阵。通过对字矩阵中元素进行冗余过滤和交运算得到n-grams特征,避免了n-grams特征中存在冗余词的情况,且不需对文本调用任何分词接口。在搜狗中文新闻语料库和网易文本语料库中的实验结果表明,相比于滑动窗口法和其他n-grams特征选择加权算法,基于字矩阵交运算的n-grams特征选择加权算法得到的n-grams特征耗时更短,在支持向量机(Support Vector Machine,SVM)中的分类效果更好。

关 键 词:汉语单双字识别  字矩阵  交运算  特征选择  特征加权  

N-grams feature selection and weighting algorithm based on single-word matrix intersection
QIU Yunfei,LIU Shixing,SHAO Liangshan.N-grams feature selection and weighting algorithm based on single-word matrix intersection[J].Computer Engineering and Applications,2016,52(22):86-92.
Authors:QIU Yunfei  LIU Shixing  SHAO Liangshan
Affiliation:1.School of Software, Liaoning Technical University, Huludao, Liaoning 125105, China 2.System Engineering Institute, Liaoning Technical University, Huludao, Liaoning 125105, China
Abstract:In Chinese text, traditional n-grams feature selection and weighting methods(Sliding window method and so on) have two shortages: the word segmentation must be called before words’ combination and n-grams’ generation. The redundancy n-grams disturb other useful n-grams and reduce the precision of classification because of the redundancy words in n-grams that can’t be deleted. To solve the problems, transform the text to single-word matrix according to Chinese single, double word identification theory. Avoid redundancy words existing in n-grams and calling word segmentation to text by redundancy filtering and intersection in single-word matrix. The experiment results in Sogou Chinese news corpus and NetEase text corpus show that compared with sliding window and other methods, the n-grams features using the method of n-grams feature selection and weighting algorithm based on single-word matrix intersection cost less time and behave better in SVM(Support Vector Machine).
Keywords:Chinese single and double word recognition  single-word matrix  intersection  feature selection  feature weighting  
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号