首页 | 本学科首页   官方微博 | 高级检索  
     

汉字关联性量化方法及其在文本相似性分析中的应用
引用本文:赵彦斌,李庆华.汉字关联性量化方法及其在文本相似性分析中的应用[J].计算机应用,2006,26(6):1396-1397.
作者姓名:赵彦斌  李庆华
作者单位:华中科技大学,计算机科学与技术学院,湖北,武汉,430074;国家高性能计算中心,湖北,武汉,430074
摘    要:文本相似性分析、聚类和分类多基于特征词,由于汉语词之间无分隔符,汉语分词及高维特征空间的处理等基础工作必然引起高计算费用问题。探索了一种在不使用特征词的条件下,使用汉字间的关系进行文本相似性分析的研究思路。首先定义了文本中汉字与汉字之间关系的量化方法,提出汉字关联度的概念,然后构造汉字关联度矩阵来表示汉语文本,并设计了一种基于汉字关联度矩阵的汉语文本相似性度量算法。实验结果表明,汉字关联度优于二字词词频、互信息、T检验等统计量。由于无需汉语分词,本算法适用于海量中文信息处理。

关 键 词:汉字关联度  信息矩阵  文本相似度算法
文章编号:1001-9081(2006)06-1396-02
收稿时间:2005-12-19
修稿时间:2005-12-192006-02-22

Chinese character association measurement method and its application on Chinese text similarity analysis
ZHAO Yan-bin,LI Qing-hua.Chinese character association measurement method and its application on Chinese text similarity analysis[J].journal of Computer Applications,2006,26(6):1396-1397.
Authors:ZHAO Yan-bin  LI Qing-hua
Affiliation:1.School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan Hubei 430074, China; 2. National High Performance Computing Center, Wuhan Hubei 430074, China
Abstract:The research of text similarity analysis and text clustering is mostly based on feature words. Because Chinese text does not have a natural delimiter between words, it must solve two problems such as Chinese word segmentation and higher-level dimensions feature vector spaces. In order to reduce the higher complexity, a novel investigation method of text similarity analysis using the association of Chinese characters was probed without using feature words. The notation of Chinese Character Association Measurement was defined, and the Chinese Character Association Measurement matrix to represent the Chinese text documents was constructed. Then a Chinese text similarity algorithm based on Chinese Character Association Measurement Matrix is proposed. The experiment result shows the Chinese Character Association Measurement is better than the mutual information and the T test and the bi-gram frequency. Without Chinese word segmentation, so this algorithm is useful in massive Chinese data corpus.
Keywords:Chinese Character Association Measurement(CCAM)  information matrix  text similarity measurement algorithm
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号