首页 | 本学科首页   官方微博 | 高级检索  
     

基于专有名词优先的快速中文分词
引用本文:梁卓明,陈炬桦.基于专有名词优先的快速中文分词[J].微机发展,2008,18(3):24-27.
作者姓名:梁卓明  陈炬桦
作者单位:中山大学信息科学学院计算机系 广东广州510275
摘    要:中文分词是中文信息处理系统中的一个重要部分。主题信息检索系统对分词的速度和准确率有特殊的要求。文中回答了词库建立的词条来源和存储结构两大问题,提出了一种基于专有名词优先的快速中文分词方法:利用首字哈希、按字数分层存储、二分查找的机制,通过优先切分专有名词,将句子切分成碎片,再对碎片进行正反两次机械切分,最后通过快速有效的评价函数选出最佳结果并作调整。实验证明,该分词方法对主题信息文献的分词速度达92万字每秒,准确率为96%,表明该分词方法在主题信息文献的分词处理中具有较高性能。

关 键 词:中文分词  专有名词  词典机制
文章编号:1673-629X(2008)03-0024-04
修稿时间:2007年6月21日

A Rapid Chinese Word Segmentation Method Based on Priority Special Names
LIANG Zhuo-ming,CHEN Ju-hua.A Rapid Chinese Word Segmentation Method Based on Priority Special Names[J].Microcomputer Development,2008,18(3):24-27.
Authors:LIANG Zhuo-ming  CHEN Ju-hua
Affiliation:LIANG Zhuo-ming, CFIEN Ju-hua (Dept. of Computer Sol. ,Sch. of Info, Sci. , Zhongshan Univ, ,Guangzhou 510275, China)
Abstract:Chinese word segmentation is a key component of Chinese information processing systems.The topic information retrieval system has special requirement for both speed and veracity.Answer two important questions for building dictionary: how to get word items and how to organize them,and design a rapid Chinese word segmentation algorithm based on dictionary based on special name.Use "first character Hash,store the items according to the word length,and binary search mechanism,cut the sentences by special name,then bi-direction maximum match to segment the rest,use an easy but effective scoring function to select the best,and adjust at last.The experiment result shows this segmentation method can reach a speed of 920 000 words per second,and the correctness rate can reach 96%,which proves that this method has high efficiency.
Keywords:Chinese word segmentation  special name  dictionary mechanism
本文献已被 CNKI 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号