首页 | 本学科首页   官方微博 | 高级检索  
     

一种中文文档的非受限无词典抽词方法
引用本文:金翔宇,孙正兴,张福炎. 一种中文文档的非受限无词典抽词方法[J]. 中文信息学报, 2001, 15(6): 34-40
作者姓名:金翔宇  孙正兴  张福炎
作者单位:南京大学软件新技术国家重点实验室,南京大学计算机科学与技术系
基金项目:国家自然科学基金项目 (6 990 30 0 6 ),教育部高等学校骨干教师资助计划(教技司[2 0 0 0 ]6 5号 ),中国博士后科学基金(中博基 [1997]11号 )
摘    要:
本文提出了一种非受限无词典抽词模型,该模型通过自增长算法获取中文文档中的汉字结合模式,并引入支持度、置信度等概念来筛选词条。实验表明:在无需词典支持和利用语料库学习的前提下,该算法能够快速、准确地抽取中文文档中的中、高频词条。适于对词条频度敏感,而又对计算速度要求很高的中文信息处理应用,例如实时文档自动分类系统。

关 键 词:中文信息处理  自动分词  非受限无词典抽词  汉字结合模式  
修稿时间:2001-01-17

A Domain-independent Dictionary-free Lexical Acquisition Model For Chinese Document
JIN Xiang yu SUN Zheng xing ZHANG Fu yan. A Domain-independent Dictionary-free Lexical Acquisition Model For Chinese Document[J]. Journal of Chinese Information Processing, 2001, 15(6): 34-40
Authors:JIN Xiang yu SUN Zheng xing ZHANG Fu yan
Affiliation:State Key Laboratory for Novell Software Technology ,Department of Computer Science and Technology ,Nanjing University
Abstract:
A domain independent dictionary free lexical acquisition model is presented in this paper,which introduces a self increasing algorithm to acquire the co occurrence patterns of Chinese characters,and introduces some criteria such as support and confidence to filter these co occurrence patterns to get lexical items.Experiments show that it can acquire lexical items with high frequency effectively and efficiently without the support of the dictionary and the supervised learning in term of corpus.The model proposed in this paper particularly suits for lexical frequency sensitive but time critical Chinese information processing applications,such as real time automatic Chinese text classification systems.
Keywords:chinese information processing  automatic word segmentation  domain independent dictionary free lexical acquisition  co occurrence patterns of Chinese characters
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号