计算大规模语料中四字词串互信息的算法设计 |
| |
引用本文: | 方莹,杨尔弘.计算大规模语料中四字词串互信息的算法设计[J].电脑开发与应用,2005,18(1):2-3,6. |
| |
作者姓名: | 方莹 杨尔弘 |
| |
作者单位: | 山西大学,太原,030006 |
| |
基金项目: | 国家重点基础研究发展计划(973计划) |
| |
摘 要: | 中文信息处理中 ,判断哪些词串该入选《分词词表》一直是一个难题。互信息作为一种衡量手段 ,在一定程度上体现了词串的各组成部分之间结合的紧密程度 ,以北京大学 1998年 1月《人民日报》标注语料为试验料 ,通过互信息的计算分析四字词串成词的可能性 ,为判断能否把其收入词表给出依据
|
关 键 词: | 互信息 语料库 算法设计 词表 词频 分词 四字词串 定中结构 |
文章编号: | 1003-5850(2005)01-0002-03 |
The Algorithm Design and Realization to Calculate The Mutual Information of Four- Word- String in Large Scale Corpus |
| |
Abstract: | During Chinese information processing,judging which word strings should be in participle list is always a difficult problem.Mutual information is a judgement measure and it reflects the compactness of different parts of strings.This paper analyses the possibility of making four-word-string into words based on the corpus of China Daily in Jan.1998 of Beijing University and provides foundation for determining whether the strings can be in list. |
| |
Keywords: | mutual information corpus algorithm design word list word frequency participle four-word string centering structure |
本文献已被 CNKI 维普 万方数据 等数据库收录! |