一种基于词编码的中文文档格式 Novel Chinese Text Format Based on Word Encoding期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

一种基于词编码的中文文档格式

引用本文：	焦慧,刘迁,贾惠波.一种基于词编码的中文文档格式[J].计算机科学,2008,35(10):162-164.

作者姓名：	焦慧刘迁贾惠波

作者单位：	1. 清华大学精密仪器与机械学系,北京,100084 2. 清华大学光盘国家工程研究中心,北京,100084

摘要：	分析了汉语自动分词问题产生的根源和面临的困难,针对性地提出了一种基于词的中文编码方法和中文文档新格式,实现了以词作为中文文本的最小信息载体,使中文文本分析可以在词平台上实现,避免了自动分词对中文信息处理带来的障碍.基于中文词的编码方法,以每个词作为一个单位,对词进行编码.此方法避开了中文的自动分词问题,特别是解决了歧义切分的难题,并提出了一种利用文档格式解决未登录词问题的新思路.采用统计分析方法对词平台基础上的关键词自动抽取进行了实验研究,取得良好效果.
关键词：	中文信息处理词典码文档格式自动分词
Novel Chinese Text Format Based on Word Encoding

JIAO Hui,LIU Qian,JIA Hui-bo.Novel Chinese Text Format Based on Word Encoding[J].Computer Science,2008,35(10):162-164.

Authors:	JIAO Hui LIU Qian JIA Hui-bo

Affiliation:	JIAO Hui1 LIU Qian1 JIA Hui-bo2(Department of Precision Instruments , Mechanology,Tsinghua University,Beijing 100084,China)1(State Key Laboratory of Precision Measurement Technology , Instruments,China)2

Abstract:	The key reasons why Chinese word automatic segmentation arises and the difficulties in the process were analyzed.This paper presented a novel Chinese text encoding method and a new format.In this format,words become the smallest information unit of the texts,which makes the segmentation unnecessary and avoids the bad effects on CIP(Chinese Information Processing).A new encoding format that encodes every word(not character)was adopted.The difficulty of ambiguity was solved by using the encoding method.A new ...

Keywords:	Chinese information processing Word coding Text format Automatic segmentation
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《计算机科学》浏览原始摘要信息
	点击此处可从《计算机科学》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏