Web汉语料的智能抽取与词汇切分 Intelligent extraction and Chinese word segmentation of Web corpus期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Web汉语料的智能抽取与词汇切分

引用本文：	陈展荣,曾毅平.Web汉语料的智能抽取与词汇切分[J].计算机工程与设计,2005,26(6):1422-1424.

作者姓名：	陈展荣曾毅平

作者单位：	1. 暨南大学信息与技术学院,广东,广州,510632 2. 暨南大学华文学院,广东,广州,510632

基金项目：	国务院侨办人文社会科学研究基金项目(04CQBYB0011)

摘要：	提出一种Web汉语料智能抽取和汉语词切分的包装器。用户无需打开网站，无需点击链接，只需键入URL(Unit Resource Location，统一资源定位符)，即可获取Web汉语料并切分词到汉词库中。给出了系统的总体构架，阐述了各功能模块的设计原理和技术实现。测试结果表明，该包装器能快速、有效地抓取Web页面并分离其中的汉语料，对歧义句、新词汇的识别率分别达到70％和60％，可应用于Web上汉语词汇的收集与分离。
关键词：	Web语料 HTML格式包装器 Web页面抓取器词汇分离器
文章编号：	1000-7024(2005)06-1422-03
Intelligent extraction and Chinese word segmentation of Web corpus

CHEN Zhan-rong,ZENG Yi-ping.Intelligent extraction and Chinese word segmentation of Web corpus[J].Computer Engineering and Design,2005,26(6):1422-1424.

Authors:	CHEN Zhan-rong ZENG Yi-ping

Affiliation:	CHEN Zhan-rong 1,ZENG Yi-ping 2

Abstract:	The wrapper with intelligentextraction and Chinese word segmentation based on web corpus are proposed. Users can get web Chinese corpus and segment Chinese word into glossary corpus database after inputing URL (unit resource location), without opening websites or clicking link. The architecture of system is presented and the design theory and technology implementation for every function module was dissertated. The result shows that it can snatch at Web pages fleetly and separate Chinese Corpus in them efficiently. The identification precision is 70% to divergentsentence and 60% to new glossary on web, respectively, it can apply to Chinese new-glossary compiling and separation.

Keywords:	web corpus html format wrapper web page-snatcher glossary separator
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏