首页 | 本学科首页   官方微博 | 高级检索  
     

Web汉语料的智能抽取与词汇切分
引用本文:陈展荣,曾毅平.Web汉语料的智能抽取与词汇切分[J].计算机工程与设计,2005,26(6):1422-1424.
作者姓名:陈展荣  曾毅平
作者单位:1. 暨南大学信息与技术学院,广东,广州,510632
2. 暨南大学华文学院,广东,广州,510632
基金项目:国务院侨办人文社会科学研究基金项目(04CQBYB0011)
摘    要:提出一种Web汉语料智能抽取和汉语词切分的包装器。用户无需打开网站,无需点击链接,只需键入URL(Unit Resource Location,统一资源定位符),即可获取Web汉语料并切分词到汉词库中。给出了系统的总体构架,阐述了各功能模块的设计原理和技术实现。测试结果表明,该包装器能快速、有效地抓取Web页面并分离其中的汉语料,对歧义句、新词汇的识别率分别达到70%和60%,可应用于Web上汉语词汇的收集与分离。

关 键 词:Web语料  HTML格式  包装器  Web页面抓取器  词汇分离器
文章编号:1000-7024(2005)06-1422-03

Intelligent extraction and Chinese word segmentation of Web corpus
CHEN Zhan-rong,ZENG Yi-ping.Intelligent extraction and Chinese word segmentation of Web corpus[J].Computer Engineering and Design,2005,26(6):1422-1424.
Authors:CHEN Zhan-rong  ZENG Yi-ping
Affiliation:CHEN Zhan-rong 1,ZENG Yi-ping 2
Abstract:The wrapper with intelligentextraction and Chinese word segmentation based on web corpus are proposed. Users can get web Chinese corpus and segment Chinese word into glossary corpus database after inputing URL (unit resource location), without opening websites or clicking link. The architecture of system is presented and the design theory and technology implementation for every function module was dissertated. The result shows that it can snatch at Web pages fleetly and separate Chinese Corpus in them efficiently. The identification precision is 70% to divergentsentence and 60% to new glossary on web, respectively, it can apply to Chinese new-glossary compiling and separation.
Keywords:web corpus  html format  wrapper  web page-snatcher  glossary separator
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号