首页 | 本学科首页   官方微博 | 高级检索  
     

Web环境下自动获取汉、维语料库
引用本文:姜子进,吐尔根·依布拉音,赛依旦·阿不力米提,田生伟.Web环境下自动获取汉、维语料库[J].计算机应用与软件,2011,28(12).
作者姓名:姜子进  吐尔根·依布拉音  赛依旦·阿不力米提  田生伟
作者单位:新疆大学信息科学与工程学院 新疆乌鲁木齐830046
基金项目:国家自然科学基金资助项目(60963017); 国家社科基金资助项目(10BTQ045); 新疆自治区高校科研计划重点项目(XJEDU2009I05)
摘    要:句子级的语料库是机器翻译的重要资源,但由于获取途径的限制,句子级的语料库不仅数量有限而且经常集中在特定领域,很难适应真实应用的需求.根据锚文本信息通过搜索引擎在网络上找到汉维双语平行网站,并下载网站中所有的双语平行网页.提取出有正文的网页,根据html特征,建立html树,提出一种将html树结构作为识别网页正文内容重要特征的网页分析方法,并根据正文内容信息相似性提取网页正文.对提取出的正文进行句子切分,分别创建句子级的汉、维语料库,为以后创建句子级的汉维双语平行语料库服务.

关 键 词:双语平行语料库  双语平行句对  正文提取

AUTOMATIC ACQUIRING CHINESE AND UIGHUR CORPUS LIBRARY IN WEB ENVIRONMENT
Jiang Zijin,Turgun Ibrahim,Sayidan Abulimit,Tian Shengwei.AUTOMATIC ACQUIRING CHINESE AND UIGHUR CORPUS LIBRARY IN WEB ENVIRONMENT[J].Computer Applications and Software,2011,28(12).
Authors:Jiang Zijin  Turgun Ibrahim  Sayidan Abulimit  Tian Shengwei
Affiliation:Jiang Zijin Turgun Ibrahim Sayidan Abulimit Tian Shengwei(School of Information Science and Engineering,Xinjiang University,Urumqi 830046,Xinjiang,China)
Abstract:Sentence level corpus library is an important resource for machine translation.However,since there are limited ways to acquire it,there is not enough sentence level corpus library.Moreover it is often focused to a few specific fields so that it is hard to meet real application demands.In the thesis,according to anchor text information,the network is searched with search engines to find Chinese-Uighur bilingual parallel websites,then to download all bilingual parallel webpages from them.After extracting page...
Keywords:Bilingual parallel corpus library Bilingual parallel sentence pair Text extraction  
本文献已被 CNKI 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号