首页 | 本学科首页   官方微博 | 高级检索  
     

大规模双语句对自动获取技术
引用本文:王澍,郑德权,赵铁军.大规模双语句对自动获取技术[J].智能计算机与应用,2012(3):72-75.
作者姓名:王澍  郑德权  赵铁军
作者单位:哈尔滨工业大学计算机科学与技术学院
基金项目:国家863计划项目(2011AA01A207);国家自然科学基金(61100093)
摘    要:从互联网上挖掘大量双语平行句对,可以快速有效地构建大规模双语资源,服务于统计机器翻译。从挖掘对象的不同,将网络数据源分成对照网页和平行网页两类,提出一种抽取双语句对的方法。首先,从上述两类网页中分别抽取平行文本段,对照网页文本段抽取的主要方法为页面过滤和模板匹配,而平行网页依赖于网页结构的相似,采用对应节点匹配方法;其次,采用Gale-Church算法进行句对齐,得到平行句对;最后统一进行后处理。实验结果表明,从对照网页获取平行句对的准确率达到93.3%,平行网页为93.5%。

关 键 词:平行句对挖掘  句对评价  对照网页识别  平行网页判断

Automatic Acquisition of Large-scale Bilingual Sentence Pair
WANG Shu,ZHENG Dequan,ZHAO Tiejun.Automatic Acquisition of Large-scale Bilingual Sentence Pair[J].INTELLIGENT COMPUTER AND APPLICATIONS,2012(3):72-75.
Authors:WANG Shu  ZHENG Dequan  ZHAO Tiejun
Affiliation:(School of Computer Science and Technology,Harbin Institute of Technology,Harbin 150001,China)
Abstract:By mining bilingual parallel sentence pairs from web pages,it can be very effective to build large scale bilingual corpus,which will benefit statistic machine translation.This paper divides the whole web data into contrast page type and parallel page type.It also describes a method to extract parallel sentence pairs.Firstly the paper uses bilingual page filter and pattern matching for contrast pages,while node matching for parallel pages to get parallel paragraphs.And Gale-Church algorithm is deployed for sentence alignment.Eventually all candidate sentence pairs will be filtered.The result shows that the precision of sentence pairs from contrast pages is 93.3%,while the precision of those from parallel pages is 93.5%.
Keywords:Parallel Sentence Mining  Sentence Score  Contract Page Select  Parallel Page Predicate
本文献已被 CNKI 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号