首页 | 本学科首页   官方微博 | 高级检索  
     

维基百科中翻译对的模板挖掘方法研究
引用本文:段建勇,闫启伟,张梅,胡熠. 维基百科中翻译对的模板挖掘方法研究[J]. 中文信息学报, 2015, 29(2): 190-198
作者姓名:段建勇  闫启伟  张梅  胡熠
作者单位:1. 北方工业大学 信息工程学院, 北京 100144;
2. 腾讯公司 搜索产品部, 上海 200230
基金项目:国家自然科学基金(61103112);北京市哲学社会科学规划基金(13SHC031);北京市青年拔尖人才培育计划(CIT&TCD201404005);国家语委十二五规划基金(YB125-10)
摘    要:双语翻译对在跨语言信息检索、机器翻译等领域有着重要的用途,尤其是专有名词、新词、俚语和术语等的翻译是影响其系统性能的关键因素,但是这些翻译对很难从现有的词典中获得。该文针对维基百科的领域覆盖率和结构特征,提出了一种从维基百科中自动获取高质量中英文翻译对的模板挖掘方法,不但能有效地挖掘出常见的模板,而且能够发现人工不容易察觉的复杂模板。主要方法包括三步: 1)从语言工具栏中直接抽取翻译对,作为进一步挖掘的启发知识;2)在维基百科页面中采用PAT-Array结构挖掘中英翻译对模板;3)利用挖掘的模板在页面中自动挖掘其他中英文翻译对,并进行模板评估。实验结果表明,模板发现翻译对的正确率达90.4%。

关 键 词:双语翻译对  维基百科  模板挖掘  信息抽取  

Mining Translation Pairs with Learnt Patterns from Wikipedia
DUAN Jianyong;YAN Qiwei;ZHANG Mei;HU Yi. Mining Translation Pairs with Learnt Patterns from Wikipedia[J]. Journal of Chinese Information Processing, 2015, 29(2): 190-198
Authors:DUAN Jianyong  YAN Qiwei  ZHANG Mei  HU Yi
Affiliation:1. College of Information Engineering, North China Univesity of Technology, Beijing 100144, China;
2. Searching Product Section, Tencent Corporation, Shanghai 200230,China
Abstract:Bilingual translation pairs play an import role in many NLP applications, such as cross language information retrieval and machine translation. The translation of proper names, out of vocabulary words, idioms and technical terminologies is one of the key factors that affect the performance of the systems. However, these translations can hardly be found in the traditional bilingual dictionary. This paper proposes a new method to automatically extract high quality translation pairs from Wikipedia based on the wide area coverage and data structure, the method not only can learn common patterns, but also learn many patterns that can hardly be found by human beings. The method contains three steps: 1) extract translation pairs from the language toolbox of the Wikipedia. They can be heuristic for the next step; 2) learn patterns of translation pairs with the knowledge of PAT-Array gained from the previous work; 3) extract other translation pairs automatically using the learned patterns. Our experimental results show the accuracy can reach 90.4%.
Keywords:bilingual translation pairs   Wikipedia   pattern mining   information extraction  
本文献已被 CNKI 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号