首页 | 本学科首页   官方微博 | 高级检索  
     

基于中英平行专利语料的短语复述自动抽取研究
引用本文:李 莉,刘知远,孙茂松.基于中英平行专利语料的短语复述自动抽取研究[J].中文信息学报,2013,27(6):151-158.
作者姓名:李 莉  刘知远  孙茂松
作者单位:清华大学 计算机系,智能技术与系统国家重点实验室;清华信息科学与技术国家实验室(筹),北京 100084
基金项目:国家自然科学基金资助项目(61133012);国家863计划资助项目(2012AA011102)
摘    要:短语复述自动抽取是自然语言处理领域的重要研究课题之一,已广泛应用于信息检索、问答系统、文档分类等任务中。而专利语料作为人类知识和技术的载体,内容丰富,实现基于中英平行专利语料的短语复述自动抽取对于技术主题相关的自然语言处理任务的效果提升具有积极意义。该文利用基于统计机器翻译的短语复述抽取技术从中英平行专利语料中抽取短语复述,并利用基于组块分析的技术过滤短语复述抽取结果。而且,为了处理对齐错误和翻译歧义引起的短语复述抽取错误,我们利用分布相似度对短语复述抽取结果进行重排序。实验表明,基于统计机器翻译的短语复述抽取在中英文上准确率分别为43.20%和43.60%,而经过基于组块分析的过滤技术后准确率分别提升至75.50%和52.40%。同时,利用分布相似度的重排序算法也能够有效改进抽取效果。


Automatically Extracting Phrase-level Paraphrases from Chinese-English Parallel Patents
LI Li,LIU Zhiyuan,SUN Maosong.Automatically Extracting Phrase-level Paraphrases from Chinese-English Parallel Patents[J].Journal of Chinese Information Processing,2013,27(6):151-158.
Authors:LI Li  LIU Zhiyuan  SUN Maosong
Affiliation:State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for
Information Science and Technology, Department of Computer Science and Technology,
Tsinghua University, Beijing 100084, China
Abstract:Automatically extracting phrase-level paraphrases is an important research task in natural language processing (NLP), which has been applied in applications such as information retrieval, query answering and document classification. Moreover, technique patents, as an important carrier of human knowledge and technology, contain abundant information. Hence, automatically extracting phrase-level paraphrases from Chinese-English parallel patents has a positive effect on NLP tasks about technology. In this paper, we aim to extract phrase-level paraphrases from Chinese-English parallel patents automatically using method based on statistical machine translation, and use chunk parsing technology for paraphrase verification. Moreover, to dispose the errors caused by translation ambiguity and bad word alignment, we use distributional similarity to re-rank the extracted phrase-level paraphrases. In experiments, we find that the method based on statistical machine translation gets a precision of 43.20% on Chinese patents while 43.60% on English patents for Top-500 results. Meanwhile, after verification with chunk parsing, the precisions are raised to 75.50% and 52.40%, respectively. Moreover, the re-ranking based on distributional similarity also improves the performance significantly.
Key wordsphrase-level paraphrase; statistical machine translation; chunk parsing; distributional similarity
Keywords:phrase-level paraphrase  statistical machine translation  chunk parsing  distributional similarity  
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号