首页 | 本学科首页   官方微博 | 高级检索  
     

一种新词自动提取方法
引用本文:李亚松,王玉龙.一种新词自动提取方法[J].电信工程技术与标准化,2014(12):83-86.
作者姓名:李亚松  王玉龙
作者单位:北京邮电大学网络与交换技术国家重点实验室,北京 100876; 东信北邮信息技术有限公司,北京 100191
基金项目:国家973计划项目(编号2013CB329102);国家自然科学基金资助项目(No.61372120;61121001);长江学者和创新团队发展计划资助(编号IRT1049);教育部科学技术研究重点(重大)项目资助(编号MCM20130310);北京高等学校青年英才计划项目(编号YETP0473)。
摘    要:当前网络语料会不断出现大量新词已经成为一种普遍的趋势,这里面包含大量网友创造的新词,以及一些社会热点形成的新词。同时社交网络产生的社交性语料存在大量口语化、简称和随意的表达。这些都对中文分词的准确性造成了困扰。本文提出了一种新词自动提取方法,旨在能准确快速地在特定的语料里提取新词,生成特定领域词典,更准确地对网络语料进行中文分词。通过从语料中提取候选词,计算候选词的支持度和置信度,通过阈值刷选出新词,从而实现从海量文本中准确且快速的提取新词。

关 键 词:新词提取  支持度  置信度  离散度  GINI指数

New method for the auto-extraction of new words
LI Ya-song , WANG Yu-long.New method for the auto-extraction of new words[J].Telecom Engineering Technics and Standardization,2014(12):83-86.
Authors:LI Ya-song  WANG Yu-long
Affiliation:LI Ya-song, WANG Yu-long(1 Beijing University of Posts and Telecommunications Networking and Switching Technology, State Key Laboratory,Beijing 100876, China; 2 EBUPT Information Technology Co., Ltd., Beijing 100191, China)
Abstract:It has been a widespread tendency that large amount of new words are emerging in web text corpus.Among these are many new words created by netizens or arising from social focuses, and are also manycolloquial expressions, abbreviations in the social intercourse corpus created by SNS. All the above casestogether make it diffi cult for words segmentation. In this essay a new extraction method for new wordsis proposed, aiming to extract new words in a certain corpus, to generate a dictionary and to segment theChinese expressions more accurately. The new method fi rstly extracts candidate words from the corpus,and then calculates its support and confi dence, sifts the new words out, and fi nally extracts new wordsaccurately and rapidly from huge text data.
Keywords:new words extraction  support  confidence  dispersion  GINI index
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号