首页 | 本学科首页   官方微博 | 高级检索  
     

基于字的分布表征的汉语基本块识别
引用本文:李国臣,党帅兵,王瑞波,李济洪.基于字的分布表征的汉语基本块识别[J].中文信息学报,2014,28(6):18-25.
作者姓名:李国臣  党帅兵  王瑞波  李济洪
作者单位:1.太原工业学院 计算机工程系,山西 太原 030008;
2. 山西大学 计算机与信息技术学院,山西 太原 030006;
3. 山西大学 计算中心,山西 太原 030006
基金项目:国家自然科学基金(60873128);山西省科技基础条件平台建设项目(2013091003-0101)
摘    要:汉语的基本块识别是汉语句法语义自动分析中的重要任务之一。传统的方法大多数直接将汉语基本块识别任务转化成词层面的一个序列标注问题,采用CRF模型来处理。虽然,在许多评测中得到最好的结果,但基于词为标注单位,在实用中受限于自动分词系统以及汉语词特征的稀疏性。为此,该文给出了一种以字为标注单位,以字为原始输入层,来构建汉语的基本块识别的深层神经网络模型,并通过无监督方法,学习到字的C&W和word2vec两种分布表征,将其作为深层神经网络模型的字的表示层的初始输入参数来强化模型参数的训练。实验结果表明,使用五层神经网络模型,以-3,3]窗口的字的word2vec分布表征,其准确率、召回率和F值分别达到80.74%,73.80%和77.12%,这比基于字的CRF高出约5%。这表明深层神经网络模型在汉语的基本块识别中是有作用的。

关 键 词:汉语基本块  分布表征  深层神经网络  序列标注  

Chinese Base-Chunk Identification Based on Distributed Character Representation
LI Guochen,DANG Shuaibing,WANG Ruibo,LI Jihong.Chinese Base-Chunk Identification Based on Distributed Character Representation[J].Journal of Chinese Information Processing,2014,28(6):18-25.
Authors:LI Guochen  DANG Shuaibing  WANG Ruibo  LI Jihong
Affiliation:1. Department of Compater Engineering, Taiyuan Institute of Technology, Taiyuan, Shanxi 030008, China;
2. School of Computer and Information Technology, Shanxi University, Taiyuan, Shanxi 030006, China;
3. Computer Center of Shanxi University, Taiyuan, Shanxi 030006, China
Abstract:Chinese base-chunk identification is an important task for automatically syntactic and semantic analysis. A widely-used strategy is to transform it into a word-level sequence labeling problem, and use models like CRFs to deal with it. Despite its best results in many open evaluations, practical application of such method is limited by accuracy of Chinese word segmentation systems and sparsity of Chinese word features. Therefore, this paper presents a base-chunk identification model based on deep neural network models, which take Chinese character as tagging unit and original input layer. Moreover, Chinese characters C&W distributed representation and word2vec distributed representation are derived through unsupervised learning models, and they are taken as initial input parameters of deep neural network to improve the training procedure. Experimental results show that the precision, recall and F-measure of our final identification model can achieved 80.74%, 73.80% and 77.12%, respectively, conditioned on a five-layer neural network with feature window of size -3, 3] and word2vec distributed representation.
Keywords:Chinese base-chunk  distributed representation  deep neural network  sequence labeling  
本文献已被 CNKI 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号