首页 | 本学科首页   官方微博 | 高级检索  
     

基于笔画中文字向量模型设计与研究
引用本文:赵浩新,俞敬松,林杰.基于笔画中文字向量模型设计与研究[J].中文信息学报,2019,33(5):17-23.
作者姓名:赵浩新  俞敬松  林杰
作者单位:1.北京大学 软件与微电子学院,北京 102600;
2.中国人民大学 信息学院,北京 100872
摘    要:中文汉字在横向、纵向展开具有二维的复杂结构。现有的中文词向量研究大都止步于汉字字符,没有利用中文笔画序列生成字向量,且受限于统计模型本质,无法为低频、未登录字词生成高质量向量表示。为此,该文提出了一种依靠中文笔画序列生成字向量的模型Stroke2Vec,扩展Word2Vec模型CBOW结构,使用卷积神经网络替换上下文信息矩阵、词向量矩阵,引入注意力机制,旨在模拟笔画构造汉字的规律,通过笔画直接生成字向量。将Stroke2Vec模型与Word2Vec、GloVe模型在命名实体识别任务上进行评测对比。实验结果显示,Stroke2Vec模型F1值达到81.49%,优于Word2Vec 1.21%,略优于GloVe模型0.21%,而Stroke2Vec产生的字向量结合Word2Vec模型结果,在NER上F1值为81.55%。

关 键 词:字向量  笔画  连续词袋模型  

Design and Research on Chinese Word Embedding Model Based on Strokes
ZHAO Haoxin,YU Jingsong,LIN Jie.Design and Research on Chinese Word Embedding Model Based on Strokes[J].Journal of Chinese Information Processing,2019,33(5):17-23.
Authors:ZHAO Haoxin  YU Jingsong  LIN Jie
Affiliation:1.School of Software and Microelectronics, Peking University, Beijing 102600, China;
2.School of Information, Renmin University of China, Beijing 100872, China
Abstract:Chinese characters have a two-dimensional complex structure that spreads horizontally and vertically. Most of the studies about Chinese word embedding explore Chinese character level without considering the strokes sequence. This paper proposes a novel Stroke2Vec model that generates word embedding based on its stroke sequence. The model expands CBOW of Word2Vec model by using CNN and attention model instead of matrix. The Stroke2Vec aims to simulate the rule of strokes structure of Chinese characters and produce better character embedding with only strokes sequence. Compared with the Word2Vec and GloVe in NER task, the results show that our model achieves 81.49% F1-score, out-performing Word2Vec by 1.21%, and GloVe by nearly 0.21%. And combining Stroke2Vecs and Word2Vecs leads to an F1-score of 81.55%.
Keywords:character embedding  stroke  CBOW  
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号