基于笔画中文字向量模型设计与研究 Design and Research on Chinese Word Embedding Model Based on Strokes期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于笔画中文字向量模型设计与研究

引用本文：	赵浩新,俞敬松,林杰.基于笔画中文字向量模型设计与研究[J].中文信息学报,2019,33(5):17-23.

作者姓名：	赵浩新俞敬松林杰

作者单位：	1.北京大学软件与微电子学院,北京 102600; 2.中国人民大学信息学院,北京 100872

摘要：	中文汉字在横向、纵向展开具有二维的复杂结构。现有的中文词向量研究大都止步于汉字字符,没有利用中文笔画序列生成字向量,且受限于统计模型本质,无法为低频、未登录字词生成高质量向量表示。为此,该文提出了一种依靠中文笔画序列生成字向量的模型Stroke2Vec,扩展Word2Vec模型CBOW结构,使用卷积神经网络替换上下文信息矩阵、词向量矩阵,引入注意力机制,旨在模拟笔画构造汉字的规律,通过笔画直接生成字向量。将Stroke2Vec模型与Word2Vec、GloVe模型在命名实体识别任务上进行评测对比。实验结果显示,Stroke2Vec模型F₁值达到81.49%,优于Word2Vec 1.21%,略优于GloVe模型0.21%,而Stroke2Vec产生的字向量结合Word2Vec模型结果,在NER上F₁值为81.55%。
关键词：	字向量笔画连续词袋模型
Design and Research on Chinese Word Embedding Model Based on Strokes

ZHAO Haoxin,YU Jingsong,LIN Jie.Design and Research on Chinese Word Embedding Model Based on Strokes[J].Journal of Chinese Information Processing,2019,33(5):17-23.

Authors:	ZHAO Haoxin YU Jingsong LIN Jie

Affiliation:	1.School of Software and Microelectronics, Peking University, Beijing 102600, China; 2.School of Information, Renmin University of China, Beijing 100872, China

Abstract:	Chinese characters have a two-dimensional complex structure that spreads horizontally and vertically. Most of the studies about Chinese word embedding explore Chinese character level without considering the strokes sequence. This paper proposes a novel Stroke2Vec model that generates word embedding based on its stroke sequence. The model expands CBOW of Word2Vec model by using CNN and attention model instead of matrix. The Stroke2Vec aims to simulate the rule of strokes structure of Chinese characters and produce better character embedding with only strokes sequence. Compared with the Word2Vec and GloVe in NER task, the results show that our model achieves 81.49% F₁-score, out-performing Word2Vec by 1.21%, and GloVe by nearly 0.21%. And combining Stroke2Vecs and Word2Vecs leads to an F₁-score of 81.55%.

Keywords:	character embedding stroke CBOW

	点击此处可从《中文信息学报》浏览原始摘要信息
	点击此处可从《中文信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏