首页 | 本学科首页   官方微博 | 高级检索  
     

面向铁路文本分类的字符级特征提取方法
引用本文:鲁博仁,胡世哲,娄铮铮,叶阳东.面向铁路文本分类的字符级特征提取方法[J].计算机科学,2021,48(3):220-226.
作者姓名:鲁博仁  胡世哲  娄铮铮  叶阳东
作者单位:郑州大学信息工程学院 郑州 450001;郑州大学信息工程学院 郑州 450001;郑州大学信息工程学院 郑州 450001;郑州大学信息工程学院 郑州 450001
基金项目:国家自然科学青年基金项目;国家重点研发计划课题基金项目
摘    要:铁路文本分类对于我国铁路事业的发展具有重要的实用意义。现有的中文文本特征提取方法依赖于事先对文本的分词处理,然而面向铁路文本数据进行分词的准确率不高,导致铁路文本的特征提取存在语义理解不充分、特征获取不全面等局限性。针对以上问题,提出了一种字符级特征提取方法CLW2V(Character Level-Word2Vec),有效地解决了铁路文本中专业词汇丰富且复杂度高所导致的问题。与基于词汇特征的TF-IDF和Word2Vec方法相比,基于字符特征的CLW2V方法能够提取更为精细的文本特征,解决了传统方法依赖事先分词而导致的特征提取效果不佳的问题。在铁路安监发牌数据集上进行的实验验证表明,面向铁路文本分类的CLW2V特征提取方法优于传统的依赖分词的TF-IDF和Word2Vec方法。

关 键 词:铁路短文本  字符级数据  特征提取方法  文本分类

Character-level Feature Extraction Method for Railway Text Classification
LU Bo-ren,HU Shi-zhe,LOU Zheng-zheng,YE Yang-dong.Character-level Feature Extraction Method for Railway Text Classification[J].Computer Science,2021,48(3):220-226.
Authors:LU Bo-ren  HU Shi-zhe  LOU Zheng-zheng  YE Yang-dong
Affiliation:(School of Information Engineering,Zhengzhou University,Zhengzhou 450001,China)
Abstract:Railway text classification is of great practical significance to the development of China’s railway industry.Existing Chinese text feature extraction methods rely on word segmentation in advance.However,due to the low accuracy of word segmentation for railway text data,the feature extraction of railway text has limitations such as inadequate semantic understanding and incomplete feature acquisition.In view of the above problems,a character-level feature extraction method,CLW2V(Character Le-vel-Word2Vec),is proposed,which effectively solves the problem caused by the rich and high complexity of professional vocabulary in railway texts.Compared with the TF-IDF and Word2Vec methods based on lexical features,the CLW2V method based on character features extracts more refined text features,which solves the problem of poor feature extraction effect caused by the dependence on presegmentation in traditional methods.Experimental verification is carried out on the data set of railway safety supervision and licensing,which shows that the CLW2V feature extraction method for railway text classification is superior to the traditional TF-IDF and Word2Vec methods that rely on word segmentation.
Keywords:Railway short text  Character level vector  Feature extraction method  Text classification
本文献已被 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号