首页 | 本学科首页   官方微博 | 高级检索  
     

基于隐马尔科夫和主成分分析的电网数据词典构建
作者姓名:秦欢  门业堃  于钊  叶宽  侯宇程  孙致远
作者单位:国网北京市电力公司电力科学研究院,北京,100075;国网北京市电力公司电力科学研究院,北京,100075;国网北京市电力公司电力科学研究院,北京,100075;国网北京市电力公司电力科学研究院,北京,100075;国网北京市电力公司电力科学研究院,北京,100075;国网北京市电力公司电力科学研究院,北京,100075
摘    要:电网企业拥有海量采用中文记录的非结构化文本信息,其中包含有大量重要的可靠性统计信息。但依靠人工对其进行挖掘不仅效率低而且准确性因人而异。如何高效、准确、智能地挖掘电网企业设备缺陷文本中重要的可靠性统计信息是目前亟待解决的问题。本文基于改式隐式马尔科夫算法对通过全过程技术监督工作采集的非结构化文本数据进行分句分词,制定研究非结构化数据的结构化表达规则。利用主成分分析、词向量以及深度神经网络等的自然语言处理算法对现有的问题描述文本中的同名词、同义词以及近义词等的语义相似度进行计算,并采用K阶近邻算法对降维后的词向量进行分类聚类。上述工作解决了缺陷文本句子成分难以划分、数字量无法精确提取等问题,形成一份国网系统运检专业领域的数据词典库,为电网领域的非结构化数据挖掘提供了新技术,为今后技术监督工作的展开具有重要意义和贡献。

关 键 词:文本分类  分词  隐马尔可夫  技术监督
收稿时间:2018/7/2 0:00:00
修稿时间:2018/7/22 0:00:00

The Construction of Grid Data Dictionary Based on HMM and PCA
Authors:MEN Yekun  ZHAO Xueqian  YU Zhao  QIAN Mengdi  QIN Huan  SUN Zhiyuan  HOU Yucheng and YE Kuan
Affiliation:State Grid Beijing Electric and Power Research Institute,100075;Beijing,China,State Grid Beijing Electric and Power Research Institute,100075;Beijing,China,State Grid Beijing Electric and Power Research Institute,100075;Beijing,China,State Grid Beijing Electric and Power Research Institute,100075;Beijing,China,State Grid Beijing Electric and Power Research Institute,100075;Beijing,China,State Grid Beijing Electric and Power Research Institute,100075;Beijing,China,State Grid Beijing Electric and Power Research Institute,100075;Beijing,China,State Grid Beijing Electric and Power Research Institute,100075;Beijing,China
Abstract:Power grid enterprises have a large number of unstructured text data recorded in Chinese, in which contains a lot of im-portant reliability statistical information. However, it is not only inefficient, but also accurate to different people to mine the unstructured text data manually. Therefore, how to excavate the important reliability statistics information in the equipment defect text of power grid enterprises effectively and accurately and intelligently becomes a problem to be solved at present. In this work, we use was used to segment the unstructured text data collected by the whole process of technical supervision was segmented based on the modified Hidden Markov Model(HMM) algorithm, and the structured expression rules of unstructured data was formulated. Natural Language Processing(NLP) algorithms, like Principal Component Analysis(PCA), vector and depth neural network were used to calculated the semantic similarity of words, synonyms and synonyms after dimensionality reduction classification clustering through K Nearest Neighbor algorithm. Relaying on the works, the problems of defects of the text sentence and dividing the digital quantity accurately has been solved, and a data dictionary in state grid operation and maintenance field was created, which has important significance and contribution to the future expansion of technical supervision work.
Keywords:text categorization  segmentation  Hidden Markov Mode  technological supervision
本文献已被 万方数据 等数据库收录!
点击此处可从《》浏览原始摘要信息
点击此处可从《》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号