首页 | 本学科首页   官方微博 | 高级检索  
     

基于文档结构与深度学习的金融公告信息抽取
引用本文:黄胜,王博博,朱菁. 基于文档结构与深度学习的金融公告信息抽取[J]. 计算机工程与设计, 2020, 41(1): 115-121
作者姓名:黄胜  王博博  朱菁
作者单位:重庆邮电大学通信与信息工程学院,重庆400065;重庆邮电大学光通信与网络重点实验室,重庆400065;深圳证券信息有限公司数据中心,广东深圳518000
摘    要:针对金融类公告中的结构化数据难以被高效快速提取的问题,提出一种基于文档结构与Bi-LSTM-CRF网络模型的信息抽取方法。自定义一种文档结构树生成算法,利用规则从文档结构树中抽取所需节点信息;构建基于信息句触发词的局部句子规则,抽取包含结构化字段信息的信息句;将字段的结构化信息抽取看作序列标注问题,分词时加入领域知识词典,构建基于Bi-LSTM-CRF的神经网络模型进行字段信息识别。实验结果表明,该信息抽取方法可以满足多类型公告的结构化信息提取,最终的信息句与字段信息抽取的平均F1值均可达到91%以上,验证了该方法在产品业务中的可行性和实用性。

关 键 词:公告  信息抽取  神经网络  文档结构树  序列标注

Information extraction of financial announcement based on document structure and deep learning
HUANG Sheng,WANG Bo-bo,ZHU Jing. Information extraction of financial announcement based on document structure and deep learning[J]. Computer Engineering and Design, 2020, 41(1): 115-121
Authors:HUANG Sheng  WANG Bo-bo  ZHU Jing
Affiliation:(School of Communication and Information Engineering,Chongqing University of Posts and Telecommunications,Chongqing 400065,China;Key Laboratory of Optical Communications and Networking,Chongqing University of Posts and Telecommunications,Chongqing 400065,China;Data Center,Shenzhen Securities Information Limited Company,Shenzhen 518000,China)
Abstract:Structured data in financial bulletins are difficult to extract efficiently and quickly,a method of extracting information based on document structure and Bi-LSTM-CRF network model was proposed.A document structure tree generation algorithm was defined to extract the required node information from the document structure tree by using rules.A local sentence rule based on trigger words of information sentences was constructed to extract information sentences containing structured field information.The structured information extraction of field was regarded as the problem of sequence labeling.A domain knowledge dictionary was added to the word segmentation,and a Bi-LSTM-CRF based neural network model was constructed to recognize field information.Experimental results show that the information extraction method can satisfy the structural information extraction of multi-type announcements.The average F1 value of the final information sentence and field information extraction can reach over 91%,which verifies the feasibility and practicability of the proposed method in product business.
Keywords:announcement  information extraction  neural network  document structure tree  sequence labeling
本文献已被 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号