首页 | 本学科首页   官方微博 | 高级检索  
     

基于插件的文本抽取系统的设计与实现
引用本文:苏宇,戴上静,石春,凌青,吴刚.基于插件的文本抽取系统的设计与实现[J].电子技术,2014(8):32-36.
作者姓名:苏宇  戴上静  石春  凌青  吴刚
作者单位:中国科学技术大学自动化系,安徽合肥
摘    要:为了使全文检索系统支持多种文件格式的检索,必须先对待检索的文件进行文本抽取以转化为便于建立索引的纯文本。针对多格式的文本抽取问题,文章设计了一种基于插件的支持多格式的文本抽取系统,该系统采用文件后缀名和魔数(magic number)结合的方式自动识别文件类型,以统一接口调用已存在的针对单一类型文件的抽取插件,对得到的纯文本进行编码转换以使得最终的输出编码统一,系统还针对目录输入设计了多进程并行优化以利用CPU多核优势,使用贪心算法优化任务分配以使总运行时间尽可能短。该系统易于扩展,编程接口简单。实验结果表明,该系统能正常抽取文本内容和元数据,且其抽取效率高于Apache的Tika等开源文本抽取系统。

关 键 词:文本抽取  多格式  插件  文件类型识别  编码转换  多进程  任务分配算法

Design and Implementation of a Text Extraction System Based on Plugins
Su Yu,Dai Shangjing,Shi Chun,Ling Qing,Wu Gang.Design and Implementation of a Text Extraction System Based on Plugins[J].Electronic Technology,2014(8):32-36.
Authors:Su Yu  Dai Shangjing  Shi Chun  Ling Qing  Wu Gang
Affiliation:(Department of Automation, University of Science and Technology of China)
Abstract:This paper designs a text extraction system that converts multi-format file sources to plain texts; such a system plays a key role in full-text retrieval tasks. The system is designed based on plugins and is able to support a variety of file formats. The system detects file types using the combination of file extensions and magic numbers, calls existing single-type-oriented plugins through a uniform interface, and unifies the encoding of output plain texts. Two novel features of the system include designing a greedy scheduling algorithm that minimizes the overall running time, as well as implementing the algorithm in a multi-process manner that takes full advantages of multiple cores. The system is easy to expand and has simple APIs. Experimental results show that the system can extract text contents and metadata of supported file formats, and outperform Apache's Tika, an existing open source system.
Keywords:text extraction  multi-format  plugins  file type identification  character encoding conversion  multi-process  scheduling algorithm
本文献已被 CNKI 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号