首页 | 本学科首页   官方微博 | 高级检索  
     

MS-DOC文件文本提取研究
引用本文:黄步根,伏娟.MS-DOC文件文本提取研究[J].计算机工程与科学,2014,36(8):1505-1511.
作者姓名:黄步根  伏娟
基金项目:国家社会科学基金资助项目(13BTQ046);公安技术,江苏省高等学校“十二五”重点学科建设专项资金资助
摘    要:关键词搜索广泛应用于情报分析、搜索引擎和计算机取证,对MS DOC文件进行关键词搜索可能漏判,明明存在的关键词却找不到。微软复合文档结构由一系列流组成,流以扇区为单位存储,通过目录结构和扇区分配表对流及其存储空间进行管理。MS DOC文件中的文本存储在WordDocument流中,文本存储不一定连续,通过Table流记录分块情况。关键词可能跨越不相邻扇区,即使在相邻扇区,一个关键词可能一部分是压缩存储,另一部分是非压缩存储,这些都是关键词搜索漏判的原因。根据Table流中的分块信息提取WordDocument流中的文本,并统一编码格式,进而进行关键词搜索,就可以避免漏判。

关 键 词:复合文档  文本提取  关键词  搜索  计算机取证  
收稿时间:2012-12-11
修稿时间:2014-08-25

Research on extracting text from MS-DOC files
HUANG Bu gen,FU Juan.Research on extracting text from MS-DOC files[J].Computer Engineering & Science,2014,36(8):1505-1511.
Authors:HUANG Bu gen  FU Juan
Affiliation:(1.Department of Computer Information and Cyber Security,Jiangsu Police Institute,Nanjing 210012;2.Huaian Municipal Public Security Bureau,Huaian 223005,China)
Abstract:Keyword search is widely used in intelligence analysis, search engine and computer forensics. However, sometimes searching key words in MS DOC files may fail to find out some matches, which are usually called false negatives. Microsoft compound document is composed by a series of stream stored in sectors. The streams and the sectors are managed through the directory and the sector allocation table. The text is stored in the MS DOC file WordDocument stream, text storage is not necessarily continuous, and the Table stream records the block information. Keyword may be stored in different sectors that are not adjacent. Even the sectors are adjacent, the part of the keyword may be compressed, but the other part is not compressed. These cause the false negatives. Firstly, texts are extracted from the WordDocument stream based on the block information in the Table stream, and they are encoded uniformly. Secondly, a keyword search is carried out. These two steps can avoid the false negative.
Keywords:compound document  text extraction  keyword  search  computer forensics  
本文献已被 CNKI 等数据库收录!
点击此处可从《计算机工程与科学》浏览原始摘要信息
点击此处可从《计算机工程与科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号