Unsupervised document structure analysis of digital scientific articles期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Unsupervised document structure analysis of digital scientific articles

Authors:	Stefan Klampfl Michael Granitzer Kris Jack Roman Kern

Affiliation:	1. Know-Center GmbH, Inffeldgasse 13/VI, 8010?, Graz, Austria 3. University of Passau, Passau, Germany 4. Mendeley Ltd, London, UK 2. Knowledge Technologies Institute, Graz University of Technology, Graz, Austria

Abstract:	Text mining and information retrieval in large collections of scientific literature require automated processing systems that analyse the documents’ content. However, the layout of scientific articles is highly varying across publishers, and common digital document formats are optimised for presentation, but lack structural information. To overcome these challenges, we have developed a processing pipeline that analyses the structure a PDF document using a number of unsupervised machine learning techniques and heuristics. Apart from the meta-data extraction, which we reused from previous work, our system uses only information available from the current document and does not require any pre-trained model. First, contiguous text blocks are extracted from the raw character stream. Next, we determine geometrical relations between these blocks, which, together with geometrical and font information, are then used categorize the blocks into different classes. Based on this resulting logical structure we finally extract the body text and the table of contents of a scientific article. We separately evaluate the individual stages of our pipeline on a number of different datasets and compare it with other document structure analysis approaches. We show that it outperforms a state-of-the-art system in terms of the quality of the extracted body text and table of contents. Our unsupervised approach could provide a basis for advanced digital library scenarios that involve diverse and dynamic corpora.

Keywords:
本文献已被 SpringerLink 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏