首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 234 毫秒
1.
This paper presents a method of page segmentation based on the approximated area Voronoi diagram. The characteristics of the proposed method are as follows: (1) The Voronoi diagram enables us to obtain the candidates of boundaries of document components from page images with non-Manhattan layout and a skew. (2) The candidates are utilized to estimate the intercharacter and interline gaps without the use of domain-specific parameters to select the boundaries. From the experimental results for 128 images with non-Manhattan layout and the skew of 0°∼45° as well as 98 images with Manhattan layout, we have confirmed that the method is effective for extraction of body text regions, and it is as efficient as other methods based on connected component analysis.  相似文献   

2.
基于视窗的OCR页面图像倾斜检测方法   总被引:2,自引:0,他引:2       下载免费PDF全文
文档在扫描输入过程中,所生成的页面图像一般都存在一定的角度倾斜,当页面图像倾斜角度过大时,将对进一步的版面分析以及字符识别产生不良影响。为了快速准确地检测页面图像倾斜角度和降低计算量,提出了一种基于视窗变换的页面图像倾斜检测方法,该算法首先对视窗中的文字及图片的细节部分进行模糊,然后对其边沿进行直线拟合,以便快速检测页面图像倾斜角度。实验结果表明,该方法能快速准确地检测出各类页面图像的倾斜角度,并具有良好的适应性。  相似文献   

3.
Skew estimation and page segmentation are the two closely related processing stages for document image analysis. Skew estimation needs proper page segmentation, especially for document images with multiple skews that are common in scanned images from thick bound publications in 2-up style or postal envelopes with various printed labels. Even if only a single skew is concerned for a document image, the presence of minority regions of different skews or undefined skew such as noise may severely affect the estimation for the dominant skew. Page segmentation, on the other hand, may need to know the exact skew angle of a page in order to work properly. This paper presents a skew estimation method with built-in skew-independent segmentation functionality that is capable of handling document images with multiple regions of different skews. It is based on the convex hulls of the individual components (i.e. the smallest convex polygon that fully contains a component) and that of the component groups (i.e. the smallest convex polygon that fully contain all the components in a group) in a document image. The proposed method first extracts the convex hulls of the components, segments an image into groups of components according to both the spatial distances and size similarities among the convex hulls of the components. This process not only extracts the hints of the alignments of the text groups, but also separate noise or graphical components from that of the textual ones. To verify the proposed algorithms, the full sets of the real and the synthetic samples of the University of Washington English Document Image Database I (UW-I) are used. Quantitative and qualitative comparisons with some existing methods are also provided.  相似文献   

4.
The document spectrum for page layout analysis   总被引:17,自引:0,他引:17  
Page layout analysis is a document processing technique used to determine the format of a page. This paper describes the document spectrum (or docstrum), which is a method for structural page layout analysis based on bottom-up, nearest-neighbor clustering of page components. The method yields an accurate measure of skew, within-line, and between-line spacings and locates text lines and text blocks. It is advantageous over many other methods in three main ways: independence from skew angle, independence from different text spacings, and the ability to process local regions of different text orientations within the same image. Results of the method shown for several different page formats and for randomly oriented subpages on the same image illustrate the versatility of the method. We also discuss the differences, advantages, and disadvantages of the docstrum with respect to other lay-out methods  相似文献   

5.
基于直线连续性的页面倾斜检测与校正   总被引:14,自引:0,他引:14  
在文档扫描过程中,输入的文档图像不可避免地会发生倾斜现象,而布局分析及字符识别算法对页面倾斜都十分敏感,因此倾斜检测和校正是文档分析预处理的重要环节,文中提出了一个基于直线连续性的倾斜检测方法。它将字符连通区包围盒底边中心点作为特征点,利用文本行中特征点与基线的关系,计算出基线的方向,即为页面倾斜方向,接着,介绍了一种基于偏移值的倾斜校正方法,实验证明,该算法速度快,准确度高。  相似文献   

6.
Dengel  A. Bleisinger  R. Hoch  R. Fein  F. Hones  F. 《Computer》1992,25(7):63-67
The principles of the model-based document analysis system called ΠODA (paper interface to office document architecture), which was developed as a prototype for the analysis of single-sided business letters in German, are presented. Initially, ΠODA extracts a part-of hierarchy of nested layout objects such as text-blocks, lines, and words based on their presentation on the page. Subsequently, in a step called logical labeling, the layout objects and their compositions are geometrically analyzed to identify corresponding logical objects that can be related to a human perceptible meaning, such as sender, recipient, and date in a letter. A context-sensitive text recognition for logical objects is then applied using logical vocabularies and syntactic knowledge. As a result, ΠODA produces a document representation that conforms to the ODA international standard  相似文献   

7.
When a page of a book is scanned or photocopied, textual noise (extraneous symbols from the neighboring page) and/or non-textual noise (black borders, speckles, ...) appear along the border of the document. Existing document analysis methods can handle non-textual noise reasonably well, whereas textual noise still presents a major issue for document analysis systems. Textual noise may result in undesired text in optical character recognition (OCR) output that needs to be removed afterwards. Existing document cleanup methods try to explicitly detect and remove marginal noise. This paper presents a new perspective for document image cleanup by detecting the page frame of the document. The goal of page frame detection is to find the actual page contents area, ignoring marginal noise along the page border. We use a geometric matching algorithm to find the optimal page frame of structured documents (journal articles, books, magazines) by exploiting their text alignment property. We evaluate the algorithm on the UW-III database. The results show that the error rates are below 4% each of the performance measures used. Further tests were run on a dataset of magazine pages and on a set of camera captured document images. To demonstrate the benefits of using page frame detection in practical applications, we choose OCR and layout-based document image retrieval as sample applications. Experiments using a commercial OCR system show that by removing characters outside the computed page frame, the OCR error rate is reduced from 4.3 to 1.7% on the UW-III dataset. The use of page frame detection in layout-based document image retrieval application decreases the retrieval error rates by 30%.  相似文献   

8.
One of the difficulties in the understanding of document images is document layout analysis, which is the first step in document image modeling. In this paper, a robust system for which a multilevel-homogeneity structure is used in accordance with a hybrid methodology is proposed to deal with this problem. Our system consists of the following three main stages: classification, segmentation, and refinement and labeling. Different from other page segmentation methods, the proposed system includes an efficient algorithm to detect table regions in document images. Besides, to create an effective application, the proposed system is designed to work with a variety of document languages. The proposed method was tested with the ICDAR2015 competition (RDCL-2015) and three other published datasets in different languages. The results of these tests show that the accuracy of proposed system is superior to the previous methods.  相似文献   

9.
基于最小二乘法的文档图像倾斜检测方法   总被引:9,自引:0,他引:9  
在文档扫描过程中,输入的文档图像不可避免地会发生倾斜现象,而布局分析及字符识别算法对页面倾斜都十分敏感,因此倾斜检测和校正是文档分析预处理的重要环节。本文提出了一个基于最小二乘法的倾斜检测方法。它将字符连通区包围盒底边中心点作为特征点,利用文本行中特征点与基线的关系,将特征点用最小二乘法拟事出基线的方向,即为页面倾斜方向。同时,本文介绍了一种基于直线拟合的快速倾斜校正算法。实验证明,该算法速度快,准确度高。  相似文献   

10.
Parameter-free geometric document layout analysis   总被引:1,自引:0,他引:1  
Automatic transformation of paper documents into electronic documents requires geometric document layout analysis at the first stage. However, variations in character font sizes, text line spacing, and document layout structures have made it difficult to design a general-purpose document layout analysis algorithm for many years. The use of some parameters has therefore been unavoidable in previous methods. The authors propose a parameter-free method for segmenting the document images into maximal homogeneous regions and identifying them as texts, images, tables, and ruling lines. A pyramidal quadtree structure is constructed for multiscale analysis and a periodicity measure is suggested to find a periodical attribute of text regions for page segmentation. To obtain robust page segmentation results, a confirmation procedure using texture analysis is applied to only ambiguous regions. Based on the proposed periodicity measure, multiscale analysis, and confirmation procedure, we could develop a robust method for geometric document layout analysis independent of character font sizes, text line spacing, and document layout structures. The proposed method was experimented with the document database from the University of Washington and the MediaTeam Document Database. The results of these tests have shown that the proposed method provides more accurate results than previous ones  相似文献   

11.
Document layout analysis or page segmentation is the task of decomposing document images into many different regions such as texts, images, separators, and tables. It is still a challenging problem due to the variety of document layouts. In this paper, we propose a novel hybrid method, which includes three main stages to deal with this problem. In the first stage, the text and non-text elements are classified by using minimum homogeneity algorithm. This method is the combination of connected component analysis and multilevel homogeneity structure. Then, in the second stage, a new homogeneity structure is combined with an adaptive mathematical morphology in the text document to get a set of text regions. Besides, on the non-text document, further classification of non-text elements is applied to get separator regions, table regions, image regions, etc. The final stage, in refinement region and noise detection process, all regions both in the text document and non-text document are refined to eliminate noises and get the geometric layout of each region. The proposed method has been tested with the dataset of ICDAR2009 page segmentation competition and many other databases with different languages. The results of these tests showed that our proposed method achieves a higher accuracy compared to other methods. This proves the effectiveness and superiority of our method.  相似文献   

12.
基于改进Hough变换的文本图像倾斜校正方法   总被引:2,自引:0,他引:2  
文本图像在扫描输入时产生的倾斜现象会对后续的页面分割及光学字符识别(OCR)处理产生很大的影响,而传统的标准Hough变换虽然具有对噪声不敏感,不依赖于直线连续性的优点,但由于计算量偏大,速度慢,在实用时有较大的局限性。提出一种基于改进的Hough变换的文本图像倾斜校正方法,通过在变分辨率图像中采用不同的文本方向提取算法,及选择合理投票门限等改进Hough变换的措施,减小了由图像区域及文字笔画粗细所产生的对倾角判定的不利影响,并使用基于偏移值的方法实现页面倾斜的快速校正。实验结果表明,该算法实现了大范围高精度的文本图像倾角的快速检测,具有较强的实用性。  相似文献   

13.
基于页面前景和最小二乘法的倾斜校正   总被引:4,自引:0,他引:4       下载免费PDF全文
陈波  王加俊  吴陈 《计算机工程》2007,33(15):202-204
鉴于页面版面复杂,提出了一种基于页面前景和最小二乘法的倾斜校正方法。该方法用特定的模式描述页面前景像素,利用模式粗分类分离页面中可能有的图像、图形和表格,通过合并余下的模式得到最大的文字模式结构体,依据该结构体所含基线特征点用最小二乘法拟合出基线方向即页面倾斜方向。实验表明该方法是有效的,速度快,它得到的模式结构体可以继续用来做版面分析。  相似文献   

14.
Transforming paper documents into XML format with WISDOM++   总被引:1,自引:1,他引:0  
The transformation of scanned paper documents to a form suitable for an Internet browser is a complex process that requires solutions to several problems. The application of an OCR to some parts of the document image is only one of the problems. In fact, the generation of documents in HTML format is easier when the layout structure of a page has been extracted by means of a document analysis process. The adoption of an XML format is even better, since it can facilitate the retrieval of documents in the Web. Nevertheless, an effective transformation of paper documents into this format requires further processing steps, namely document image classification and understanding. WISDOM++ is a document processing system that operates in five steps: document analysis, document classification, document understanding, text recognition with an OCR, and transformation into HTML/XML format. The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats. A benchmarking of the system components implementing these innovative aspects is reported. Received June 15, 2000 / Revised November 7, 2000  相似文献   

15.
As sharing documents through the World Wide Web has been recently and constantly increasing, the need for creating hyperdocuments to make them accessible and retrievable via the internet, in formats such as HTML and SGML/XML, has also been rapidly rising. Nevertheless, only a few works have been done on the conversion of paper documents into hyperdocuments. Moreover, most of these studies have concentrated on the direct conversion of single-column document images that include only text and image objects. In this paper, we propose two methods for converting complex multi-column document images into HTML documents, and a method for generating a structured table of contents page based on the logical structure analysis of the document image. Experiments with various kinds of multi-column document images show that, by using the proposed methods, their corresponding HTML documents can be generated in the same visual layout as that of the document images, and their structured table of contents page can be also produced with the hierarchically ordered section titles hyperlinked to the contents.  相似文献   

16.
版面分割是版面分析的重要组成部分,经过大量的研究,如今已到了一个比较成熟的阶段。对基于连通域的版面分割算法进行了改进,能有效快速地分割较为复杂的版面图像,同时有效减少原有算法中阈值引起的分割错误的情况。先对文本图像进行单个字体的区域扩充,使后续的连通间距统计更为准确和方便,再通过连通间距的统计对图像进行模糊整合,进行文本图像的连通区域分割。实验结果表明,改进的基于连通域的算法分割版面准确,速度快,适用范围广,对于较为复杂的版面分割更具有优越性。  相似文献   

17.
一种基于连通域的版面分割方法   总被引:4,自引:0,他引:4  
版面分割是版面分析的重要组成部分,是一个受到广泛关注的研究课题。该文提出了一种基于连通域的版面分割算法,首先对文档图像进行倾斜校正,然后采用模糊处理以得到较大的连通单元,再根据区域连通的特性进行版面分割和处理。实验表明,基于连通域的算法分割版面准确,速度快,适用范围广,不仅可用于矩形版面,对于非Manhattan版面等复杂版面也能取得令人满意的效果。  相似文献   

18.
This paper describes a novel method for extracting text from document pages of mixed content. The method works by detecting pieces of text lines in small overlapping columns of width , shifted with respect to each other by image elements (good default values are: of the image width, ) and by merging these pieces in a bottom-up fashion to form complete text lines and blocks of text lines. The algorithm requires about 1.3 s for a 300 dpi image on a PC with a Pentium II CPU, 300 MHz, MotherBoard Intel440LX. The algorithm is largely independent of the layout of the document, the shape of the text regions, and the font size and style. The main assumptions are that the background be uniform and that the text sit approximately horizontally. For a skew of up to about 10 degrees no skew correction mechanism is necessary. The algorithm has been tested on the UW English Document Database I of the University of Washington and its performance has been evaluated by a suitable measure of segmentation accuracy. Also, a detailed analysis of the segmentation accuracy achieved by the algorithm as a function of noise and skew has been carried out. Received April 4, 1999 / Revised June 1, 1999  相似文献   

19.
新的文本图像倾斜检测及校正算法   总被引:3,自引:0,他引:3  
在文档扫描过程中,文档可能会发生倾斜,而很多字符识别和布局分析算法都对倾斜十分敏感,文本图像的倾斜检测及校正就成为文档分析不可缺少的环节.提出了一种新的倾斜文本图像的校正方法,该方法首先获取文档图像的bounding box,以bounding box面积最小作为倾斜校正的最终目标,并使用遗传算法搜索该最小值.实验结果表明,该算法对倾斜角的检测具有较高的精确度.  相似文献   

20.

Information extraction is a fundamental task of many business intelligence services that entail massive document processing. Understanding a document page structure in terms of its layout provides contextual support which is helpful in the semantic interpretation of the document terms. In this paper, inspired by the progress of deep learning methodologies applied to the task of object recognition, we transfer these models to the specific case of document object detection, reformulating the traditional problem of document layout analysis. Moreover, we importantly contribute to prior arts by defining the task of instance segmentation on the document image domain. An instance segmentation paradigm is especially important in complex layouts whose contents should interact for the proper rendering of the page, i.e., the proper text wrapping around an image. Finally, we provide an extensive evaluation, both qualitative and quantitative, that demonstrates the superior performance of the proposed methodology over the current state of the art.

  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号