首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 312 毫秒
1.
Biblio is an adaptive system that automatically extracts meta-data from semi-structured and structured scanned documents. Instead of using hand-coded templates or other methods manually customized for each given document format, it uses example-based machine learning to adapt to customer-defined document and meta-data types. We provide results from experiments on the recognition of document information in two document corpuses: a set of scanned journal articles and a set of scanned legal documents. The first set is semi-structured, as the different journals use a variety of flexible layouts. The second set is largely free-form text based on poor quality scans of FAX-quality legal documents. We demonstrate accuracy on the semi-structured document set roughly comparable to hand-coded systems, and much worse performance on the legal documents.  相似文献   

2.
The analysis of scanned documents is important in the construction of digital libraries and paperless offices. One significant challenge is coping with artifacts of photocopying and scanning. We present a series of simple techniques for handling these difficulties. Using 125 images of the University of Washington scanned documents database, we demonstrate the effectiveness of these methods in preparing the images for segmentation by a multiresolution algorithm.  相似文献   

3.
一种快速的文本倾斜检测方法   总被引:2,自引:0,他引:2  
文本的倾斜检测是将文本转换成数字形式的过程中的第一步工作,也是很重要的一步工作。因为后续的很多工作都是基于摆正的文本。文章提出了一种全新的倾斜检测与纠正方法。其特点在于:一、与文本的纹理无关,从而适应各种图文混排及各种书写方向并存等复杂情形;二、运算量小,只需进行一次旋转和四次对图像的部分投影。  相似文献   

4.
Character groundtruth for real, scanned document images is crucial for evaluating the performance of OCR systems, training OCR algorithms, and validating document degradation models. Unfortunately, manual collection of accurate groundtruth for characters in a real (scanned) document image is not practical because (i) accuracy in delineating groundtruth character bounding boxes is not high enough, (ii) it is extremely laborious and time consuming, and (iii) the manual labor required for this task is prohibitively expensive. Ee describe a closed-loop methodology for collecting very accurate groundtruth for scanned documents. We first create ideal documents using a typesetting language. Next we create the groundtruth for the ideal document. The ideal document is then printed, photocopied and then scanned. A registration algorithm estimates the global geometric transformation and then performs a robust local bitmap match to register the ideal document image to the scanned document image. Finally, groundtruth associated with the ideal document image is transformed using the estimated geometric transformation to create the groundtruth for the scanned document image. This methodology is very general and can be used for creating groundtruth for documents in typeset in any language, layout, font, and style. We have demonstrated the method by generating groundtruth for English, Hindi, and FAX document images. The cost of creating groundtruth using our methodology is minimal. If character, word or zone groundtruth is available for any real document, the registration algorithm can be used to generate the corresponding groundtruth for a rescanned version of the document  相似文献   

5.
International Journal on Document Analysis and Recognition (IJDAR) - Optical character recognition (OCR) is the process of recognizing characters automatically from scanned documents for editing,...  相似文献   

6.
Gao  Xing  Zhang  Guangyu  Lin  Juncong  Liao  Minghong 《Multimedia Tools and Applications》2018,77(16):21163-21184
Multimedia Tools and Applications - Heterogeneous paper documents (such as newspaper, magazine) are very common in our daily life. They are usually scanned and stored as images. Reading such images...  相似文献   

7.
The retrieval of information from scanned handwritten documents is becoming vital with the rapid increase of digitized documents, and word spotting systems have been developed to search for words within documents. These systems can be either template matching algorithms or learning based. This paper presents a coherent learning based Arabic handwritten word spotting system which can adapt to the nature of Arabic handwriting, which can have no clear boundaries between words. Consequently, the system recognizes Pieces of Arabic Words (PAWs), then re-constructs and spots words using language models. The proposed system produced promising result for Arabic handwritten word spotting when tested on the CENPARMI Arabic documents database.  相似文献   

8.
The fast evolution of scanning and computing technologies in recent years has led to the creation of large collections of scanned historical documents. It is almost always the case that these scanned documents suffer from some form of degradation. Large degradations make documents hard to read and substantially deteriorate the performance of automated document processing systems. Enhancement of degraded document images is normally performed assuming global degradation models. When the degradation is large, global degradation models do not perform well. In contrast, we propose to learn local degradation models and use them in enhancing degraded document images. Using a semi-automated enhancement system, we have labeled a subset of the Frieder diaries collection (The diaries of Rabbi Dr. Avraham Abba Frieder. ). This labeled subset was then used to train classifiers based on lookup tables in conjunction with the approximated nearest neighbor algorithm. The resulting algorithm is highly efficient and effective. Experimental evaluation results are provided using the Frieder diaries collection (The diaries of Rabbi Dr. Avraham Abba Frieder. ).  相似文献   

9.
We present here an enhanced algorithm (e-PCP) for skew detection in scanned documents, based on the work on Piecewise Covering by Parallelogram (PCP) for robust determination of skew angles [C.-H. Chou, S.-Y. Chu, F. Chang, Estimation of skew angles for scanned documents based on piecewise covering by parallelograms, Pattern Recognition 40 (2007) 443-455]. Our algorithm achieves even better robustness for detection of skew angle than the original PCP algorithm. We have shown accurate determination of skew angles in document images where the original PCP algorithm fails. Further, the increased robustness of performance is achieved with reduced number of computation compared to the originally proposed PCP algorithm. The e-PCP algorithm also outputs a confidence measure which is important in automated systems to filter cases where the estimated skew angle may not be very accurate and thus can be handled by manual intervention. The proposed algorithm was tested extensively on all categories of real time documents and comparisons with PCP method is also provided. Useful details regarding faster execution of the proposed algorithm is provided in Appendix.  相似文献   

10.
A significant portion of currently available documents exist in the form of images, for instance, as scanned documents. Electronic documents produced by scanning and OCR software contain recognition errors. This paper uses an automatic approach to examine the selection and the effectiveness of searching techniques for possible erroneous terms for query expansion. The proposed method consists of two basic steps. In the first step, confused characters in erroneous words are located and editing operations are applied to create a collection of erroneous error-grams in the basic unit of the model. The second step uses query terms and error-grams to generate additional query terms, identify appropriate matching terms, and determine the degree of relevance of retrieved document images to the user's query, based on a vector space IR model. The proposed approach has been trained on 979 document images to construct about 2,822 error-grams and tested on 100 scanned Web pages, 200 advertisements and manuals, and 700 degraded images. The performance of our method is evaluated experimentally by determining retrieval effectiveness with respect to recall and precision. The results obtained show its effectiveness and indicate an improvement over standard methods such as vectorial systems without expanded query and 3-gram overlapping.  相似文献   

11.
从海量文档中快速有效地搜索到相似文档是一个重要且耗时的问题。现有的文档相似性搜索算法是先找出候选文档集,再对候选文档进行相关性排序,找出最相关的文档。提出了一种基于文档拓扑的相似性搜索算法——Hub-N,将文档相似性搜索问题转化为图搜索问题,应用相应的剪枝技术,缩小了扫描文档的范围,提高了搜索效率。通过实验验证了算法的有效性和可行性。  相似文献   

12.
本文对采用图像识别技术来辅助核电企业开展文档智能化应用进行研究。文章阐述了开展图像识别技术应用的业务背景、主要过程、实现原理和典型应用场景解决方案等内容,通过对基于图像识别技术的扫描文件清晰度的自动化检测以及基于光学图像文字识别技术的文件自动化拆分和比对这两个应用场景,阐述了面临的问题、解决方案的原理、具体的程序功能设计方案以及最终的应用效果。根据对应用效果的评估,证明了图像识别技术能够在文档智能化应用中发挥重要作用。通过本课题的研究和实践,为基于图像识别技术的文档智能化利用做了有益的探索。  相似文献   

13.
This paper presents a method for determining the up/down orientation of text in a scanned document of unknown orientation, so that it can be appropriately rotated and processed by an optical character recognition (OCR) engine. The method analyzes the “open” portions of text blobs to determine the direction in which the open portions face. By determining the respective densities of blobs opening in a pair of opposite directions (e.g., right or left), the method can establish the direction in which the text as a whole is oriented. We first describe a method for determining the up/down orientation of roman text based on the asymmetry in the openness of most roman letters in the horizontal direction. For non-roman text such as Pashto and Hebrew, we provide a method that determines a direction that is the most asymmetric, and therefore the most useful for the determination of text orientation, given a training data set of documents of known orientation. This work can be adapted for use in automated mail processing or to determine the orientation of checks in automated teller machine envelopes, scanned or copied documents, documents sent via facsimile, and digital photographs that include text (e.g., road signs, business cards, driver's licenses), among other applications.  相似文献   

14.
We consider the problem of locating a watermark in pages of archaic documents that have been both scanned and back-lit: the problem is of interest to codicologists in identifying and tracking paper materials. Commonly, documents of interest are worn or damaged, and all information is victim to very unfavourable signal-to-noise ratios—this is especially true of ‘hidden’ data such as watermarks and chain lines. We present an approach to recto removal, followed by highlighting of such ‘hidden’ data. The result is still of very low signal quality, and we also present a statistical approach to locate watermarks from a known lexicon of fragments. Results are presented from a comprehensively scanned nineteenth century copy of the Qur’ān. The approach has lent itself to immediate exploitation in improving known watermarks, and distinguishing between twin copies. Mr Hiary was supported by the University of Jordan in pursuing this work.  相似文献   

15.
Hidden Markov Models for Text Categorization in Multi-Page Documents   总被引:3,自引:0,他引:3  
In the traditional setting, text categorization is formulated as a concept learning problem where each instance is a single isolated document. However, this perspective is not appropriate in the case of many digital libraries that offer as contents scanned and optically read books or magazines. In this paper, we propose a more general formulation of text categorization, allowing documents to be organized as sequences of pages. We introduce a novel hybrid system specifically designed for multi-page text documents. The architecture relies on hidden Markov models whose emissions are bag-of-words resulting from a multinomial word event model, as in the generative portion of the Naive Bayes classifier. The rationale behind our proposal is that taking into account contextual information provided by the whole page sequence can help disambiguation and improves single page classification accuracy. Our results on two datasets of scanned journals from the Making of America collection confirm the importance of using whole page sequences. The empirical evaluation indicates that the error rate (as obtained by running the Naive Bayes classifier on isolated pages) can be significantly reduced if contextual information is incorporated.  相似文献   

16.
龙雅琴  古乐野  柳岸 《计算机应用》2007,27(4):1020-1022
利用扫描后的病案图像的特征进行预处理,加快识别和归档的效率。首先用大津法对图像进行二值化,然后用Radon进行倾斜检测,最后用数学形态学开运算减少干扰,用投影法框定待识别标题位置。  相似文献   

17.

Paper documents are ideal sources of useful information and have a profound impact on every aspect of human lives. These documents may be printed or handwritten and contain information as combinations of texts, figures, tables, charts, etc. This paper proposes a method to segment text lines from both flatbed scanned/camera-captured heavily warped printed and handwritten documents. This work uses the concept of semantic segmentation with the help of a multi-scale convolutional neural network. The results of line segmentation using the proposed method outperform a number of similar proposals already reported in the literature. The performance and efficacy of the proposed method have been corroborated by the test result on a variety of publicly available datasets, including ICDAR, Alireza, IUPR, cBAD, Tobacco-800, IAM, and our dataset.

  相似文献   

18.
With the great advantages of digitization, more and more documents are being transformed into digital representations. Most content digitization of documents is performed by scanners or digital cameras. However, the transformation might degrade the image quality caused by lighting variations, i.e. uneven illumination distribution. In this paper we describe a new approach for text images to compensate uneven illumination distribution with a high degree of text recognition. Our proposed scheme is implemented by enhancing the contrast of the scanned documents, and then generating an edge map from the contrast-enhanced image for locating text area. With the information of the text location, a light distribution image (background) is created to assist the producing of the final light balanced image. Simulation results demonstrate that our approach is superior to the previous works of Hsia et al. (2005, 2006).  相似文献   

19.
积厚文档扫描图像校正   总被引:3,自引:0,他引:3  
扫描已装订的积厚文档时,由于页面不能紧贴于扫描平面,会产生两个问题:(1)扫描图像中离装订线较近的一侧出现黑色的阴影区域;(2)阴影区域中的文本产生扭曲.基于图像信息和几何变形信息,提出一种去除阴影和校正文本的算法.首先采用分块自动阈值算法去除阴影;然后通过垂直投影函数、有效包围盒和标记点提取文本行中心线,中心线被用于全局几何参数的估计;最后,扭曲的文本通过估计的几何参数和分片四边形映射进行校正.实验结果表明该算法能给出较好的校正结果.  相似文献   

20.
This paper presents a syntactic method for sophisticated logical structure analysis that transforms document images with multiple pages and hierarchical structure into an electronic document based on SGML/XML. To produce a logical structure more accurately and quickly than previous works of which the basic units are text lines, the proposed parsing method takes text regions with hierarchical structure as input. Furthermore, we define a document model that is able to describe geometric characteristics and logical structure information of documents efficiently and present its automated creation method. Experimental results with 372 images scanned from the IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) show that the method has performed logical structure analysis successfully and generated a document model automatically. Particularly, the method generates SGML/XML documents as the result of structural analysis, so that it enhances the reusability of documents and independence of platform.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号