期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Extraction of type style-based meta-information from imaged documents

B.B. Chaudhuri U. Garain 《International Journal on Document Analysis and Recognition》2001,3(3):138-149

Extraction of some meta-information from printed documents without carrying out optical character recognition (OCR) is considered. It can be statistically verified that important terms in technical articles are mainly printed in italic, bold, and all-capital style. A quick approach to detecting them is proposed here. This approach is based on the global shape heuristics of these styles of any font. Important words in a document are sometimes printed in larger size as well. A smart approach for the determination of font size is also presented. Detection of type styles helps in improving OCR performance, especially for reading italicized text. Another advantage to identifying word type styles and font size has been discussed in the context of extracting: (i) different logical labels; and (ii) important terms from the document. Experimental results on the performance of the approach on a large number of good quality, as well as degraded, document images are presented. Received July 12, 2000 / Revised October 1, 2000 相似文献

2.

Word matching using single closed contours for indexing handwritten historical documents

Tomasz Adamek Noel E. O’Connor Alan F. Smeaton 《International Journal on Document Analysis and Recognition》2007,9(2-4):153-165

相似文献

3.

Large scale document image retrieval by automatic word annotation

K. Pramod Sankar R. Manmatha C. V. Jawahar 《International Journal on Document Analysis and Recognition》2014,17(1):1-17

相似文献

4.

Word searching in unconstrained layout using character pair coding

Partha Pratim Roy Umapada Pal Josep Lladós 《International Journal on Document Analysis and Recognition》2014,17(4):343-358

Word searching in non-structural layout such as graphical documents is a difficult task due to arbitrary orientations of text words and the presence of graphical symbols. This paper presents an efficient approach for word searching in documents of non-structural layout using an efficient indexing and retrieval approach. The proposed indexing scheme stores spatial information of text characters of a document using a character spatial feature table (CSFT). The spatial feature of text component is derived from the neighbor component information. The character labeling of a multi-scaled and multi-oriented component is performed using support vector machines. For searching purpose, the positional information of characters is obtained from the query string by splitting it into possible combinations of character pairs. Each of these character pairs searches the position of corresponding text in document with the help of CSFT. Next, the searched text components are joined and formed into sequence by spatial information matching. String matching algorithm is performed to match the query word with the character pair sequence in documents. The experimental results are presented on two different datasets of graphical documents: maps dataset and seal/logo image dataset. The results show that the method is efficient to search query word from unconstrained document layouts of arbitrary orientation. 相似文献

5.

Word spotting in historical printed documents using shape and sequence comparisons

Khurram Khurshid Claudie Faure Nicole Vincent 《Pattern recognition》2012,45(7):2598-2609

Information spotting in scanned historical document images is a very challenging task. The joint use of the mechanical press and of human controlled inking introduced great variability in ink level within a book or even within a page. Consequently characters are often broken or merged together and thus become difficult to segment and recognize. The limitations of commercial OCR engines for information retrieval in historical document images have inspired alternative means of identification of given words in such documents. We present a word spotting method for scanned documents in order to find the word images that are similar to a query word, without assuming a correct segmentation of the words into characters. The connected components are first processed to transform a word pattern into a sequence of sub-patterns. Each sub-pattern is represented by a sequence of feature vectors. A modified Edit distance is proposed to perform a segmentation-driven string matching and to compute the Segmentation Driven Edit (SDE) distance between the words to be compared. The set of SDE operations is defined to obtain the word segmentations that are the most appropriate to evaluate their similarity. These operations are efficient to cope with broken and touching characters in words. The distortion of character shapes is handled by coupling the string matching process with local shape comparisons that are achieved by Dynamic Time Warping (DTW). The costs of the SDE operations are provided by the DTW distances. A sub-optimal version of the SDE string matching is also proposed to reduce the computation time, nevertheless it did not lead to a great decrease in performance. It is possible to enter a query by example or a textual query entered with the keyboard. Textual queries can be used to directly spot the word without the need to synthesize its image, as far as character prototype images are available. Results are presented for different documents and compared with other methods, showing the efficiency of our method. 相似文献

6.

A Document Image Retrieval System

Konstantinos Zagoris Kavallieratou Ergina Nikos Papamarkos 《Engineering Applications of Artificial Intelligence》2010,23(6):872-879

相似文献

7.

Hybrid contextural text recognition with string matching 总被引：1，自引：0，他引：1

Sinha R.M.K. Prasada B. Houle G.F. Sabourin M. 《IEEE transactions on pattern analysis and machine intelligence》1993,15(9):915-925

The hybrid contextural algorithm for reading real-life documents printed in varying fonts of any size is presented. Text is recognized progressively in three passes. The first pass is used to generate character hypothesis, the second to generate word hypothesis, and the third to verify the word hypothesis. During the first pass, isolated characters are recognized using a dynamic contour warping classifier. Transient statistical information is collected to accelerate the recognition process and to verify hypotheses in later processing. A transient dictionary consisting of high confidence nondictionary words is constructed in this pass. During the second pass, word-level hypotheses are generated using hybrid contextual text processing. Nondictionary words are recognized using a modified Viterbi algorithm, a string matching algorithm utilizing n grams, special handlers for touching characters, and pragmatic handlers for numerals, punctuation, hyphens, apostrophes, and a prefix/suffix handler. This processing usually generates several word hypothesis. During the third pass, word-level verification occurs 相似文献

8.

Mining the Web's link structure

Chakrabarti S. Dom B.E. Kumar S.R. Raghavan P. Rajagopalan S. Tomkins A. Gibson D. Kleinberg J. 《Computer》1999,32(8):60-67

The Web is a hypertext body of approximately 300 million pages that continues to grow at roughly a million pages per day. Page variation is more prodigious than the data's raw scale: taken as a whole, the set of Web pages lacks a unifying structure and shows far more authoring style and content variation than that seen in traditional text document collections. This level of complexity makes an “off-the-shelf” database management and information retrieval solution impossible. To date, index based search engines for the Web have been the primary tool by which users search for information. Such engines can build giant indices that let you quickly retrieve the set of all Web pages containing a given word or string. Experienced users can make effective use of such engines for tasks that can be solved by searching for tightly constrained key words and phrases. These search engines are, however, unsuited for a wide range of equally important tasks. In particular, a topic of any breadth will typically contain several thousand or million relevant Web pages. How then, from this sea of pages, should a search engine select the correct ones-those of most value to the user? Clever is a search engine that analyzes hyperlinks to uncover two types of pages: authorities, which provide the best source of information on a given topic; and hubs, which provide collections of links to authorities. We outline the thinking that went into Clever's design, report briefly on a study that compared Clever's performance to that of Yahoo and AltaVista, and examine how our system is being extended and updated 相似文献

9.

Integrating knowledge sources in Devanagari text recognition system

Bansal V. Sinha R.M.K. 《IEEE transactions on systems, man, and cybernetics. Part A, Systems and humans : a publication of the IEEE Systems, Man, and Cybernetics Society》2000,30(4):500-505

The reading process has been widely studied and there is a general agreement among researchers that knowledge in different forms and at different levels plays a vital role. This is the underlying philosophy of the Devanagari document recognition system described in this work. The knowledge sources we use are mostly statistical in nature or in the form of a word dictionary tailored specifically for optical character recognition (OCR). We do not perform any reasoning on these. However, we explore their relative importance and role in the hierarchy. Some of the knowledge sources are acquired a priori by an automated training process while others are extracted from the text as it is processed. A complete Devanagari OCR system has been designed and tested with real-life printed documents of varying size and font. Most of the documents used were photocopies of the original. A performance of approximately 90% correct recognition is achieved 相似文献

10.

Text retrieval from early printed books

Simone Marinai 《International Journal on Document Analysis and Recognition》2011,14(2):117-129

相似文献

11.

An automatic closed-loop methodology for generating charactergroundtruth for scanned documents

Kanungo T. Haralick R.M. 《IEEE transactions on pattern analysis and machine intelligence》1999,21(2):179-183

Character groundtruth for real, scanned document images is crucial for evaluating the performance of OCR systems, training OCR algorithms, and validating document degradation models. Unfortunately, manual collection of accurate groundtruth for characters in a real (scanned) document image is not practical because (i) accuracy in delineating groundtruth character bounding boxes is not high enough, (ii) it is extremely laborious and time consuming, and (iii) the manual labor required for this task is prohibitively expensive. Ee describe a closed-loop methodology for collecting very accurate groundtruth for scanned documents. We first create ideal documents using a typesetting language. Next we create the groundtruth for the ideal document. The ideal document is then printed, photocopied and then scanned. A registration algorithm estimates the global geometric transformation and then performs a robust local bitmap match to register the ideal document image to the scanned document image. Finally, groundtruth associated with the ideal document image is transformed using the estimated geometric transformation to create the groundtruth for the scanned document image. This methodology is very general and can be used for creating groundtruth for documents in typeset in any language, layout, font, and style. We have demonstrated the method by generating groundtruth for English, Hindi, and FAX document images. The cost of creating groundtruth using our methodology is minimal. If character, word or zone groundtruth is available for any real document, the registration algorithm can be used to generate the corresponding groundtruth for a rescanned version of the document 相似文献

12.

Texture for script identification 总被引：2，自引：0，他引：2

Busch A Boles WW Sridharan S 《IEEE transactions on pattern analysis and machine intelligence》2005,27(11):1720-1732

相似文献

13.

An experimental evaluation of OCR text representations for learning document classifiers

Markus Junker Rainer Hoch 《International Journal on Document Analysis and Recognition》1998,1(2):116-122

In the literature, many feature types are proposed for document classification. However, an extensive and systematic evaluation of the various approaches has not yet been done. In particular, evaluations on OCR documents are very rare. In this paper we investigate seven text representations based on n-grams and single words. We compare their effectiveness in classifying OCR texts and the corresponding correct ASCII texts in two domains: business letters and abstracts of technical reports. Our results indicate that the use of n-grams is an attractive technique which can even compare to techniques relying on a morphological analysis. This holds for OCR texts as well as for correct ASCII texts. Received February 17, 1998 / Revised April 8, 1998 相似文献

14.

Identification of different script lines from multi-script documents 总被引：1，自引：0，他引：1

U. Pal B. B. Chaudhuri 《Image and vision computing》2002,20(13-14)

相似文献

15.

Collaborative filtering with maximum entropy

Pavlov D. Manavoglu E. Giles C.L. Pennock D.M. 《Intelligent Systems, IEEE》2004,19(6):40-47

As users navigate through online document collections on high-volume Web servers, they depend on good recommendations. We present a novel maximum-entropy algorithm for generating accurate recommendations and a data-clustering approach for speeding up model training. Recommender systems attempt to automate the process of "word of mouth" recommendations within a community. Typical application environments such as online shops and search engines have many dynamic aspects. 相似文献

16.

Learning on the fly: a font-free approach toward multilingual OCR

Andrew Kae David A. Smith Erik Learned-Miller 《International Journal on Document Analysis and Recognition》2011,14(3):289-301

Despite ubiquitous claims that optical character recognition (OCR) is a “solved problem,” many categories of documents continue to break modern OCR software such as documents with moderate degradation or unusual fonts. Many approaches rely on pre-computed or stored character models, but these are vulnerable to cases when the font of a particular document was not part of the training set or when there is so much noise in a document that the font model becomes weak. To address these difficult cases, we present a form of iterative contextual modeling that learns character models directly from the document it is trying to recognize. We use these learned models both to segment the characters and to recognize them in an incremental, iterative process. We present results comparable with those of a commercial OCR system on a subset of characters from a difficult test document in both English and Greek. 相似文献

17.

Finding words in alphabet soup: Inference on freeform character recognition for historical scripts

Nicholas R. Howe Shaolei Feng R. ManmathaAuthor vitae 《Pattern recognition》2009,42(12):3338-3347

This paper develops word recognition methods for historical handwritten cursive and printed documents. It employs a powerful segmentation-free letter detection method based upon joint boosting with histograms of gradients as features. Efficient inference on an ensemble of hidden Markov models can select the most probable sequence of candidate character detections to recognize complete words in ambiguous handwritten text, drawing on character n-gram and physical separation models. Experiments with two corpora of handwritten historic documents show that this approach recognizes known words more accurately than previous efforts, and can also recognize out-of-vocabulary words. 相似文献

18.

基于Rough集潜在语义索引的Web文档分类 总被引：5，自引：0，他引：5

何明冯博琴傅向华《计算机工程》2004,30(13):3-5

Rough集(粗糙集)埋论是一种处理不确定或模糊知识的数学工具。提出了一种基于Rough集理论的潜在语义索引的Web文档分类方法。首先应用向量空间模型表示Web文档信息，然后通过矩阵的奇异值分解来进行信息过滤和潜在语义索引；运用属性约简算法生成分类规则，最后利用多知识库进行文档分类。通过试验比较，该方法具有较好的分类效果。相似文献

19.

聚焦爬虫常见算法分析

Chen Lijun 《数字社区&智能家居》2008,(Z1)

聚焦爬虫搜集与特定主题相关的页面,为搜索引擎构建页面集。传统的聚焦爬虫采用向量空间模型和局部搜索算法,精确率和召回率都比较低。文章分析了聚焦爬虫存在的问题及其相应的解决方法。最后对未来的研究方向进行了展望。相似文献

20.

Retrieving poorly degraded OCR documents

Y. Fataicha M. Cheriet J. Y. Nie C. Y. Suen 《International Journal on Document Analysis and Recognition》2006,8(1):15-26

A significant portion of currently available documents exist in the form of images, for instance, as scanned documents. Electronic documents produced by scanning and OCR software contain recognition errors. This paper uses an automatic approach to examine the selection and the effectiveness of searching techniques for possible erroneous terms for query expansion. The proposed method consists of two basic steps. In the first step, confused characters in erroneous words are located and editing operations are applied to create a collection of erroneous error-grams in the basic unit of the model. The second step uses query terms and error-grams to generate additional query terms, identify appropriate matching terms, and determine the degree of relevance of retrieved document images to the user's query, based on a vector space IR model. The proposed approach has been trained on 979 document images to construct about 2,822 error-grams and tested on 100 scanned Web pages, 200 advertisements and manuals, and 700 degraded images. The performance of our method is evaluated experimentally by determining retrieval effectiveness with respect to recall and precision. The results obtained show its effectiveness and indicate an improvement over standard methods such as vectorial systems without expanded query and 3-gram overlapping. 相似文献