首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
2.
The convenience of search, both on the personal computer hard disk as well as on the web, is still limited mainly to machine printed text documents and images because of the poor accuracy of handwriting recognizers. The focus of research in this paper is the segmentation of handwritten text and machine printed text from annotated documents sometimes referred to as the task of “ink separation” to advance the state-of-art in realizing search of hand-annotated documents. We propose a method which contains two main steps—patch level separation and pixel level separation. In the patch level separation step, the entire document is modeled as a Markov Random Field (MRF). Three different classes (machine printed text, handwritten text and overlapped text) are initially identified using G-means based classification followed by a MRF based relabeling procedure. A MRF based classification approach is then used to separate overlapped text into machine printed text and handwritten text using pixel level features forming the second step of the method. Experimental results on a set of machine-printed documents which have been annotated by multiple writers in an office/collaborative environment show that our method is robust and provides good text separation performance.  相似文献   

3.
In this paper, we address the problem of the identification of text in noisy document images. We are especially focused on segmenting and identifying between handwriting and machine printed text because: 1) Handwriting in a document often indicates corrections, additions, or other supplemental information that should be treated differently from the main content and 2) the segmentation and recognition techniques requested for machine printed and handwritten text are significantly different. A novel aspect of our approach is that we treat noise as a separate class and model noise based on selected features. Trained Fisher classifiers are used to identify machine printed text and handwriting from noise and we further exploit context to refine the classification. A Markov Random Field-based (MRF) approach is used to model the geometrical structure of the printed text, handwriting, and noise to rectify misclassifications. Experimental results show that our approach is robust and can significantly improve page segmentation in noisy document collections.  相似文献   

4.
Today, there is an increasing demand of efficient archival and retrieval methods for online handwritten data. For such tasks, text categorization is of particular interest. The textual data available in online documents can be extracted through online handwriting recognition; however, this process produces errors in the resulting text. This work reports experiments on the categorization of online handwritten documents based on their textual contents. We analyze the effect of word recognition errors on the categorization performances, by comparing the performances of a categorization system with the texts obtained through online handwriting recognition and the same texts available as ground truth. Two well-known categorization algorithms (kNN and SVM) are compared in this work. A subset of the Reuters-21578 corpus consisting of more than 2,000 handwritten documents has been collected for this study. Results show that classification rate loss is not significant, and precision loss is only significant for recall values of 60–80% depending on the noise levels.  相似文献   

5.
6.
Search and retrieval is gaining importance in the ink domain due to the increase in the availability of online handwritten data. However, the problem is challenging due to variations in handwriting between various writers, digitizers and writing conditions. In this paper, we propose a retrieval mechanism for online handwriting, which can handle different writing styles, specifically for Indian languages. The proposed approach provides a keyboard-based search interface that enables to search handwritten data from any platform, in addition to pen-based and example-based queries. One of the major advantages of this framework is that information retrieval techniques such as ranking relevance, detecting stopwords and controlling word forms can be extended to work with search and retrieval in the ink domain. The framework also allows cross-lingual document retrieval across Indian languages.  相似文献   

7.
In this paper, we present a segmentation methodology of handwritten documents in their distinct entities, namely, text lines and words. Text line segmentation is achieved by applying Hough transform on a subset of the document image connected components. A post-processing step includes the correction of possible false alarms, the detection of text lines that Hough transform failed to create and finally the efficient separation of vertically connected characters using a novel method based on skeletonization. Word segmentation is addressed as a two class problem. The distances between adjacent overlapped components in a text line are calculated using the combination of two distance metrics and each of them is categorized either as an inter- or an intra-word distance in a Gaussian mixture modeling framework. The performance of the proposed methodology is based on a consistent and concrete evaluation methodology that uses suitable performance measures in order to compare the text line segmentation and word segmentation results against the corresponding ground truth annotation. The efficiency of the proposed methodology is demonstrated by experimentation conducted on two different datasets: (a) on the test set of the ICDAR2007 handwriting segmentation competition and (b) on a set of historical handwritten documents.  相似文献   

8.
The retrieval of information from scanned handwritten documents is becoming vital with the rapid increase of digitized documents, and word spotting systems have been developed to search for words within documents. These systems can be either template matching algorithms or learning based. This paper presents a coherent learning based Arabic handwritten word spotting system which can adapt to the nature of Arabic handwriting, which can have no clear boundaries between words. Consequently, the system recognizes Pieces of Arabic Words (PAWs), then re-constructs and spots words using language models. The proposed system produced promising result for Arabic handwritten word spotting when tested on the CENPARMI Arabic documents database.  相似文献   

9.
A new paradigm, which models the relationships between handwriting and topic categories, in the context of medical forms, is presented. The ultimate goals are: (1) a robust method which categorizes medical forms into specified categories, and (2) the use of such information for practical applications such as an improved recognition of medical handwriting or retrieval of medical forms as in a search engine. Medical forms have diverse, complex and large lexicons consisting of English, Medical and Pharmacology corpus. Our technique shows that a few recognized characters, returned by handwriting recognition, can be used to construct a linguistic model capable of representing a medical topic category. This allows (1) a reduced lexicon to be constructed, thereby improving handwriting recognition performance, and (2) PCR (Pre-Hospital Care Report) forms to be tagged with a topic category and subsequently searched by information retrieval systems. We present an improvement of over 7% in raw recognition rate and a mean average precision of 0.28 over a set of 1,175 queries on a data set of unconstrained handwritten medical forms filled in emergency environments. This work was supported by the National Science Foundation.  相似文献   

10.
ABSTRACT

Topic detection and tracking refers to automatic techniques for locating topically related cohesive paragraphs in a stream of text. Most documents are about more than one subject, but many Natural Language Processing (NLP) and Information Retrieval (IR) techniques implicitly assume documents have just one topic. Even in the presence of a single topic within a document, the document may address multiple subtopics and various aspects of the primary topic. Hence, dividing documents into topically coherent units and discovering their topic might have many uses. We describe new clues that account for the topic of grouping of contiguous portions of the text. Those clues are based on general lexical resources, which make them applicable to unrestricted texts, and can have many uses such as helping users find answers to general questions in an information search task, or in question/answering systems, or in text summarization. We devise an algorithm for identifying these clues, and we report on the performance of these clues, as well as the improvements suggested by our experiments.  相似文献   

11.
12.
This paper describes a handwritten Chinese text editing and recognition system that can edit handwritten text and recognize it with a client-server mode. First, the client end samples and redisplays the handwritten text by using digital ink technics, segments handwritten characters, edits them and saves original handwritten information into a self-defined document. The self-defined document saves coordinates of all sampled points of handwriting characters. Second, the server recognizes handwritten document based on the proposed Gabor feature extraction and affinity propagation clustering (GFAP) method, and returns the recognition results to client end. Moreover, the server can also collect the labeled handwritten characters and fine tune the recognizer automatically. Experimental results on HIT-OR3C database show that our handwriting recognition method improves the recognition performance remarkably.  相似文献   

13.
14.
15.
《Information Systems》1999,24(4):303-326
The emergence of the pen as the main interface device for personal digital assistants and pen-computers has made handwritten text, and more generally ink, a first-class object. As for any other type of data, the need of retrieval is a prevailing one. Retrieval of handwritten text is more difficult than that of conventional data since it is necessary to identify a handwritten word given slightly different variations in its shape. The current way of addressing this is by using handwriting recognition, which is prone to errors and limits the expressiveness of ink. Alternatively, one can retrieve from the database handwritten words that are similar to a query handwritten word using techniques borrowed from pattern and speech recognition. In this paper, an indexing technique based on Hidden Markov Models is proposed. Its implementation and its performance is reported in this paper.  相似文献   

16.
17.
18.
Despite several decades of research in document analysis, recognition of unconstrained handwritten documents is still considered a challenging task. Previous research in this area has shown that word recognizers perform adequately on constrained handwritten documents which typically use a restricted vocabulary (lexicon). But in the case of unconstrained handwritten documents, state-of-the-art word recognition accuracy is still below the acceptable limits. The objective of this research is to improve word recognition accuracy on unconstrained handwritten documents by applying a post-processing or OCR correction technique to the word recognition output. In this paper, we present two different methods for this purpose. First, we describe a lexicon reduction-based method by topic categorization of handwritten documents which is used to generate smaller topic-specific lexicons for improving the recognition accuracy. Second, we describe a method which uses topic-specific language models and a maximum-entropy based topic categorization model to refine the recognition output. We present the relative merits of each of these methods and report results on the publicly available IAM database.  相似文献   

19.
Voting techniques for expert search   总被引:4,自引:2,他引:2  
In an expert search task, the users’ need is to identify people who have relevant expertise to a topic of interest. An expert search system predicts and ranks the expertise of a set of candidate persons with respect to the users’ query. In this paper, we propose a novel approach for predicting and ranking candidate expertise with respect to a query, called the Voting Model for Expert Search. In the Voting Model, we see the problem of ranking experts as a voting problem. We model the voting problem using 12 various voting techniques, which are inspired from the data fusion field. We investigate the effectiveness of the Voting Model and the associated voting techniques across a range of document weighting models, in the context of the TREC 2005 and TREC 2006 Enterprise tracks. The evaluation results show that the voting paradigm is very effective, without using any query or collection-specific heuristics. Moreover, we show that improving the quality of the underlying document representation can significantly improve the retrieval performance of the voting techniques on an expert search task. In particular, we demonstrate that applying field-based weighting models improves the ranking of candidates. Finally, we demonstrate that the relative performance of the voting techniques for the proposed approach is stable on a given task regardless of the used weighting models, suggesting that some of the proposed voting techniques will always perform better than other voting techniques. Extended version of ‘Voting for candidates: adapting data fusion techniques for an expert search task’. C. Macdonald and I. Ounis. In Proceedings of ACM CIKM 2006, Arlington, VA. 2006. doi: 10.1145/1183614.1183671.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号