共查询到20条相似文献,搜索用时 0 毫秒
1.
Hidden tree Markov models for document image classification 总被引:3,自引:0,他引:3
Diligenti M. Frasconi P. Gori M. 《IEEE transactions on pattern analysis and machine intelligence》2003,25(4):519-523
Classification is an important problem in image document processing and is often a preliminary step toward recognition, understanding, and information extraction. In this paper, the problem is formulated in the framework of concept learning and each category corresponds to the set of image documents with similar physical structure. We propose a solution based on two algorithmic ideas. First, we obtain a structured representation of images based on labeled XY-trees (this representation informs the learner about important relationships between image subconstituents). Second, we propose a probabilistic architecture that extends hidden Markov models for learning probability distributions defined on spaces of labeled trees. Finally, a successful application of this method to the categorization of commercial invoices is presented. 相似文献
2.
3.
4.
Bakkali Souhail Ming Zuheng Coustaty Mickaël Rusiñol Marçal 《International Journal on Document Analysis and Recognition》2021,24(3):251-268
International Journal on Document Analysis and Recognition (IJDAR) - In the recent past, complex deep neural networks have received huge interest in various document understanding tasks such as... 相似文献
5.
Eduardo Vellasques Robert Sabourin Eric Granger 《Expert systems with applications》2013,40(13):5240-5259
Intelligent watermarking (IW) techniques employ population-based evolutionary computing in order to optimize embedding parameters that trade-off between watermark robustness and image quality for digital watermarking systems. Recent advances indicate that it is possible to decrease the computational burden of IW techniques in scenarios involving long heterogeneous streams of bi-tonal document images by recalling embedding parameters (solutions) from a memory based on Gaussian Mixture Model (GMM) representation of optimization problems. This representation can provide ready-to-use solutions for similar optimization problem instances, avoiding the need for a costly re-optimization process. In this paper, a dual surrogate dynamic Particle Swarm Optimization (DS-DPSO) approach is proposed which employs a memory of GMMs in regression mode in order to decrease the cost of re-optimization for heterogeneous bi-tonal image streams. This approach is applied within a four level search for near-optimal solutions, with increasing computational burden and precision. Following previous research, the first two levels use GMM re-sampling to recall solutions for recurring problems, allowing to manage streams of heterogeneous images. Then, if embedding parameters of an image require a significant adaptation, the third level is activated. This optimization level relies on an off-line surrogate, using Gaussian Mixture Regression (GMR), in order to replace costly fitness evaluations during optimization. The final level also performs optimization, but GMR is employed as a costlier on-line surrogate in a worst-case scenario and provides a safeguard to the IW system. Experimental validation were performed on the OULU image data set, featuring heterogeneous image streams with a varying levels of attacks. In this scenario, the DS-DPSO approach has been shown to provide comparable level of watermarking performance with a 93% decline in computational cost compared to full re-optimization. Indeed, when significant parameter adaptation is required, fitness evaluations may be replaced with GMR. 相似文献
6.
Morteza Valizadeh Ehsanollah Kabir 《International Journal on Document Analysis and Recognition》2012,15(1):57-69
In this paper, we propose a new algorithm for the binarization of degraded document images. We map the image into a 2D feature
space in which the text and background pixels are separable, and then we partition this feature space into small regions.
These regions are labeled as text or background using the result of a basic binarization algorithm applied on the original
image. Finally, each pixel of the image is classified as either text or background based on the label of its corresponding
region in the feature space. Our algorithm splits the feature space into text and background regions without using any training
dataset. In addition, this algorithm does not need any parameter setting by the user and is appropriate for various types
of degraded document images. The proposed algorithm demonstrated superior performance against six well-known algorithms on
three datasets. 相似文献
7.
Daewook Lee Joonho Kwon Weidong Yang Hyoseop Shin Jae-min Kwak Sukho Lee 《Journal of Intelligent Manufacturing》2009,20(3):273-282
The XML stream filtering is gaining widespread attention from the research community in recent years. There have been many
efforts to improve the performance of the XML filtering system by utilizing XML schema information. In this paper, we design
and implement an XML stream filtering system, SFilter, which uses DTD or XML schema information for improving the performance.
We propose the simplification and two kinds of optimization, one is static and the other is dynamic optimization. The Simplification
and static optimization transform the XPath queries to make automata as an index structure for the filtering. The dynamic
optimization are done in runtime at the filtering time. We developed five kinds of static optimization and two kinds of dynamic
optimization. We present the novel filtering algorithm for the resulting transformed XPath queries and runtime optimizing.
The experimental result shows that our system filters the XML streams efficiently. 相似文献
8.
The document spectrum for page layout analysis 总被引:17,自引:0,他引:17
Page layout analysis is a document processing technique used to determine the format of a page. This paper describes the document spectrum (or docstrum), which is a method for structural page layout analysis based on bottom-up, nearest-neighbor clustering of page components. The method yields an accurate measure of skew, within-line, and between-line spacings and locates text lines and text blocks. It is advantageous over many other methods in three main ways: independence from skew angle, independence from different text spacings, and the ability to process local regions of different text orientations within the same image. Results of the method shown for several different page formats and for randomly oriented subpages on the same image illustrate the versatility of the method. We also discuss the differences, advantages, and disadvantages of the docstrum with respect to other lay-out methods 相似文献
9.
Information retrieval in document image databases 总被引:2,自引:0,他引:2
Yje Lu Chew Lim Tan 《Knowledge and Data Engineering, IEEE Transactions on》2004,16(11):1398-1410
With the rising popularity and importance of document images as an information source, information retrieval in document image databases has become a growing and challenging problem. In this paper, we propose an approach with the capability of matching partial word images to address two issues in document image retrieval: word spotting and similarity measurement between documents. First, each word image is represented by a primitive string. Then, an inexact string matching technique is utilized to measure the similarity between the two primitive strings generated from two word images. Based on the similarity, we can estimate how a word image is relevant to the other and, thereby, decide whether one is a portion of the other. To deal with various character fonts, we use a primitive string which is tolerant to serif and font differences to represent a word image. Using this technique of inexact string matching, our method is able to successfully handle the problem of heavily touching characters. Experimental results on a variety of document image databases confirm the feasibility, validity, and efficiency of our proposed approach in document image retrieval. 相似文献
10.
B. Gatos Author Vitae I. Pratikakis Author Vitae Author Vitae 《Pattern recognition》2006,39(3):317-327
This paper presents a new adaptive approach for the binarization and enhancement of degraded documents. The proposed method does not require any parameter tuning by the user and can deal with degradations which occur due to shadows, non-uniform illumination, low contrast, large signal-dependent noise, smear and strain. We follow several distinct steps: a pre-processing procedure using a low-pass Wiener filter, a rough estimation of foreground regions, a background surface calculation by interpolating neighboring background intensities, a thresholding by combining the calculated background surface with the original image while incorporating image up-sampling and finally a post-processing step in order to improve the quality of text regions and preserve stroke connectivity. After extensive experiments, our method demonstrated superior performance against four (4) well-known techniques on numerous degraded document images. 相似文献
11.
Nawei Chen Dorothea Blostein 《International Journal on Document Analysis and Recognition》2007,10(1):1-16
Document image classification is an important step in Office Automation, Digital Libraries, and other document image analysis
applications. There is great diversity in document image classifiers: they differ in the problems they solve, in the use of
training data to construct class models, and in the choice of document features and classification algorithms. We survey this
diverse literature using three components: the problem statement, the classifier architecture, and performance evaluation.
This brings to light important issues in designing a document classifier, including the definition of document classes, the
choice of document features and feature representation, and the choice of classification algorithm and learning mechanism.
We emphasize techniques that classify single-page typeset document images without using OCR results. Developing a general,
adaptable, high-performance classifier is challenging due to the great variety of documents, the diverse criteria used to
define document classes, and the ambiguity that arises due to ill-defined or fuzzy document classes. 相似文献
12.
Xiang-guo ZhaoAuthor Vitae Guoren WangAuthor Vitae Xin BiAuthor Vitae Peizhen Gong Yuhai Zhao 《Neurocomputing》2011,74(16):2444-2451
In this paper, we describe an XML document classification framework based on extreme learning machine (ELM). On the basis of Structured Link Vector Model (SLVM), an optimized Reduced Structured Vector Space Model (RS-VSM) is proposed to incorporate structural information into feature vectors more efficiently and optimize the computation of document similarity. We apply ELM in the XML document classification to achieve good performance at extremely high speed compared with conventional learning machines (e.g., support vector machine). A voting-ELM algorithm is then proposed to improve the accuracy of ELM classifier. Revoting of Equal Votes (REV) method and Revoting of Confusing Classes (RCC) method are also proposed to postprocess the voting result of v-ELM and further improve the performance. The experiments conducted on real world classification problems demonstrate that the voting-ELM classifiers presented in this paper can achieve better performance than ELM algorithms with respect to precision, recall and F-measure. 相似文献
13.
XML's increasing diffusion makes efficient XML query processing and indexing all the more critical. Given the semistructured nature of XML documents, however, general query processing techniques won't work. Researchers have proposed several specialized indexing methods that offer query processors efficient access to XML documents, although none are yet fully implemented in commercial products. In this article the classification of XML indexing techniques identifies current practices and trends, offering insight into how developers can improve query processing and select the best solution for particular contexts. 相似文献
14.
Tayo Obafemi-Ajayi Gady Agam Ophir Frieder 《International Journal on Document Analysis and Recognition》2010,13(1):1-17
The fast evolution of scanning and computing technologies in recent years has led to the creation of large collections of
scanned historical documents. It is almost always the case that these scanned documents suffer from some form of degradation.
Large degradations make documents hard to read and substantially deteriorate the performance of automated document processing
systems. Enhancement of degraded document images is normally performed assuming global degradation models. When the degradation
is large, global degradation models do not perform well. In contrast, we propose to learn local degradation models and use
them in enhancing degraded document images. Using a semi-automated enhancement system, we have labeled a subset of the Frieder
diaries collection (The diaries of Rabbi Dr. Avraham Abba Frieder. ). This labeled subset was then used to train classifiers based on lookup tables in conjunction with the approximated nearest
neighbor algorithm. The resulting algorithm is highly efficient and effective. Experimental evaluation results are provided
using the Frieder diaries collection (The diaries of Rabbi Dr. Avraham Abba Frieder. ). 相似文献
15.
Adaptive document block segmentation and classification 总被引:3,自引:0,他引:3
Shih F.Y. Shy-Shyan Chen 《IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics》1996,26(5):797-802
This paper presents an adaptive block segmentation and classification technique for daily-received office documents having complex layout structures such as multiple columns and mixed-mode contents of text, graphics, and pictures. First, an improved two-step block segmentation algorithm is performed based on run-length smoothing for decomposing any document into single-mode blocks. Then, a rule-based block classification is used for classifying each block into the text, horizontal/vertical line, graphics, or-picture type. The document features and rules used are independent of character font and size and the scanning resolution. Experimental results show that our algorithms are capable of correctly segmenting and classifying different types of mixed-mode printed documents. 相似文献
16.
Abstract
The bag-of-words approach to text document representation
typically results in vectors of the order of 5000–20,000
components as the representation of documents. To make effective
use of various statistical classifiers, it may be necessary to
reduce the dimensionality of this representation. We point out
deficiencies in class discrimination of two popular such
methods, Latent Semantic Indexing (LSI), and sequential feature
selection according to some relevant criterion. As a remedy, we
suggest feature transforms based on Linear Discriminant Analysis
(LDA). Since LDA requires operating both with large and dense
matrices, we propose an efficient intermediate dimension
reduction step using either a random transform or LSI. We report
good classification results with the combined feature transform
on a subset of the Reuters-21578 database. Drastic reduction of
the feature vector dimensionality from 5000 to 12 actually
improves the classification performance.An erratum to this article can be found at 相似文献
17.
This paper presents a document classifier based on text content features and its application to email classification. We test the validity of a classifier which uses Principal Component Analysis Document Reconstruction (PCADR), where the idea is that principal component analysis (PCA) can compress optimally only the kind of documents-in our experiments email classes-that are used to compute the principal components (PCs), and that for other kinds of documents the compression will not perform well using only a few components. Thus, the classifier computes separately the PCA for each document class, and when a new instance arrives to be classified, this new example is projected in each set of computed PCs corresponding to each class, and then is reconstructed using the same PCs. The reconstruction error is computed and the classifier assigns the instance to the class with the smallest error or divergence from the class representation. We test this approach in email filtering by distinguishing between two message classes (e.g. spam from ham, or phishing from ham). The experiments show that PCADR is able to obtain very good results with the different validation datasets employed, reaching a better performance than the popular Support Vector Machine classifier. 相似文献
18.
We present a simple and yet effective approach for document classification to incorporate rationales elicited from annotators into the training of any off-the-shelf classifier. We empirically show on several document classification datasets that our classifier-agnostic approach, which makes no assumptions about the underlying classifier, can effectively incorporate rationales into the training of multinomial naïve Bayes, logistic regression, and support vector machines. In addition to being classifier-agnostic, we show that our method has comparable performance to previous classifier-specific approaches developed for incorporating rationales and feature annotations. Additionally, we propose and evaluate an active learning method tailored specifically for the learning with rationales framework. 相似文献
19.
Soumyadeep Dey Jayanta Mukherjee Shamik Sural 《International Journal on Document Analysis and Recognition》2016,19(4):351-368
Segmentation of a document image plays an important role in automatic document processing. In this paper, we propose a consensus-based clustering approach for document image segmentation. In this method, the foreground regions of a document image are grouped into a set of primitive blocks, and a set of features is extracted from them. Similarities among the blocks are computed on each feature using a hypothesis test-based similarity measure. Based on the consensus of these similarities, clustering is performed on the primitive blocks. This clustering approach is used iteratively with a classifier to label each primitive block. Experimental results show the effectiveness of the proposed method. It is further shown in the experimental results that the dependency of classification performance on the training data is significantly reduced. 相似文献
20.
E. Appiani F. Cesarini A.M. Colla M. Diligenti M. Gori S. Marinai G. Soda 《International Journal on Document Analysis and Recognition》2001,4(2):69-83
In this paper a system for analysis and automatic indexing of imaged documents for high-volume applications is described.
This system, named STRETCH (STorage and RETrieval by Content of imaged documents), is based on an Archiving and Retrieval Engine, which overcomes the bottleneck of document profiling bypassing some limitations of existing pre-defined indexing schemes.
The engine exploits a structured document representation and can activate appropriate methods to characterise and automatically
index heterogeneous documents with variable layout. The originality of STRETCH lies principally in the possibility for unskilled
users to define the indexes relevant to the document domains of their interest by simply presenting visual examples and applying
reliable automatic information extraction methods (document classification, flexible reading strategies) to index the documents
automatically, thus creating archives as desired. STRETCH offers ease of use and application programming and the ability to
dynamically adapt to new types of documents. The system has been tested in two applications in particular, one concerning
passive invoices and the other bank documents. In these applications, several classes of documents are involved. The indexing
strategy first automatically classifies the document, thus avoiding pre-sorting, then locates and reads the information pertaining
to the specific document class. Experimental results are encouraging overall; in particular, document classification results
fulfill the requirements of high-volume application. Integration into production lines is under execution.
Received March 30, 2000 / Revised June 26, 2001 相似文献