首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Segmentation and classification of mixed text/graphics/image documents   总被引:2,自引:0,他引:2  
In this paper, a feature-based document analysis system is presented which utilizes domain knowledge to segment and classify mixed text/graphics/image documents. In our approach, we first perform a run-length smearing operation followed by the stripe merging procedure to segment the blocks embedded in a document. The classification task is then performed based on the domain knowledge induced from the primitives associated with each type of medium. Proper use of domain knowledge is proved to be effective in accelerating the segmentation speed and decreasing the classification error. The experimental study reveals the feasibility of the new technique in segmenting and classifying mixed text/graphics/image documents.  相似文献   

2.
Document Segmentation is a process that aims to filter documents while identifying certain regions of interest. Generally, the regions of interest include texts, graphics (image occupied regions) and the background. This paper presents a novel top-bottom approach to perform document segmentation using texture features that are extracted from the specified/selected documents. A mask of suitable size is used to summarize textural features, and statistical parameters are captured as blocks in document images. Four textural features that are extracted from masks using the gray level co-occurrence matrix (glcm) include entropy, contrast, energy and homogeneity. Furthermore, two statistical parameters extracted from corresponding masks are the modal and median pixel values. The extracted attributes allow the classification of each mask or block as text, graphics, and background. A feedforward network is trained on the 6 extracted attributes, using documents obtained from a public database ; an error rate of 15.77 % is achieved. Furthermore, it is shown that this novel approach produces promising performance in segmenting documents and is expected to be significantly efficient for content-based information retrieval systems. Detection of duplicate documents within large databases is another potential area of application.  相似文献   

3.
4.
Text segmentation using gabor filters for automatic document processing   总被引:24,自引:0,他引:24  
There is a considerable interest in designing automatic systems that will scan a given paper document and store it on electronic media for easier storage, manipulation, and access. Most documents contain graphics and images in addition to text. Thus, the document image has to be segmented to identify the text regions, so that OCR techniques may be applied only to those regions. In this paper, we present a simple method for document image segmentation in which text regions in a given document image are automatically identified. The proposed segmentation method for document images is based on a multichannel filtering approach to texture segmentation. The text in the document is considered as a textured region. Nontext contents in the document, such as blank spaces, graphics, and pictures, are considered as regions with different textures. Thus, the problem of segmenting document images into text and nontext regions can be posed as a texture segmentation problem. Two-dimensional Gabor filters are used to extract texture features for each of these regions. These filters have been extensively used earlier for a variety of texture segmentation tasks. Here we apply the same filters to the document image segmentation problem. Our segmentation method does not assume any a priori knowledge about the content or font styles of the document, and is shown to work even for skewed images and handwritten text. Results of the proposed segmentation method are presented for several test images which demonstrate the robustness of this technique. This work was supported by the National Science Foundation under NSF grant CDA-88-06599 and by a grant from E. 1. Du Pont De Nemours & Company.  相似文献   

5.
《Applied Soft Computing》2008,8(1):118-126
In this work, we propose a new document page segmentation method, capable of differentiating between text, graphics and background, using a neuro-fuzzy methodology. Our approach is based firstly on the analysis of a set of features extracted from the image, available at different resolution levels. An initial segmentation is obtained by classifying the pixels into coherent regions, which are successively refined by the analysis of their shape. The core of our approach relies on a neuro-fuzzy methodology, for performing the classification processes. The proposed strategy is capable of describing the physical structure of a page in an accurate way and proved to be robust against noise and page skew. Additionally, the knowledge-based neuro-fuzzy methodology allows us to understand the classification mechanisms better, contrary to what happens when other kinds of knowledge-free methods are applied.  相似文献   

6.
文档图像分割的研究对于打印、传真以及这样的数据处理工作具有十分重要的意义 .提出了一个文档图像分割的新算法 .分割算法的特征是基于根据文档图像中各种图像类型直方图的不同特性 .算法中重要的特征是通过小波图像来加强原始图像的特征 ,从而使得精确度提高  相似文献   

7.
快速成型切片数据的优化算法研究   总被引:4,自引:0,他引:4  
为了能够顺利地进行 STL模型切片轮廓数据的进一步处理 ,提出了对切片数据进行优化处理的算法 .对由于STL模型的缺陷造成切片之后的轮廓信息数据有大量的冗余数据 ,提出了一种冗余数据的滤除算法 ;针对切片轮廓的不封闭 ,给出了有效的修正算法 ;同时给出了对切片轮廓的内外边界进行自动识别的算法 .该算法高效简单 ,提高了后续的数据处理的效率和成型件的加工质量 ,改善了零件成型的加工性能  相似文献   

8.
Transforming paper documents into XML format with WISDOM++   总被引:1,自引:1,他引:0  
The transformation of scanned paper documents to a form suitable for an Internet browser is a complex process that requires solutions to several problems. The application of an OCR to some parts of the document image is only one of the problems. In fact, the generation of documents in HTML format is easier when the layout structure of a page has been extracted by means of a document analysis process. The adoption of an XML format is even better, since it can facilitate the retrieval of documents in the Web. Nevertheless, an effective transformation of paper documents into this format requires further processing steps, namely document image classification and understanding. WISDOM++ is a document processing system that operates in five steps: document analysis, document classification, document understanding, text recognition with an OCR, and transformation into HTML/XML format. The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats. A benchmarking of the system components implementing these innovative aspects is reported. Received June 15, 2000 / Revised November 7, 2000  相似文献   

9.
10.
基于最小生成树聚类的中文版面分割法   总被引:1,自引:1,他引:0       下载免费PDF全文
针对中文版面多横竖混排的特点,提出一种基于最小生成树聚类的版面分割方法。对原图像进行水平和垂直游程平滑,并对平滑后所得的连通域进行预分类处理,将文本进行横排、竖排分类。对预分类后的各类文本采用最小生成树聚类算法进行聚类处理。经实验,准确率达97%。实验表明,该方法对中文文档有良好的分割效果。  相似文献   

11.
Text extraction in mixed-type documents is a pre-processing and necessary stage for many document applications. In mixed-type color documents, text, drawings and graphics appear with millions of different colors. In many cases, text regions are overlaid onto drawings or graphics. In this paper, a new method to automatically detect and extract text in mixed-type color documents is presented. The proposed method is based on a combination of an adaptive color reduction (ACR) technique and a page layout analysis (PLA) approach. The ACR technique is used to obtain the optimal number of colors and to convert the document into the principal of them. Then, using the principal colors, the document image is split into the separable color plains. Thus, binary images are obtained, each one corresponding to a principal color. The PLA technique is applied independently to each of the color plains and identifies the text regions. A merging procedure is applied in the final stage to merge the text regions derived from the color plains and to produce the final document. Several experimental and comparative results, exhibiting the performance of the proposed technique, are also presented.  相似文献   

12.
13.
A new algorithm for segmenting documents into regions containing musical scores and text is proposed. Such segmentation is a required step prior to applying optical character recognition and optical music recognition on scanned pages that contain both music notation and text. Our segmentation technique is based on the bag-of-visual-words representation followed by random block voting (RBV) in order to detect the bounding boxes containing the musical score and text within a document image. The RBV procedure consists of extracting a fixed number of blocks whose position and size are sampled from a discrete uniform distribution that “over”-covers the input image. Each block is automatically classified as either coming from musical score or text and votes with a particular posterior probability of classification in its spatial domain. An initial coarse segmentation is obtained by summarizing all the votes in a single image. Subsequently, the final segmentation is obtained by subdividing the image in microblocks and classifying them using a N-nearest neighbor classifier which is trained using the coarse segmentation. We demonstrate the potential of the proposed method by experiments on two different datasets. One is on a challenging dataset of images collected and artificially combined and manipulated for this project. The other is a music dataset obtained by the scanning of two music books. The results are reported using precision/recall metrics of the overlapping area with respect to the ground truth. The proposed system achieves an overall averaged F-measure of 85 %. The complete source code package and associated data are available at https://github.com/fpeder/mscr under the FreeBSD license to support reproducibility.  相似文献   

14.
康厚良  杨玉婷 《图学学报》2022,43(5):865-874
以卷积神经网络(CNN)为代表的深度学习技术在图像分类和识别领域表现出了非常优异的性能。但东巴象形文字未有标准、公开的数据集,无法借鉴或使用已有的深度学习算法。为了快速建立权威、有效的东巴文字库,分析已出版东巴文档的版面结构,从文档中提取文本行、东巴字成为了当前的首要任务。因此,结合东巴象形文字文档图像的结构特点,给出了东巴文档图像的文本行自动分割算法。首先利用基于密度和距离的k-均值聚类算法确定了文本行的分类数量和分类标准;然后,通过文字块的二次处理矫正了分割中的错误结果,提高了算法的准确率。在充分利用东巴字文档结构特征的同时,保留了机器学习模型客观、无主观经验影响的优势。通过实验表明,该算法可用于东巴文档图像、脱机手写汉字、东巴经的文本行分割,以及文本行中东巴字和汉字的分割,具有实现简单、准确性高、适应性强的特点,从而为东巴文字库的建立奠定基础。  相似文献   

15.
The convenience of search, both on the personal computer hard disk as well as on the web, is still limited mainly to machine printed text documents and images because of the poor accuracy of handwriting recognizers. The focus of research in this paper is the segmentation of handwritten text and machine printed text from annotated documents sometimes referred to as the task of “ink separation” to advance the state-of-art in realizing search of hand-annotated documents. We propose a method which contains two main steps—patch level separation and pixel level separation. In the patch level separation step, the entire document is modeled as a Markov Random Field (MRF). Three different classes (machine printed text, handwritten text and overlapped text) are initially identified using G-means based classification followed by a MRF based relabeling procedure. A MRF based classification approach is then used to separate overlapped text into machine printed text and handwritten text using pixel level features forming the second step of the method. Experimental results on a set of machine-printed documents which have been annotated by multiple writers in an office/collaborative environment show that our method is robust and provides good text separation performance.  相似文献   

16.
Data charts can be used to effectively compress large amounts of complex information and can convey information in an efficient and succinct manner. It is now easier to create data charts by using a variety of automated software systems. These data charts are routinely inserted in text documents and are widely disseminated over many different media. This study addresses the problem of finding goodness of data charts in mixed-mode documents. The quality of the graphics can be used to assist the document development process as well as to serve as an additional criterion for search engines like Google and Yahoo. The quality measures are motivated by principles of visual learning and are based on research in educational psychology and cognitive theories and use attributes of both the graphic and its textual context. We have implemented the approach and evaluated its effectiveness using a set of documents compiled from the Web. Results of a human study shows that the proposed quality measures have a high correlation with the quality ratings of the users for each of the five classes of data charts studied in this research.  相似文献   

17.
Most of the research on text categorization has focused on classifying text documents into a set of categories with no structural relationships among them (flat classification). However, in many information repositories documents are organized in a hierarchy of categories to support a thematic search by browsing topics of interests. The consideration of the hierarchical relationship among categories opens several additional issues in the development of methods for automated document classification. Questions concern the representation of documents, the learning process, the classification process and the evaluation criteria of experimental results. They are systematically investigated in this paper, whose main contribution is a general hierarchical text categorization framework where the hierarchy of categories is involved in all phases of automated document classification, namely feature selection, learning and classification of a new document. An automated threshold determination method for classification scores is embedded in the proposed framework. It can be applied to any classifier that returns a degree of membership of a document to a category. In this work three learning methods are considered for the construction of document classifiers, namely centroid-based, naïve Bayes and SVM. The proposed framework has been implemented in the system WebClassIII and has been tested on three datasets (Yahoo, DMOZ, RCV1) which present a variety of situations in terms of hierarchical structure. Experimental results are reported and several conclusions are drawn on the comparison of the flat vs. the hierarchical approach as well as on the comparison of different hierarchical classifiers. The paper concludes with a review of related work and a discussion of previous findings vs. our findings.  相似文献   

18.

Datasets of documents in Arabic are urgently needed to promote computer vision and natural language processing research that addresses the specifics of the language. Unfortunately, publicly available Arabic datasets are limited in size and restricted to certain document domains. This paper presents the release of BE-Arabic-9K, a dataset of more than 9000 high-quality scanned images from over 700 Arabic books. Among these, 1500 images have been manually segmented into regions and labeled by their functionality. BE-Arabic-9K includes book pages with a wide variety of complex layouts and page contents, making it suitable for various document layout analysis and text recognition research tasks. The paper also presents a page layout segmentation and text extraction baseline model based on fine-tuned Faster R-CNN structure (FFRA). This baseline model yields cross-validation results with an average accuracy of 99.4% and F1 score of 99.1% for text versus non-text block classification on 1500 annotated images of BE-Arabic-9K. These results are remarkably better than those of the state-of-the-art Arabic book page segmentation system ECDP. FFRA also outperforms three other prior systems when tested on a competition benchmark dataset, making it an outstanding baseline model to challenge.

  相似文献   

19.
In any image segmentation problem, there exist uncertainties. These uncertainties occur from gray level and spatial ambiguities in an image. As a result, accurate segmentation of text regions from non-text regions (graphics/images) in mixed and complex documents is a fairly difficult problem. In this paper, we propose a novel text region segmentation method based on digital shearlet transform (DST). The method is capable of handling the uncertainties arising in the segmentation process. To capture the anisotropic features of the text regions, the proposed method uses the DST coefficients as input features to a segmentation process block. This block is designed using the neutrosophic set (NS) for management of the uncertainty in the process. The proposed method is experimentally verified extensively and the performance is compared with that of some state-of-the-art techniques both quantitatively and qualitatively using benchmark dataset.  相似文献   

20.
A bottom-up approach to segmentation of a scanned document into background, text, and image regions is considered. The image is partitioned into blocks at the first step. A series of texture features is computed for each block. The block type is determined on the basis of these features. Different variants of block arrangement and size, 26 texture variables, and four block type classification algorithms have been considered. The block type is corrected on the basis of adjacent region analysis at the second step. The error matrix and ICDAR 2007 criterion are used for result estimation.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号