首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
This article addresses writer identification of handwritten Arabic text. Several types of structural and statistical features were extracted from Arabic handwriting text. A novel approach was used to extract structural features that build on some of the main characteristics of the Arabic language. Connected component features for Arabic handwritten text as well as gradient distribution features, windowed gradient distribution features, contour chain code distribution features, and windowed contour chain code distribution features were extracted. A nearest neighbor (NN) classifier was used with the Euclidean distance measure. Data reduction algorithms (viz. principal component analysis [PCA], linear discriminant analysis [LDA], multiple discriminant analysis [MDA], multidimensional scaling [MDS], and forward/backward feature selection algorithm) were used. A database of 500 paragraphs handwritten in Arabic by 250 writers was used. The paragraphs used were randomly generated from a large corpus. NN provided the best accuracy in text-independent writer identification with top-1 result of 88.0%, top-5 result of 96.0%, and top-10 result of 98.5% for the first 100 writers. Extending the work to include all 250 writers and with the backward feature selection algorithm (using 54 out of 83 features), the system attained a top-1 result of 75.0%, top-5 result of 91.8%, and top-10 result of 95.4%.  相似文献   

3.
Even though a lot of researches have been conducted in order to solve the problem of unconstrained handwriting recognition, an effective solution is still a serious challenge. In this article, we address two Arabic handwriting recognition-related issues. Firstly, we present IESK-arDB, a new multi-propose off-line Arabic handwritten database. It is publicly available and contains more than 4,000 word images, each equipped with binary version, thinned version as well as a ground truth information stored in separate XML file. Additionally, it contains around 6,000 character images segmented from the database. A letter frequency analysis showed that the database exhibits letter frequencies similar to that of large corpora of digital text, which proof the database usefulness. Secondly, we proposed a multi-phase segmentation approach that starts by detecting and resolving sub-word overlaps, then hypothesizing a large number of segmentation points that are later reduced by a set of heuristic rules. The proposed approach has been successfully tested on IESK-arDB. The results were very promising, indicating the efficiency of the suggested approach.  相似文献   

4.
Most research in the automatic assessment of free text answers written by students address English language. This paper handles the assessment task in Arabic language. This research focuses on applying multiple similarity measures separately and in combination. Many aspects are introduced that depend on translation to overcome the lack of text processing resources in Arabic, such as extracting model answers automatically from an already built database and applying K-means clustering to scale the obtained similarity values. Additionally, this research presents the first benchmark Arabic data set that contains 610 students’ short answers together with their English translations.  相似文献   

5.
Traditionally, a corpus is a large structured set of text, electronically stored and processed. Corpora have become very important in the study of languages. They have opened new areas of linguistic research, which were unknown until recently. Corpora are also key to the development of optical character recognition (OCR) applications. Access to a corpus of both language and images is essential during OCR development, particularly while training and testing a recognition application. Excellent corpora have been developed for Latin-based languages, but few relate to the Arabic language. This limits the penetration of both corpus linguistics and OCR in Arabic-speaking countries. This paper describes the construction and provides a comprehensive study and analysis of a multi-modal Arabic corpus (MMAC) that is suitable for use in both OCR development and linguistics. MMAC currently contains six million Arabic words and, unlike previous corpora, also includes connected segments or pieces of Arabic words (PAWs) as well as naked pieces of Arabic words (NPAWs) and naked words (NWords); PAWs and Words without diacritical marks. Multi-modal data is generated from both text, gathered from a wide variety of sources, and images of existing documents. Text-based data is complemented by a set of artificially generated images showing each of the Words, NWords, PAWs and NPAWs involved. Applications are provided to generate a natural-looking degradation to the generated images. A ground truth annotation is offered for each such image, while natural images showing small paragraphs and full pages are augmented with representations of the text they depict. A statistical analysis and verification of the dataset has been carried out and is presented. MMAC was also tested using commercial OCR software and is publicly and freely available.  相似文献   

6.
This paper presents a new technique of high accuracy to recognize both typewritten and handwritten English and Arabic texts without thinning. After segmenting the text into lines (horizontal segmentation) and the lines into words, it separates the word into its letters. Separating a text line (row) into words and a word into letters is performed by using the region growing technique (implicit segmentation) on the basis of three essential lines in a text row. This saves time as there is no need to skeletonize or to physically isolate letters from the tested word whilst the input data involves only the basic information—the scanned text. The baseline is detected, the word contour is defined and the word is implicitly segmented into its letters according to a novel algorithm described in the paper. The extracted letter with its dots is used as one unit in the system of recognition. It is resized into a 9 × 9 matrix following bilinear interpolation after applying a lowpass filter to reduce aliasing. Then the elements are scaled to the interval [0,1]. The resulting array is considered as the input to the designed neural network. For typewritten texts, three types of Arabic letter fonts are used—Arial, Arabic Transparent and Simplified Arabic. The results showed an average recognition success rate of 93% for Arabic typewriting. This segmentation approach has also found its application in handwritten text where words are classified with a relatively high recognition rate for both Arabic and English languages. The experiments were performed in MATLAB and have shown promising results that can be a good base for further analysis and considerations of Arabic and other cursive language text recognition as well as English handwritten texts. For English handwritten classification, a success rate of about 80% in average was achieved while for Arabic handwritten text, the algorithm performance was successful in about 90%. The recent results have shown increasing success for both Arabic and English texts.  相似文献   

7.
8.
The retrieval of information from scanned handwritten documents is becoming vital with the rapid increase of digitized documents, and word spotting systems have been developed to search for words within documents. These systems can be either template matching algorithms or learning based. This paper presents a coherent learning based Arabic handwritten word spotting system which can adapt to the nature of Arabic handwriting, which can have no clear boundaries between words. Consequently, the system recognizes Pieces of Arabic Words (PAWs), then re-constructs and spots words using language models. The proposed system produced promising result for Arabic handwritten word spotting when tested on the CENPARMI Arabic documents database.  相似文献   

9.
Algorithms to process off-line Arabic handwriting before recognition are presented. First, an algorithm that converts smoothed and thinned images into straight line approximations is described. Second, an algorithm is developed to obtain a 1D representation of off-line Arabic handwriting. This is achieved by first finding the start-end pair of vertices of writing. Then a stroke is traversed from the start to the end vertex by solving the Chinese postman's problem for its graph. Special rules are applied to enforce temporal information on the stroke to obtain the most likely traversal that is consistent with Arabic handwriting. Finally, an algorithm is suggested to reduce straight line approximations to other approximations in which loops are represented by vertices with features. In testing, 2256 unconstrained handwritten strokes, written by six writes, were used. In 96.5% of the samples, the algorithm restored the actual temporal information.  相似文献   

10.
手写文档的非结构化,导致对手写文档的编辑很困难。文本行是手写文档中一个显著的结构,它的可靠提取对于更高级别结构化文档(图形与文字分离,段结构的提取,文字的提取)及编辑文档非常重要。目前关于手写文档的结构化,分为联机和脱机两种。使用联机算法提取文本行,然后讨论文本行的提取对手势设计的影响。  相似文献   

11.
12.
13.
Font recognition is useful for improving optical text recognition systems’ accuracy and time, and to restore the documents’ original formats. This paper addresses a need for Arabic font recognition research by introducing an Arabic font recognition database consisting of 40 fonts, 10 sizes (ranging from 8 to 24 points) and 4 styles (viz. normal, bold, italic, and bold–italic). The database is split into three sets (viz. training, validation, and testing). The database is freely available to researchers.1 Moreover, we introduce a baseline font recognition system for benchmarking purposes, and report identification rates on our KAFD database and the Arabic Printed Text Image (APTI) database with 20 and 10 fonts, respectively. The best recognition rates are achieved using log-Gabor filters.  相似文献   

14.
In this paper, we present a segmentation methodology of handwritten documents in their distinct entities, namely, text lines and words. Text line segmentation is achieved by applying Hough transform on a subset of the document image connected components. A post-processing step includes the correction of possible false alarms, the detection of text lines that Hough transform failed to create and finally the efficient separation of vertically connected characters using a novel method based on skeletonization. Word segmentation is addressed as a two class problem. The distances between adjacent overlapped components in a text line are calculated using the combination of two distance metrics and each of them is categorized either as an inter- or an intra-word distance in a Gaussian mixture modeling framework. The performance of the proposed methodology is based on a consistent and concrete evaluation methodology that uses suitable performance measures in order to compare the text line segmentation and word segmentation results against the corresponding ground truth annotation. The efficiency of the proposed methodology is demonstrated by experimentation conducted on two different datasets: (a) on the test set of the ICDAR2007 handwriting segmentation competition and (b) on a set of historical handwritten documents.  相似文献   

15.
In this paper, a structural method of recognising Arabic handwritten characters is proposed. The major problem in cursive text recognition is the segmentation into characters or into representative strokes. When we segment the cursive portions of words, we take into account the contextual properties of the Arabic grammar and the junction segments connecting the characters to each other along the writing line. The problem of overlapping characters is resolved with a contour-following algorithm associated with the labelling of the detected contours. In the recognition phase, the characters are gathered into ten families of candidate characters with similar shapes. Then a heterarchical analysis follows that checks the pattern via goal-directed feedback control.  相似文献   

16.
This paper presents a new Bayesian-based method of unconstrained handwritten offline Chinese text line recognition. In this method, a sample of a real character or non-character in realistic handwritten text lines is jointly recognized by a traditional isolated character recognizer and a character verifier, which requires just a moderate number of handwritten text lines for training. To improve its ability to distinguish between real characters and non-characters, the isolated character recognizer is negatively trained using a linear discriminant analysis (LDA)-based strategy, which employs the outputs of a traditional MQDF classifier and the LDA transform to re-compute the posterior probability of isolated character recognition. In tests with 383 text lines in HIT-MW database, the proposed method achieved the character-level recognition rates of 71.37% without any language model, and 80.15% with a bi-gram language model, respectively. These promising results have shown the effectiveness of the proposed method for unconstrained handwritten offline Chinese text line recognition.  相似文献   

17.
针对手写阿拉伯单词书写连笔,且相似词较多的特点,该文提出一种新的脱机手写文字识别算法。该算法以固定组件为成分拆分阿拉伯单词,构建自组件特征至单词类别的加权贝叶斯推理模型。算法结合单词组件分割、多级混合式组件识别、组件加权系数估计等,计算单词类别的后验概率并得到单词识别结果。在IFN/ENIT库上的实验,获得了90.03%的单词识别率,证实组件分解对笔画连写具有鲁棒性,组件识别能提高相似词的辨别能力,而且该算法所需训练类别少,易向大词汇量识别扩展。  相似文献   

18.
Separating text lines in unconstrained handwritten documents remains a challenge because the handwritten text lines are often un-uniformly skewed and curved, and the space between lines is not obvious. In this paper, we propose a novel text line segmentation algorithm based on minimal spanning tree (MST) clustering with distance metric learning. Given a distance metric, the connected components (CCs) of document image are grouped into a tree structure, from which text lines are extracted by dynamically cutting the edges using a new hypervolume reduction criterion and a straightness measure. By learning the distance metric in supervised learning on a dataset of pairs of CCs, the proposed algorithm is made robust to handle various documents with multi-skewed and curved text lines. In experiments on a database with 803 unconstrained handwritten Chinese document images containing a total of 8,169 lines, the proposed algorithm achieved a correct rate 98.02% of line detection, and compared favorably to other competitive algorithms.  相似文献   

19.
20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号