首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 101 毫秒
1.
针对藏文舆情分析中藏文印刷品和藏文图片内容无法自动监测的难题,文章在深入分析印刷藏文字符特征和藏文文本特点的基础上,提出支持多字体印刷藏文内容监测系统的实现方法,重点阐述藏文字符的特征提取、分类算法以及藏文文本内容监测方法。  相似文献   

2.
字符排序一般要遵循字典顺序并且需要为每个参与排序的字符赋予特定的排序码。藏文字符有两种编码方式:动态组合方式和静态组合方式,对于动态组合方式编码的藏文组合字符而言,仅仅根据构成它的字母来排序,则排序结果与字典顺序有较大的差异。文中分析了藏文字符的字典顺序,总结了藏文的字典顺序规律,提出了引入藏文字符集扩展集A中的UNICODE编码为排序码对藏文组合字符进行排序的观点,使得藏文组合字符的排序符合藏文字典顺序。  相似文献   

3.
在藏文信息处理中,实现藏文字符的字典排序是一个比较重要的问题,实现藏文字符排序的关键是准确地判定藏文音节。藏文音节的判定关键是组合字符的判定。判定组合字符的瓶颈是藏文字符编码的占位和不占位的判定。通过应用程序对藏文编码的占位和不占位的有效判定,即可找出组合字符。对找出组合字符的藏文字符可通过藏文正字法的限定对藏文音节进行有效的判定和音节组件的拆分,从而为藏文字符的字典排序奠定基础。  相似文献   

4.
书面藏语排序的数学模型及算法   总被引:11,自引:0,他引:11  
江荻  康才畯 《计算机学报》2004,27(4):524-529
针对中国国家标准及ISO藏文编码字符集提出书面藏语字词的排序涉及藏字结构序、构造级和字符序概念,是不同于中文、英文序性而性质独特的一种排序,文章详尽分析了藏字字形、结构形态、传统字符顺序以及藏字字长和层高等特征,构建出藏语排序的数学模型,然后依据模型要求为每类藏文符号进行数字赋值,通过算法逐步确定字符位置并识别字符,最后按照抽取字符的对应数值组合排序,完成了藏语字词的排序,该模型现已在Windows平台上实现。  相似文献   

5.
藏文识别中相似字丁的区分研究   总被引:7,自引:0,他引:7  
相似字丁多是藏文识别中的一大难点。本文通过对相似字丁类型的研究,以及印刷体藏文识别结果的统计分析,得到图形结构的分析与识别结果相吻合的结论。说明必须根据藏文字丁的结构特点,在字符归一化、特征选择方面进行特殊的处理,以实现藏文识别中相似字丁的区分。  相似文献   

6.
为了使得藏文字符特征向量维数少、存储空间小、运算速度快及区分相似字能力高,基于图像投影法提出一种基于极坐标投影变换的脱机手写藏文字符特征提取方法。将脱机手写藏文字符图像进行预处理后得到大小、位置统一的二值图像,并定位二值图像的极点;求出二值图像中所有值为1的点对应的极坐标后将其进行投影变换得到投影向量,即作为脱机手写藏文字符的特征向量。使用KNN分类器对30 000个脱机手写藏文字进行实验,其中80%的样本作为训练数据,20%的样本作为测试数据,识别率达到了96.32%。结果表明该方法的有效性、计算简单及达到了较好的识别效果。  相似文献   

7.
引入排序码实现藏文字符的排序   总被引:1,自引:0,他引:1  
字符排序一般要遵循字典顺序并且需要为每个参与排序的字符赋予特定的排序码.藏文字符有两种编码方式:动态组合方式和静态组合方式,对于动态组合方式编码的藏文组合字符而言,仅仅根据构成它的字母来排序,则排序结果与字典顺序有较大的差异.文中分析了藏文字符的字典顺序,总结了藏文的字典顺序规律,提出了引入藏文字符集扩展集A中的UNICODE编码为排序码对藏文组合字符进行排序的观点,使得藏文组合字符的排序符合藏文字典顺序.  相似文献   

8.
藏文同元码与基本集相互转换的规则与实现   总被引:1,自引:1,他引:0       下载免费PDF全文
在当今的计算机信息处理过程中,不同文字处理平台上相同字符的不同编码问题,即文字处理的不兼容,是一个亟待解决的重要问题。而在藏文信息处理的研究中,藏文的编码转换也是一个研究热点。藏文的文本、网站大多采用同元编码方式,而微软的Vista操作系统采用的是基本集的编码方式,所以两种编码的转换在藏文信息处理领域是非常重要的。主要介绍了藏文同元编码与基本集的相互转换技术,采用了将藏文按照拉丁转写拆分的方法,利用层数作为藏文同元编码字符结构与基本集编码字符结构的桥梁,通过一系列规则,实现了两种编码的相互转换。  相似文献   

9.
藏文识别的预处理   总被引:9,自引:2,他引:7  
预处理是整个文字识别系统的重要组成部分,预处理性能的优劣将直接影响整个识别系统的性能,根据藏文字在字形和书写方式上的特点,实现了一种适用于藏文识别的预处理技术,整个预处理过程包括二值化、版面分析、倾斜校正、字符切和归一化,在预处理过程中还提取了一此圾关字丁的基本特征,这些特征充分反映了藏文的特点,具有良好的稳定性,可以用于识别系统的粗分类和后处理。  相似文献   

10.
论藏文的序性及排序方法   总被引:17,自引:10,他引:7  
为解决藏文排序问题,本文提出藏文的构造序和字符序概念,并在此基础上提出解决藏文词典序的计算机方案。文章对各类藏文构造及字符进行了分析和赋值,给出了藏文计算机排序的技术流程图。  相似文献   

11.
文档识别中误切分字符拒识问题的研究   总被引:4,自引:1,他引:4  
自动文档识别中字切分算法如果仅仅依靠大小位置等度量信息,很容易产生误切分图像块,需要字符分类器给出一定的反馈才能准确切分,为此提出了一个新的拒识算法,目标是尽可能准确地拒识非法字符。该文分析了基于距离的分类器的置信度和广义置信度,在此基础上改进了常用的广义置信度映射函数,并设计了一个基于样本学习的拒识规则,提高了拒识算法的适应性。在中日韩三种文档样本上的实验表明,该文算法明显改善了系统性能,对于较低质量的印刷文本识别具有一定的普遍意义。  相似文献   

12.
Optical character recognition (OCR) refers to a process whereby printed documents are transformed into ASCII files for the purpose of compact storage, editing, fast retrieval, and other file manipulations through the use of a computer. The recognition stage of an OCR process is made difficult by added noise, image distortion, and the various character typefaces, sizes, and fonts that a document may have. In this study a neural network approach is introduced to perform high accuracy recognition on multi-size and multi-font characters; a novel centroid-dithering training process with a low noise-sensitivity normalization procedure is used to achieve high accuracy results. The study consists of two parts. The first part focuses on single size and single font characters, and a two-layered neural network is trained to recognize the full set of 94 ASCII character images in 12-pt Courier font. The second part trades accuracy for additional font and size capability, and a larger two-layered neural network is trained to recognize the full set of 94 ASCII character images for all point sizes from 8 to 32 and for 12 commonly used fonts. The performance of these two networks is evaluated based on a database of more than one million character images from the testing data set  相似文献   

13.
复杂彩色文本图像中字符的提取   总被引:4,自引:1,他引:4  
从复杂彩色文本图像中提取和识别字符已经成为一个既困难又有趣的问题。本文给出了一个具有创新性和实用性的区域生长算法用于彩色图像的分割:彩色图像游程邻接算法CRAG(color run-length adjacency graph algorithm)。我们将该算法用于彩色文本图像,首先得到图像的彩色连通域,再对这些连通域的平均颜色进行颜色聚类,可得到若干个聚类中心,然后根据不同的颜色中心将图像分为相应的彩色层面,最后通过连通域分析判断所需的文字层。该生长算法修改并扩展了传统的BAG算法,并将其运用于彩色印刷体文本图像中,充分利用了彩色图像的颜色和位置信息。实验结果表明新的方法能很好的从彩色印刷图像中提取多种常见的艺术字,并具有较高的提取速度,同时保留了文字和背景图像的原始色彩,便于将来的图像恢复。  相似文献   

14.
This paper presents a new method for detecting and recognizing text in complex images and video frames. Text detection is performed in a two-step approach that combines the speed of a text localization step, enabling text size normalization, with the strength of a machine learning text verification step applied on background independent features. Text recognition, applied on the detected text lines, is addressed by a text segmentation step followed by an traditional OCR algorithm within a multi-hypotheses framework relying on multiple segments, language modeling and OCR statistics. Experiments conducted on large databases of real broadcast documents demonstrate the validity of our approach.  相似文献   

15.

Optical character recognition (OCR) systems help to digitize paper-based historical achieves. However, poor quality of scanned documents and limitations of text recognition techniques result in different kinds of errors in OCR outputs. Post-processing is an essential step in improving the output quality of OCR systems by detecting and cleaning the errors. In this paper, we present an automatic model consisting of both error detection and error correction phases for OCR post-processing. We propose a novel approach of OCR post-processing error correction using correction pattern edits and evolutionary algorithm which has been mainly used for solving optimization problems. Our model adopts a variant of the self-organizing migrating algorithm along with a fitness function based on modifications of important linguistic features. We illustrate how to construct the table of correction pattern edits involving all types of edit operations and being directly learned from the training dataset. Through efficient settings of the algorithm parameters, our model can be performed with high-quality candidate generation and error correction. The experimental results show that our proposed approach outperforms various baseline approaches as evaluated on the benchmark dataset of ICDAR 2017 Post-OCR text correction competition.

  相似文献   

16.
笔迹鉴别的字符予处理与匹配   总被引:1,自引:0,他引:1  
笔迹鉴别多用匹配方法比较字并的书写风格, 而字符困像的预处理和归一化对匹配是昨常重要的本文介绍笔迹鉴别的字符图像预处理和一种形状匹配方法。预处理主要介绍二值图像的噪声消除和归一化方法。嗓声消除的方法是平滑、轮廓跟踪和填充为保持字符中的书写特征, 点阵的归一化是线性的, 但字符位五和尺度的确定昨常重要。本文给出了三种归一化方法四边定界法、重心对准法和单边定界法, 并在此基拙上用图像匹配方法进行书写人识别的实验。匹配方法是通过距离变换快速实现的。实验结果表明, 重心对·准归一化最适合于笔迹鉴别问题, 距离变换匹配得到的识别率也比较令人满意  相似文献   

17.
Current Optical Character Recognition (OCR) systems are not capable of detection and recognition of detached words on an image, especially if the text is not located horizontally. Such text blocks are typical of charts and graphs. In this paper an algorithm of detection of small text blocks with arbitrary orientation, color, style, and font size, which can be used for text localization before application of arbitrary character recognition system, is proposed. According to the experimental results, the use of the proposed algorithm for determination of the location and orientation of text blocks on charts and graphs and the transmission of this information to text recognition system allow increasing the fullness by 20 times and the text recognition precision by 15 times. The experiments were carried out on a test collection of 1000 charts containing about 14 000 text blocks, which was created by means of the XML/SWF Chart tool.  相似文献   

18.
We propose an approach for the word-level indexing of modern printed documents which are difficult to recognize using current OCR engines. By means of word-level indexing, it is possible to retrieve the position of words in a document, enabling queries involving proximity of terms. Web search engines implement this kind of indexing, allowing users to retrieve Web pages on the basis of their textual content. Nowadays, digital libraries hold collections of digitized documents that can be retrieved either by browsing the document images or relying on appropriate metadata assembled by domain experts. Word indexing tools would therefore increase the access to these collections. The proposed system is designed to index homogeneous document collections by automatically adapting to different languages and font styles without relying on OCR engines for character recognition. The approach is based on three main ideas: the use of self organizing maps (SOM) to perform unsupervised character clustering, the definition of one suitable vector-based word representation whose size depends on the word aspect-ratio, and the run-time alignment of the query word with indexed words to deal with broken and touching characters. The most appropriate applications are for processing modern printed documents (17th to 19th centuries) where current OCR engines are less accurate. Our experimental analysis addresses six data sets containing documents ranging from books of the 17th century to contemporary journals.  相似文献   

19.
Common OCR (Optical Character Recognition) systems fail to detect and recognize small text strings of few characters, in particular when a text line is not horizontal. Such text regions are typical for chart images. In this paper we present an algorithm that is able to detect small text regions regardless of string orientation and font size or style. We propose to use this algorithm as a preprocessing step for text recognition with a common OCR engine. According to our experimental results, one can get up to 20 times better text recognition rate, and 15 times higher text recognition precision when the proposed algorithm is used to detect text location, size and orientation, before using an OCR system. Experiments have been performed on a benchmark set of 1000 chart images created with the XML/SWF Chart tool, which contain about 14000 text regions in total.  相似文献   

20.
藏文内码扩展体系   总被引:6,自引:0,他引:6  
针对藏文编码字符集的基本集和辅助集建立在不同平面、编码体系不同所存在的问题,本文提出建立藏文内码扩展体系,给出了藏文合成、生成、分解的规则和方法:通过内码转换表合成藏文藏文内字,实现基本集与辅助集的信息交换;通过构件集,生成规范、标准的藏文外字,满足藏文编码字符集开放性的需要。并且,向上,在字汇一级,兼容UCS ;向下,与GB2312的事实上的内码标准兼容,是一个全藏文编码体系。 作者建议在UCS基本平面的拼音文字区建立内码扩充体系。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号