首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Word spotting has become a field of strong research interest in document image analysis over the last years. Recently, AttributeSVMs were proposed which predict a binary attribute representation (Almazán et al. in IEEE Trans Pattern Anal Mach Intell 36(12):2552–2566, 2014). At their time, this influential method defined the state of the art in segmentation-based word spotting. In this work, we present an approach for learning attribute representations with convolutional neural networks(CNNs). By taking a probabilistic perspective on training CNNs, we derive two different loss functions for binary and real-valued word string embeddings. In addition, we propose two different CNN architectures, specifically designed for word spotting. These architectures are able to be trained in an end-to-end fashion. In a number of experiments, we investigate the influence of different word string embeddings and optimization strategies. We show our attribute CNNs to achieve state-of-the-art results for segmentation-based word spotting on a large variety of data sets.  相似文献   

2.
In this paper, we propose a word spotting framework for accessing the content of historical machine-printed documents without the use of an optical character recognition engine. A preprocessing step is performed in order to improve the quality of the document images, while word segmentation is accomplished with the use of two complementary segmentation methodologies. In the proposed methodology, synthetic word images are created from keywords, and these images are compared to all the words in the digitized documents. A user feedback process is used in order to refine the search procedure. The methodology has been evaluated in early Modern Greek documents printed during the seventeenth and eighteenth century. In order to improve the efficiency of accessing and search, natural language processing techniques have been addressed that comprise a morphological generator that enables searching in documents using only a base word-form for locating all the corresponding inflected word-forms and a synonym dictionary that further facilitates access to the semantic context of documents.  相似文献   

3.
The retrieval of information from scanned handwritten documents is becoming vital with the rapid increase of digitized documents, and word spotting systems have been developed to search for words within documents. These systems can be either template matching algorithms or learning based. This paper presents a coherent learning based Arabic handwritten word spotting system which can adapt to the nature of Arabic handwriting, which can have no clear boundaries between words. Consequently, the system recognizes Pieces of Arabic Words (PAWs), then re-constructs and spots words using language models. The proposed system produced promising result for Arabic handwritten word spotting when tested on the CENPARMI Arabic documents database.  相似文献   

4.
5.
International Journal on Document Analysis and Recognition (IJDAR) - In this article, we propose a new approach to segmentation-free word spotting that is based on the combination of three...  相似文献   

6.
In this paper, a two-step segmentation-free word spotting method for historical printed documents is presented. The first step involves a minimum distance matching between a query keyword image and a document page image using keypoint correspondences. In the second step of the method, the matched keypoints on the document image serve as indicators for creating candidate image areas. The query keyword image is matched against the candidate image areas in order to properly estimate the bounding boxes of the detected word instances. The method is evaluated using two datasets of different languages and is compared against segmentation-free state-of-the-art methods. The experimental results show that the proposed method outperforms significantly the competitive approaches.  相似文献   

7.
8.
In this paper, we propose a novel technique for word spotting in historical printed documents combining synthetic data and user feedback. Our aim is to search for keywords typed by the user in a large collection of digitized printed historical documents. The proposed method consists of the following stages: (1) creation of synthetic image words; (2) word segmentation using dynamic parameters; (3) efficient feature extraction for each word image and (4) a retrieval procedure that is optimized by user feedback. Experimental results prove the efficiency of the proposed approach.  相似文献   

9.
关键词识别是近年来语音识别研究的一个热点。提出了一种新的基于分层查询表的关键词识别模型,该模型具有简单、实用、快速的特点。利用该模型实现了路况信息查询系统,取得了较高的识别率,具有一定的实用性。  相似文献   

10.
对XML文档查询的常用方法有两种:一种是使用查询语言;另一种是使用关键字,而使用关键字查询XML文档比使用查询语言更为简单方便。给出了一种使用关键字查询XML文档的索引查找算法。该算法只需要扫描一次关键字对应的编码列,就可以找到需要的编码,提高了查询效率。实验表明该算法是可行的和有效的。  相似文献   

11.
Searching and indexing historical handwritten collections are a very challenging problem. We describe an approach called word spotting which involves grouping word images into clusters of similar words by using image matching to find similarity. By annotating “interesting” clusters, an index that links words to the locations where they occur can be built automatically. Image similarities computed using a number of different techniques including dynamic time warping are compared. The word similarities are then used for clustering using both K-means and agglomerative clustering techniques. It is shown in a subset of the George Washington collection that such a word spotting technique can outperform a Hidden Markov Model word-based recognition technique in terms of word error rates. An erratum to this article can be found at  相似文献   

12.
13.
14.
中文文本布局复杂,汉字种类多,书写随意性大,因而手写汉字检测是一个很有挑战的问题。本文提出了一种无分割的手写中文文档字符检测的方法。该方法用SIFT定位文本中候选关键点,然后基于关键点位置和待查询汉字大小来确定候选字符的位置,最后用两个方向动态时间规整(Dynamic Time Warping, DTW)算法来筛选候选字符。实验结果表明,该方法能够在无需将文本分割为字符的情况下准确找到待查询的汉字,并且优于传统的基于DTW字符检测方法。  相似文献   

15.
专利检索与普通的文本检索有着极大的不同,专利文本包括权利声明、摘要、全文等不同部分,自然不能简单地将普通文本的检索方法应用到专利检索当中来。专利检索通常面临着召回率低下的问题,首先,由于专利文本具有极强的专业性,有着复杂的术语表达方式,用户输入的关键词通常无法明确捕捉到检索意图,导致检索结果不理想。其次,专利撰写时有意识地制造与众不同的词汇,导致相关专利无法被检索到。目前有很多的研究方法都旨在提高专利检索的召回率,但是仍然有许多问题有待解决,检索效果有待改善。提出了一个基于词向量的专利自动扩展查询方法,在词向量的基础上,构建一个关键词查询网络,通过稠密子图发现算法来寻找扩展词集合,提高扩展词的有效性。在CLEF-IP 2012数据集的基础上进行了充分的实验,实验结果表明,本文提出的算法能够保证扩展词集获取的灵活性和有效性,同时能进一步提高专利检索的召回率。  相似文献   

16.
17.
In keyword spotting from handwritten documents by text query, the word similarity is usually computed by combining character similarities, which are desired to approximate the logarithm of the character probabilities. In this paper, we propose to directly estimate the posterior probability (also called confidence) of candidate characters based on the N-best paths from the candidate segmentation-recognition lattice. On evaluating the candidate segmentation-recognition paths by combining multiple contexts, the scores of the N-best paths are transformed to posterior probabilities using soft-max. The parameter of soft-max (confidence parameter) is estimated from the character confusion network, which is constructed by aligning different paths using a string matching algorithm. The posterior probability of a candidate character is the summation of the probabilities of the paths that pass through the candidate character. We compare the proposed posterior probability estimation method with some reference methods including the word confidence measure and the text line recognition method. Experimental results of keyword spotting on a large database CASIA-OLHWDB of unconstrained online Chinese handwriting demonstrate the effectiveness of the proposed method.  相似文献   

18.
One of the useful tools offered by existing web search engines is query suggestion (QS), which assists users in formulating keyword queries by suggesting keywords that are unfamiliar to users, offering alternative queries that deviate from the original ones, and even correcting spelling errors. The design goal of QS is to enrich the web search experience of users and avoid the frustrating process of choosing controlled keywords to specify their special information needs, which releases their burden on creating web queries. Unfortunately, the algorithms or design methodologies of the QS module developed by Google, the most popular web search engine these days, is not made publicly available, which means that they cannot be duplicated by software developers to build the tool for specifically-design software systems for enterprise search, desktop search, or vertical search, to name a few. Keyword suggested by Yahoo! and Bing, another two well-known web search engines, however, are mostly popular currently-searched words, which might not meet the specific information needs of the users. These problems can be solved by WebQS, our proposed web QS approach, which provides the same mechanism offered by Google, Yahoo!, and Bing to support users in formulating keyword queries that improve the precision and recall of search results. WebQS relies on frequency of occurrence, keyword similarity measures, and modification patterns of queries in user query logs, which capture information on millions of searches conducted by millions of users, to suggest useful queries/query keywords during the user query construction process and achieve the design goal of QS. Experimental results show that WebQS performs as well as Yahoo! and Bing in terms of effectiveness and efficiency and is comparable to Google in terms of query suggestion time.  相似文献   

19.
用VC++语言创建WORD文档   总被引:1,自引:1,他引:0  
简单文本格式无法生成复杂的表格和图形,因而采用WORD(或EXCEL)作为专用软件的数据报表文件是很有意义的,它可以按照用户要求将数据以图表的形式形象地表示出来。为了在VC 环境中创建和生成WORD文档,必须利用COM技术,把WORD程序看做COM服务器,把用户程序看做客户端,利用COM技术创建WORD对象,利用WORD对象调用VBScript函数(宏函数)来控制WORD文档的生成。通过研究证实该技术是可行的,并且对OFFICE软件都适用。  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号