首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
Computer recognition of machine-printed letters of the Tamil alphabet is described. Each character is represented as a binary matrix and encoded into a string using two different methods. The encoded strings form a dictionary. A given text is presented symbol by symbol and information from each symbol is extracted in the form of a string and compared with the strings in the dictionary. When there is agreement the letters are recognized and printed out in Roman letters following a special method of transliteration. The lengthening of vowels and hardening of consonants are indicated by numerals printed above each letter.  相似文献   

2.
针对字符串测试数据自动生成,讨论了字符串间的距离,将不满足路径条件的字符串谓词表示成一个实值目标函数;利用快速下降搜索算法实施目标函数极小化,实现了基于搜索的面向路径字符串测试数据自动生成方法;探讨了其测试数据生成效率与初始输入、路径处理顺序之间的关系,并与遗传算法等几种算法进行了比较.实验结果表明:该方法是一种更经济有效的测试数据生成方法.  相似文献   

3.
工程图纸图像图文自动分割工具SegChar   总被引:3,自引:0,他引:3  
江早  刘积仁  刘晋军 《软件学报》1999,10(6):589-594
文章分析了工程图纸图像图文分割的技术特点、关键步骤和基本框架,着重介绍了图文自动分割工具SegChar采用的技术,如:(1) 自动字符尺寸阈值过滤技术,可使图文分割过程自动化和智能化;(2) 任意方向、任意长度字符串检测技术,通过精确HOUGH空间需求、松弛共线、基于字符串的HOUGH域更新等策略,提高了字符分割的处理速度,降低了处理的空间复杂度,能够使复杂的中西文字符串得以完整提取.文章最后给出了性能评价.  相似文献   

4.
In recent years, several authors have presented algorithms that locate instances of a given string, or set of strings, within a text. Recently, authors have given less consideration to the complementary problem of processing a text to find out what strings appear in the text, without any preconceived notion of what strings might be present. A system called PATRICIA, which was developed two decades ago, is an implementation of a solution to this problem. The design of PATRICIA is very tightly bound to the assumptions that individual string elements are bits and that the user of the system can provide complete lists of starting and stopping places for strings. This paper presents an approach that drops these assumptions. Our method allows different definitions of indivisible string elements for different applications, and the only information the user provides for the determination of the beginning and ends of strings is a specification of a maximum length for output strings. This paper also describes a portable C implementation of the method, called PORTREP. The primary data structure of PORTREP is a trie represented as a ternary tree. PORTREP has a method for eliminating redundancy from the output, and it can function with a bounded number of nodes by employing a heuristic process that reuses seldom-visited nodes. Theoretical analysis and empirical studies, reported here, give confidence in the efficiency of the algorithms. PORTREP has the ability to form the basis for a variety of text-analysis applications, and this paper considers one such application, automatic document indexing.  相似文献   

5.
Pei‐Chi Wu 《Software》2000,30(7):765-774
Character sets are one of the basic issues for information interchange. Most current national standard character sets extend 7‐bit ASCII. These extensions conflict with each other and make the design of multilingual information systems complicated. Unicode or the Universal Character Set (UCS) is a character set that covers symbols in the major written languages. Text files and strings usually have no header to indicate which character set is in use, and they currently use one of the national standards by default. The transition from national standards to Unicode may take a longer time than expected. This paper presents the following methods to help the transition. (1) A text file format of fixed‐width characters: if the first character in a text file is a nonzero control code, the file is in UCS; otherwise, it is in the default national standard. The control code indicates which UCS subset or byte order is in use. (2) A tagged string storage: each string has a tag representing which character set or coding format is in use, e.g., the default national standard, 8‐bit subset of UCS‐2, UCS‐2, or UCS‐4. (3) A method for assigning the format of string literals: all string literals use the same syntax notation, and their storage format is the same as that of their source files. These methods can improve multilingual support without introducing much complexity. Copyright © 2000 John Wiley & Sons, Ltd.  相似文献   

6.
We present a new approach based on anagram hashing to handle globally the lexical variation in large and noisy text collections. Lexical variation addressed by spelling correction systems is primarily typographical variation. This is typically handled in a local fashion: given one particular text string some system of retrieving near-neighbors is applied, where near-neighbors are other text strings that differ from the particular string by a given number of characters. The difference in characters between the original string and one of its retrieved near-neighbors constitutes a particular character confusion. We present a global way of performing this action: for all possible particular character confusions given a particular edit distance, we sequentially identify all the pairs of text strings in the text collection that display a particular confusion. We work on large digitized corpora, which contain lexical variation due to both the OCR process and typographical or typesetting error and show that all these types of variation can be handled equally well in the framework we present. The character confusion-based prototype of Text-Induced Corpus Clean-up (ticcl) is compared to its focus word-based counterpart and evaluated on 6 years’ worth of digitized Dutch Parliamentary documents. The character confusion approach is shown to gain an order of magnitude in speed on its word-based counterpart on large corpora. Insights gained about the useful contribution of global corpus variation statistics are shown to also benefit the more traditional word-based approach to spelling correction. Final tests on a held-out set comprising the 1918 edition of the Dutch daily newspaper ‘Het Volk’ show that the system is not sensitive to domain variation.  相似文献   

7.
Let x be a nonempty string and x? another string with the same characters as x, but possibly in a different order. A string of the form xx? is called a permutation. A permutation-containing string is of the form wxx?y. The Interchange Lemma for context-free languages [11] is used to show that the set of permutation-containing strings over a 16 character alphabet is not context-free. The application of the lemma is important since other techniques, such as the pumping lemma and Ogden's lemma cannot show that the set is not context-free. Finally, a collection of open problems is given.  相似文献   

8.
Alden H. Wright 《Software》1994,24(4):337-362
Given a text string, a pattern string, and an integer k, the problem of approximate string matching with k differences is to find all substrings of the text string whose edit distance from the pattern string is less than k. The edit distance between two strings is defined as the minimum number of differences, where a difference can be a substitution, insertion, or deletion of a single character. An implementation of the dynamic programming algorithm for this problem is given that packs several characters and mod-4 integers into a computer word. Thus, it is a parallelization of the conventional implementation that runs on ordinary processors. Since a small alphabet means that characters have short binary codes, the degree of parallelism is greatest for small alphabets and for processors with long words. For an alphabet of size 8 or smaller and a 64 bit processor, a 21-fold parallelism over the conventional algorithm can be obtained. Empirical comparisons to the basic dynamic programming algorithm, to a version of Ukkonen's algorithm, to the algorithm of Galil and Park, and to a limited implementation of the Wu-Manber algorithm are given.  相似文献   

9.
The development and implementation of an algorithm for automated text string separation that is relatively independent of changes in text font style and size and of string orientation are described. It is intended for use in an automated system for document analysis. The principal parts of the algorithm are the generation of connected components and the application of the Hough transform in order to group components into logical character strings that can then be separated from the graphics. The algorithm outputs two images, one containing text strings and the other graphics. These images can then be processed by suitable character recognition and graphics recognition systems. The performance of the algorithm, both in terms of its effectiveness and computational efficiency, was evaluated using several test images and showed superior performance compared to other techniques  相似文献   

10.
11.
在传统的字符串处理算法中往往分别考虑字符串的频度和长度.然而,在实际应用中,将字符串的频度和长度结合考虑是有意义的.基于这点我们提出了频长积的概念,规定字符串的频度和长度的乘积为字符串的频长积.并基于广义后缀树和Ukkonen算法,提出了时间复杂度为O(N)的查找算法.效率实验证实了该算法的高效性.语义实验表明,本算法找出的最大频长积字符串相比于最大频度字符串或最大长度字符串,其实际语义更为明确.这样的字符串在文本压缩、基因序列的分析以及其他注重语义的应用中将具有很高的应用价值.  相似文献   

12.
This paper studies the use of text signatures in string searching. Text signatures are a coded representation of a unit of text formed by hashing substrings into bit positions which are, in turn, set to one. Then instead of searching an entire line of text exhaustively, the text signature may be examined first to determine if complete processing is warranted. A hashing function which minimizes the number of collisions in a signature is described. Experimental results for two signature lengths with both a text file and a program file are given. Analyses of the results and the utility and application of the method conclude the discussion.  相似文献   

13.
基于谓词切片的字符串测试数据自动生成   总被引:3,自引:0,他引:3  
字符串谓词使用相当普遍,如何实现字符串测试数据的自动生成是一个有待解决的问题,针对字符串谓词,讨论了路径Path上给定谓词的谓词切片的动态生成算法,以及基于谓词切片的字符串测试数据自动生成方法,并给出了字符串间距离的定义,利用程序DUC(Definithon-Use-Control)表达式,构造谓词的谓词切片,对任意的输入,通过执行谓词切片,获取谓词中变量的当前值,进而对谓词中变量的每一字符进行分支函数极小化,动态生成给定字符串谓词边界的ON-OFF测试点,实验表明,该方法是行之有效的。  相似文献   

14.
As more information sources become available in multimedia systems, the development of abstract semantic models for video, audio, text, and image data is becoming very important. An abstract semantic model has two requirements: it should be rich enough to provide a friendly interface of multimedia presentation synchronization schedules to the users and it should be a good programming data structure for implementation in order to control multimedia playback. An abstract semantic model based on an augmented transition network (ATN) is presented. The inputs for ATNs are modeled by multimedia input strings. Multimedia input strings provide an efficient means for iconic indexing of the temporal/spatial relations of media streams and semantic objects. An ATN and its subnetworks are used to represent the appearing sequence of media streams and semantic objects. The arc label is a substring of a multimedia input string. In this design, a presentation is driven by a multimedia input string. Each subnetwork has its own multimedia input string. Database queries relative to text, image, and video can be answered via substring matching at subnetworks. Multimedia browsing allows users the flexibility to select any part of the presentation they prefer to see. This means that the ATN and its subnetworks can be included in multimedia database systems which are controlled by a database management system (DBMS). User interactions and loops are also provided in an ATN. Therefore, ATNs provide three major capabilities: multimedia presentations, temporal/spatial multimedia database searching, and multimedia browsing  相似文献   

15.
非限定性手写汉字串的分割与识别是当前字符识别领域中的一个难点问题.针对手写日期的特点,提出了整词识别和定长汉字串分割识别相结合的组合识别方法.整词识别将字符串作为一个整体进行识别,无需复杂的字符串分割过程.在定长汉字串分割过程中,首先通过识别来预测汉字串的长度,然后通过投影和轮廓分析确定候选分割线,最后通过识别选取最优分割路径.这两种分割识别方法通过规则进行组合,大大提高了系统的性能.在真实票据图像上的实验表明了该方法的有效性,分割识别正确率达到了93.3%.  相似文献   

16.
An approach to designing very fast algorithms for approximate string matching in a dictionary is proposed. Multiple spelling errors corresponding to insert, delete, change, and transpose operations on character strings are considered in the fault model. The design of very fast approximate string matching algorithms through a four-step reduction procedure is described. The final and most effective step uses hashing techniques to avoid comparing the given word with words at large distances. The technique has been applied to a library book catalog textbase. The experiments show that performing approximate string matching for a large dictionary in real-time on an ordinary sequential computer under our multiple fault model is feasible  相似文献   

17.
A software tool for inputing and outputing patient data (1/O routine) has been developed. Since this I/O routine is programmed exclusively in FORTRAN77, it will make a powerful tool for constructing a portable database system. Basically the routine manipulates an ASCII-coded text string that consists of lines demarcated by the CR code (13) and is terminated by the null code (0). The editing commands are preceded by one of the following ASCII characters: @, !, ], [, *, and _, and all the strings with an initial character other than these are interpreted as data to be inserted into the text. Since the routine uses two FORTRAN tools already reported, i.e. the subroutines to manipulate key files and the subroutines to manage variable length records, character strings can be stored without any restrictions in format or in size, and can be retrieved either sequentially or in an indexed manner.  相似文献   

18.
串频统计和词形匹配相结合的汉语自动分词系统   总被引:45,自引:7,他引:45  
本文介绍了一种汉语自动分词软件系统,该系统对原文进行三遍扫描:第一遍,利用切分标记将文本切分成汉字短串的序列;第二遍,根据各短串的每个子串在上下文中的频度计算其权值,权值大的子串视为候选词;第三遍,利用候选词集和一部常用词词典对汉字短串进行切分。实验表明,该分词系统的分词精度在1.5%左右,能够识别大部分生词,特别适用于文献检索等领域。  相似文献   

19.
《国际计算机数学杂志》2012,89(3-4):133-145
The notion of splicing system has been used to abstract the process of DNA digestion by restriction enzymes and subsequent religation. A splicing system language is the formal language of all DNA strings producible by such a process. The membership problem is to devise an algorithm (if possible) to answer the question of whether or not a given DNA string belongs to a splicing system language given by initial strings and enzymes.

In this paper the concept of a sequential splicing system is introduced. A sequential splicing system differs from a splicing system in that the latter allows arbitrarily many copies of any string in the initial set whereas the sequential splicing system may restrict the initial number of copies of some strings. The main result is that there exist sequential splicing systems with recursively unsolvable membership problem. The technique of the proof is to embed Turing machine computations in the languages.  相似文献   

20.
The generalised median string is defined as a string that has the smallest sum of distances to the elements of a given set of strings. It is a valuable tool in representing a whole set of objects by a single prototype, and has interesting applications in pattern recognition. All algorithms for computing generalised median strings known from the literature are of static nature. That is, they require all elements of the underlying set of strings to be given when the algorithm is started. In this paper, we present a novel approach that is able to operate in a dynamic environment, where there is a steady arrival of new strings belonging to the considered set. Rather than computing the median from scratch upon arrival of each new string, the proposed algorithm needs only the median of the set computed before together with the new string to compute an updated median string of the new set. Our approach is experimentally compared to a greedy algorithm and the set median using both synthetic and real data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号