首页 | 本学科首页   官方微博 | 高级检索  
     


Portrep: A portable repeated string finder
Authors:Leslie P. Jones  Edward W. Gassie  Sridhar Radhakrishnan
Abstract:In recent years, several authors have presented algorithms that locate instances of a given string, or set of strings, within a text. Recently, authors have given less consideration to the complementary problem of processing a text to find out what strings appear in the text, without any preconceived notion of what strings might be present. A system called PATRICIA, which was developed two decades ago, is an implementation of a solution to this problem. The design of PATRICIA is very tightly bound to the assumptions that individual string elements are bits and that the user of the system can provide complete lists of starting and stopping places for strings. This paper presents an approach that drops these assumptions. Our method allows different definitions of indivisible string elements for different applications, and the only information the user provides for the determination of the beginning and ends of strings is a specification of a maximum length for output strings. This paper also describes a portable C implementation of the method, called PORTREP. The primary data structure of PORTREP is a trie represented as a ternary tree. PORTREP has a method for eliminating redundancy from the output, and it can function with a bounded number of nodes by employing a heuristic process that reuses seldom-visited nodes. Theoretical analysis and empirical studies, reported here, give confidence in the efficiency of the algorithms. PORTREP has the ability to form the basis for a variety of text-analysis applications, and this paper considers one such application, automatic document indexing.
Keywords:String searching  Trie  Portability  Algorithmic complexity  Automatic document indexing
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号