首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
There is a significant commercial and research interest in location-based web search engines. Given a number of search keywords and one or more locations (geographical points) that a user is interested in, a location-based web search retrieves and ranks the most textually and spatially relevant web pages. In this type of search, both the spatial and textual information should be indexed. Currently, no efficient index structure exists that can handle both the spatial and textual aspects of data simultaneously and accurately. Existing approaches either index space and text separately or use inefficient hybrid index structures with poor performance and inaccurate results. Moreover, most of these approaches cannot accurately rank web-pages based on a combination of space and text and are not easy to integrate into existing search engines. In this paper, we propose a new index structure called Spatial-Keyword Inverted File for Points to handle point-based indexing of web documents in an integrated/efficient manner. To seamlessly find and rank relevant documents, we develop a new distance measure called spatial tf-idf. We propose four variants of spatial-keyword relevance scores and two algorithms to perform top-k searches. As verified by experiments, our proposed techniques outperform existing index structures in terms of search performance and accuracy.  相似文献   

2.
We present a multimodal document alignment framework, which highlights existing alignment relationships between documents that are discussed and recorded during multimedia events such as meetings. These relationships that should help indexing the archives of these events are detected using various techniques from natural language processing and information retrieval. The main alignment strategies studied are based on thematic, quotation and reference relationships. At the analysis level, the alignment framework was applied at several levels of granularity of documents, requiring specific document segmentation techniques. Our framework that is language independent was evaluated on corpora in French and English, including meetings and scientific presentations. The satisfactory evaluation results obtained at several stages show the importance of our approach in bridging the gap between meeting documents, independently from the language and domain. They highlight also the utility of the multimodal alignment in advanced applications, e.g. multimedia document browsing, content-based / temporal-based searching, etc.  相似文献   

3.
Keyword search in XML documents has recently gained a lot of research attention. Given a keyword query, existing approaches first compute the lowest common ancestors (LCAs) or their variants of XML elements that contain the input keywords, and then identify the subtrees rooted at the LCAs as the answer. In this the paper we study how to use the rich structural relationships embedded in XML documents to facilitate the processing of keyword queries. We develop a novel method, called SAIL, to index such structural relationships for efficient XML keyword search. We propose the concept of minimal-cost trees to answer keyword queries and devise structure-aware indices to maintain the structural relationships for efficiently identifying the minimal-cost trees. For effectively and progressively identifying the top-k answers, we develop techniques using link-based relevance ranking and keyword-pair-based ranking. To reduce the index size, we incorporate a numbering scheme, namely schema-aware dewey code, into our structure-aware indices. Experimental results on real data sets show that our method outperforms state-of-the-art approaches significantly, in both answer quality and search efficiency.  相似文献   

4.
Google and other products have revolutionized the way we search for information. There are, however, still a number of research challenges. One challenge that arises specifically in desktop search is to exploit the structure and semantics of documents, as defined by the application program that generated the data (e.g., Word, Excel, or Outlook). The current generation of search products does not understand these structures and therefore often returns wrong results. This paper shows how today’s search technology can be extended in order to take the specific semantics of certain structures into account. The key idea is to extend inverted file index structures with predicates which encode the circumstances under which certain keywords of a document become visible to a user. This paper provides a framework that allows to express the semantics of structures in documents and algorithms to construct enhanced, predicate-based indexes. Furthermore, this paper shows how keyword and phrase queries can be processed efficiently on such enhanced indexes. It is shown that the proposed approach has superior retrieval performance with regard to both recall and precision and has tolerable space and query running time overheads.  相似文献   

5.
The increasing performance and wider spread use of automated semantic annotation and entity linking platforms has empowered the possibility of using semantic information in information retrieval. While keyword-based information retrieval techniques have shown impressive performance, the addition of semantic information can increase retrieval performance by allowing for more accurate sense disambiguation, intent determination, and instance identification, just to name a few. Researchers have already delved into the possibility of integrating semantic information into practical search engines using a combination of techniques such as using graph databases, hybrid indices and adapted inverted indices, among others. One of the challenges with the efficient design of a search engine capable of considering semantic information is that it would need to be able to index information beyond the traditional information stored in inverted indices, including entity mentions and type relationships. The objective of our work in this paper is to investigate various ways in which different data structure types can be adopted to integrate three types of information including keywords, entities and types. We will systematically compare the performance of the different data structures for scenarios where (i) the same data structure types are adopted for the three types of information, and (ii) different data structure types are integrated for storing and retrieving the three different information types. We report our findings in terms of the performance of various query processing tasks such as Boolean and ranked intersection for the different indices and discuss which index type would be appropriate under different conditions for semantic search.  相似文献   

6.
With the ubiquitous collection of data and creation of large distributed repositories, enabling search over this data while respecting access control is critical. A related problem is that of ensuring privacy of the content owners while still maintaining an efficient index of distributed content. We address the problem of providing privacy-preserving search over distributed access-controlled content. Indexed documents can be easily reconstructed from conventional (inverted) indexes used in search. Currently, the need to avoid breaches of access-control through the index requires the index hosting site to be fully secured and trusted by all participating content providers. This level of trust is impractical in the increasingly common case where multiple competing organizations or individuals wish to selectively share content. We propose a solution that eliminates the need of such a trusted authority. The solution builds a centralized privacy-preserving index in conjunction with a distributed access-control enforcing search protocol. Two alternative methods to build the centralized index are proposed, allowing trade offs of efficiency and security. The new index provides strong and quantifiable privacy guarantees that hold even if the entire index is made public. Experiments on a real-life dataset validate performance of the scheme. The appeal of our solution is twofold: (a) content providers maintain complete control in defining access groups and ensuring its compliance, and (b) system implementors retain tunable knobs to balance privacy and efficiency concerns for their particular domains. Dr. Vaidya’s work was supported by the National Science Foundation under grant CNS-0746943 and by a research resources grant from Rutgers Business School, Newark and New Brunswick.  相似文献   

7.
Although conventional index structures provide various nearest-neighbor search algorithms for high-dimensional data, there are additional requirements to increase search performances, as well as to support index scalability for large-scale datasets. To support these requirements, we propose a distributed high-dimensional index structure based on cluster systems, called a Distributed Vector Approximation-tree (DVA-tree), which is a two-level structure consisting of a hybrid spill-tree and Vector Approximation files (VA-files). We also describe the algorithms used for constructing the DVA-tree over multiple machines and performing distributed k-nearest neighbors (NN) searches. To evaluate performances of the DVA-tree, we conduct an experimental study using both real and synthetic datasets. The results show that our proposed method has significant performance advantages over existing index structures on different kinds of dataset.  相似文献   

8.
This paper proposes a new, efficient algorithm for extracting similar sections between two time sequence data sets. The algorithm, called Relay Continuous Dynamic Programming (Relay CDP), realizes fast matching between arbitrary sections in the reference pattern and the input pattern and enables the extraction of similar sections in a frame synchronous manner. In addition, Relay CDP is extended to two types of applications that handle spoken documents. The first application is the extraction of repeated utterances in a presentation or a news speech because repeated utterances are assumed to be important parts of the speech. These repeated utterances can be regarded as labels for information retrieval. The second application is flexible spoken document retrieval. A phonetic model is introduced to cope with the speech of different speakers. The new algorithm allows a user to query by natural utterance and searches spoken documents for any partial matches to the query utterance. We present herein a detailed explanation of Relay CDP and the experimental results for the extraction of similar sections and report results for two applications using Relay CDP. Yoshiaki Itoh has been an associate professor in the Faculty of Software and Information Science at Iwate Prefectural University, Iwate, Japan, since 2001. He received the B.E. degree, M.E. degree, and Dr. Eng. from Tokyo University, Tokyo, in 1987, 1989, and 1999, respectively. From 1989 to 2001 he was a researcher and a staff member of Kawasaki Steel Corporation, Tokyo and Okayama. From 1992 to 1994 he transferred as a researcher to Real World Computing Partnership, Tsukuba, Japan. Dr. Itoh's research interests include spoken document processing without recognition, audio and video retrieval, and real-time human communication systems. He is a member of ISCA, Acoustical Society of Japan, Institute of Electronics, Information and Communication Engineers, Information Processing Society of Japan, and Japan Society of Artificial Intelligence. Kazuyo Tanaka has been a professor at the University of Tsukuba, Tsukuba, Japan, since 2002. He received the B.E. degree from Yokohama National University, Yokohama, Japan, in 1970, and the Dr. Eng. degree from Tohoku University, Sendai, Japan, in 1984. From 1971 to 2002 he was research officer of Electrotechnical Laboratory (ETL), Tsukuba, Japan, and the National Institute of Advanced Science and Technology (AIST), Tsukuba, Japan, where he was working on speech analysis, synthesis, recognition, and understanding, and also served as chief of the speech processing section. His current interests include digital signal processing, spoken document processing, and human information processing. He is a member of IEEE, ISCA, Acoustical Society of Japan, Institute of Electronics, Information and Communication Engineers, and Japan Society of Artificial Intelligence. Shi-Wook Lee received the B.E. degree and M.E. degree from Yeungnam University, Korea and Ph.D. degree from the University of Tokyo in 1995, 1997, and 2001, respectively. Since 2001 he has been working in the Research Group of Speech and Auditory Signal Processing, the National Institute of Advanced Science and Technology (AIST), Tsukuba, Japan, as a postdoctoral fellow. His research interests include spoken document processing, speech recognition, and understanding.  相似文献   

9.
We propose an approach for the word-level indexing of modern printed documents which are difficult to recognize using current OCR engines. By means of word-level indexing, it is possible to retrieve the position of words in a document, enabling queries involving proximity of terms. Web search engines implement this kind of indexing, allowing users to retrieve Web pages on the basis of their textual content. Nowadays, digital libraries hold collections of digitized documents that can be retrieved either by browsing the document images or relying on appropriate metadata assembled by domain experts. Word indexing tools would therefore increase the access to these collections. The proposed system is designed to index homogeneous document collections by automatically adapting to different languages and font styles without relying on OCR engines for character recognition. The approach is based on three main ideas: the use of self organizing maps (SOM) to perform unsupervised character clustering, the definition of one suitable vector-based word representation whose size depends on the word aspect-ratio, and the run-time alignment of the query word with indexed words to deal with broken and touching characters. The most appropriate applications are for processing modern printed documents (17th to 19th centuries) where current OCR engines are less accurate. Our experimental analysis addresses six data sets containing documents ranging from books of the 17th century to contemporary journals.  相似文献   

10.
《Information Systems》2001,26(2):75-92
As information becomes available on the World Wide Web in larger quantities and in more disparate formats and media, adequate search engines and portal services providing search and filtering modalities tailored to the needs of the various communities of users become essential components of the global information infrastructure. Global-Atlas is a geographical search engine leveraging the cartographic paradigm for the indexing and searching of World Wide Web documents. Documents are indexed according to their geographical footprint, i.e. the bounding box of the geographical region to which they are related. Documents are searched by interactively drawing bounding boxes on maps. The effectiveness of Global-Atlas depends on the capability to quickly and to accurately index documents and maps. Maps, however, come in a variety of unspecified coordinate systems and projections. In this paper, we evaluate various surface-fitting techniques for the calibration of maps and devise an original hybrid calibration method from the empirical results obtained.  相似文献   

11.
由于图模型的广泛采用,图数据的快速包容搜索在许多不同领域广泛应用。给定一个模型图集D和一个查询图集q,传统的图搜索旨在检索所有包含q的图(qg),与此不同,包容搜索有其自身的索引特性,针对这些特性进行系统地研究,并提出一种基于对比子图的索引模型(csgIndex):使用一个冗余感知特征选择过程,csgIndex能挑选出一个鲜明的具有区分力的对比子图集,并最大化其索引能力。对实时测试数据的实验结果显示,csgIndex对不同的包容搜索荷载能达到近优化修剪力,相较于传统图搜索方法表现出明显的索引性能优势。  相似文献   

12.
Multimedia Tools and Applications - Product quantization is a widely used lossy compression technique that can generate high quantization levels by a compact codebook set. It has been conducted in...  相似文献   

13.
14.
The diffusion of the World Wide Web (WWW) and the consequent increase in the production and exchange of textual information demand the development of effective information retrieval systems. The HyperText Markup Language (HTML) constitues a common basis for generating documents over the internet and the intranets. By means of the HTML the author is allowed to organize the text into subparts delimited by special tags; these subparts are then visualized by the HTML browser in distinct ways, i.e. with distinct typographical formats. In this paper a model for indexing HTML documents is proposed which exploits the role of tags in encoding the importance of their delimited text. Central to our model is a method to compute the significance degree of a term in a document by weighting the term instances according to the tags in which they occur. The indexing model proposed is based on a contextual weighted representation of the document under consideration, by means of which a set of (normalized) numerical weights is assigned to the various tags containing the text. The weighted representation is contextual in the sense that the set of numerical weights assigned to the various tags and the respective text depend (other than on the tags themselves) on the particular document considered. By means of the contextual weighted representation our indexing model reflects not only the general syntactic structure of the HTML language but also the information conveyed by the particular way in which the author instantiates that general structure in the document under consideration. We discuss two different forms of contextual weighting: the first is based on a linear weighted representation and is closer to the standard model of universal (i.e. non contextual) weighting; the second is based on a more complex non linear weighted representation and has a number of novel and interesting features.  相似文献   

15.
One of the major challenges in Peer-to-Peer (P2P) file sharing systems is to support content-based search. Although there have been some proposals to address this challenge, they share the same weakness of using either servers or super-peers to keep global knowledge, which is required to identify importance of terms to avoid popular terms in query processing. As a result, they are not scalable and are prone to the bottleneck problem, which is caused by the high visiting load at the global knowledge maintainers. To that end, in this paper, we propose a novel adaptive indexing approach for content-based search in P2P systems, which can identify importance of terms without keeping global knowledge. Our method is based on an adaptive indexing structure that combines a Chord ring and a balanced tree. The tree is used to aggregate and classify terms adaptively, while the Chord ring is used to index terms of nodes in the tree. Specifically, at each node of the tree, the system classifies terms as either important or unimportant. Important terms, which can distinguish the node from its neighbor nodes, are indexed in the Chord ring. On the other hand, unimportant terms, which are either popular or rare terms, are aggregated to higher level nodes. Such classification enables the system to process queries on the fly without the need for global knowledge. Besides, compared to the methods that index terms separately, term aggregation reduces the indexing cost significantly. Taking advantage of the tree structure, we also develop an efficient search algorithm to tackle the bottleneck problem near the root. Finally, our extensive experiments on both benchmark and Wikipedia datasets validated the effectiveness and efficiency of the proposed method.  相似文献   

16.
Spoken term detection is an extension of text-based searching that allows users to type keywords and search audio files containing recordings of spoken language. Performance is dependent on many external factors such as the acoustic channel, language, pronunciation variations and acoustic confusability of the search term. Unlike text-based searches, the likelihoods of false alarms and misses for specific search terms, which we refer to as reliability, play a significant role in the overall perception of the usability of the system. In this paper, we present a system that predicts the reliability of a search term based on its inherent confusability. Our approach integrates predictors of the reliability that are based on both acoustic and phonetic features. These predictors are trained using an analysis of recognition errors produced from a state of the art spoken term detection system operating on the Fisher Corpus. This work represents the first large-scale attempt to predict the success of a keyword search term from only its spelling. We explore the complex relationship between phonetic and acoustic properties of search terms. We show that a 76 % correlation between the predicted error rate and the actual measured error rate can be achieved, and that the remaining confusability is due to other acoustic modeling issues that cannot be derived from a search term’s spelling.  相似文献   

17.
Authors use images to present a wide variety of important information in documents. For example, two-dimensional (2-D) plots display important data in scientific publications. Often, end-users seek to extract this data and convert it into a machine-processible form so that the data can be analyzed automatically or compared with other existing data. Existing document data extraction tools are semi-automatic and require users to provide metadata and interactively extract the data. In this paper, we describe a system that extracts data from documents fully automatically, completely eliminating the need for human intervention. The system uses a supervised learning-based algorithm to classify figures in digital documents into five classes: photographs, 2-D plots, 3-D plots, diagrams, and others. Then, an integrated algorithm is used to extract numerical data from data points and lines in the 2-D plot images along with the axes and their labels, the data symbols in the figure’s legend and their associated labels. We demonstrate that the proposed system and its component algorithms are effective via an empirical evaluation. Our data extraction system has the potential to be a vital component in high volume digital libraries.  相似文献   

18.
基于对普通语音语料库构建方法的研究与分析,结合自然口语语音识别研究相关需求以及藏语自然口语语音的基本特点,研究设计了适用于藏语语音识别的口语语音语料库建设方案以及相应的标注规范,并据此构建了时长50小时,包含音素、半音节、音节、藏文字以及语句共5层标注信息的藏语拉萨话口语语音语料库。统计结果显示,该语料库在保留口语语音自然属性的同时,对音素、半音节等常用语音建模单元也有均衡的覆盖,为基于藏语口语语音数据的语音识别技术研究提供了可靠的数据支撑。  相似文献   

19.
This paper describes MetaIndex, an automatic indexing program that creates symbolic representations of documents for the purpose of document retrieval. MetaIndex uses a simple transition network parser to recognize a language that is derived from the set of main concepts in the Unified Medical Language System Metathesaurus (Meta-1). MetaIndex uses a hierarchy of medical concepts, also derived from Meta-1, to represent the content of documents. The goal of this approach is to improve document retrieval performance by better representation of documents. An evaluation method is described, and the performance of MetaIndex on the task of indexing the Slice of Life medical image collection is reported.  相似文献   

20.
Abstract. For some multimedia applications, it has been found that domain objects cannot be represented as feature vectors in a multidimensional space. Instead, pair-wise distances between data objects are the only input. To support content-based retrieval, one approach maps each object to a k-dimensional (k-d) point and tries to preserve the distances among the points. Then, existing spatial access index methods such as the R-trees and KD-trees can support fast searching on the resulting k-d points. However, information loss is inevitable with such an approach since the distances between data objects can only be preserved to a certain extent. Here we investigate the use of a distance-based indexing method. In particular, we apply the vantage point tree (vp-tree) method. There are two important problems for the vp-tree method that warrant further investigation, the n-nearest neighbors search and the updating mechanisms. We study an n-nearest neighbors search algorithm for the vp-tree, which is shown by experiments to scale up well with the size of the dataset and the desired number of nearest neighbors, n. Experiments also show that the searching in the vp-tree is more efficient than that for the -tree and the M-tree. Next, we propose solutions for the update problem for the vp-tree, and show by experiments that the algorithms are efficient and effective. Finally, we investigate the problem of selecting vantage-point, propose a few alternative methods, and study their impact on the number of distance computation. Received June 9, 1998 / Accepted January 31, 2000  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号