首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
SEEKER:基于关键词的关系数据库信息检索   总被引:20,自引:3,他引:20  
文继军  王珊 《软件学报》2005,16(7):1270-1281
传统上,SQL是存取关系数据库中数据的主要界面.但是,对于没有经验的用户来说,学习复杂的SQL语法是一件困难的事情.实现基于关键词的关系数据库信息检索,将使用户不需要任何SQL语言和底层数据库模式的知识,用搜索引擎的方式来获取数据库中的相关数据.描述了一个基于关键词的关系数据库信息检索系统SEEKER的设计和实现.现有的关系数据库关键词查询系统只能检索关系数据库中的文本属性,而SEEKER还可以检索数据库元数据以及数字属性.并且,SEEKER采用了更合理的排序公式,支持Top-k查询.实验结果显示,SEEKER具有良好的查询性能.  相似文献   

2.
This paper proposes a novel method for distributed data organization and parallel data retrieval from huge volume point clouds generated by airborne Light Detection and Ranging (LiDAR) technology under a cluster computing environment, in order to allow fast analysis, processing, and visualization of the point clouds within a given area. The proposed method is suitable for both grid and quadtree data structures. As for distribution strategy, cross distribution of the dataset would be more efficient than serial distribution in terms of non-redundant datasets, since a dataset is more uniformly distributed in the former arrangement. However, redundant datasets are necessary in order to meet the frequent need of input and output operations in multi-client scenarios: the first copy would be distributed by a cross distribution strategy while the second (and later) would be distributed by an iterated exchanging distribution strategy. Such a distribution strategy would distribute datasets more uniformly to each data server. In data retrieval, a greedy algorithm is used to allocate the query task to a data server, where the computing load is lightest if the data block needing to be retrieved is stored among multiple data servers. Experiments show that the method proposed in this paper can satisfy the demands of frequent and fast data query.  相似文献   

3.
This paper studies how to enable an effective ranked retrieval over data with categorical attributes, in particular, by supporting personalized ranked retrieval of highly relevant data. While ranked retrieval has been actively studied lately, existing efforts have focused only on supporting ranking over numerical or text data. However, many real-life data contain a large amount of categorical attributes, in combination with numerical and text attributes, which cannot be efficiently supported - unlike numerical attributes where a natural ordering is inherent, the existence of categorical attributes with no such ordering complicates both the formulation and processing of ranking. This paper studies the efficient and effective support of ranking over categorical data, as well as uniform support with other types of attributes.  相似文献   

4.
针对网页信息自动抽取问题,提出一种将网页按标记分块并根据朴素贝叶斯理论从中识别新闻正文的方法。该方法将各分块的标记信息、文本相似度以及字长特征作为机器学习的特征属性。为提高标记属性的表征作用,减少相关标记之间的干扰,算法采用χ2检验法来检验标记属性之间以及标记属性与类别之间的相关性并实现属性约减。新闻正文抽取过程中同时考虑正文与非正文分块的后验概率,以提高抽取准确率。实验结果表明,选取适当的参数值,抽取新闻正文的准确率达到85%。   相似文献   

5.
Legal text retrieval traditionally relies upon external knowledge sources such as thesauri and classification schemes, and an accurate indexing of the documents is often manually done. As a result not all legal documents can be effectively retrieved. However a number of current artificial intelligence techniques are promising for legal text retrieval. They sustain the acquisition of knowledge and the knowledge-rich processing of the content of document texts and information need, and of their matching. Currently, techniques for learning information needs, learning concept attributes of texts, information extraction, text classification and clustering, and text summarization need to be studied in legal text retrieval because of their potential for improving retrieval and decreasing the cost of manual indexing. The resulting query and text representations are semantically much richer than a set of key terms. Their use allows for more refined retrieval models in which some reasoning can be applied. This paper gives an overview of the state of the art of these innovativetechniques and their potential for legal text retrieval.  相似文献   

6.
Tag clouds have proliferated over the web over the last decade. They provide a visual summary of a collection of texts by visually depicting the tag frequency by font size. In use, tag clouds can evolve as the associated data source changes over time. Interesting discussions around tag clouds often include a series of tag clouds and consider how they evolve over time. However, since tag clouds do not explicitly represent trends or support comparisons, the cognitive demands placed on the person for perceiving trends in multiple tag clouds are high. In this paper, we introduce SparkClouds, which integrate sparklines into a tag cloud to convey trends between multiple tag clouds. We present results from a controlled study that compares SparkClouds with two traditional trend visualizations—multiple line graphs and stacked bar charts—as well as Parallel Tag Clouds. Results show that SparkClouds ability to show trends compares favourably to the alternative visualizations.  相似文献   

7.
Since documents on the Web are naturally partitioned into many text databases, the efficient document retrieval process requires identifying the text databases that are most likely to provide relevant documents to the query and then searching for the identified text databases. In this paper, we propose a neural net based approach to such an efficient document retrieval. First, we present a neural net agent that learns about underlying text databases from the user's relevance feedback. For a given query, the neural net agent, which is sufficiently trained on the basis of the BPN learning mechanism, discovers the text databases associated with the relevant documents and retrieves those documents effectively. In order to scale our approach with the large number of text databases, we also propose the hierarchical organization of neural net agents which reduces the total training cost at the acceptable level. Finally, we evaluate the performance of our approach by comparing it to those of the conventional well-known approaches. Received 5 March 1999 / Revised 7 March 2000 / Accepted in revised form 2 November 2000  相似文献   

8.
Microprogramming commonly executed operations can improve the computational speed of data processing systems. This paper describes how microprogramming may be used to execute directly the intermediate text generated by a high-level language compiler after syntactic and semantic analysis of the input source program.Direct microprogrammed execution of common forms of intermediate text—i.e. quadruples, triples, and duos—has been simulated. A comparison is made, in terms of storage requirements and execution time, of this direct microprogrammed system with the present methods which result in machine language representation and execution of the intermediate text. Direct generation of a microprogram from the high-level language statements is also examined.Timing assumptions for comparative purposes have been based on the IBM 360 MOD 50 system. Simulation and timing estimates for the microprograms have been carried out on a microprogram directed simulator which closely represents the architectural organization of the MOD 50.  相似文献   

9.
Automatic text segmentation and text recognition for video indexing   总被引:13,自引:0,他引:13  
Efficient indexing and retrieval of digital video is an important function of video databases. One powerful index for retrieval is the text appearing in them. It enables content-based browsing. We present our new methods for automatic segmentation of text in digital videos. The algorithms we propose make use of typical characteristics of text in videos in order to enable and enhance segmentation performance. The unique features of our approach are the tracking of characters and words over their complete duration of occurrence in a video and the integration of the multiple bitmaps of a character over time into a single bitmap. The output of the text segmentation step is then directly passed to a standard OCR software package in order to translate the segmented text into ASCII. Also, a straightforward indexing and retrieval scheme is introduced. It is used in the experiments to demonstrate that the proposed text segmentation algorithms together with existing text recognition algorithms are suitable for indexing and retrieval of relevant video sequences in and from a video database. Our experimental results are very encouraging and suggest that these algorithms can be used in video retrieval applications as well as to recognize higher level semantics in videos.  相似文献   

10.
This paper considers the problem of a qualitative searches in abstract and biographic databases. It analyzes two classes of search instruments, viz., with the application of retrieval requests based on the Boolean combinations of terms and instruments that use free sentences in a natural language. It is noted that the first class of systems gives a clearer insight into the result but demands high qualifications from a user; it is very difficult to achieve high indicators of search completeness with them. The second class of systems is simpler in its workings, permits the processing of more verbose queries, and is oriented to an unqualified user. However, the output in such systems requires a longer browsing routine to search for relevant records. The experience in using an untraditional search engine with the automatic creation of terminological combinations from a query text is described. Many terminological combinations from a query text, which are contained in the documents found during a search, are issued to a user as an intermediate result. There is a convenient mechanism for browsing through the terminological combinations that are of interest to a user and through the found records themselves, as well as the mechanism for searching by subject heading indices with the output to them through a query text. Illustrative examples of using the system during searching in the abstract Medicine VINITI DB are given.  相似文献   

11.
In recent years, as the amount of data grows, personal information management has become essential as well as challenging for everyday lives. Tagging, an alternative or complement to classifying into tree-structured directories, allows users to classify a single information item in multiple categories. Due to its flexibility, tagging system has become popular and a number of studies have been conducted. Most of the previous research investigated the quality of tags with various tools such as questionnaires. However, the actual usage behavior of tag-based browsing and retrieval of stored information has rarely been studied. In this study, we examined the effects of tag attributes on the user behavior in browsing self-tagged documents under personal information management settings.

Three attributes, tag commonness, tag frequency and tag position, were identified. A controlled experiment with tasks of tagging and retrieval to trace users’ behavior revealed that the tags with higher tag commonness, higher tag frequency, and lower tag position were more likely to be used. The tags with lower tag commonness and lower tag frequency helped users recognize a desired document among a list of candidates. Among the three attributes, tag position was found the most influential. The findings of this study are expected to enhance the understanding of the quality tags and help information designers in building an effective tagging environment.  相似文献   


12.
13.
We present a framework for measuring the complexity of image databases, which characterizes the databases for image retrieval. Motivated from the concept of text corpus perplexity, the complexity of image databases is formulated based on image database statistics and information theory. We propose a quantitative metric which can be used to measure the degree of difficulty to retrieve images based on contents of the database. This metric is independent of queries, hence, it is objective. Experiments on both synthetic and real-world images demonstrate that the proposed measurement is highly effective in quantitatively measuring the contents of image databases for content-based retrieval.  相似文献   

14.
The degree to which information sources are pre-processed by Web-based information systems varies greatly. In search engines like Altavista, little pre-processing is done, while in knowledge integration systems, complex site-specific wrappers are used to integrate different information sources into a common database representation. In this paper we describe an intermediate point between these two models. In our system, information sources are converted into a highly structured collection of small fragments of text. Database-like queries to this structured collection of text fragments are approximated using a novel logic called WHIRL, which combines inference in the style of deductive databases with ranked retrieval methods from information retrieval (IR). WHIRL allows queries that integrate information from multiple Web sites, without requiring the extraction and normalization of object identifiers that can be used as keys; instead, operations that in conventional databases require equality tests on keys are approximated using IR similarity metrics for text. This leads to a reduction in the amount of human engineering required to field a knowledge integration system. Experimental evidence is given showing that many information sources can be easily modeled with WHIRL, and that inferences in the logic are both accurate and efficient.  相似文献   

15.
Automatic image tagging automatically assigns image with semantic keywords called tags, which significantly facilitates image search and organization. Most of present image tagging approaches are constrained by the training model learned from the training dataset, and moreover they have no exploitation on other type of web resource (e.g., web text documents). In this paper, we proposed a search based image tagging algorithm (CTSTag), in which the result tags are derived from web search result. Specifically, it assigns the query image with a more comprehensive tag set derived from both web images and web text documents. First, a content-based image search technology is used to retrieve a set of visually similar images which are ranked by the semantic consistency values. Then, a set of relevant tags are derived from these top ranked images as the initial tag set. Second, a text-based search is used to retrieve other relevant web resources by using the initial tag set as the query. After the denoising process, the initial tag set is expanded with other tags mined from the text-based search result. Then, an probability flow measure method is proposed to estimate the probabilities of the expanded tags. Finally, all the tags are refined using the Random Walk with Restart (RWR) method and the top ones are assigned to the query images. Experiments on NUS-WIDE dataset show not only the performance of the proposed algorithm but also the advantage of image retrieval and organization based on the result tags.  相似文献   

16.
基于Web中文检索系统SEARCH2000的设计与实现   总被引:3,自引:0,他引:3  
本文详细介绍Search 2000中文检索系统的设计思想及实现方法。与传统的全文检索系统相比,基于WEB的信息检索系统,具有许多全新的特征。页面为半结构化文档、页面通过超链接相互关联、页面的内容覆盖不同应用领域并且拥有大量专有名词和缩略词汇,这些特性成为影响查询精度的主要因素。针对Web的上述特性设计的Search2000全文检索系统,使用智能化的页面相关分析、评分技术,以及高效数据存取、压缩算法和知识库的支持,使其具有使用方便、查询时间短、查询精度高等特点。  相似文献   

17.
This paper addresses the problems that lawyers experience retrieving information from legal-text databases. Traditional access mechanisms of text databases require users to know how information is stored. We propose a method for index organisation which shields lawyers from the internal storage structures and which allows them to address the legal databases in their own legal terms. The proposed index is based on a model of legal tasks as opposed to traditional database indexes which represent the contents of the database. We will lay out the architecture of an information system in which this task model is used to determine the information need, to retrieve relevant documents and to give methodical guidance for the legal task itself. To account for the design of a task-based legal information retrieval system, a substantial part of this paper is devoted to analysis and representation of legal tasks.  相似文献   

18.
Document Segmentation is a process that aims to filter documents while identifying certain regions of interest. Generally, the regions of interest include texts, graphics (image occupied regions) and the background. This paper presents a novel top-bottom approach to perform document segmentation using texture features that are extracted from the specified/selected documents. A mask of suitable size is used to summarize textural features, and statistical parameters are captured as blocks in document images. Four textural features that are extracted from masks using the gray level co-occurrence matrix (glcm) include entropy, contrast, energy and homogeneity. Furthermore, two statistical parameters extracted from corresponding masks are the modal and median pixel values. The extracted attributes allow the classification of each mask or block as text, graphics, and background. A feedforward network is trained on the 6 extracted attributes, using documents obtained from a public database ; an error rate of 15.77 % is achieved. Furthermore, it is shown that this novel approach produces promising performance in segmenting documents and is expected to be significantly efficient for content-based information retrieval systems. Detection of duplicate documents within large databases is another potential area of application.  相似文献   

19.
高明  黄哲学 《集成技术》2012,1(3):47-54
随着Deep Web数量和规模的快速增长,通过对其发起查询请求以得到存储在后台数据库中的相关信息,日渐成为用户获取信息的主要方式。为了方便用户有效地利用Deep Web中的信息,越来越多的研究者致力于这一领域的研究,重点之一是Deep Web后台数据库的数据集成。由于Deep Web后台数据库存储的主要是文本信息,使得从文本处理角度出发,针对Deep Web中存储的内容进行查询与检索的研究具有十分广阔的应用前景。本文对Deep Web的研究现状进行了较为详细的分析,同时对研究的发展方向进行了展望。  相似文献   

20.
基于网页框架和规则的网页噪音去除方法   总被引:4,自引:0,他引:4       下载免费PDF全文
提出了一种基于网页框架和规则的网页去除噪音的新方法,该方法根据网页中HTML标签将网页分成若干部分,对各个table的长宽比属性进行比较,去掉长宽比很大的部分,并对其余table中的内容进行分析,根据内部是否存在和段落文字有关的标签


等来区分主题内容和噪音内容,在此基础上去除噪音内容。对来自CWT200G语料的132 559个网页进行测试后的结果表明,该方法可以有效地去除网页噪音,使索引文件减少约75%,大大地提高了检索速度,准确度也得到一定提高。  相似文献   


设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号