首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Legal information certification and secured storage combined with documents electronic signature are of great interest when digital documents security and conservation are in concern. Therefore, these new and evolving technologies offer powerful abilities, such as identification, authentication and certification. The latter contribute to increase the global security of legal digital archives conservation and access. However, currently used cryptographic and hash coding concepts cannot intrinsically enclose cognitive information about both the signer and the signed content. Indeed, an evolution of these technologies may be necessary to achieve full text researches within hundreds or thousands of electronically signed documents. This article aims at describing a possible model along with associated processes to create and make use of these new electronic signatures called “meaningful electronic signatures” as opposed to traditional electronic signatures based on bit per bit computation.  相似文献   

2.
With the ubiquitous collection of data and creation of large distributed repositories, enabling search over this data while respecting access control is critical. A related problem is that of ensuring privacy of the content owners while still maintaining an efficient index of distributed content. We address the problem of providing privacy-preserving search over distributed access-controlled content. Indexed documents can be easily reconstructed from conventional (inverted) indexes used in search. Currently, the need to avoid breaches of access-control through the index requires the index hosting site to be fully secured and trusted by all participating content providers. This level of trust is impractical in the increasingly common case where multiple competing organizations or individuals wish to selectively share content. We propose a solution that eliminates the need of such a trusted authority. The solution builds a centralized privacy-preserving index in conjunction with a distributed access-control enforcing search protocol. Two alternative methods to build the centralized index are proposed, allowing trade offs of efficiency and security. The new index provides strong and quantifiable privacy guarantees that hold even if the entire index is made public. Experiments on a real-life dataset validate performance of the scheme. The appeal of our solution is twofold: (a) content providers maintain complete control in defining access groups and ensuring its compliance, and (b) system implementors retain tunable knobs to balance privacy and efficiency concerns for their particular domains. Dr. Vaidya’s work was supported by the National Science Foundation under grant CNS-0746943 and by a research resources grant from Rutgers Business School, Newark and New Brunswick.  相似文献   

3.
One promise of current information retrieval systems is the capability to identify risk groups for certain diseases and pathologies based on the automatic analysis of vast amounts of Electronic Medical Records repositories. However, the complexity and the degree of specialization of the language used by the experts in this context, make this task both challenging and complex. In this work, we introduce a novel experimental study to evaluate the performance of the two semantic similarity metrics (Path and Intrinsic IC-Path, both widely accepted in the literature) in a real-life information retrieval situation. In order to achieve this goal and due to the lack of methodologies for this context in the literature, we propose a straightforward information retrieval system for the biomedical field based on the UMLS Metathesaurus and on semantic similarity metrics. In contrast with previous studies which focus on testbeds with limited and controlled sets of concepts, we use a large amount of information (101,712 medical documents extracted from TREC Medical Records Track 2011). Our results show that in real-life cases, both metrics display similar performance, Path (F-Measure = 0.430) e Intrinsic IC-Path (F-Measure = 0.427). Thereby we suggest that the use of Intrinsic IC-Path is not justified in real scenarios.  相似文献   

4.
There is a significant commercial and research interest in location-based web search engines. Given a number of search keywords and one or more locations (geographical points) that a user is interested in, a location-based web search retrieves and ranks the most textually and spatially relevant web pages. In this type of search, both the spatial and textual information should be indexed. Currently, no efficient index structure exists that can handle both the spatial and textual aspects of data simultaneously and accurately. Existing approaches either index space and text separately or use inefficient hybrid index structures with poor performance and inaccurate results. Moreover, most of these approaches cannot accurately rank web-pages based on a combination of space and text and are not easy to integrate into existing search engines. In this paper, we propose a new index structure called Spatial-Keyword Inverted File for Points to handle point-based indexing of web documents in an integrated/efficient manner. To seamlessly find and rank relevant documents, we develop a new distance measure called spatial tf-idf. We propose four variants of spatial-keyword relevance scores and two algorithms to perform top-k searches. As verified by experiments, our proposed techniques outperform existing index structures in terms of search performance and accuracy.  相似文献   

5.
6.
7.
We propose an approach for the word-level indexing of modern printed documents which are difficult to recognize using current OCR engines. By means of word-level indexing, it is possible to retrieve the position of words in a document, enabling queries involving proximity of terms. Web search engines implement this kind of indexing, allowing users to retrieve Web pages on the basis of their textual content. Nowadays, digital libraries hold collections of digitized documents that can be retrieved either by browsing the document images or relying on appropriate metadata assembled by domain experts. Word indexing tools would therefore increase the access to these collections. The proposed system is designed to index homogeneous document collections by automatically adapting to different languages and font styles without relying on OCR engines for character recognition. The approach is based on three main ideas: the use of self organizing maps (SOM) to perform unsupervised character clustering, the definition of one suitable vector-based word representation whose size depends on the word aspect-ratio, and the run-time alignment of the query word with indexed words to deal with broken and touching characters. The most appropriate applications are for processing modern printed documents (17th to 19th centuries) where current OCR engines are less accurate. Our experimental analysis addresses six data sets containing documents ranging from books of the 17th century to contemporary journals.  相似文献   

8.
This article addresses the task of mining concepts from biomedical literature to index and search through a documents base. This research takes place within the Telemakus project, which has for goal to support and facilitate the knowledge discovery process by providing retrieval, visual, and interaction tools to mine and map research findings from research literature in the field of aging. A concept mining component automating research findings extraction such as the one presented here, would permit Telemakus to be efficiently applied to other domains. The main strategy that has been followed in this project has been to mine from the legends of the documents the research findings as relationships between concepts from the medical literature. The concept mining proceeds through stages of syntactic analysis, semantic analysis, relationships building, and ranking. Evaluation results are presented at the end and show that the system learns concepts and relationships between them with good recall, and that these concepts can be used for indexing the documents. Future improvements of the system are also presented.  相似文献   

9.
XML (extensible markup language) is fast becoming the de facto standard for information exchange over the Internet. As more and more sensitive information gets stored in the form of XML, proper access control to the XML documents becomes increasingly important. However, traditional access control methodologies that have been adapted for XML documents do not address the performance issue of access control. This paper proposes a bitmap-indexing scheme in which access control decisions can be sped up. Authorization policies of the form (subject, object, and action) are encoded as bitmaps in the same manner as XML document indexes are constructed. These two are then efficiently pipelined and manipulated for "fast" access control and "secure" retrieval of XML documents.  相似文献   

10.
11.
The diffusion of the World Wide Web (WWW) and the consequent increase in the production and exchange of textual information demand the development of effective information retrieval systems. The HyperText Markup Language (HTML) constitues a common basis for generating documents over the internet and the intranets. By means of the HTML the author is allowed to organize the text into subparts delimited by special tags; these subparts are then visualized by the HTML browser in distinct ways, i.e. with distinct typographical formats. In this paper a model for indexing HTML documents is proposed which exploits the role of tags in encoding the importance of their delimited text. Central to our model is a method to compute the significance degree of a term in a document by weighting the term instances according to the tags in which they occur. The indexing model proposed is based on a contextual weighted representation of the document under consideration, by means of which a set of (normalized) numerical weights is assigned to the various tags containing the text. The weighted representation is contextual in the sense that the set of numerical weights assigned to the various tags and the respective text depend (other than on the tags themselves) on the particular document considered. By means of the contextual weighted representation our indexing model reflects not only the general syntactic structure of the HTML language but also the information conveyed by the particular way in which the author instantiates that general structure in the document under consideration. We discuss two different forms of contextual weighting: the first is based on a linear weighted representation and is closer to the standard model of universal (i.e. non contextual) weighting; the second is based on a more complex non linear weighted representation and has a number of novel and interesting features.  相似文献   

12.
This paper proposes a document image analysis system that extracts newspaper headlines from microfilm images with a view to providing automatic indexing for news articles in microfilm. A major challenge in achieving this is the poor image quality of microfilm as most images are usually inadequately illuminated and considerably dirty. To overcome the problem we propose a new effective method for separating characters from noisy background since conventional threshold selection techniques are inadequate to deal with this kind of image. A run length smoothing algorithm is then applied to the headline extraction. Experimental results confirm the validity of the approach.Received: 15 November 2002, Accepted: 19 May 2003, Published online: 30 January 2004Correspondence to: Chew Lim Tan  相似文献   

13.
《Information Systems》2001,26(2):75-92
As information becomes available on the World Wide Web in larger quantities and in more disparate formats and media, adequate search engines and portal services providing search and filtering modalities tailored to the needs of the various communities of users become essential components of the global information infrastructure. Global-Atlas is a geographical search engine leveraging the cartographic paradigm for the indexing and searching of World Wide Web documents. Documents are indexed according to their geographical footprint, i.e. the bounding box of the geographical region to which they are related. Documents are searched by interactively drawing bounding boxes on maps. The effectiveness of Global-Atlas depends on the capability to quickly and to accurately index documents and maps. Maps, however, come in a variety of unspecified coordinate systems and projections. In this paper, we evaluate various surface-fitting techniques for the calibration of maps and devise an original hybrid calibration method from the empirical results obtained.  相似文献   

14.
International Journal on Digital Libraries - Indexing documents with controlled vocabularies enables a wealth of semantic applications for digital libraries. Due to the rapid growth of scientific...  相似文献   

15.
An automatic keyphrase extraction system for scientific documents   总被引:1,自引:0,他引:1  
Automatic keyphrase extraction techniques play an important role for many tasks including indexing, categorizing, summarizing, and searching. In this paper, we develop and evaluate an automatic keyphrase extraction system for scientific documents. Compared with previous work, our system concentrates on two important issues: (1) more precise location for potential keyphrases: a new candidate phrase generation method is proposed based on the core word expansion algorithm, which can reduce the size of the candidate set by about 75% without increasing the computational complexity; (2) overlap elimination for the output list: when a phrase and its sub-phrases coexist as candidates, an inverse document frequency feature is introduced for selecting the proper granularity. Additional new features are added for phrase weighting. Experiments based on real-world datasets were carried out to evaluate the proposed system. The results show the efficiency and effectiveness of the refined candidate set and demonstrate that the new features improve the accuracy of the system. The overall performance of our system compares favorably with other state-of-the-art keyphrase extraction systems.  相似文献   

16.
A statistically based segmentation method was applied to the recognition of organs in abdominal CT scans. To incorporate prior knowledge of anatomical structure, a stochastic model was used that represented abdominal geometry in three dimensions. Properties of each tissue class with respect to X-ray imaging were modeled by mean grey-value distributions. Twelve different tissues were labeled simultaneously. Deterministic maximization of the a posteriori distribution as well as stochastic optimization by simulated annealing were both applied. Mean segmentation results were determined for a set of 18 scan sequences, using a set of reference contours designated by a radiologist as ground truth. Results appeared to be somewhat better using simulated annealing. The proposed segmentation method, which is fast and fully automatic, seems sufficiently accurate for many clinical applications, such as determination of relative organ volumes.  相似文献   

17.
Character groundtruth for real, scanned document images is crucial for evaluating the performance of OCR systems, training OCR algorithms, and validating document degradation models. Unfortunately, manual collection of accurate groundtruth for characters in a real (scanned) document image is not practical because (i) accuracy in delineating groundtruth character bounding boxes is not high enough, (ii) it is extremely laborious and time consuming, and (iii) the manual labor required for this task is prohibitively expensive. Ee describe a closed-loop methodology for collecting very accurate groundtruth for scanned documents. We first create ideal documents using a typesetting language. Next we create the groundtruth for the ideal document. The ideal document is then printed, photocopied and then scanned. A registration algorithm estimates the global geometric transformation and then performs a robust local bitmap match to register the ideal document image to the scanned document image. Finally, groundtruth associated with the ideal document image is transformed using the estimated geometric transformation to create the groundtruth for the scanned document image. This methodology is very general and can be used for creating groundtruth for documents in typeset in any language, layout, font, and style. We have demonstrated the method by generating groundtruth for English, Hindi, and FAX document images. The cost of creating groundtruth using our methodology is minimal. If character, word or zone groundtruth is available for any real document, the registration algorithm can be used to generate the corresponding groundtruth for a rescanned version of the document  相似文献   

18.
This paper describes the DocMIR system which captures, analyzes and indexes automatically meetings, conferences, lectures, etc. by taking advantage of the documents projected (e.g. slideshows, budget tables, figures, etc.) during the events. For instance, the system can automatically apply the above-mentioned procedures to a lecture and automatically index the event according to the presented slides and their contents. For indexing, the system requires neither specific software installed on the presenter’s computer nor any conscious intervention of the speaker throughout the presentation. The only material required by the system is the electronic presentation file of the speaker. Even if not provided, the system would temporally segment the presentation and offer a simple storyboard-like browsing interface. The system runs on several capture boxes connected to cameras and microphones that records events, synchronously. Once the recording is over, indexing is automatically performed by analyzing the content of the captured video containing projected documents and detects the scene changes, identifies the documents, computes their duration and extracts their textual content. Each of the captured images is identified from a repository containing all original electronic documents, captured audio–visual data and metadata created during post-production. The identification is based on documents’ signatures, which hierarchically structure features from both layout structure and color distributions of the document images. Video segments are finally enriched with textual content of the identified original documents, which further facilitate the query and retrieval without using OCR. The signature-based indexing method proposed in this article is robust and works with low-resolution images and can be applied to several other applications including real-time document recognition, multimedia IR and augmented reality systems.
Rolf IngoldEmail:
  相似文献   

19.
针对微博文本的特点,提出了一种自动识别微博标引词的方法。根据微博文本中的名词或动词之间语义相似度构造图的邻接矩阵,在图的邻接矩阵基础上利用Pagerank算法思想来计算词语的重要度,选择重要度较大的一些词作为标引词。实验结果表明,较传统的自动标引方法,提出的自动标引方法简单实用、准确率较高。  相似文献   

20.
提出一种基于PKI的电子公文安全管理方法(SMMED).该方法为电子公文提供了保密性、完整性和不可否认性等高强度的安全性服务.详细阐述了SMMED的理论模型,并描述了应用SMMED的网上招投标系统的结构,最后概括出本方法的优势所在.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号