首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
This paper describes aminimally immersive three-dimensional volumetric interactive information visualization system for management and analysis of document corpora. The system, SFA, uses glyph-based volume rendering, enabling more complex data relationships and information attributes to be visualized than traditional 2D and surface-based visualization systems. Two-handed interaction using three-space magnetic trackers and stereoscopic viewing are combined to produce aminimally immersive interactive system that enhances the user’s three-dimensional perception of the information space. This new system capitalizes on the human visual system’s pre-attentive learning capabilities to quickly analyze the displayed information. SFA is integrated with adocument management and information retrieval engine named Telltale. Together, these systems integrate visualization and document analysis technologies to solve the problem of analyzing large document corpora. We describe the usefulness of this system for the analysis and visualization of document similarity within acorpus of textual documents, and present an example exploring authorship of ancient Biblical texts. Received: 15 December 1997 / Revised: June 1999  相似文献   

2.
3.
Text Retrieval from Document Images Based on Word Shape Analysis   总被引:2,自引:1,他引:2  
In this paper, we propose a method of text retrieval from document images using a similarity measure based on word shape analysis. We directly extract image features instead of using optical character recognition. Document images are segmented into word units and then features called vertical bar patterns are extracted from these word units through local extrema points detection. All vertical bar patterns are used to build document vectors. Lastly, we obtain the pair-wise similarity of document images by means of the scalar product of the document vectors. Four corpora of news articles were used to test the validity of our method. During the test, the similarity of document images using this method was compared with the result of ASCII version of those documents based on the N-gram algorithm for text documents.  相似文献   

4.
Extraction and normalization of temporal expressions from documents are important steps towards deep text understanding and a prerequisite for many NLP tasks such as information extraction, question answering, and document summarization. There are different ways to express (the same) temporal information in documents. However, after identifying temporal expressions, they can be normalized according to some standard format. This allows the usage of temporal information in a term- and language-independent way. In this paper, we describe the challenges of temporal tagging in different domains, give an overview of existing annotated corpora, and survey existing approaches for temporal tagging. Finally, we present our publicly available temporal tagger HeidelTime, which is easily extensible to further languages due to its strict separation of source code and language resources like patterns and rules. We present a broad evaluation on multiple languages and domains on existing corpora as well as on a newly created corpus for a language/domain combination for which no annotated corpus has been available so far.  相似文献   

5.
Discovering unexpected documents in corpora   总被引:1,自引:0,他引:1  
Text mining is widely used to discover frequent patterns in large corpora of documents. Hence, many classical data mining techniques, that have been proven fruitful in the context of data stored in relational databases, are now successfully used in the context of textual data. Nevertheless, there are many situations where it is more valuable to discover unexpected information rather than frequent ones. In the context of technology watch for example, we may want to discover new trends in specific markets, or discover what competitors are planning in the near future, etc. This paper is related to that context of research. We have proposed several unexpectedness measures and implemented them in a prototype, called UnexpectedMiner, that can be used by watchers, in order to discover unexpected documents in large corpora of documents (patents, datasheets, advertisements, scientific papers, etc.). UnexpectedMiner is able to take into account the structure of documents during the discovery of unexpected information. Many experiments have been performed in order to validate our measures and show the interest of our system.  相似文献   

6.
在当今处于信息数量爆炸式增长的互联网时代,如何分析海量文本中的信息并从而提取出所蕴含的有利用价值的部分,是一个值得关注的问题。然而论坛语料作为网络语料,其结构和内容较一般语料相比更为复杂,文本也更加短小。该文提出的方法利用LDA模型对语料集进行建模,将话题从中抽取出来,根据生成的话题空间找到相应的话题支持文档,计算文档支持率作为话题强度;将话题强度反映在时间轴上,得到话题的强度趋势;通过在不同时间段上对语料重新建模,并结合全局话题,得到话题的内容演化路径。实验结果说明,上述方法是合理和有效的。  相似文献   

7.
Imaged document text retrieval without OCR   总被引:6,自引:0,他引:6  
We propose a method for text retrieval from document images without the use of OCR. Documents are segmented into character objects. Image features, namely the vertical traverse density (VTD) and horizontal traverse density (HTD), are extracted. An n-gram-based document vector is constructed for each document based on these features. Text similarity between documents is then measured by calculating the dot product of the document vectors. Testing with seven corpora of imaged textual documents in English and Chinese as well as images from the UW1 (University of Washington 1) database confirms the validity of the proposed method  相似文献   

8.
One of the largest areas of growth in the computer industry is that of relational data base management systems (RDBMSs). Despite the plethora of such systems currently available, only a distinct few can store and manipulate unstructured textual data, thereby enabling users to manage correspondence and other documents more effectively. Full text retrieval software, which enables users to create a data base of documents that can then be queried in various ways, is the most efficient means of avoiding the heavy costs of manually storing and retrieving information. This article discusses full text retrieval software and its potential for any organization that needs a more effective document management system.  相似文献   

9.
One of the largest areas of growth in the computer industry is that of relational data base management systems (RDBMSs). Despite the plethora of such systems currently available, only a distinct few can store and manipulate unstructured textual data, thereby enabling users to manage correspondence and other documents more effectively. Full text retrieval software, which enables users to create a data base of documents that can then be queried in various ways, is the most efficient means of avoiding the heavy costs of manually storing and retrieving information. This article discusses full text retrieval software and its potential for any organization that needs a more effective document management system.  相似文献   

10.
Exploratory analysis is an area of increasing interest in the computational linguistics arena. Pragmatically speaking, exploratory analysis may be paraphrased as natural language processing by means of analyzing large corpora of text. Concerning the analysis, appropriate means are statistics, on the one hand, and artificial neural networks, on the other hand. As a challenging application area for exploratory analysis of text corpora we may certainly identify text databases, be it information retrieval or information filtering systems. With this paper we present recent findings of exploratory analysis based on both statistical and neural models applied to legal text corpora. Concerning the artificial neural networks, we rely on a model adhering to the unsupervised learning paradigm. This choice appears naturally when taking into account the specific properties of large text corpora where one is faced with the fact that input-output-mappings as required by supervised learning models cannot be provided beforehand to a satisfying extent. This is due to the fact of the highly changing contents of text archives. In a nutshell, artificial neural networks count for their highly robust behavior regarding the parameters for model optimization. In particular, we found statistical classification techniques much more susceptible to minor parameter variations than unsupervised artificial neural networks. In this paper we describe two different lines of research in exploratory analysis. First, we use the classification methods for concept analysis. The general goal is to uncover different meanings of one and the same natural language concept. A task that, obviously, is of specific importance during the creation of thesauri. As a convenient environment to present the results we selected the legal term of neutrality, which is a perfect representative of a concept having a number of highly divergent meanings. Second, we describe the classification methods in the setting of document classification. The ultimate goal in such an application is to uncover semantic similarities of various text documents in order to increase the efficiency of an information retrieval system. In this sense, document classification has its fixed position in information retrieval research from the very beginning. Nowadays renewed massive interest in document classification may be witnessed due to the appearance of large-scale digital libraries.  相似文献   

11.
Field Association (FA) Terms—words or phrases that serve to identify document fields are effective in document classification, similar file retrieval and passage retrieval. But the problem lies in the lack of an effective method to extract and select relevant FA Terms to build a comprehensive dictionary of FA Terms. This paper presents a new method to extract, select and rank FA Terms from domain-specific corpora using part-of-speech (POS) pattern rules, corpora comparison and modified tf-idf weighting. Experimental evaluation on 21 fields using 306 MB of domain-specific corpora obtained from English Wikipedia dumps selected up to 2,517 FA Terms (single and compound) per field at precision and recall of 74–97 and 65–98. This is better than the traditional methods. The FA Terms dictionary constructed using this method achieved an average accuracy of 97.6% in identifying the fields of 10,077 test documents collected from Wikipedia, Reuters RCV1 corpus and 20 Newsgroup data set.  相似文献   

12.
Digitalization has changed the way of information processing, and new techniques of legal data processing are evolving. Text mining helps to analyze and search different court cases available in the form of digital text documents to extract case reasoning and related data. This sort of case processing helps professionals and researchers to refer the previous case with more accuracy in reduced time. The rapid development of judicial ontologies seems to deliver interesting problem solving to legal knowledge formalization. Mining context information through ontologies from corpora is a challenging and interesting field. This research paper presents a three tier contextual text mining framework through ontologies for judicial corpora. This framework comprises on the judicial corpus, text mining processing resources and ontologies for mining contextual text from corpora to make text and data mining more reliable and fast. A top-down ontology construction approach has been adopted in this paper. The judicial corpus has been selected with a sufficient dataset to process and evaluate the results. The experimental results and evaluations show significant improvements in comparison with the available techniques.  相似文献   

13.
We explore how to organize large text databases hierarchically by topic to aid better searching, browsing and filtering. Many corpora, such as internet directories, digital libraries, and patent databases are manually organized into topic hierarchies, also called taxonomies. Similar to indices for relational data, taxonomies make search and access more efficient. However, the exponential growth in the volume of on-line textual information makes it nearly impossible to maintain such taxonomic organization for large, fast-changing corpora by hand. We describe an automatic system that starts with a small sample of the corpus in which topics have been assigned by hand, and then updates the database with new documents as the corpus grows, assigning topics to these new documents with high speed and accuracy. To do this, we use techniques from statistical pattern recognition to efficiently separate the feature words, or discriminants, from thenoise words at each node of the taxonomy. Using these, we build a multilevel classifier. At each node, this classifier can ignore the large number of “noise” words in a document. Thus, the classifier has a small model size and is very fast. Owing to the use of context-sensitive features, the classifier is very accurate. As a by-product, we can compute for each document a set of terms that occur significantly more often in it than in the classes to which it belongs. We describe the design and implementation of our system, stressing how to exploit standard, efficient relational operations like sorts and joins. We report on experiences with the Reuters newswire benchmark, the US patent database, and web document samples from Yahoo!. We discuss applications where our system can improve searching and filtering capabilities. Received January 25, 1998 / Accepted May 27, 1998  相似文献   

14.
Automatic text classification based on vector space model (VSM), artificial neural networks (ANN), K-nearest neighbor (KNN), Naives Bayes (NB) and support vector machine (SVM) have been applied on English language documents, and gained popularity among text mining and information retrieval (IR) researchers. This paper proposes the application of VSM and ANN for the classification of Tamil language documents. Tamil is morphologically rich Dravidian classical language. The development of internet led to an exponential increase in the amount of electronic documents not only in English but also other regional languages. The automatic classification of Tamil documents has not been explored in detail so far. In this paper, corpus is used to construct and test the VSM and ANN models. Methods of document representation, assigning weights that reflect the importance of each term are discussed. In a traditional word-matching based categorization system, the most popular document representation is VSM. This method needs a high dimensional space to represent the documents. The ANN classifier requires smaller number of features. The experimental results show that ANN model achieves 93.33% which is better than the performance of VSM which yields 90.33% on Tamil document classification.  相似文献   

15.
Recent advances in information and networking technologies have contributed significantly to global connectivity and greatly facilitated and fostered information creation, distribution, and access. The resultant ever-increasing volume of online textual documents creates an urgent need for new text mining techniques that can intelligently and automatically extract implicit and potentially useful knowledge from these documents for decision support. This research focuses on identifying and discovering event episodes together with their temporal relationships that occur frequently (referred to as evolution patterns (EPs) in this paper) in sequences of documents. The discovery of such EPs can be applied in domains such as knowledge management and used to facilitate existing document management and retrieval techniques [e.g., event tracking (ET)]. Specifically, we propose and design an EP discovery technique for mining EPs from sequences of documents. We experimentally evaluate our proposed EP technique in the context of facilitating ET. Measured by miss and false alarm rates, the EP-supported ET (EPET) technique exhibits better tracking effectiveness than a traditional ET technique. The encouraging performance of the EPET technique demonstrates the potential usefulness of EPs in supporting ET and suggests that the proposed EP technique could effectively discover event episodes and EPs in sequences of documents  相似文献   

16.
FacetAtlas: multifaceted visualization for rich text corpora   总被引:1,自引:0,他引:1  
Documents in rich text corpora usually contain multiple facets of information. For example, an article about a specific disease often consists of different facets such as symptom, treatment, cause, diagnosis, prognosis, and prevention. Thus, documents may have different relations based on different facets. Powerful search tools have been developed to help users locate lists of individual documents that are most related to specific keywords. However, there is a lack of effective analysis tools that reveal the multifaceted relations of documents within or cross the document clusters. In this paper, we present FacetAtlas, a multifaceted visualization technique for visually analyzing rich text corpora. FacetAtlas combines search technology with advanced visual analytical tools to convey both global and local patterns simultaneously. We describe several unique aspects of FacetAtlas, including (1) node cliques and multifaceted edges, (2) an optimized density map, and (3) automated opacity pattern enhancement for highlighting visual patterns, (4) interactive context switch between facets. In addition, we demonstrate the power of FacetAtlas through a case study that targets patient education in the health care domain. Our evaluation shows the benefits of this work, especially in support of complex multifaceted data analysis.  相似文献   

17.
We compare different strategies to apply statistical machine translation techniques in order to retrieve documents that are a plausible translation of a given source document. Finding the translated version of a document is a relevant task; for example, when building a corpus of parallel texts that can help to create and evaluate new machine translation systems.

In contrast to the traditional settings in cross-language information retrieval tasks, in this case both the source and the target text are long and, thus, the procedure used to select which words or phrases will be included in the query has a key effect on the retrieval performance. In the statistical approach explored here, both the probability of the translation and the relevance of the terms are taken into account in order to build an effective query.  相似文献   

18.
Most of the written materials are consisted of Multimedia (MM) information because beside text usually contain image information. The present information retrieval and filtering systems use only text parts of the documents or in best case images represented by keywords or image captions. Why do not use both, text and image features of the documents and in the retrieval or filtering process utilize more completely the document information content? Can such approach increase the effectiveness of retrieval and filtering processes? There is a very little difference between retrieval and filtering at an abstract level. In this paper, we will discuss some possible similarities and differences between them on the application level taking into account the experiments in retrieval and filtering of multimedia mineral information.  相似文献   

19.
Small displays on mobile handheld devices, such as personal digital assistants (PDAs) and cellular phones, are the bottlenecks for usability of most content browsing applications. Generally, conventional content such as documents and Web pages need to be modified for effective presentation on mobile devices. This paper proposes a novel visualization for documents, called multimedia thumbnails, which consists of text and image content converted into playable multimedia clips. A multimedia thumbnail utilizes visual and audio channels of small portable devices as well as both spatial and time dimensions to communicate text and image information of a single document. The proposed algorithm for generating multimedia thumbnails includes 1) a semantic document analysis step, where salient content from a source document is extracted; 2) an optimization step, where a subset of this extracted content is selected based on time, display, and application constraints; and 3) a composition step, where the selected visual and audible document content is combined into a multimedia thumbnail. Scalability of MMNails that allows generation of multimedia clips of various lengths is also described. A user study is presented that evaluates the effectiveness of the proposed multimedia thumbnail visualization.  相似文献   

20.
Taylor  S.M. 《IT Professional》2004,6(6):28-34
Most readily available tools - basic search engines, possibly a news or information service, and perhaps agents and Web crawlers - are inadequate for many information retrieval tasks and downright dangerous for others. These tools either return too much useless material or miss important material. Even when such tools find useful information, the data is still in a text form that makes it difficult to build displays or diagrams. Employing the data in data mining or standard database operations, such as sorting and counting, can also be difficult. An emerging technology called information extraction (IE) is beginning to change all that, and you might already be using some very basic IE tools without even knowing it. Companies are increasingly applying IE behind the scenes to improve information and knowledge management applications such as text search, text categorization, data mining, and visualization (Rao, 2003). IE has also begun playing a key role in fields such as national security, law enforcement, insurance, and biomedical research, which have highly critical information and knowledge needs. In these fields, IE's powerful capabilities arc necessary to save lives or substantial investments of time and money. IE views language up close, considering grammar and vocabulary, and tries to determine the details of "who did what to whom" from a piece of text. In its most in-depth applications, IE is domain focused; it does not try to define all the events or relationships present in a piece of text, but focuses only on items of particular interest to the user organization.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号