首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
Given a specific information need, documents of the wrong genre can be considered as noise. From this perspective, genre classification helps to separate relevant documents from noise. Orthographic errors represent a second, finer notion of noise. Since specific genres often include documents with many errors, an interesting question is whether this “micro-noise” can help to classify genre. In this paper we consider both problems. After introducing a comprehensive hierarchy of genres, we present an intuitive method to build specialized and distinctive classifiers that also work for very small training corpora. Special emphasis is given to the selection of intelligent high-level features. We then investigate the correlation between genre and micro noise. Using special error dictionaries, we estimate the typical error rates for each genre. Finally, we test if the error rate of a document represents a useful feature for genre classification.  相似文献   

3.
TEXPROS (TEXt PROcessing System) is an automatic document processing system which supports text-based information representation and manipulation, conveying meanings from stored information within office document texts. A dual modeling approach is employed to describe office documents and support document search and retrieval. The frame templates for representing document classes are organized to form a document type hierarchy. Based on its document type, the synopsis of a document is extracted to form its corresponding frame instance. According to the user predefined criteria, these frame instances are stored in different folders, which are organized as a folder organization (i.e., repository of frame instances associated with their documents). The concept of linking folders establishes filing paths for automatically filing documents in the folder organization. By integrating document type hierarchy and folder organization, the dual modeling approach provides efficient frame instance access by limiting the searches to those frame instances of a document type within those folders which appear to be the most similar to the corresponding queries.This paper presents an agent-based document filing system using folder organization. A storage architecture is presented to incorporate the document type hierarchy, folder organization and original document storage into a three-level storage system. This folder organization supports effective filing strategy and allows rapid frame instance searches by confining the search to the actual predicate-driven retrieval method. A predicate specification is proposed for specifying criteria on filing paths in terms of user predefined predicates for governing the document filing. A method for evaluating whether a given frame instance satisfies the criteria of a filing path is presented. The basic operations for constructing and reorganizing a folder organization are proposed.  相似文献   

4.
中文文本体裁的自动分类机制   总被引:1,自引:0,他引:1  
文本按体裁自动分类属于按文本的形式分类的范畴,所以它与按内容自动分类问题有许多的不同之处,本文提出了一种关于中文文本体裁自动分类的新机制。在体裁分类过程中首要的问题是分类特征的选取,体裁分类特征项分为两种方式加以描述,一是集合形式,如基于分类词典和语料统计的政论性词汇和情感词汇等,二是规则形式,如公文标识信息和条文句等。基于根据特征之间的关联性和差异性,采用样本分布决策的方法抽取相应的特征项。最后利用支撑向量机算法进行自动分类。该机制已经在五类体裁的语料上得到实现,并获得了较好的效果。  相似文献   

5.
6.
With the rapid growth of web, automatic tagging that detects informative terms from a document becomes an important problem for information aggregation and sharing services. In particular, automatic tagging for short documents becomes more interesting as many users are increasingly publishing information through social media services which encourage users to create the documents of short length. In this paper, we propose a novel automatic tagging model for short text documents from social media services, following the framework of supervised learning. We redefine traditional frequency-based term features so that they can address the properties of the documents created under length limitation and consider sequential dependencies between successive terms in a document based on a structural support vector machine. In addition, our proposed approach incorporates composition patterns by which users put informative terms into their documents. Extensive experiments have been conducted to validate the presented approach, and it was found that the proposed term features were effective for extracting tags, and the tag extractor trained by considering the sequential dependencies and composition patterns achieved superior performance results over the existing alternative methods.  相似文献   

7.
Automatic keyword extraction is an important research direction in text mining, natural language processing and information retrieval. Keyword extraction enables us to represent text documents in a condensed way. The compact representation of documents can be helpful in several applications, such as automatic indexing, automatic summarization, automatic classification, clustering and filtering. For instance, text classification is a domain with high dimensional feature space challenge. Hence, extracting the most important/relevant words about the content of the document and using these keywords as the features can be extremely useful. In this regard, this study examines the predictive performance of five statistical keyword extraction methods (most frequent measure based keyword extraction, term frequency-inverse sentence frequency based keyword extraction, co-occurrence statistical information based keyword extraction, eccentricity-based keyword extraction and TextRank algorithm) on classification algorithms and ensemble methods for scientific text document classification (categorization). In the study, a comprehensive study of comparing base learning algorithms (Naïve Bayes, support vector machines, logistic regression and Random Forest) with five widely utilized ensemble methods (AdaBoost, Bagging, Dagging, Random Subspace and Majority Voting) is conducted. To the best of our knowledge, this is the first empirical analysis, which evaluates the effectiveness of statistical keyword extraction methods in conjunction with ensemble learning algorithms. The classification schemes are compared in terms of classification accuracy, F-measure and area under curve values. To validate the empirical analysis, two-way ANOVA test is employed. The experimental analysis indicates that Bagging ensemble of Random Forest with the most-frequent based keyword extraction method yields promising results for text classification. For ACM document collection, the highest average predictive performance (93.80%) is obtained with the utilization of the most frequent based keyword extraction method with Bagging ensemble of Random Forest algorithm. In general, Bagging and Random Subspace ensembles of Random Forest yield promising results. The empirical analysis indicates that the utilization of keyword-based representation of text documents in conjunction with ensemble learning can enhance the predictive performance and scalability of text classification schemes, which is of practical importance in the application fields of text classification.  相似文献   

8.
9.
10.
The problem addressed in this paper is the automatic extraction of names from a document image. Our approach relies on the combination of two complementary analyses. First, the image-based analysis exploits visual clues to select the regions of interest in the document. Second, the textual-based analysis searches for name patterns and low-level word textual features. Both analyses are then combined at the word level through a neural network fusion scheme. Reported results on degraded documents such as facsimile and photocopied technical journals demonstrate the interest of the combined approach.  相似文献   

11.
体裁是信息检索中重要的上下文因素之一.文章阐述了体裁的含义,重点说明了数字体裁的含义、识别与描述,介绍了体裁在信息检索中的应用现状,并分析了体裁在应用中所面临的识别、标注等问题;同时指出在未来发展中,体裁作为检索目标与文档目标的表现方式之一,应独立于内容与用户当前任务进行单独匹配.为将体裁作为独立维度应用于检索系统以提高返回结果相关度,引入DCG作为评价指标.实验结果表明,该方法能有效提高检索效果.  相似文献   

12.
This paper presents a knowledge-based approach to managing and retrieving personal documents. The dual document models consist of a document type hierarchy and a folder organization. The document type hierarchy is used to capture the layout, logical and conceptual structures of documents. The folder organization mimics the user's real-world document filing system for organizing and storing documents in an office environment. Predicate-based representation of documents is formalized for specifying knowledge about documents. Document filing and retrieval are predicate-driven. The filing criteria for the folders, which are specified in terms of predicates, govern the grouping of frame instances, regardless of their document types. We incorporated the notions of document type hierarchy and folder organization into the multilevel architecture of document storage. This architecture supports various text-based information retrieval techniques and content-based multimedia information retrieval techniques. The paper also proposes a knowledge-based query-preprocessing algorithm, which reduces the search space. For automating the document filing and retrieval, a predicate evaluation engine with a knowledge base is proposed. The learning agent is responsible for acquiring the knowledge needed by the evaluation engine.  相似文献   

13.
14.
通过分析五类典型视频在视觉上的特性,提取了七种最能揭示几类视频差异的特征,并设计了一种基于一对一支持向量机(1-1 SVM)的视频内容自动分类算法,用于解决在对网络视频媒体的管理、点播、检索中对视频内容进行初步筛选的问题。基于大量实际视频片段的仿真实验结果证明了本算法在区分能力和准确率方面的性能优势。  相似文献   

15.
The ability to reliably merge independent updates of a document is a crucial prerequisite to efficient collaboration in office work. However, merge support for common office document standards like OpenDocument or OfficeOpenXML is still in its infancy. In this paper, we present a consistent versioning model for XML documents in general including merge support. This is achieved by using context-aware fingerprints that identify edit operations and allow for a conflict detection. We show how to extract tracked changes from office documents and map them on our delta model. Experimental results indicate that our fingerprinting technique is efficient and reliable.  相似文献   

16.
We present a multimodal document alignment framework, which highlights existing alignment relationships between documents that are discussed and recorded during multimedia events such as meetings. These relationships that should help indexing the archives of these events are detected using various techniques from natural language processing and information retrieval. The main alignment strategies studied are based on thematic, quotation and reference relationships. At the analysis level, the alignment framework was applied at several levels of granularity of documents, requiring specific document segmentation techniques. Our framework that is language independent was evaluated on corpora in French and English, including meetings and scientific presentations. The satisfactory evaluation results obtained at several stages show the importance of our approach in bridging the gap between meeting documents, independently from the language and domain. They highlight also the utility of the multimodal alignment in advanced applications, e.g. multimedia document browsing, content-based / temporal-based searching, etc.  相似文献   

17.
With the development of the World Wide Web, document clustering is receiving more and more attention as an important and fundamental technique for unsupervised document organization, automatic topic extraction, and fast information retrieval or filtering. A good document clustering approach can assist computers in organizing the document corpus automatically into a meaningful cluster hierarchy for efficient browsing and navigation, which is very valuable for complementing the deficiencies of traditional information retrieval technologies. In this paper, we study the performance of different density-based criterion functions, which can be classified as internal, external or hybrid, in the context of partitional clustering of document datasets. In our study, a weight was assigned to each document, which defined its relative position in the entire collection. To show the efficiency of the proposed approach, the weighted methods were compared to their unweighted variants. To verify the robustness of the proposed approach, experiments were conducted on datasets with a wide variety of numbers of clusters, documents and terms. To evaluate the criterion functions, we used the WebKb, Reuters-21578, 20Newsgroups-18828, WebACE and TREC-5 datasets, as they are currently the most widely used benchmarks in document clustering research. To evaluate the quality of a clustering solution, a wide spectrum of indices, three internal validity indices and seven external validity indices, were used. The internal validity indices were used for evaluating the within-cluster scatter and between cluster separations. The external validity indices were used for comparing the clustering solutions produced by the proposed criterion functions with the “ground truth” results. Experiments showed that our approach significantly improves clustering quality. In this paper, we developed a modified differential evolution (DE) algorithm to optimize the criterion functions. This modification accelerates the convergence of DE and, unlike the basic DE algorithm, guarantees that the received solution will be feasible.  相似文献   

18.
As the World Wide Web develops at an unprecedented pace, identifying web page genre has recently attracted increasing attention because of its importance in web search. A common approach for identifying genre is to use textual features that can be extracted directly from a web page, that is, On-Page features. The extracted features are subsequently inputted into a machine learning algorithm that will perform classification. However, these approaches may be ineffective when the web page contains limited textual information (e.g., the page is full of images). In this study, we address genre identification of web pages under the aforementioned situation. We propose a framework that uses On-Page features while simultaneously considering information in neighboring pages, that is, the pages that are connected to the original page by backward and forward links. We first introduce a graph-based model called GenreSim, which selects an appropriate set of neighboring pages. We then construct a multiple classifier combination module that utilizes information from the selected neighboring pages and On-Page features to improve performance in genre identification. Experiments are conducted on well-known corpora, and favorable results indicate that our proposed framework is effective, particularly in identifying web pages with limited textual information.  相似文献   

19.
An important initial step of mathematical formula recognition is to correctly identify the location of formulae within documents. Previous work in this area has traditionally focused on image-based documents; however, given the prevalence and popularity of the PDF format for dissemination, alternatives to image-based approaches are increasingly being explored. In this paper, we investigate the use of both machine learning techniques and heuristic rules to locate the boundaries of both isolated and embedded formulae within documents, based upon data extracted directly from PDF files. We propose four new features along with preprocessing and post-processing techniques for isolated formula identification. Furthermore, we compare, analyse and extensively tune nine state-of-the-art learning algorithms for a comprehensive evaluation of our proposed methods. The evaluation is carried out over a ground-truth dataset, which we have made publicly available, together with an application adaptable fine-grained evaluation metric. Our experimental results demonstrate that the overall accuracies of isolated and embedded formula identification are increased by 11.52 and 10.65 %, compared with our previously proposed formula identification approach.  相似文献   

20.
为解决办公人员在进行文档写作时存在各种文本格式和内容错误的问题,设计基于深度学习的文本自动纠错系统,用于辅助办公人员的写作和校对工作;分析办公人员的文本纠错需求,并进行文本格式与内容纠错方法研究;设计系统由写作模板生成、文本格式纠错和文本内容纠错三个功能组成;首先,设计文本要素识别与检查算法并基于VBA技术实现文本格式校对;然后基于Seq2Seq深度学习模型训练字词、语法和标点符号查错模型完成公文内容纠错,并根据办公人员工作需求建立纠错辅助字库提升系统纠错准确率;最终,通过系统测试实验结果表明,设计系统能够极大地提升办公人员写作效率并减轻文本校对工作负担。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号