首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper presents an unified approach in analyzing and structuring the content of videotaped lectures for distance learning applications. By structuring lecture videos, we can support topic indexing and semantic querying of multimedia documents captured in the traditional classrooms. Our goal in this paper is to automatically construct the cross references of lecture videos and textual documents so as to facilitate the synchronized browsing and presentation of multimedia information. The major issues involved in our approach are topical event detection, video text analysis and the matching of slide shots and external documents. In topical event detection, a novel transition detector is proposed to rapidly locate the slide shot boundaries by computing the changes of text and background regions in videos. For each detected topical event, multiple keyframes are extracted for video text detection, super-resolution reconstruction, binarization and recognition. A new approach for the reconstruction of high-resolution textboxes based on linear interpolation and multi-frame integration is also proposed for the effective binarization and recognition. The recognized characters are utilized to match the video slide shots and external documents based on our proposed title and content similarity measures.  相似文献   

2.
This paper presents a new method for detecting and recognizing text in complex images and video frames. Text detection is performed in a two-step approach that combines the speed of a text localization step, enabling text size normalization, with the strength of a machine learning text verification step applied on background independent features. Text recognition, applied on the detected text lines, is addressed by a text segmentation step followed by an traditional OCR algorithm within a multi-hypotheses framework relying on multiple segments, language modeling and OCR statistics. Experiments conducted on large databases of real broadcast documents demonstrate the validity of our approach.  相似文献   

3.
Marginal noise is a common phenomenon in document analysis which results from the scanning of thick documents or skew documents. It usually appears in the front of a large and dark region around the margin of document images. Marginal noise might cover meaningful document objects, such as text, graphics and forms. The overlapping of marginal noise with meaningful objects makes it difficult to perform the task of segmentation and recognition of document objects. This paper proposes a novel approach to remove marginal noise. The proposed approach consists of two steps which are marginal noise detection and marginal noise deletion. Marginal noise detection will reduce an original document image into a smaller image, and then find marginal noise regions according to the shape length and location of the split blocks. After the detection of marginal noise regions, different removal methods are performed. A local thresholding method is proposed for the removal of marginal noise in gray-scale document images, whereas a region growing method is devised for binary document images. Experimenting with a wide variety of test samples reveals the feasibility and effectiveness of our proposed approach in removing marginal noises.  相似文献   

4.
介绍一种文档复制检测中基于窗口的特征提取方法,并从理论上分析了性能.采用重叠的文本块分割文档,利用滚动的HASH函数把文本块映射成散列值,再从定义的散列值窗口中选择文本特征,实验验证了方法的特性并与具有代表性的文档复制检测系统进行了对比,结果表明该方法能够确保发现文本长度大于保证闽值的复制内容,有效地提高了检测结果的准确性.  相似文献   

5.
6.
In this paper we describe a system for automatically analyzing old documents and creating hyper linking between different epochs, thus opening ancient documents to young people and to make them available on the web with old and current content. We propose a supervised learning approach to segment text and illustration of digitized old documents using a texture feature based on local correlation aimed at detecting the repeating patterns of text regions and differentiate them from pictorial elements. Moreover we present a solution to help the user in finding contemporary content connected to what is automatically extracted from the ancient documents.  相似文献   

7.
This paper investigates a particular data mining problem which is to ‘identify’ an unknown number of targets based on homogeneous observations that are collected via multiple independent sources. This particular clustering problem corresponds to a significant problem of multi-target detection in the multi-sensor/scan context. No prior information is given about either the level of clutter (namely noisy data) or the number of targets/clusters, both of which have to be learned online from the data. In addition, the data-points from the same source cannot be grouped into the same cluster (namely the cannot link, CL, constraint) and the sizes of the generated clusters need to be bounded by the number of data sources. In the proposed approach, a density-based clustering mechanism is proposed firstly to identify dense regions as clusters and to remove clutter at the coarser level; the CL constraint is then applied for finer data mining and to distinguish overlapping clusters. Illustrative datasets are employed to demonstrate the validity of the present clustering approach for multi-target detection and estimation in cluttered environments which are affected by both misdetection and clutter.  相似文献   

8.
This paper proposes an automatic folder allocation system for text documents through the implementation of a hybrid classification method which combines the Bayesian (Bayes) approach and the Support Vector Machines (SVMs). Folder allocation for text documents in computer is typically executed manually by the user. Every time the user creates text documents by using text editors or downloads the documents from the internet, and wishes to store these documents on the computer, the user needs to determine and allocate the appropriate folder in which to store these new documents. This situation is inconvenient as repeating the folder allocation each time a text document is stored becomes tedious especially when the numbers and layers of folders are huge and the structure is complex and continuously growing. This problem can be overcome by implementing Artificial Intelligence machine learning methods to classify the new text documents and allocate the most appropriate folder as the storage for them. In this paper we propose the Bayes-SVMs hybrid classification framework to perform the tedious task of automatically allocating the right folder for text documents in computers.  相似文献   

9.
With the great advantages of digitization, more and more documents are being transformed into digital representations. Most content digitization of documents is performed by scanners or digital cameras. However, the transformation might degrade the image quality caused by lighting variations, i.e. uneven illumination distribution. In this paper we describe a new approach for text images to compensate uneven illumination distribution with a high degree of text recognition. Our proposed scheme is implemented by enhancing the contrast of the scanned documents, and then generating an edge map from the contrast-enhanced image for locating text area. With the information of the text location, a light distribution image (background) is created to assist the producing of the final light balanced image. Simulation results demonstrate that our approach is superior to the previous works of Hsia et al. (2005, 2006).  相似文献   

10.
Analysing online handwritten notes is a challenging problem because of the content heterogeneity and the lack of prior knowledge, as users are free to compose documents that mix text, drawings, tables or diagrams. The task of separating text from non-text strokes is of crucial importance towards automated interpretation and indexing of these documents, but solving this problem requires a careful modelling of contextual information, such as the spatial and temporal relationships between strokes. In this work, we present a comprehensive study of contextual information modelling for text/non-text stroke classification in online handwritten documents. Formulating the problem with a conditional random field permits to integrate and combine multiple sources of context, such as several types of spatial and temporal interactions. Experimental results on a publicly available database of freely hand-drawn documents demonstrate the superiority of our approach and the benefit of contextual information combination for solving text/non-text classification.  相似文献   

11.
12.
提出了对含有自由文本和丰富语义标记的网络文档资源的一种检索方法.通过对现有的三种语义检索系统原型的分析,提出了一个改进后的实现框架,在此框架中文档资源和查询都可用Web本体语言描述.这些描述提供了关于文档和其内容结构化或半结构化的信息.当这些文档被索引后执行语义查询时或者查询结果处理时,它可以对这些信息进行语义推理,从而将极大地提高检索效果.  相似文献   

13.
Web 2.0 provides user-friendly tools that allow persons to create and publish content online. User generated content often takes the form of short texts (e.g., blog posts, news feeds, snippets, etc). This has motivated an increasing interest on the analysis of short texts and, specifically, on their categorisation. Text categorisation is the task of classifying documents into a certain number of predefined categories. Traditional text classification techniques are mainly based on word frequency statistical analysis and have been proved inadequate for the classification of short texts where word occurrence is too small. On the other hand, the classic approach to text categorization is based on a learning process that requires a large number of labeled training texts to achieve an accurate performance. However labeled documents might not be available, when unlabeled documents can be easily collected. This paper presents an approach to text categorisation which does not need a pre-classified set of training documents. The proposed method only requires the category names as user input. Each one of these categories is defined by means of an ontology of terms modelled by a set of what we call proximity equations. Hence, our method is not category occurrence frequency based, but highly depends on the definition of that category and how the text fits that definition. Therefore, the proposed approach is an appropriate method for short text classification where the frequency of occurrence of a category is very small or even zero. Another feature of our method is that the classification process is based on the ability of an extension of the standard Prolog language, named Bousi~Prolog , for flexible matching and knowledge representation. This declarative approach provides a text classifier which is quick and easy to build, and a classification process which is easy for the user to understand. The results of experiments showed that the proposed method achieved a reasonably useful performance.  相似文献   

14.
A text mining approach for automatic construction of hypertexts   总被引:1,自引:0,他引:1  
The research on automatic hypertext construction emerges rapidly in the last decade because there exists a urgent need to translate the gigantic amount of legacy documents into web pages. Unlike traditional ‘flat’ texts, a hypertext contains a number of navigational hyperlinks that point to some related hypertexts or locations of the same hypertext. Traditionally, these hyperlinks were constructed by the creators of the web pages with or without the help of some authoring tools. However, the gigantic amount of documents produced each day prevent from such manual construction. Thus an automatic hypertext construction method is necessary for content providers to efficiently produce adequate information that can be used by web surfers. Although most of the web pages contain a number of non-textual data such as images, sounds, and video clips, text data still contribute the major part of information about the pages. Therefore, it is not surprising that most of automatic hypertext construction methods inherit from traditional information retrieval research. In this work, we will propose a new automatic hypertext construction method based on a text mining approach. Our method applies the self-organizing map algorithm to cluster some at text documents in a training corpus and generate two maps. We then use these maps to identify the sources and destinations of some important hyperlinks within these training documents. The constructed hyperlinks are then inserted into the training documents to translate them into hypertext form. Such translated documents will form the new corpus. Incoming documents can also be translated into hypertext form and added to the corpus through the same approach. Our method had been tested on a set of at text documents collected from a newswire site. Although we only use Chinese text documents, our approach can be applied to any documents that can be transformed to a set of index terms.  相似文献   

15.
视频和图像中的文本通常在基于内容的视频数据库检索、网络视频搜索,图像分割和图像修复等中起到重要作用,为了提高文本检测的效率,给出了一种基于多种特征自适应阈值的视频文本检测方法.方法是在Michael算法的基础上,利用文本边缘的强度,密度,水平竖直边缘比3个特征计算自适应局部阈值,用阈值能较好去除非文本区域,提取文本边缘,检测并定位文本,减少了Michael算法单一特征阈值的不利影响.在文本定位阶段引入了合并机制.减少了不完整区域的出现.实验结果表明有较高的精度和召回率,可用于视频搜索、图像分割和图像修复等.  相似文献   

16.
基于统计的网页正文信息抽取方法的研究   总被引:47,自引:6,他引:47  
为了把自然语言处理技术有效的运用到网页文档中,本文提出了一种依靠统计信息,从中文新闻类网页中抽取正文内容的方法。该方法先根据网页中的HTML 标记把网页表示成一棵树,然后利用树中每个结点包含的中文字符数从中选择包含正文信息的结点。该方法克服了传统的网页内容抽取方法需要针对不同的数据源构造不同的包装器的缺点,具有简单、准确的特点,试验表明该方法的抽取准确率可以达到95%以上。采用该方法实现的网页文本抽取工具目前为一个面向旅游领域的问答系统提供语料支持,很好的满足了问答系统的需求。  相似文献   

17.
传统的文本分类方法需要大量的已知类别样本来得到一个好的文本分类器,然而在现实的文本分类应用过程中,大量的已知类别样本通常很难获得,因此如何利用少量的已知类别样本和大量的未知类别样本来获得比较好的分类效果成为一个热门的研究课题。本文为此提出了一种扩大已知类别样本集的新方法,该方法先从已知类别样本集中提取出每个类别的代表特征,然后根据代表特征从未知类别样本集中寻找相似样本加入已知类别样本集。实验证明,该方法能有效地提高分类效果。  相似文献   

18.
In this paper, we present a new text line detection method for handwritten documents. The proposed technique is based on a strategy that consists of three distinct steps. The first step includes image binarization and enhancement, connected component extraction, partitioning of the connected component domain into three spatial sub-domains and average character height estimation. In the second step, a block-based Hough transform is used for the detection of potential text lines while a third step is used to correct possible splitting, to detect text lines that the previous step did not reveal and, finally, to separate vertically connected characters and assign them to text lines. The performance evaluation of the proposed approach is based on a consistent and concrete evaluation methodology.  相似文献   

19.
In the present article we introduce and validate an approach for single-label multi-class document categorization based on text content features. The introduced approach uses the statistical property of Principal Component Analysis, which minimizes the reconstruction error of the training documents used to compute a low-rank category transformation matrix. Such matrix transforms the original set of training documents from a given category to a new low-rank space and then optimally reconstructs them to the original space with a minimum reconstruction error. The proposed method, called Minimizer of the Reconstruction Error (mRE) classifier, uses this property, and extends and applies it to new unseen test documents. Several experiments on four multi-class datasets for text categorization are conducted in order to test the stable and generally better performance of the proposed approach in comparison with other popular classification methods.  相似文献   

20.
Huge numbers of documents are being generated on the Web, especially for news articles and social media. How to effectively organize these evolving documents so that readers can easily browse or search is a challenging task. Existing methods include classification, clustering, and chronological or geographical ordering, which only provides a partial view of the relations among news articles. To better utilize cross‐document relations in organizing news articles, in this paper, we propose a novel approach to organize news archives by exploiting their near‐duplicate relations. First, we use a sentence‐level statistics‐based approach to near‐duplicate copy detection, which is language independent, simple but effective. Since content‐based approaches are usually time consuming and not robust to term substitutions, near‐duplicate detection approach can be used. Second, by extracting the cross‐document relations in a block‐sharing graph, we can derive a near‐duplicate clustering by cross‐document relations in which users can easily browse and find out unnecessary repetitions among documents. From the experimental results, we observed high efficiency and good accuracy of the proposed approach in detecting and clustering near‐duplicate documents in news archives.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号