首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The processing and management of XML data are popular research issues. However, operations based on the structure of XML data have not received strong attention. These operations involve, among others, the grouping of structurally similar XML documents. Such grouping results from the application of clustering methods with distances that estimate the similarity between tree structures. This paper presents a framework for clustering XML documents by structure. Modeling the XML documents as rooted ordered labeled trees, we study the usage of structural distance metrics in hierarchical clustering algorithms to detect groups of structurally similar XML documents. We suggest the usage of structural summaries for trees to improve the performance of the distance calculation and at the same time to maintain or even improve its quality. Our approach is tested using a prototype testbed.  相似文献   

2.
Character groundtruth for real, scanned document images is crucial for evaluating the performance of OCR systems, training OCR algorithms, and validating document degradation models. Unfortunately, manual collection of accurate groundtruth for characters in a real (scanned) document image is not practical because (i) accuracy in delineating groundtruth character bounding boxes is not high enough, (ii) it is extremely laborious and time consuming, and (iii) the manual labor required for this task is prohibitively expensive. Ee describe a closed-loop methodology for collecting very accurate groundtruth for scanned documents. We first create ideal documents using a typesetting language. Next we create the groundtruth for the ideal document. The ideal document is then printed, photocopied and then scanned. A registration algorithm estimates the global geometric transformation and then performs a robust local bitmap match to register the ideal document image to the scanned document image. Finally, groundtruth associated with the ideal document image is transformed using the estimated geometric transformation to create the groundtruth for the scanned document image. This methodology is very general and can be used for creating groundtruth for documents in typeset in any language, layout, font, and style. We have demonstrated the method by generating groundtruth for English, Hindi, and FAX document images. The cost of creating groundtruth using our methodology is minimal. If character, word or zone groundtruth is available for any real document, the registration algorithm can be used to generate the corresponding groundtruth for a rescanned version of the document  相似文献   

3.
An automatic keyphrase extraction system for scientific documents   总被引:1,自引:0,他引:1  
Automatic keyphrase extraction techniques play an important role for many tasks including indexing, categorizing, summarizing, and searching. In this paper, we develop and evaluate an automatic keyphrase extraction system for scientific documents. Compared with previous work, our system concentrates on two important issues: (1) more precise location for potential keyphrases: a new candidate phrase generation method is proposed based on the core word expansion algorithm, which can reduce the size of the candidate set by about 75% without increasing the computational complexity; (2) overlap elimination for the output list: when a phrase and its sub-phrases coexist as candidates, an inverse document frequency feature is introduced for selecting the proper granularity. Additional new features are added for phrase weighting. Experiments based on real-world datasets were carried out to evaluate the proposed system. The results show the efficiency and effectiveness of the refined candidate set and demonstrate that the new features improve the accuracy of the system. The overall performance of our system compares favorably with other state-of-the-art keyphrase extraction systems.  相似文献   

4.
An optimization methodology for intermodal terminal management   总被引:13,自引:0,他引:13  
A solution to the problems of resource allocation and scheduling of loading and unloading operations in a container terminal is presented. The two problems are formulated and solved hierarchically. First, the solution of the resource allocation problem returns, over a number of work shifts, a set of quay cranes used to load and unload containers from the moored ships and the set of yard cranes to store those containers on the yard. Then, a scheduling problem is formulated to compute the loading and unloading lists of containers for each allocated crane. The feasibility of the solution is verified against a detailed, discrete-event based, simulation model of the terminal. The simulation results show that the optimized resource allocation, which reduces the costs by [frac13], can be effectively adopted in combination with the optimized loading and unloading list. Moreover, the simulation shows that the optimized lists reduce the number of crane conflicts on the yard and the average length of the truck queues in the terminal.  相似文献   

5.
在深入分析二值文本隐藏技术评价准则的基础上,结合汉字笔画结构特点,提出了一种具有最佳视觉效果的认证数字水印方案。首先结合像素可翻动性评分准则与汉字结构特征,选择标准的16模块,然后在此模块中依据DRDM准则选择最佳翻动8模块,翻动图像中的8模块来嵌入水印信息,实验表明该算法视觉隐蔽性极佳,并能完成篡改定位。  相似文献   

6.
This paper proposes a new, efficient algorithm for extracting similar sections between two time sequence data sets. The algorithm, called Relay Continuous Dynamic Programming (Relay CDP), realizes fast matching between arbitrary sections in the reference pattern and the input pattern and enables the extraction of similar sections in a frame synchronous manner. In addition, Relay CDP is extended to two types of applications that handle spoken documents. The first application is the extraction of repeated utterances in a presentation or a news speech because repeated utterances are assumed to be important parts of the speech. These repeated utterances can be regarded as labels for information retrieval. The second application is flexible spoken document retrieval. A phonetic model is introduced to cope with the speech of different speakers. The new algorithm allows a user to query by natural utterance and searches spoken documents for any partial matches to the query utterance. We present herein a detailed explanation of Relay CDP and the experimental results for the extraction of similar sections and report results for two applications using Relay CDP. Yoshiaki Itoh has been an associate professor in the Faculty of Software and Information Science at Iwate Prefectural University, Iwate, Japan, since 2001. He received the B.E. degree, M.E. degree, and Dr. Eng. from Tokyo University, Tokyo, in 1987, 1989, and 1999, respectively. From 1989 to 2001 he was a researcher and a staff member of Kawasaki Steel Corporation, Tokyo and Okayama. From 1992 to 1994 he transferred as a researcher to Real World Computing Partnership, Tsukuba, Japan. Dr. Itoh's research interests include spoken document processing without recognition, audio and video retrieval, and real-time human communication systems. He is a member of ISCA, Acoustical Society of Japan, Institute of Electronics, Information and Communication Engineers, Information Processing Society of Japan, and Japan Society of Artificial Intelligence. Kazuyo Tanaka has been a professor at the University of Tsukuba, Tsukuba, Japan, since 2002. He received the B.E. degree from Yokohama National University, Yokohama, Japan, in 1970, and the Dr. Eng. degree from Tohoku University, Sendai, Japan, in 1984. From 1971 to 2002 he was research officer of Electrotechnical Laboratory (ETL), Tsukuba, Japan, and the National Institute of Advanced Science and Technology (AIST), Tsukuba, Japan, where he was working on speech analysis, synthesis, recognition, and understanding, and also served as chief of the speech processing section. His current interests include digital signal processing, spoken document processing, and human information processing. He is a member of IEEE, ISCA, Acoustical Society of Japan, Institute of Electronics, Information and Communication Engineers, and Japan Society of Artificial Intelligence. Shi-Wook Lee received the B.E. degree and M.E. degree from Yeungnam University, Korea and Ph.D. degree from the University of Tokyo in 1995, 1997, and 2001, respectively. Since 2001 he has been working in the Research Group of Speech and Auditory Signal Processing, the National Institute of Advanced Science and Technology (AIST), Tsukuba, Japan, as a postdoctoral fellow. His research interests include spoken document processing, speech recognition, and understanding.  相似文献   

7.
In this paper an efficient approach for segmentation of the individual characters from scanned documents typed on old typewriters is proposed. The approach proposed in this paper is primarily intended for processing of machine-typed documents, but can be used for machine-printed documents as well. The proposed character segmentation approach uses the modified projection profiles technique which is based on using the sliding window for obtaining the information about the document image structure. This is followed by histogram processing in order to determine the spaces between lines, words and characters in the document image. The decision-making logic used in the process of character segmentation is describes and represents the most an integral aspect of the proposed technique. Beside the character segmentation approach, the ultra-fast architecture for geometrical image transformations, which is used for image rotation in the process of skew correction, is presented, and its fast implementation using pointer arithmetic and a highly optimized low-level machine routine is provided. The proposed character segmentation approach is semi-automatic and uses threshold values to control the segmentation process. Provided results for segmentation accuracy show that the proposed approach outperforms the state-of-the-art approaches in most cases. Also, the results from the aspect of the time complexity show that the new technique performs faster than state-of-the-art approaches and can process even very large document images in less than one second, which makes this approach suitable for real-time tasks. Finally, visual demonstration of the proposed approach performances is achieved using original documents authored by Nikola Tesla.  相似文献   

8.
9.
We present our work on the paleographic analysis and recognition system intended for processing of historical Hebrew calligraphy documents. The main goal is to analyze documents of different writing styles in order to identify the locations, dates, and writers of test documents. Using interactive software tools, a data base of extracted characters has been established. It now contains about 20,000 characters of 34 different writers, and will be distinctly expanded in the near future. Preliminary results of automatic extraction of pre-specified letters using the erosion operator are presented. We further propose and test topological features for handwriting style classification based on a selected subset of the Hebrew alphabet. A writer identification experiment using 34 writers yielded 100% correct classification.  相似文献   

10.
Multimedia Tools and Applications - The performance of document text recognition depends on text line segmentation algorithms, which heavily relies on the type of language, author’s writing...  相似文献   

11.
In this article, we are interested in the restoration of character shapes in antique document images. This particular class of documents generally present a lot of involuntary historical information that have to be taken into account to get quality digital libraries. Actually, many document processing methods of all sorts have already been proposed to cope with degraded character images, but those techniques often consist in replacing the degraded shapes by a corresponding prototype which is not satisfying for lots of specialists. For that, we decided to develop our own method for accurate character restoration, basing our study on generic image processing tools (namely: Gabor filtering and the active contours model) completed with some specific automatically extracted structural information. The principle of our method is to make an active contour recover the lost information using an external energy term based on the use of an automatically built and selected reference character image. Results are presented for real case examples taken from printed and handwritten documents.  相似文献   

12.
Classifier-based acronym extraction for business documents   总被引:1,自引:1,他引:0  
Acronym extraction for business documents has been neglected in favor of acronym extraction for biomedical documents. Although there are overlapping challenges, the semi-structured and non-predictive nature of business documents hinder the effectiveness of the extraction methods used on biomedical documents and fail to deliver the expected performance. A classifier-based extraction subsystem is presented as part of the wider project, Binocle, for the analysis of French business corpora. Explicit and implicit acronym presentation cases are identified using textual and syntactical hints. Among the 7 features extracted from each candidate instance, we introduce “similarity” features, which compare a candidate’s characteristics with average length-related values calculated from a generic acronym repository. Commonly used rules for evaluating the candidate (matching first letters, ordered instances, etc.) are scored and aggregated in a single composite feature that permits a supple classification. One hundred and thirty-eight French business documents from 14 public organizations were used for the training and evaluation corpora, yielding a recall of 90.9% at a precision level of 89.1% for a search space size of 3 sentences.  相似文献   

13.
The use of optimization in a simulation-based design environment has become a common trend in industry today. Computer simulation tools are commonplace in many engineering disciplines, providing the designers with tools to evaluate a designs performance without building a physical prototype. This has triggered the development of optimization techniques suitable for dealing with such simulations. One of these approaches is known as sequential approximate optimization. In sequential approximate minimization a sequence of optimizations are performed over local response surface approximations of the system. This paper details the development of an interior-point approach for trust-region-managed sequential approximate optimization. The interior-point approach will ensure that approximate feasibility is maintained throughout the optimization process. This facilitates the delivery of a usable design at each iteration when subject to reduced design cycle time constraints. In order to deal with infeasible starting points, homotopy methods are used to relax constraints and push designs toward feasibility. Results of application studies are presented, illustrating the applicability of the proposed algorithm.  相似文献   

14.
Authors use images to present a wide variety of important information in documents. For example, two-dimensional (2-D) plots display important data in scientific publications. Often, end-users seek to extract this data and convert it into a machine-processible form so that the data can be analyzed automatically or compared with other existing data. Existing document data extraction tools are semi-automatic and require users to provide metadata and interactively extract the data. In this paper, we describe a system that extracts data from documents fully automatically, completely eliminating the need for human intervention. The system uses a supervised learning-based algorithm to classify figures in digital documents into five classes: photographs, 2-D plots, 3-D plots, diagrams, and others. Then, an integrated algorithm is used to extract numerical data from data points and lines in the 2-D plot images along with the axes and their labels, the data symbols in the figure’s legend and their associated labels. We demonstrate that the proposed system and its component algorithms are effective via an empirical evaluation. Our data extraction system has the potential to be a vital component in high volume digital libraries.  相似文献   

15.
以语义为基础实现文档关键词提取是提高自动提取准确度的有效途径。以中文文档为处理对象,通过《同义词词林》计算词语间语义距离,对词语进行密度聚类,得到主题相关类,并从主题相关类中选取中心词作为  相似文献   

16.
Training recognizers for handwritten characters is still a very time consuming task involving tremendous amounts of manual annotations by experts. In this paper we present semi-supervised labeling strategies that are able to considerably reduce the human effort. We propose two different methods to label and later recognize characters in collections of historical archive documents. The first one is based on clustering of different feature representations and the second one incorporates a simultaneous retrieval on different representations. Hence, both approaches are based on multi-view learning and later apply a voting procedure for reliably propagating annotations to unlabeled data. We evaluate our methods on the MNIST database of handwritten digits and introduce a realistic application in form of a database of handwritten historical weather reports. The experiments show that our method is able to significantly reduce the human effort that is required to build a character recognizer for the data collection considered while still achieving recognition rates that are close to a supervised classification experiment.  相似文献   

17.
Efficient extraction of schemas for XML documents   总被引:3,自引:0,他引:3  
In this paper, we present a technique for efficient extraction of concise and accurate schemas for XML documents. By restricting the schema form and applying some heuristic rules, we achieve the efficiency and conciseness. The result of an experiment with real-life DTDs shows that our approach attains high accuracy and is 20 to 200 times faster than existing approaches.  相似文献   

18.
An efficient and scalable algorithm for clustering XML documents by structure   总被引:11,自引:0,他引:11  
With the standardization of XML as an information exchange language over the Internet, a huge amount of information is formatted in XML documents. In order to analyze this information efficiently, decomposing the XML documents and storing them in relational tables is a popular practice. However, query processing becomes expensive since, in many cases, an excessive number of joins is required to recover information from the fragmented data. If a collection consists of documents with different structures (for example, they come from different DTDs), mining clusters in the documents could alleviate the fragmentation problem. We propose a hierarchical algorithm (S-GRACE) for clustering XML documents based on structural information in the data. The notion of structure graph (s-graph) is proposed, supporting a computationally efficient distance metric defined between documents and sets of documents. This simple metric yields our new clustering algorithm which is efficient and effective, compared to other approaches based on tree-edit distance. Experiments on real data show that our algorithm can discover clusters not easily identified by manual inspection.  相似文献   

19.
This article summarizes ‘generalized response surface methodology’ (GRSM), extending Box and Wilson’s ‘response surface methodology’ (RSM). GRSM allows multiple random responses, selecting one response as goal and the other responses as constrained variables. Both GRSM and RSM estimate local gradients to search for the optimum. These gradients are based on local first-order polynomial approximations. GRSM combines these gradients with Mathematical Programming findings to estimate a better search direction than the steepest ascent direction used by RSM. Moreover, these gradients are used in a bootstrap procedure for testing whether the estimated solution is indeed optimal. The focus of this paper is the optimization of simulated (not real) systems.  相似文献   

20.
A file system tailored to the general needs of the office environment is proposed. This system supports large numbers of a wide variety of documents and inexact fuzzy queries on the documents. The file system is based on a multilevel file structure that combines and extends multikey extendible hashing and signature files to create a document-retrieval system that is more time efficient than other previously proposed systems and is also space efficient  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号