首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A method is proposed for the detection of statistical tables that use metafiles as input data; the latter fact allows one to apply this method to documents of different formats. In this method, the table detection process is viewed as a bottom-up segmentation of a document page, i.e., segmentation from simple elements of a page to more complicated ones. The experimental evaluation of the method shows that it is efficient as applied to a wide class of statistical tables.  相似文献   

2.
VC平台下基于OLE的Word自动化操作应用   总被引:1,自引:0,他引:1  
在工程项目中,需要将最后的实验结果和数据提供给用户.但简单的文本形式无法生成复杂的表格和图形,而程序界面的形式不便让用户对实验结果和数据进行进一步的总结和修改.采用Word文档形式可以利用Word的强大文档编辑处理功能,生成复杂的图文及报表,便于用户进行进一步的编辑打印等操作.介绍在VC开发平台下,基于OLE技术,实现了对Word的自动化操作,将实验结果以复杂Word文档的形式反馈给用户.结合一个工程项目中的Word报表生成过程,并给出了具体的实现步骤.  相似文献   

3.
While techniques for evaluating the performance of lower-level document analysis tasks such as optical character recognition have gained acceptance in the literature, attempts to formalize the problem for higher-level algorithms, while receiving a fair amount of attention in terms of theory, have generally been less successful in practice, perhaps owing to their complexity. In this paper, we introduce intuitive, easy-to-implement evaluation schemes for the related problems of table detection and table structure recognition. We also present the results of several small experiments, demonstrating how well the methodologies work and the useful sorts of feedback they provide. We first consider the table detection problem. Here algorithms can yield various classes of errors, including non-table regions improperly labeled as tables (insertion errors), tables missed completely (deletion errors), larger tables broken into a number of smaller ones (splitting errors), and groups of smaller tables combined to form larger ones (merging errors). This leads naturally to the use of an edit distance approach for assessing the results of table detection. Next we address the problem of evaluating table structure recognition. Our model is based on a directed acyclic attribute graph, or table DAG. We describe a new paradigm, “graph probing,” for comparing the results returned by the recognition system and the representation created during ground-truthing. Probing is in fact a general concept that could be applied to other document recognition tasks as well. Received July 18, 2000 / Accepted October 4, 2001  相似文献   

4.
5.
An algorithm for page layout analysis (segmentation) is suggested in the paper. It allows whitespace between text blocks to be detected on a document page. The algorithm could be used in document analysis and recognition problems. In particular, it can be used for column recognition in multicolumn text and tables. The suggested algorithm is quite simple for implementation.  相似文献   

6.

As financial document automation becomes more general, table detection is receiving more and more attention as an important part of document automation. Disclosure documents contain both bordered and borderless tables of varying lengths, and there is currently no model that performs well on these types of documents. To solve this problem, we propose a table detection model based on YOLO-table. We introduce involution into the backbone of the network to improve the network’s ability to learn table spatial layout features and design a simple Feature Pyramid Network to improve model effectiveness. In addition, this paper proposes a table-based augment method. We experiment on a disclosure document dataset, and the results show that the F1-measure of the YOLO-table reaches 97.3%. Compared with YOLOv3, our method improves the accuracy by 2.8% and the speed by 1.25 times. It also evaluates the ICDAR2013 and ICDAR2019 Table Competition datasets and achieves state-of-the-art performance.

  相似文献   

7.
Fiebig  Thorsten  Moerkotte  Guido 《World Wide Web》2001,4(3):167-187
While using an algebra that acts on sets of variable bindings for evaluating XML queries, the problem of constructing XML from these bindings arises. One approach is to define a powerful operator that is able to perform a complex construction of a representation of the XML result document. The drawback is that such an operator in its generality is hard to implement and disables algebraic optimization since it has to be executed last in the plan. Therefore we suggest to construct XML documents by special query execution plans called construction plans built from simple, easy to implement and efficient operators. The paper proposes four simple algebraic operators needed for XML document construction. Further, we introduce an optimizing translation algorithm of construction clauses into algebraic expressions and briefly point out algebraic optimizations enabled by our approach.  相似文献   

8.
An algorithm for correspondence analysis is described and implemented in SAS/IML (SAS Institute, 1985a). The technique is shown, through the analysis of several biological examples, to supplement the log-linear models approach to the analysis of contingency tables, both in the model identification and model interpretation stages of analysis. A simple two-way contingency table of tumor data is analyzed using correspondence analysis. This example emphasises the relationships between the parameters of the log-linear model for the table and the graphical correspondence analysis results. The technique is also applied to a three-way table of survey data concerning ulcer patients to demonstrate applications of simple correspondence analysis to higher dimensional tables with fixed margins. Finally, the diets and foraging behaviors of birds of the Hubbard Brook Forest are each analyzed and then a simultaneous display of the two separate but related tables is constructed to highlight relationships between the tables.  相似文献   

9.
目前的XML文档映射关系数据库方法普遍存在生成的数据表较多、查询效率较低等问题,基于模型映射提出一种包含Dewey编码的XML文档映射关系数据库方法。给出了映射的逻辑结构模型、详细设计方案、映射算法和实验结果。实验结果表明,该方法生成的数据表结构简单,与传统算法相比在文档解析和数据查询方面具有一定的性能优势。  相似文献   

10.
Document representation and its application to page decomposition   总被引:6,自引:0,他引:6  
Transforming a paper document to its electronic version in a form suitable for efficient storage, retrieval, and interpretation continues to be a challenging problem. An efficient representation scheme for document images is necessary to solve this problem. Document representation involves techniques of thresholding, skew detection, geometric layout analysis, and logical layout analysis. The derived representation can then be used in document storage and retrieval. Page segmentation is an important stage in representing document images obtained by scanning journal pages. The performance of a document understanding system greatly depends on the correctness of page segmentation and labeling of different regions such as text, tables, images, drawings, and rulers. We use the traditional bottom-up approach based on the connected component extraction to efficiently implement page segmentation and region identification. A new document model which preserves top-down generation information is proposed based on which a document is logically represented for interactive editing, storage, retrieval, transfer, and logical analysis. Our algorithm has a high accuracy and takes approximately 1.4 seconds on a SGI Indy workstation for model creation, including orientation estimation, segmentation, and labeling (text, table, image, drawing, and ruler) for a 2550×3300 image of a typical journal page scanned at 300 dpi. This method is applicable to documents from various technical journals and can accommodate moderate amounts of skew and noise  相似文献   

11.
It has been shown that simple substitution ciphers can be solved using statistical methods such as probabilistic relaxation. However, the utility of such solutions has been limited by their inability to cope with noise encountered in practical applications. We propose a new solution to substitution deciphering based on hidden Markov models. We show that our algorithm is more accurate than relaxation and much more robust in the presence of noise, making it useful for applications in compressed document processing. Recovering character interpretations from the sequence of cluster identifiers in a symbolically compressed document can be treated as a cipher problem. Although a significant amount of noise is present in the cluster sequence, enough information can be recovered with a robust deciphering algorithm to accomplish certain document analysis tasks. The feasibility of this approach is demonstrated in a multilingual document duplicate detection system.  相似文献   

12.
Huge numbers of documents are being generated on the Web, especially for news articles and social media. How to effectively organize these evolving documents so that readers can easily browse or search is a challenging task. Existing methods include classification, clustering, and chronological or geographical ordering, which only provides a partial view of the relations among news articles. To better utilize cross‐document relations in organizing news articles, in this paper, we propose a novel approach to organize news archives by exploiting their near‐duplicate relations. First, we use a sentence‐level statistics‐based approach to near‐duplicate copy detection, which is language independent, simple but effective. Since content‐based approaches are usually time consuming and not robust to term substitutions, near‐duplicate detection approach can be used. Second, by extracting the cross‐document relations in a block‐sharing graph, we can derive a near‐duplicate clustering by cross‐document relations in which users can easily browse and find out unnecessary repetitions among documents. From the experimental results, we observed high efficiency and good accuracy of the proposed approach in detecting and clustering near‐duplicate documents in news archives.  相似文献   

13.
The digitalization processes of documents produce frequently images with small rotation angles. The skew angles in document images degrade the performance of optical character recognition (OCR) tools. Therefore, skew detection of document images plays an important role in automatic document analysis systems. In this paper, we propose a Rectangular Active Contour Model (RAC Model) for content region detection and skew angle calculation by imposing a rectangular shape constraint on the zero-level set in Chan–Vese Model (C-V Model) according to the rectangular feature of content regions in document images. Our algorithm differs from other skew detection methods in that it does not rely on local image features. Instead, it uses global image features and shape constraint to obtain a strong robustness in detecting skew angles of document images. We experimented on different types of document images. Comparing the results with other skew detection algorithms, our algorithm is more accurate in detecting the skews of the complex document images with different fonts, tables, illustrations, and layouts. We do not need to pre-process the original image, even if it is noisy, and at the same time the rectangular content region of a document image is also detected.  相似文献   

14.
Similar document detection plays important roles in many applications, such as file management, copyright protection, plagiarism prevention, and duplicate submission detection. The state of the art protocols assume that the contents of files stored on a server (or multiple servers) are directly accessible. However, this makes such protocols unsuitable for any environment where the documents themselves are sensitive and cannot be openly read. Essentially, this assumption limits more practical applications, e.g., detecting plagiarized documents between two conferences, where submissions are confidential. We propose novel protocols to detect similar documents between two entities where documents cannot be openly shared with each other. The similarity measure used can be a simple cosine similarity on entire documents or on document fragments, enabling detection of partial copying. We conduct extensive experiments to show the practical value of the proposed protocols. While the proposed base protocols are much more efficient than the general secure multiparty computation based solutions, they are still slow for large document sets. We then investigate a clustering based approach that significantly reduces the running time and achieves over 90% of accuracy in our experiments. This makes secure similar document detection both practical and feasible.  相似文献   

15.
Cross-language plagiarism detection deals with the automatic identification and extraction of plagiarism in a multilingual setting. In this setting, a suspicious document is given, and the task is to retrieve all sections from the document that originate from a large, multilingual document collection. Our contributions in this field are as follows: (1) a comprehensive retrieval process for cross-language plagiarism detection is introduced, highlighting the differences to monolingual plagiarism detection, (2) state-of-the-art solutions for two important subtasks are reviewed, (3) retrieval models for the assessment of cross-language similarity are surveyed, and, (4) the three models CL-CNG, CL-ESA and CL-ASA are compared. Our evaluation is of realistic scale: it relies on 120,000 test documents which are selected from the corpora JRC-Acquis and Wikipedia, so that for each test document highly similar documents are available in all of the six languages English, German, Spanish, French, Dutch, and Polish. The models are employed in a series of ranking tasks, and more than 100 million similarities are computed with each model. The results of our evaluation indicate that CL-CNG, despite its simple approach, is the best choice to rank and compare texts across languages if they are syntactically related. CL-ESA almost matches the performance of CL-CNG, but on arbitrary pairs of languages. CL-ASA works best on “exact” translations but does not generalize well.  相似文献   

16.
基于视窗的OCR页面图像倾斜检测方法   总被引:2,自引:0,他引:2       下载免费PDF全文
文档在扫描输入过程中,所生成的页面图像一般都存在一定的角度倾斜,当页面图像倾斜角度过大时,将对进一步的版面分析以及字符识别产生不良影响。为了快速准确地检测页面图像倾斜角度和降低计算量,提出了一种基于视窗变换的页面图像倾斜检测方法,该算法首先对视窗中的文字及图片的细节部分进行模糊,然后对其边沿进行直线拟合,以便快速检测页面图像倾斜角度。实验结果表明,该方法能快速准确地检测出各类页面图像的倾斜角度,并具有良好的适应性。  相似文献   

17.
Shading is an important feature for the comprehension of volume datasets, but is difficult to implement accurately. Current techniques based on pre-integrated direct volume rendering approximate the volume rendering integral by ignoring non-linear gradient variations between front and back samples, which might result in cumulated shading errors when gradient variations are important and / or when the illumination function features high frequencies. In this paper, we explore a simple approach for pre-integrated volume rendering with non-linear gradient interpolation between front and back samples. We consider that the gradient smoothly varies along a quadratic curve instead of a segment in-between consecutive samples. This not only allows us to compute more accurate shaded pre-integrated look-up tables, but also allows us to more efficiently process shading amplifying effects, based on gradient filtering. An interesting property is that the pre-integration tables we use remain two-dimensional as for usual pre-integrated classification. We conduct experiments using a full hardware approach with the Blinn-Phong illumination model as well as with a non-photorealistic illumination model.  相似文献   

18.
工作流技术是实现办公自动化系统的一个有力的软件工具,在它的设计和实现过程中,需要解决一系列的问题,如:工作流的定义、工作流关系、工作流的流转等等。论文结合实际给出了公文流转系统中公文流的定义,详细分析了公文流节点关系,使用关系数据库模型来描述公文流转,并基于该模型给出公文流转策略和数据库实现。该方法和模型能较好地适应办公自动化系统,具有通用性。  相似文献   

19.
Tables appearing in natural language documents provide a compact method for presenting relational information in an immediate and intuitive manner, while simultaneously organizing and indexing that information. Despite their ubiquity and obvious utility, tables have not received the same level of formal characterization enjoyed by sentential text. Rather, they are modeled in terms of geometry, simple hierarchies of strings and database-like relational structures. Tables have been the focus of a large volume of research in the document image analysis field and lately, have received particular attention from researchers interested in extracting information from non-trivial elements of web pages. This paper provides a framework for representing tables at both the semantic and structural levels. It presents a representation of the indexing structures present in tables and the relationship between these structures and the underlying categories. Matthew Hurst graduated from Edinburgh University in 1992 and completed an MPhil at Cambridge in Computer Speech and Language Processing. He then worked at The University of Edinburgh on a number of projects involving text and document analysis before enroling in the PhD programme. While studying for his PhD, he completed a European Science and Technology Fellowship in Japan. After working for IBM Research, Tokyo he moved tothe United States of America to work for a number of companies with unique applications utilizing applied natural language processing and document analysis. He is currently the Director of Science and Innovation at Nielsen BuzzMetrics.  相似文献   

20.
This paper considers statistical analysis of recurrent event data when there exist observation gaps. By observation gaps, we mean that some study subjects are out of the study for a period of time for various reasons and then are back in the study again and this may happen more than once. Most of existing studies of recurrent events discuss situations where study subjects are under observation over continuous time periods. For recurrent event data with observation gaps, a naive analysis method is to treat them as usual recurrent events without gaps by either censoring observations at times when subjects first leave the study or ignoring the gaps. As expected and shown below, this could yield biased and misleading results. In this paper, we present some appropriate methods for the problem. In particular, we consider estimation of the underlying mean function and regression analysis of recurrent event data in the presence of observation gaps. The presented analysis methods are evaluated and compared to the naive approach that ignores observation gaps using extensive simulation studies and an example.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号