首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 125 毫秒
1.
该文就当前电子文档的相关技术进行了回顾,并就电子文档生成和管理中存在问题进行进一步讨论。并提出了一种结合XML结构,自动从网页中相关元素提取电子文本,并对生成HTML电子文档进行管理的方法。  相似文献   

2.
针对金融类公告中的结构化数据难以被高效快速提取的问题,提出一种基于文档结构与Bi-LSTM-CRF网络模型的信息抽取方法。自定义一种文档结构树生成算法,利用规则从文档结构树中抽取所需节点信息;构建基于信息句触发词的局部句子规则,抽取包含结构化字段信息的信息句;将字段的结构化信息抽取看作序列标注问题,分词时加入领域知识词典,构建基于Bi-LSTM-CRF的神经网络模型进行字段信息识别。实验结果表明,该信息抽取方法可以满足多类型公告的结构化信息提取,最终的信息句与字段信息抽取的平均F1值均可达到91%以上,验证了该方法在产品业务中的可行性和实用性。  相似文献   

3.
该文提出了一种面向由XML描述的Web文档的基于用户主题信息的模式和数据抽取方法,它利用学习算法从样本文档中提取规则,然后使用匹配算法从目标文档中抽取出数据。该文使用一种改进的解析方法对XML文档进行解析,在模式抽取时使用了顺序覆盖算法从样本XML文档集中训练出模式。在数据抽取算法中,数据抽取算法从解析后的XML文档树中寻找用户所需的信息,它可以高效、准确地找到用户所需数据。  相似文献   

4.
文档类型定义(DTD)是一类文档逻辑结构的共同特征的规范化描述,作为文档内容层次关系描述的结构是文档类型定义的一个具体体现,并被文档类型定义所制约。通过采用一种快速的定位方法来支持文档结构节点在文档类型定义中的定位,本文提出了一个基于文档类型定义约束的文档结构生成算法,该算法可为基于结构的文档处理提供高效的实时约束机制和更严格的验证机制。  相似文献   

5.
文档图像理解中最重要的部分是逻辑结构的提取。目前的研究主要集中在页面的布局分析上,少数对文档逻辑结构的研究只是针对单页文档或页面关系简单的多页文档。建筑标书的特殊性在于其层次式的逻辑组成结构没有明确的索引信息标识。本文提出了一种利用页面间引用关系获取文档逻辑结构的方法。该方法采用修正的树形结构表示文档的逻辑结构,逻辑树的创建过程就是逻辑结构的获取过程,而且有利于更高层的语义处理及还原输出。该方法已在标书自动处理系统中实现,保证了该系统的灵活和高效。  相似文献   

6.
针对以维吾尔语书写的文档间的相似性计算及剽窃检测问题,提出了一种基于内容的维吾尔语剽窃检测(U-PD)方法。首先,通过预处理阶段对维吾尔语文本进行分词、删除停止词、提取词干和同义词替换,其中提取词干是基于N-gram 统计模型实现。然后,通过BKDRhash算法计算每个文本块的hash值并构建整个文档的hash指纹信息。最后,根据hash指纹信息,基于RKR-GST匹配算法在文档级、段落级和句子级将文档与文档库进行匹配,获得文档相似度,以此实现剽窃检测。通过在维吾尔语文档中的实验评估表明,提出的方法能够准确检测出剽窃文档,具有可行性和有效性。  相似文献   

7.
信任模式是对事物可信判断的指导,根据信任模式可以高效、快速地建立信任。信息文档也具有信任模式。不同类型的信息文档有着不同的规范和要求。根据这些规范和要求,可以对信息文档结构信任模式进行提取。对信息文档结构信任模式进行提取,提取出的信息文档结构信任模式可用来判断结构的完整性、内容连贯性和格式的规范性。信任模式用ALCCTL逻辑描述,将信息文档进行模型检测验证。如果文档模型满足逻辑公式,那么文档满足这些信任模式;否则,定位出错位置,并输出其不符合的信任模式。根据这些出错信息可对文档进行校正、评审或可信计算。  相似文献   

8.
采用索引技术,对输入的XML文档建立一个双索引结构来改进YFilter算法,优化XML文档过滤性能。藉助索引结构,该算法超前搜索元素结点在文档中的结构信息,预先排除不能保证得到任何匹配结果的元素结点,以避免大量不必要的查询处理。实验结果显示,当输入的XML文档较大时,该算法有较好的过滤性能。  相似文献   

9.
基于电子病历文档的半结构化特性,构建XML电子病历文档。针对原有多重签名计算开销大以及如何使得分解出的子文档具有意义等问题,在Wu等人的算法基础上,对多重签名算法作了改进,优化签名规则,解决签名效率问题,并对其作评估分析。结合XML电子病历文档的特点,构建医疗关键词索引库,使得分解出的病历子文档具有意义。基于XML签名规范,实现该多重签名方案,具有一定的可扩展性。  相似文献   

10.
一个基于关联规则的多层文档聚类算法   总被引:3,自引:0,他引:3  
提出了一种新的基于关联规则的多层文档聚类算法,该算法利用新的文档特征抽取方法构造了文档的主题和关键字特征向量。首先在主题特征向量空间中利用频集快速算法对文档进行初始聚类,然后在基于主题关键字的新的特征向量空间中利用类间距和连接度对初始文档类进行求精,从而得到最终聚类。由于使用了两层聚类方法,使算法的效率和精度都大大提高;使用新的文档特征抽取方法还解决了由于文档关键字过多而导致文档特征向量的维数过高的问题。  相似文献   

11.
Structure analysis of table form documents is an important issue because a printed document and even an electronic document do not provide logical structural information but merely geometrical layout and lexical information. To handle these documents automatically, logical structure information is necessary. In this paper, we first analyze the elements of the form documents from a communication point of view and retrieve the grammatical elements that appear in them. Then, we present a document structure grammar which governs the logical structure of the form documents. Finally, we propose a structure analysis system of the table form documents based on the grammar. By using grammar notation, we can easily modify and keep it consistent, as the rules are relatively simple. Another advantage of using grammar notation is that it can be used for generating documents only from logical structure. In our system, documents are assumed to be composed of a set of boxes and they are classified as seven box types. Then the box relations between the indication box and its associated entry box are analyzed based on the semantic and geometric knowledge defined in the document structure grammar. Experimental results have shown that the system successfully analyzed several kinds of table forms.  相似文献   

12.
The most noticeable characteristic of a construction tender document is that its hierarchical architecture is not obviously expressed but is implied in the citing information. Currently available methods cannot deal with such documents. In this paper, the intra-page and inter-page relationships are analyzed in detail. The creation of citing relationships is essential to extracting the logical structure of tender documents. The hierarchy of tender documents naturally leads to extracting and displaying the logical structure as tree structure. This method is successfully implemented in VHTender, and is the key to the efficiency and flexibility of the whole system. Received February 28, 2000 / Revised October 20, 2000  相似文献   

13.
针对采用机器学习方法识别流式文档结构时语料库稀少、语料标注复杂的问题,该文在研究文档的逻辑结构和编辑语义特征的基础上,确立流式文档逻辑结构标注体系,并提出一种三段式的半自动文档逻辑结构标注方法: 第一阶段通过机助人工实现文档元数据的分离式标注,第二阶段自动重建逻辑结构,第三阶段自动填充特征向量。实验结果表明,该文提出的文档逻辑结构标注方法能够节省人工成本、提高机器学习算法对文档结构识别的准确率与召回率,F值达到97.5%。  相似文献   

14.
基于文档树的XML文件转换   总被引:1,自引:0,他引:1  
随着互联网与XML技术的不断发展,实现XML文件与非结构化的文本文件之间的相互转换的要求日趋提高,针对该问题,文章提出了一种基于文档树的XML文件转换方法。该方法通过文档树的形式描述文本文件的结构与内容,在特定的映射规则下对文档树进行遍历以实现RTF文件为代表的文本文件与XML文件的相互转换,最后介绍了文档树的构造及相关算法.  相似文献   

15.
This paper presents a syntactic method for sophisticated logical structure analysis that transforms document images with multiple pages and hierarchical structure into an electronic document based on SGML/XML. To produce a logical structure more accurately and quickly than previous works of which the basic units are text lines, the proposed parsing method takes text regions with hierarchical structure as input. Furthermore, we define a document model that is able to describe geometric characteristics and logical structure information of documents efficiently and present its automated creation method. Experimental results with 372 images scanned from the IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) show that the method has performed logical structure analysis successfully and generated a document model automatically. Particularly, the method generates SGML/XML documents as the result of structural analysis, so that it enhances the reusability of documents and independence of platform.  相似文献   

16.
As sharing documents through the World Wide Web has been recently and constantly increasing, the need for creating hyperdocuments to make them accessible and retrievable via the internet, in formats such as HTML and SGML/XML, has also been rapidly rising. Nevertheless, only a few works have been done on the conversion of paper documents into hyperdocuments. Moreover, most of these studies have concentrated on the direct conversion of single-column document images that include only text and image objects. In this paper, we propose two methods for converting complex multi-column document images into HTML documents, and a method for generating a structured table of contents page based on the logical structure analysis of the document image. Experiments with various kinds of multi-column document images show that, by using the proposed methods, their corresponding HTML documents can be generated in the same visual layout as that of the document images, and their structured table of contents page can be also produced with the hierarchically ordered section titles hyperlinked to the contents.  相似文献   

17.
Successful applications of digital libraries require structured access to sources of information. This paper presents an approach to extract the logical structure of text documents. The extracted structure is explicated by means of SGML (Standard Generalized Markup Language). Consequently, the extraction is achieved on the basis of grammars that extend SGML with recognition rules. From these grammars parsing automata are generated. These automata are used to partition a flat text document into its elements, to discard formatting information, and to insert SGML markups. Complex document structures and fallback rules needed for error tolerant parsing make such automata highly ambiguous. A novel parsing strategy has been developed that ranks and prunes ambiguous parsing paths.  相似文献   

18.
Yaron Wolfsthal 《Software》1991,21(6):625-638
A critical problem in the design of editors for structured documents is that of style control, i.e. mapping the logical elements of the documents to their physical appearance on pages. This paper presents a novel approach to style control, used in the Quill document editing system that has been prototyped at the IBM Almaden Research Center. In our approach, the style control mechanism is an integral part of the editing system and consistent with the overall system architecture, in both its inner structure and its user interface. Properties that specify the formatting process, together with action routines for specifying complex semantics, are the basic style control primitives in the proposed approach.  相似文献   

19.
This paper presents an efficient method for extracting a logical structure from a Web document. The proposed method consists of three phases: visual grouping, element identification, and logical grouping. To produce a logical structure more accurately, the proposed method defines a document model that is able to describe logical structure information of a specific document class. Since the proposed method is based on a visual structure from the visual grouping phase as well as a document model that describes logical structure information of a document type, it supports sophisticated structure analysis. Experimental results with HTML documents from the Web show that the method has performed logical structure analysis successfully, compared with previous work. Particularly, the method generates XML documents as the result of structure analysis, so that it enhances the reusability of documents.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号