首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
When searching or browsing documents, the genre of a document is an important consideration that complements topical characterization. We examine design considerations for automatic tagging of office document pages with genre membership. These include selecting features that characterize genre-related information in office documents, examining the utility of text-based features and image-based features, and proposing a simple ensemble method to improve the performance of genre identification. Experiments were conducted on the open-set identification of four coarse office document genres: technical paper, photo, slide, and table. Our experiments show that when combined with image-based features, text-based features do not significantly influence performance. These results provide support for a topic-independent approach to identification of coarse office document genres. Experiments also show that our simple ensemble method significantly improves performance relative to using a support vector machine (SVM) classifier alone. We demonstrate the utility of our approach by integrating our automatic genre tags in a faceted search and browsing application for office document collections.  相似文献   

2.
在某大型软件系统中,采用XML来表示多种不同格式的文书和其它信息,能够大大简化系统编程接口和加快信息交流共享.XML文档处理子系统是该大型软件系统的重要组成部分,为其他子系统提供高效存储处理各种信息的统一接口.从XML文档映射模式的相关研究开始,在XML文档处理子系统的设计中提出了一种简单高效的XML文档到关系数据库的映射模式,有效提高了系统处理XML文档的速度.  相似文献   

3.
采用索引技术,对输入的XML文档建立一个双索引结构来改进YFilter算法,优化XML文档过滤性能。藉助索引结构,该算法超前搜索元素结点在文档中的结构信息,预先排除不能保证得到任何匹配结果的元素结点,以避免大量不必要的查询处理。实验结果显示,当输入的XML文档较大时,该算法有较好的过滤性能。  相似文献   

4.
Secure broadcasting of web documents is becoming a crucial requirement for many web-based applications. Under the broadcast document dissemination strategy, a web document source periodically broadcasts (portions of) its documents to a potentially large community of users, without the need for explicit requests. By secure broadcasting, we mean that the delivery of information to users must obey the access control policies of the document source. Traditional access control mechanisms that have been adapted for XML documents, however, do not address the performance issues inherent in access control. In this paper, a labeling scheme is proposed to support rapid reconstruction of XML documents in the context of a well-known method, called XML pool encryption. The proposed labeling scheme supports the speedy inference of structure information in all portions of the document. The binary representation of the proposed labeling scheme is also investigated. In the experimental results, the proposed labeling scheme is efficient in searching for the location of decrypted information.  相似文献   

5.
Implementation techniques for relational database management systems (DBMSs) have proven their efficiency and robustness in many existing systems. However, many of these concepts and mechanisms cannot be used when implementing a native XML DBMS (XDBMS) because of substantial differences in the processing properties of natively stored XML documents as compared to relational tables. Therefore, we have to develop new and appropriate techniques with ACID transaction guarantees tailored to the processing characteristics of tree documents and the operations on them.

For this reason, we want to provide for an efficient infrastructure of XDBMSs consisting of tree node addressing and indexing together with fine-grained locking of tree nodes. In this respect, our prime and novel contribution is to reveal the potential of our prefix-based node labeling called DeweyIDs supporting record addressing, indexing, and locking protocols. In this paper, we first sketch our version of prefix-based node labeling and summarize a quantitative study on them. An overview of our layered XDBMS architecture indicates the concepts and functionalities to be reused from relational DBMS implementations. The core part of the paper describes the infrastructural services for XML document storage with compressed DeweyIDs, the principles and methods for navigational and declarative processing of queries, as well as the lock modes and protocols to enable efficient collaboration. Selected empirical experiments evaluate the XTC system performance and support our system assessment.  相似文献   


6.
本文针对现有办公系统数据交换安全存在的问题,提出了一种数据交换的安全解决方案,应用XML签名技术保障电子公文的在传输过程中的完整性,应用XML加密技术保障电子公文在本地存储和在传输过程中的机密性,有效地解决了办公自动化系统的安全问题。  相似文献   

7.
一种基于XML的文档处理模型   总被引:1,自引:0,他引:1  
在某军用软件开发过程中,由于系统文档格式不一致、结构性差,造成了系统文档管理、数据库存储及资源共享的不便。为解决这些问题,给出了一种基于XML的文档处理模型,应用XML和Oracle XML DB技术,对文档作结构化处理,并映射到关系数据库,映射过程中保持了文档模式语义约束和文档保真性;详细介绍了模型的结构和实现技术,并给出了应用实例。  相似文献   

8.
本文分析了传统生成文档方式的问题,结合当前一些常见类型文档支持XML格式.提出将数据集的数据转换成这些常见类型的文档支持的特定XML字符串,通过文件流的形式直接快速生成文档.这种过程就是通过XML文档直接生成常见类型的文档。  相似文献   

9.
TEXPROS (TEXt PROcessing System) is an automatic document processing system which supports text-based information representation and manipulation, conveying meanings from stored information within office document texts. A dual modeling approach is employed to describe office documents and support document search and retrieval. The frame templates for representing document classes are organized to form a document type hierarchy. Based on its document type, the synopsis of a document is extracted to form its corresponding frame instance. According to the user predefined criteria, these frame instances are stored in different folders, which are organized as a folder organization (i.e., repository of frame instances associated with their documents). The concept of linking folders establishes filing paths for automatically filing documents in the folder organization. By integrating document type hierarchy and folder organization, the dual modeling approach provides efficient frame instance access by limiting the searches to those frame instances of a document type within those folders which appear to be the most similar to the corresponding queries.This paper presents an agent-based document filing system using folder organization. A storage architecture is presented to incorporate the document type hierarchy, folder organization and original document storage into a three-level storage system. This folder organization supports effective filing strategy and allows rapid frame instance searches by confining the search to the actual predicate-driven retrieval method. A predicate specification is proposed for specifying criteria on filing paths in terms of user predefined predicates for governing the document filing. A method for evaluating whether a given frame instance satisfies the criteria of a filing path is presented. The basic operations for constructing and reorganizing a folder organization are proposed.  相似文献   

10.
基于关系数据库分裂存储的XML文档片段重构*   总被引:7,自引:0,他引:7  
首先对基于关系数据库分裂存储XML文档的各种方法进行总结,并对X-RESTORE基于关系数据库的XML文档的分裂存储原理和关系存储模式进行介绍;然后对XML查询的三种计算模式进行分析;最后基于X-RESTORE给出一个重构结果文档片段的算法,并对该算法的执行代价进行分析。分析结果显示,X-RESTORE不仅能够有效地支持XML的选择计算模式,而且能够有效地支持XML的抽取计算模式和重构计算模式。  相似文献   

11.
X地数据的过滤和分发是X地应用领域的研究热点之一。为了提高X地文档的传输效率,有必要进行压缩。本文提出一种应用于压缩X地文档的过滤和片断分发方法。实验表明,这种方法既保持了压缩文档的传输效率优势,又滤摔了非查询目标数据,提高了分发中心的匹配/分发处理效率。  相似文献   

12.
利用关系表构建XML文档解析的树模型   总被引:2,自引:1,他引:1  
祝青  阳王东 《计算机应用》2009,29(6):1719-1721
在对XML文档的数据解析和查询操作研究中,发现树能较好地反映XML文档的层次结构,但其查询效率较低,而关系表是一种适合存储大量数据且有较好查询效率与操作功能的数据结构。给出了一个把树和关系表相结合构建一种存储XML文档的数据模型;在这个模型的解析过程中,采用回调事件式的分段解析方法以减少解析时间和存储空间。这样既能较好保存XML文档的结构特点,又能提高其查询的效率和操作的便利性。通过对大数据量XML文档的解析和操作实验,实验结果证明这种数据模型在处理大型XML文档中具有明显优势。  相似文献   

13.
对于基于DTD在关系数据库中存储XML文档,此处利用结点模型映射方法,实现用关系模式来表示目标XML文档的逻辑结构(即 XML模式或DTD).还介绍了如何在已建立好的关系模式中添加约束用来保持原有XML文档中隐含的约束信息,此外XML文档的元素之间通常是相互递归的,这里也对XML文档中在出现递归的情况时,如何来存储递归的XML文档进行说明.最后通过举例,证明此种方法是合理有效的.  相似文献   

14.
XML文档相似性的仿真研究   总被引:1,自引:0,他引:1  
XML文档相似性的计算是XML文档分类中的一个难题。文中描述了一种基于结构的方法,通过序列化模式挖掘方法,挖掘出两个文档之间的最大相似路径,从而可以通过计算最大相似的路径的节点数目和所有路径的节点数目的比值,得到两个文档之间的相似度。文章提出了一种新的最小化XML文档的方法,并且综合考虑了文档节点的语义相似度和结构相似度,从而进一步地提高了计算文档相似度的精度。实验表明,该方法有着良好的应用前景。  相似文献   

15.
XML的自描述性、可扩展性等特点使得XML非常适用于异构域数据的交换,以XML作为数据交换格式需要XML转换技术的强力支持。为实现异构域XML文档自动转换,提出一种XML Schema模式匹配方法,建立了模式元素之间的映射关系。该映射关系文件可翻译成XSLT脚本,实现XML文档的自动转换;实验结果证明了该方法具有较高的查准率和查全率。  相似文献   

16.
本文对细粒度XML文档上的BIBA严格完整性策略进行了研究。通过对XML文档的结构约束进行分析,建立了XML文档上的完整性约束规则,从而将BIBA严格完整性策略扩展到了XML文档上;建立一个包含完整性属性的XML文档模型,对XML文档的结构特点进行了分析;提出了完整性标签传播规则,以支持部分标记完整性的XML文档;最后对完整性策略的实现机制进行了讨论。  相似文献   

17.
徐明  庄毅 《计算机科学》2006,33(2):205-207
作为构建开放和分布式应用系统的一种主流模式,多Agent系统有着广阔的研究前帚和应用价值。在统一建模语言(UML)的支持下,面向Agent的软件工程研究开始走向成熟。一些面向Agent的方法学提供了开发多A—gent系统的工具、应用方法或技术。随着Web服务技术的发展,XML成为Internet上数据组织和交换的标准。现有研究工作所提出的多Agent系统对XML文档提供很少的支持。针对上述问题,设计了一个基于XML的多Agent系统——XMAS。该系统采用带根连通有向图来表示XML文档数据模型,并给出相应的文档模式提取算法,XML文档数据的解析以及对Web服务的相关支持。在数据存储过程中的索引优化使得XMAS在数据查询上具有良好的性能。  相似文献   

18.
XML has recently become very popular as a means of representing semistructured data and as a standard for data exchange over the Web, because of its varied applicability in numerous applications. Therefore, XML documents constitute an important data mining domain. In this paper, we propose a new method of XML document clustering by a global criterion function, considering the weight of common structures. Our approach initially extracts representative structures of frequent patterns from schemaless XML documents using a sequential pattern mining algorithm. Then, we perform clustering of an XML document by the weight of common structures, without a measure of pairwise similarity, assuming that an XML document is a transaction and frequent structures extracted from documents are items of the transaction. We conducted experiments to compare our method with previous methods. The experimental results show the effectiveness of our approach.  相似文献   

19.
XML is acknowledged as the most effective format for data encoding and exchange over domains ranging from the World Wide Web to desktop applications. However, large-scale adoption into actual system implementations is being slowed down due to the inefficiency of its document-parsing methods. The recent development of lazy parsing techniques is a major step towards improving this situation, but lazy parsers still have a key drawback—they must load the entire XML document in order to extract the overall document structure before document parsing can be performed. We have developed a framework for efficient parsing based on the idea of placing internal physical pointers within the XML document that allow the navigation process to skip large portions of the document during parsing. We show how to generate such internal pointers in a way that optimizes parsing using constructs supported by the current W3C XML standard. A double-lazy parser (2LP) exploits these internal pointers to efficiently parse the document. The usage of supported W3C constructs to create internal pointers allows 2LP to be backward compatible—i.e., the pointer-augmented documents can be parsed by current XML parsers. We also implemented a mechanism to efficiently parse large documents with limited main memory, thereby overcoming a major limitation in current solutions. We study our pointer generation and parsing algorithms both theoretically and experimentally, and show that they perform considerably better than existing approaches.  相似文献   

20.
XML文档近似连接操作是在两个XML文档集合中发现近似的XML文档,其在基于XML数据的信息集成、XML数据清洗等系统中有着广泛的应用.然而,目前XML文档近似连接操作的一个显著问题在于:当文档之间存在较大差异时,存在大量的重复计算,降低了处理效率.对于这个问题,提出了基于聚类的XML文档近似连接方法,基本思想是为每个XML文档建立一个索引,如果两个数据集中若干文档的索引较相似,可以把它们组成一簇,然后在每一簇中执行近似连接.而不在任何簇中的文档,则无需对其进行任何计算.实验结果表明,提出的方法在保证正确率的前提下具有高效性.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号