首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In this paper, we describe an XML document classification framework based on extreme learning machine (ELM). On the basis of Structured Link Vector Model (SLVM), an optimized Reduced Structured Vector Space Model (RS-VSM) is proposed to incorporate structural information into feature vectors more efficiently and optimize the computation of document similarity. We apply ELM in the XML document classification to achieve good performance at extremely high speed compared with conventional learning machines (e.g., support vector machine). A voting-ELM algorithm is then proposed to improve the accuracy of ELM classifier. Revoting of Equal Votes (REV) method and Revoting of Confusing Classes (RCC) method are also proposed to postprocess the voting result of v-ELM and further improve the performance. The experiments conducted on real world classification problems demonstrate that the voting-ELM classifiers presented in this paper can achieve better performance than ELM algorithms with respect to precision, recall and F-measure.  相似文献   

2.
3.
In name and in practice, the World‐Wide Web (hereafter Web) is used around the World beyond English‐speaking areas. This creates a tremendous need to internationalize standard terminology used in the technologies that make the Web possible. Existing efforts on XML internationalization (i18n) and localization (i10n) have focused on the content of XML documents instead of the terms used in markup (annotations) such as elements and attributes. The SGML standard ISO 8879 supports the use of Unicode (ISO 10646) throughout a document, including markups. However, most elements and attributes of XML documents are still defined in English, thereby limiting their use among non‐English speakers. This paper presents an XSLT‐based method that can completely localize the markup of XML documents into different natural languages. We also describe how the proposed technique can be applied to translation problems in programming (e.g. C and Java) or documentation (e.g. LATEX or other formatting languages) so that a program or a document can be converted to and from an XML format. Copyright © 2004 John Wiley & Sons, Ltd.  相似文献   

4.
Native XML数据库快速查询的实现,可以采用基于XML文档编码的结构连接算法。而结构连接算法的实现需要对XML文档进行编码,以便于快速判断XML文档树结点之间的祖先后裔关系。在对现有编码机制进行综述的前提下,提出一种新的XML文档编码机制——前缀整除编码(PDIV)机制。该机制编码形式简单,只需要一个正整数即可充分表示结点在XML文档树中的位置信息;可以实现祖先后裔关系的快速查询;支持XML文档的更新操作;编码长度较短,编码长度约为o(ln(n))。  相似文献   

5.
XML plays an important role as the standard language for representing structured data for the traditional Web, and hence many Web-based knowledge management repositories store data and documents in XML. If semantics about the data are formally represented in an ontology, then it is possible to extract knowledge: This is done as ontology definitions and axioms are applied to XML data to automatically infer knowledge that is not explicitly represented in the repository. Ontologies also play a central role in realizing the burgeoning vision of the semantic Web, wherein data will be more sharable because their semantics will be represented in Web-accessible ontologies. In this paper, we demonstrate how an ontology can be used to extract knowledge from an exemplar XML repository of Shakespeare’s plays. We then implement an architecture for this ontology using de facto languages of the semantic Web including OWL and RuleML, thus preparing the ontology for use in data sharing. It has been predicted that the early adopters of the semantic Web will develop ontologies that leverage XML, provide intra-organizational value such as knowledge extraction capabilities that are irrespective of the semantic Web, and have the potential for inter-organizational data sharing over the semantic Web. The contribution of our proof-of-concept application, KROX, is that it serves as a blueprint for other ontology developers who believe that the growth of the semantic Web will unfold in this manner.
Henry M. KimEmail:
  相似文献   

6.
葛莹歆  夏克俭  曾德华 《计算机工程与设计》2005,26(10):2863-2864,F0003
随着电子政务建设的蓬勃开展,进行跨平台的电子公文交换已经成为一个亟待解决的问题。XML是为了解决互联网上数据交换的标准问题而设计的一种置标语言,将XML用于跨平台电子公文交换是电子政务的技术发展趋势,对实现电子公文的标准化具有重要意义。  相似文献   

7.
当对XML文档进行插入操作时面临调整编码问题,目前提出的很多编码方案不能同时很好地支持XPath查询和XML文档更新。在分析现有编码方案的基础上,提出了基于完全树的编码方案,该编码方案采用序号冗余和虚拟节点两种冗余方式,不仅支持XPath的查询,而且能有效降低因插入节点需要对XML文档进行二次编码率。实验结果表明,完全树以及相应编码有效提高了XML文档插入节点的效率。  相似文献   

8.
XML document may contain inconsistencies that violate predefined integrity constraints, which causes the data inconsistency problem. In this paper, we consider how to get the consistent data from an inconsistent XML document. There are two basic concepts for this problem: Repair is the data consistent with the integrity constraints, and also minimally differs from the original one. Consistent data is the data common for every possible repair. First we give a general constraint model for XML, which can express the commonly discussed integrity constraints, including functional dependencies, keys and multivalued dependencies. Next we provide a repair framework for inconsistent XML document with three basic update operations: node insertion, node deletion and node value modification. Following this approach, we introduce the concept of repair for inconsistent XML document, discuss the chase method to generate repairs, and prove some important properties of the chase. Finally we give a method to obtain the greatest lower bound of all possible repairs, which is sufficient for consistent data. We also implement prototypes of our method, and evaluate our framework and algorithms in the experiment.  相似文献   

9.
基于频繁结构的XML文档聚类   总被引:1,自引:1,他引:0       下载免费PDF全文
研究基于频繁结构的XML文档聚类方法,其频繁结构包括频繁路径和频繁子树。首先介绍一种挖掘XML文档中所有嵌入频繁子树的算法SSTMiner,对SSTMiner算法进行修改,得到FrePathMiner算法和FreTreeMiner算法,分别用于挖掘XML文档中最大频繁路径和最大频繁子树,在此基础上,提出一种凝聚的层次聚类算法XMLCluster,分别以最大频繁路径和最大频繁子树作为XML文档的特征,对文档进行聚类。实验结果表明FrePathMiner算法和FreTreeMiner算法找到频繁结构的数量都比传统的ASPMiner算法多,这就可以为文档聚类提供更多的结构特征,从而获得更高的聚类精度。  相似文献   

10.
冯景超  温浩宇 《计算机工程与设计》2007,28(6):1423-1424,1428
随着可扩展标记语言(XML)在互联网中的广泛应用,XML的信息安全性越来越受到重视.基于电子商务的特点,通过对对称密码术(也称作密钥密码术)中的分组密码方式和算法的研究,认为对XML中的数据的加密方式可以采用分组密码中较为成熟的算法-数据加密标准(DES)来实现,并给出了电子商务中应用XML数字签名的实现过程.  相似文献   

11.
利用DOM类库检索XML文档   总被引:1,自引:0,他引:1  
文档对象模型(DOM)是一种与平台无关、语言无关的标准接口,是XML文档操作的基础。论述了XML的应用前景和应用现状,提出了用高级语言中封装的DOM类库检索和解析XML文档,以VB6.0为例,来完成XML文档的检索和数据提取等工作。  相似文献   

12.
Browsing the DOM tree of an XML document is an act of following the links among the nodes of the DOM tree to find some desired nodes without any knowledge for search. When the structure of the XML document is not known to a user, browsing is the basic operation performed for referring the contents of the XML document. If the size of the XML document is very large, however, using a general-purpose XML parser for browsing the DOM tree of the XML document to access arbitrary node may suffer from the lack of memory space for constructing the large DOM tree. To alleviate this problem, we suggest a method to browse the DOM tree of a very large XML document by splitting the XML document into n small XML documents and generating sequentially the DOM tree of each of those small n XML documents. For later reference, the information of some nodes accessed from the DOM tree already generated has been also kept using the concept of their virtual nodes. With our suggested approach, the memory space necessary for browsing the DOM tree of a very large XML document is reduced such that it can be managed by a personal computer.  相似文献   

13.
可扩展标记语言XML正逐渐成为分布式计算的通用语言.随着XML的广泛应用,XML数据的安全问题已成为关注的焦点.分析了XML加密规范.基于NET平台,通过把C#语言和XML加密规范有机地结合起来,实现了XML加密系统.该系统能够对XML文档、XML元素以及任意二进制数据加密.利用公钥基础设施PKI技术,把XML加密应用于电子公文,保证了电子公文的安全.  相似文献   

14.
XML文档对象模型研究与应用   总被引:7,自引:0,他引:7  
从XML文档的基本结构出发,详细论述了DOM树、节点树结构特征及DOM的基本接口。结合产品定单实例实现XML文档结构树的动态创建、遍历,并通过XML DOM接口实现对文档结构树的操作等核心应用。  相似文献   

15.
16.
XML文档到关系数据库的转换研究   总被引:1,自引:0,他引:1  
XML作为网络数据交换的标准技术,广泛应用于计算机软件.目前存储数据的主流手段是关系数据库,因此XML文档与关系数据库之间必须进行转换.通过分析XML文档的层次结构,建立了XML文档树模型,并给出结点定义.依据XML的BNF规则给出了元素与属性的正规表达式和相对应的状态转换图,设计了识别元素和属性的词法分析程序用于解析XML文档.提出了XML文档树到关系数据库存储的转换思想和算法,并结合实例给出转换后的关系表.  相似文献   

17.
Efficient memory representation of XML document trees   总被引:1,自引:0,他引:1  
Implementations that load XML documents and give access to them via, e.g., the DOM, suffer from huge memory demands: the space needed to load an XML document is usually many times larger than the size of the document. A considerable amount of memory is needed to store the tree structure of the XML document. In this paper, a technique is presented that allows to represent the tree structure of an XML document in an efficient way. The representation exploits the high regularity in XML documents by compressing their tree structure; the latter means to detect and remove repetitions of tree patterns. Formally, context-free tree grammars that generate only a single tree are used for tree compression. The functionality of basic tree operations, like traversal along edges, is preserved under this compressed representation. This allows to directly execute queries (and in particular, bulk operations) without prior decompression. The complexity of certain computational problems like validation against XML types or testing equality is investigated for compressed input trees.  相似文献   

18.
The XML stream filtering is gaining widespread attention from the research community in recent years. There have been many efforts to improve the performance of the XML filtering system by utilizing XML schema information. In this paper, we design and implement an XML stream filtering system, SFilter, which uses DTD or XML schema information for improving the performance. We propose the simplification and two kinds of optimization, one is static and the other is dynamic optimization. The Simplification and static optimization transform the XPath queries to make automata as an index structure for the filtering. The dynamic optimization are done in runtime at the filtering time. We developed five kinds of static optimization and two kinds of dynamic optimization. We present the novel filtering algorithm for the resulting transformed XPath queries and runtime optimizing. The experimental result shows that our system filters the XML streams efficiently.  相似文献   

19.
Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space.We introduce new techniques for compressing inverted indexes that exploit this near-copy regularity. They are based on run-length, Lempel–Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists.We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet, they are orders of magnitude slower.  相似文献   

20.
There are three factors involved in text classification. These are classification model, similarity measure and document representation model. In this paper, we will focus on document representation and demonstrate that the choice of document representation has a profound impact on the quality of the classifier. In our experiments, we have used the centroid-based text classifier, which is a simple and robust text classification scheme. We will compare four different types of document representations: N-grams, Single terms, phrases and RDR which is a logic-based document representation. The N-gram representation is a string-based representation with no linguistic processing. The Single term approach is based on words with minimum linguistic processing. The phrase approach is based on linguistically formed phrases and single words. The RDR is based on linguistic processing and representing documents as a set of logical predicates. We have experimented with many text collections and we have obtained similar results. Here, we base our arguments on experiments conducted on Reuters-21578. We show that RDR, the more complex representation, produces more effective classifier on Reuters-21578, followed by the phrase approach.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号