共查询到20条相似文献,搜索用时 0 毫秒
1.
基于OEM的XML半结构数据的模式描述方法 总被引:3,自引:1,他引:3
半结构数据的类型和模式是提高半结构数据处理效率的关键技术,首先对半结构数据的特点及半结构数据的模式的特点进行了论述,然后对基于XML的模式描述形式进行了研究,提出了一种基于OEM的XMLDTD模式的定义和形式化描述方法。 相似文献
2.
Many modern applications(e-commerce,digital library,etc.)require integrated access to various information sources(from tr5aditional RDBMS to semistructured Web repositories).Extracting schema from semistructured data is a prereuisite to integrated heterogeneous information sources.The traditional method that extracts global schema may require time (and space)to increase exponentially with the number of objects and edges in the source.A new method is presented in this paper.which is about extracting local schema,In this method,the algorithm controls the scale of extracting schema within the “schema diameter“ by examining the semantic distance of the target set and using the Hash class and its path distance operation.This method is very efficient for restraining schema from expanding.The prototype validates the new approach. 相似文献
3.
Extracting Web Data Using Instance-Based Learning 总被引:1,自引:0,他引:1
This paper studies structured data extraction from Web pages. Existing approaches to data extraction include wrapper induction
and automated methods. In this paper, we propose an instance-based learning method, which performs extraction by comparing
each new instance to be extracted with labeled instances. The key advantage of our method is that it does not require an initial
set of labeled pages to learn extraction rules as in wrapper induction. Instead, the algorithm is able to start extraction
from a single labeled instance. Only when a new instance cannot be extracted does it need labeling. This avoids unnecessary
page labeling, which solves a major problem with inductive learning (or wrapper induction), i.e., the set of labeled instances
may not be representative of all other instances. The instance-based approach is very natural because structured data on the
Web usually follow some fixed templates. Pages of the same template usually can be extracted based on a single page instance
of the template. A novel technique is proposed to match a new instance with a manually labeled instance and in the process
to extract the required data items from the new instance. The technique is also very efficient. Experimental results based
on 1,200 pages from 24 diverse Web sites demonstrate the effectiveness of the method. It also outperforms the state-of-the-art
existing systems significantly. 相似文献
4.
一个基于模式的XML存储模型 总被引:4,自引:0,他引:4
XML基于半结构化数据模型,而半结构化数据很难统一存储和管理。文章提出了一个基于模式的XML存储模型SBSM,并定义了如何在该模型上进行相关的操作,该模型克服了对象-关系映射模型的局限性,并支持直接在模型上进行查询操作。 相似文献
5.
6.
网络使用挖掘是通过分析记录在Web服务器上的用户使用数据,来自动发现用户访问信息网的方式。其挖掘结果可以用于改善网站设计、商业决策支持、个性化服务等方面。序列模式分析是数据挖掘使用的模式分析的一种方式。本文主要介绍了一种适应复杂条件限制的序列模式分析在网络使用挖掘中的应用及其一般步骤。 相似文献
7.
Daiyue Weng Jun Hong David A. Bell 《International Journal of Software and Informatics》2012,6(3):453-472
A rapidly increasing number of Web databases are now become accessible via their HTML form-based query interfaces. Query result pages are dynamically generated in response to user queries, which encode structured data and are displayed for human use. Query result pages usually contain other types of information in addition to query
results, e.g., advertisements, navigation bar etc. The problem of extracting structured data from query result pages is critical for web data integration applications, such as comparison shopping, meta-search engines etc, and has been intensively studied. A number of approaches have been proposed. As the structures of Web pages become more and more complex, the existing approaches start to fail, and most of them do not remove irrelevant contents which may affect the accuracy of data record extraction. We propose an automated approach for Web data extraction. First, it makes use of visual features and query terms to identify data sections and extracts data records in these sections. We also represent several content and visual features of visual blocks in a data section, and use them to filter out noisy blocks. Second, it measures similarity between data items in different data records based on their visual and content features, and aligns them into different groups so that the data in the same group have the same semantics. The results of our experiments with a large set of Web query result pages in di?erent domains show that our proposed approaches are highly effective. 相似文献
8.
9.
目前国内外在深层网络方面的研究几乎都围绕英文环境进行,还没有针对中文深层网络的研究.提出了对中文深层网络进行模式匹配和接口集成的方法.该方法首先创建一个用来存储同义词、超义词和子义词的字典,然后使用基于规则的分词算法将从接口中抽取的属性分成词.对于每一个属性,从定义的字典中找到其对应的所有同义词、超义词和子义词,生成一条相应的记录并存储到列表中,再从每条记录中选取出现次数最多的属性作为联合接口的属性. 相似文献
10.
面向主题的Web信息收集系统的设计与实现 总被引:7,自引:0,他引:7
随着互联网信息的持续爆炸性增长,通用搜索引擎的信息覆盖率和检索精度都在不断下降,发展面向主题信息的专用网络信息检索工具已经成为趋势。文中提出的面向主题的Web信息收集系统是这类工具的核心部件,该系统采用文档矢量模型进行文档相关度计算,并结合页面链接的上下文信息过滤页面;借鉴并修改了Shark启发式查找算法来查找相关页面;可采用多机并行下载提高收集效率;并依据站点的重要程度进行动态更新。在一个面向Internet的计算机教学资源检索的搜索引擎中具体实现了这个Web信息收集系统,整个系统在低性能的台式机上就能运行,并可获得较高的属于指定主题的页面的收集精度和收集效率。 相似文献
11.
In this paper, first studied are the distribution characteristics of user behaviors based on log data from a massive web search engine. Analysis shows that stochastic distribution of user queries accords with the characteristics of power-law function and exhibits strong similarity, and the user' s queries and clicked URLs present dramatic locality, which implies that query cache and 'hot click' cache can be employed to improve system performance. Then three typical cache replacement policies are compared, including LRU, FIFO, and LFU with attenuation. In addition, the distribution character-istics of web information are also analyzed, which demonstrates that the link popularity and replica pop-ularity of a URL have positive influence on its importance. Finally, variance between the link popularity and user popularity, and variance between replica popularity and user popularity are analyzed, which give us some important insight that helps us improve the ranking algorithms in a search engine. 相似文献
12.
Web数据空间边建边用模式集成 总被引:1,自引:0,他引:1
使用边建边用的方法,实现了Web数据空间的模式集成。在模式集成时,提出了一个Web数据空间模式集成的框架,运用组合方法,创建了一个中间模式;同时,为用户提供了Top-k个源数据模式。实验表明该方法通过用户参与可以提高查询的准确率与召回率,Top-k个源数据模式的提供,明显提高了查询的有效性。 相似文献
13.
XML has become the standard for publishing and exchanging data on the Web. However, most business data is managed and will
remain to be managed by relational database management systems. As such, there is an increasing need to efficiently and accurately
publish relational data as XML documents for Internet-based applications. One way to publish relational data is to provide
virtual XML documents for relational data via an XML schema which is transformed from the underlying relational database schema
such that users can access the relational database through the XML schema. In this paper, we discuss issues in transforming
a relational database schema into the corresponding XML schema. We aim to preserve all integrity constraints defined in a
relational database schema, to achieve high level of nesting and to avoid introducing data redundancy in the transformed XML
schema. In the paper, we first propose a basic transformation algorithm which introduces no data redundancy, then we improve
the algorithm by exploring further nesting of the transformed XML schema. 相似文献
14.
模式匹配就是在作为输入的模式中有对应语义关系的元素间产生一个映射.为了提高模式匹配的效率,提出了一种新型的模式匹配方法--源模式分裂模式匹配算法.它可以解决标准模式匹配难以解决的问题:1)源模式的某一个属性和多个目标模式的多个属性之间建立匹配关系;2)表格中的不同元组对应其他表格同一元组的不同属性值的匹配.在匹配过程中,该方法先搜索种类型属性,然后根据种类型属性建立选择条件,最后把源模式进行分裂形成视图,再重新生成候选匹配集合,从而提高模式匹配的质量. 相似文献
15.
Semistructued data are specified in lack of any fixed and rigid schema,even though typically some implicit structure appears in the data.The huge amounts of on-line applications make it important and imperative to mine the schema of semistructured data ,both for the users(e.g.,to gather useful information and facilitate querying)and for the systems (e.g.,to optimize access).The critical problem is to discover the hidden structure in the semistructured data.Current methods in extracting Web data structure are either in a general way independent of application background,or bound in some concrete environment such as HTML,XML etc.But both face the burden of expensive cost and difficulty in keeping along with the frequent and complicated variances of Web data.In this paper,the problem of incremental mining of schema for semistructured data after the update of the raw data is discusses.An algorithm for incrementally mining the schema of semistructured data is provided,and some experimental results are also given,which show that incremental mining for semistructured data is more efficient than non-incremental mining. 相似文献
16.
Web数据管理研究综述 总被引:53,自引:1,他引:53
孟小峰 《计算机研究与发展》2001,38(4):385-395
综述了Web数据管理技术,对Web数据管理的研究给出了定义,就Web数据管理的几个重要问题给出了阐述,在此基础上提出了一种基于XML的Web数据管理系统的框架和待研究的问题。 相似文献
17.
18.
XML作为一种数据交换的国际标准,已经广泛应用于各个领域,如何准确地实现XML与关系数据库的转换已经成为一个重要的研究课题。本文主要研究XML Schema到关系模式的映射方法,通过对XML Schema的结构和语法进行分析,提取元素的语义信息,找到真正的复杂类型的元素,将XML Schema转换为E_Schema,E_Schema表达方式简单且信息完整,结合基于查询代价的思想对E_Schema进行处理,得到最优模式,再将其转换成关系模式,而且能保持原有的层次关系,从而实现了一个更为简单、完善的映射方案。 相似文献
19.
20.
Shen Yidong 《计算机科学技术学报》1998,13(4):289-299
While the schema-less feature of the OEM(Object Exchange Modl)gives flexibility in representing semi-structured data,it brings difficulty in formulating database queries. Extracting schema from an OEM database then becomes an important research topic.This paper presents a new approach to this topic with th following reatures.(1)In addition to representing th nested label structure of an OEM database,the proposed OEM schema keeps up-tp-date information about instance objects of the database,The object-level information is useful in speeding up query evaluation.(2)The OEM schema is explicitly represented as a label-set,which is easy to construct and update.(3)The OEM schema of a database is statically built and dynamically updated.The time complexity of building the OEM schems is linear in the size of the OEM database.(4)The approach is applicable to a wide range of areas where the underlying schema is much smaller than the database itself(e.g.data warehouses that are made from a set of heterogeneous databases). 相似文献