首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Discovering structural association of semistructured data   总被引:4,自引:0,他引:4  
Many semistructured objects are similarly, though not identically structured. We study the problem of discovering “typical” substructures of a collection of semistructured objects. The discovered structures can serve the following purposes: 1) the “table-of-contents” for gaining general information of a source, 2) a road map for browsing and querying information sources, 3) a basis for clustering documents, 4) partial schemas for providing standard database access methods, and 5) user/customer interests and browsing patterns. The discovery task is affected by structural features of semistructured data in a nontrivial way and traditional data mining frameworks are inapplicable. We define this discovery problem and propose a solution  相似文献   

2.
Web data-extraction systems in use today mainly focus on the generation of extraction rules, i.e., wrapper induction. Thus, they appear ad hoc and are difficult to integrate when a holistic view is taken. Each phase in the data-extraction process is disconnected and does not share a common foundation to make the building of a complete system straightforward. In this paper, we demonstrate a holistic approach to Web data extraction. The principal component of our proposal is the notion of a document schema. Document schemata are patterns of structures embedded in documents. Once the document schemata are obtained, the various phases (e.g. training set preparation, wrapper induction and document classification) can be easily integrated. The implication of this is improved efficiency and better control over the extraction procedure. Our experimental results confirmed this. More importantly, because a document can be represented as avector of schema, it can be easily incorporated into existing systems as the fabric for integration.  相似文献   

3.
Abstract. This paper presents structural recursion as the basis of the syntax and semantics of query languages for semistructured data and XML. We describe a simple and powerful query language based on pattern matching and show that it can be expressed using structural recursion, which is introduced as a top-down, recursive function, similar to the way XSL is defined on XML trees. On cyclic data, structural recursion can be defined in two equivalent ways: as a recursive function which evaluates the data top-down and remembers all its calls to avoid infinite loops, or as a bulk evaluation which processes the entire data in parallel using only traditional relational algebra operators. The latter makes it possible for optimization techniques in relational queries to be applied to structural recursion. We show that the composition of two structural recursion queries can be expressed as a single such query, and this is used as the basis of an optimization method for mediator systems. Several other formal properties are established: structural recursion can be expressed in first-order logic extended with transitive closure; its data complexity is PTIME; and over relational data it is a conservative extension of the relational calculus. The underlying data model is based on value equality, formally defined with bisimulation. Structural recursion is shown to be invariant with respect to value equality. Received: July 9, 1999 / Accepted: December 24, 1999  相似文献   

4.
对软件项目管理系统的项目数据备份进行分析,提出了一种基于半结构化数据的项目备份方法SDB-Method.该方法通过对系统的数据模型进行分析,建立关系数据模型和半结构化数据模型OEM(对象交换模型)之间的映射,实现关系数据和半结构化数据的相互转换,从而解决项目的导入和导出问题.该方法应用于项目管理系统SoftPM中,支持软件项目的多分支开发,迭代开发以及移植,有效地解决了软件项目管理系统的项目备份问题.  相似文献   

5.
聚类是数据挖掘中重要的技术之一,它是按照相似原则将数据进行分类。然而分类型数据的聚类是学习算法中重要而又棘手的问题。传统的k-modes算法采用简单的0-1匹配方法定义两个属性值之间的相异度,没有将整个数据集的分布考虑进来,导致差异性度量不够准确。针对这个问题,提出基于结构相似性的k-modes算法。该算法不仅考虑属性值它们本身的异同,而且考虑了它们在其他属性下所处的结构。从集群识别和准确率两个方面进行仿真实验,表明基于结构相似性的k-modes算法在伸缩性和准确率方面更有效。  相似文献   

6.
Many modern applications(e-commerce,digital library,etc.)require integrated access to various information sources(from tr5aditional RDBMS to semistructured Web repositories).Extracting schema from semistructured data is a prereuisite to integrated heterogeneous information sources.The traditional method that extracts global schema may require time (and space)to increase exponentially with the number of objects and edges in the source.A new method is presented in this paper.which is about extracting local schema,In this method,the algorithm controls the scale of extracting schema within the “schema diameter“ by examining the semantic distance of the target set and using the Hash class and its path distance operation.This method is very efficient for restraining schema from expanding.The prototype validates the new approach.  相似文献   

7.
为了减少传统RANSAC(Random Sample Consensus,随机抽样一致性)算法的迭代次数和运行时间,提高算法的速度和精度,提出了一种基于结构相似的RANSAC改进算法。采用BRISK(Binary Robust Invariant Scalable Keypoints)算法提取和描述二进制特征点,用Hamming距离进行特征匹配,获得初始匹配点集,利用结构相似约束剔除误匹配点,得到新的匹配点集,用新的点集作为RANSAC的输入,求出变换矩阵。该算法在初始匹配后进行了匹配点提纯,能快速求得变换模型。实验证明该算法迭代次数和运行时间比传统RANSAC算法明显减少,因此改进的算法在速度和精度上优于传统的RANSAC算法。  相似文献   

8.
9.
Measuring the structural similarity between an XML document and a DTD has many relevant applications that range from document classification and approximate structural queries on XML documents to selective dissemination of XML documents and document protection. The problem is harder than measuring structural similarity among documents, because a DTD can be considered as a generator of documents. Thus, the problem is to evaluate the similarity between a document and a set of documents. An effective structural similarity measure should face different requirements that range from considering the presence and absence of required elements, as well as the structure and level of the missing and extra elements to vocabulary discrepancies due to the use of synonymous or syntactically similar tags. In the paper, starting from these requirements, we provide a definition of the measure and present an algorithm for matching a document against a DTD to obtain their structural similarity. Finally, experimental results to assess the effectiveness of the approach are presented.  相似文献   

10.
In this paper we propose a graph-based generic model able to uniformly represent semistructured data and their temporal aspects. In particular, we start from a generic and expressive model proposed in the database literature and consider in a formal and systematic way both valid time and transaction time, together with the set of temporal constraints needed to correctly manage the semantics of the represented time dimension. We then propose operations, which allow the incremental management of the proposed model satisfying the introduced temporal constraints. Moreover, we also deal with the possibility of managing together the two classical time dimensions of valid and transaction times, and formalize the set of constraints needed to correctly handle these temporal aspects together. Some examples taken from a medical scenario will be used to describe the introduced concepts.  相似文献   

11.
基于人眼视觉系统和视觉域抽取的结构信息高度相关的原理,提出了一种基于结构相似度的快速运动估计算法(FMEBS).该算法针对H.264率失真优化算法存在的不足,引入基于结构相似度的图像质量衡量标准,对失真度的表示进行修正,并采用快速的模式选择算法和有效的搜索模板.实验表明,在获得相近重建图像质量的前提下,FMEBS算法较之全搜索算法可节省约2.7%的比特率和91.2%用于运动估计的时间,较之UMHexagonS算法可节省约1.9%的比特率和35.6%的时间.  相似文献   

12.
Neural Computing and Applications - Transfer learning focuses on building better predictive models by exploiting knowledge gained in previous related tasks, being able to soften the traditional...  相似文献   

13.
图聚集技术是将一个大规模图用简洁的小规模图来表示,同时保留原始图的结构和属性信息的技术。现有算法未同时考虑节点的属性信息与边的权重信息,导致图聚集后与原始图存在较大差异。因此,提出一种同时考虑节点属性信息与边权重信息的图聚集算法,使得聚集图既保留了节点属性相似度又保留了边权重信息。该算法首先定义了闭邻域结构相似度,通过一种剪枝策略来计算节点之间的结构相似度;其次使用最小哈希(MinHash)技术计算节点之间的属性相似度,并调节结构相似与属性相似所占的比例;最后,根据2方面相似度的大小对加权图进行聚集。实验表明了该算法可行且有效。  相似文献   

14.
Querying time series data based on similarity   总被引:3,自引:0,他引:3  
We study similarity queries for time series data where similarity is defined, in a fairly general way, in terms of a distance function and a set of affine transformations on the Fourier series representation of a sequence. We identify a safe set of transformations supporting a wide variety of comparisons and show that this set is rich enough to formulate operations such as moving average and time scaling. We also show that queries expressed using safe transformations can efficiently be computed without prior knowledge of the transformations. We present a query processing algorithm that uses the underlying multidimensional index built over the data set to efficiently answer similarity queries. Our experiments show that the performance of this algorithm is competitive to that of processing ordinary (exact match) queries using the index, and much faster than sequential scanning. We propose a generalization of this algorithm for simultaneously handling multiple transformations at a time, and give experimental results on the performance of the generalized algorithm  相似文献   

15.
基于概率相似度的不完备信息系统数据补齐算法*   总被引:2,自引:1,他引:1  
在决策属性已知、条件属性值分布不确定的情况下,用基于概率相似度原理和按决策属性划分系统的原则,对缺损数据进行填补,可使不完备决策信息系统的完备化具有较高可信度。  相似文献   

16.
Collaborative filtering is one of the most popular recommendation techniques, which provides personalised recommendations based on users’ tastes. In spite of its huge success, it suffers from a range of problems, the most fundamental being that of data sparsity. Sparsity in ratings makes the formation of inaccurate neighbourhood, thereby resulting in poor recommendations. To address this issue, in this article, we propose a novel collaborative filtering approach based on information-theoretic co-clustering. The proposed approach computes two types of similarities: cluster preference and rating, and combines them. Based on the combined similarity, the user-based and item-based approaches are adopted, respectively, to obtain individual predictions for an unknown target rating. Finally, the proposed approach fuses these resultant predictions. Experimental results show that the proposed approach is superior to existing alternatives.  相似文献   

17.
The large volume and nature of data available to the casual users and programs motivate the increasing interest of the database community in studying flexible and efficient techniques for extracting and querying semistructured data. On the other hand, efficient methods have been discovered for solving the so-called model-checking problem for some modal logics. The aim of this paper is to show how some of these methods can be used for querying semistructured data. For doing that we show that semistructured data can be naturally seen as Kripke Transition Systems. To keep the presentation independent of a specific language, we introduce a graphical query language that includes some of the features of the query languages based on graphs and patterns. We show how to associate CTL formulas to queries of this language. This allows us to see the problems of solving a query as an instance of the model-checking problem for CTL that can be solved in polynomial time. We have tested the method by using a model-checker, and have studied the applicability of the method to some existing languages for semistructured databases.  相似文献   

18.
19.
Until now, most reversible data hiding techniques have been evaluated by peak signal-to-noise ratio(PSNR), which based on mean squared error(MSE). Unfortunately, MSE turns out to be an extremely poor measure when the purpose is to predict perceived signal fidelity or quality. The structural similarity (SSIM) index has gained widespread popularity as an alternative motivating principle for the design of image quality measures. How to utilize the characterize of SSIM to design RDH algorithm is very critical. In this paper, we propose an optimal RDH algorithm under structural similarity constraint. Firstly, we deduce the metric of the structural similarity constraint, and further we prove it does’t hold non-crossing-edges property. Secondly, we construct the rate-distortion function of optimal structural similarity constraint, which is equivalent to minimize the average distortion for a given embedding rate, and then we can obtain the optimal transition probability matrix under the structural similarity constraint. Comparing with previous RDH, our method have gained the improvement of SSIM about 1.89 % on average. Experiments show that our proposed method outperforms the state-of-arts performance in SSIM.  相似文献   

20.
孙贵宾  周勇 《计算机应用》2015,35(3):633-637
复杂网络中普遍存在着一定的社团结构,社团检测具有重要的理论意义和实际价值。为了提高复杂网络中社团检测的性能,提出了一种基于结构相似度仿射传播的社团检测算法。首先,选取结构相似度作为节点之间的相似性度量,并采用了一种优化的方法来计算复杂网络的相似度矩阵;其次,将计算得到的相似度矩阵作为输入,采用快速仿射传播(FAP)算法进行聚类;最后,得到最终的社团结构。实验结果表明,所提算法在LFR(Lancichinetti-Fortunato-Radicchi)模拟网络上的社团检测平均标准化互信息(NMI)值为65.1%,要高于标签传播算法(LPA)的45.3%以及CNM(Clauset-Newman-Moore)算法的49.8%;在真实网络上的社团检测平均模块度值为53.1%,要高于LPA算法的39.9%以及CNM算法的47.8%,具有更好的社团检测能力,能够发现更高质量的社团结构。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号