首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
Fast detection of XML structural similarity   总被引:6,自引:0,他引:6  
Because of the widespread diffusion of semistructured data in XML format, much research effort is currently devoted to support the storage and retrieval of large collections of such documents. XML documents can be compared as to their structural similarity, in order to group them into clusters so that different storage, retrieval, and processing techniques can be effectively exploited. In this scenario, an efficient and effective similarity function is the key of a successful data management process. We present an approach for detecting structural similarity between XML documents which significantly differs from standard methods based on graph-matching algorithms, and allows a significant reduction of the required computation costs. Our proposal roughly consists of linearizing the structure of each XML document, by representing it as a numerical sequence and, then, comparing such sequences through the analysis of their frequencies. First, some basic strategies for encoding a document are proposed, which can focus on diverse structural facets. Moreover, the theory of discrete Fourier transform is exploited to effectively and efficiently compare the encoded documents (i.e., signals) in the domain of frequencies. Experimental results reveal the effectiveness of the approach, also in comparison with standard methods.  相似文献   

3.
The gap between storing data in relational databases and transferring data in form of XML has been closed e.g. by SQL/XML queries that generate XML data out of relational data sources. However, only few relational database systems support the evaluation of SQL/XML queries. And even in those systems supporting SQL/XML, the evaluation of such queries is quite slow compared to the evaluation of SQL queries. In this paper, we present S2CX, an approach that allows to efficiently evaluate SQL/XML queries on any relational database system, no matter whether it supports SQL/XML or not. As a result to an SQL/XML query, S2CX supports different output formats ranging from plain XML to different compressed XML representations including a succinct encoding of XML data, schema-aware compressed XML to grammar compressed XML. In many cases, S2CX produces compressed XML as a result to an SQL/XML query even faster than the evaluation of SQL/XML queries into non-compressed XML as provided by Oracle 11 g and by DB2. Furthermore, our approach to query evaluation scales better, i.e., the larger the dataset, the faster is our approach compared to SQL/XML query evaluation in Oracle 11 g and in DB2.  相似文献   

4.
Traditional High-Performance Computing (HPC) based big-data applications are usually constrained by having to move large amount of data to compute facilities for real-time processing purpose. Modern HPC systems, represented by High-Throughput Computing (HTC) and Many-Task Computing (MTC) platforms, on the other hand, intend to achieve the long-held dream of moving compute to data instead. This kind of data-aware scheduling, typically represented by Hadoop MapReduce, has been successfully implemented in its Map Phase, whereby each Map Task is sent out to the compute node where the corresponding input data chunk is located. However, Hadoop MapReduce limits itself to a one-map-to-one-reduce framework, leading to difficulties for handling complex logics, such as pipelines or workflows. Meanwhile, it lacks built-in support and optimization when the input datasets are shared among multiple applications and/or jobs. The performance can be improved significantly when the knowledge of the shared and frequently accessed data is taken into scheduling decisions.To enhance the capability of managing workflow in modern HPC system, this paper presents CloudFlow, a Hadoop MapReduce based programming model for cloud workflow applications. CloudFlow is built on top of MapReduce, which is proposed not only being data aware, but also shared-data aware. It identifies the most frequently shared data, from both task-level and job-level, replicates them to each compute node for data locality purposes. It also supports user-defined multiple Map- and Reduce functions, allowing users to orchestrate the required data-flow logic. Mathematically, we prove the correctness of the whole scheduling framework by performing theoretical analysis. Further more, experimental evaluation also shows that the execution runtime speedup exceeds 4X compared to traditional MapReduce implementation with a manageable time overhead.  相似文献   

5.
An important challenge for the adoption of cloud computing in the scientific community remains the efficient allocation and execution of data-intensive scientific workflows to reduce execution time and the size of transferred data. The transferred data overhead is becoming significant with emerging scientific workflows that have input/output files and intermediate data products ranging in the hundreds of gigabytes. The allocation of scientific workflows on public clouds can be described through a variety of perspectives and parameters, and has been proved to be NP-complete. This paper proposes an evolutionary approach for task allocation on public clouds considering data transfer and execution time. In our framework, a solution is represented using an allocation chromosome that encodes the allocation of tasks to nodes, and an ordering chromosome that defines the execution order according to the scientific workflow representation. We propose a multi-objective optimization that relies on a cloud cost model and employs tailored evolution operators. Starting from a population of possible solutions, we employ crossover and mutation operators on both chromosomes aiming at optimizing the data transferred between nodes as well as the total workflow runtime. The crossover operators combine parts of solutions to reduce data overhead, whereas the mutation operators swamp between parts of the same chromosome according to pre-defined rules. Our experimental study compares between the proposed approach and current state-of-the art approaches using synthetic and real-life workflows. Our algorithm performs similarly to existing heuristics for small workflows and shows up to 80 % improvements for larger synthetic workflows. To further validate our approach we compare between the allocation and scheduling obtained by our approach with that obtained by popular scientific workflow managers, when real workflows with hundreds of tasks are executed on a public cloud. The results show a 10 % improvement in runtime over existing schedulers, caused by a 80 % reduction in transferred data and optimized allocation and ordering of tasks. This improved data locality has greater impact as it can be employed to improve and study data provenance and facilitate data persistence for scientific workflows.  相似文献   

6.
The distributed nature of the Web, as a decentralized system exchanging information between heterogeneous sources, has underlined the need to manage interoperability, i.e., the ability to automatically interpret information in Web documents exchanged between different sources, necessary for efficient information management and search applications. In this context, XML was introduced as a data representation standard that simplifies the tasks of interoperation and integration among heterogeneous data sources, allowing to represent data in (semi-) structured documents consisting of hierarchically nested elements and atomic attributes. However, while XML was shown most effective in exchanging data, i.e., in syntactic interoperability, it has been proven limited when it comes to handling semantics, i.e.,  semantic interoperability, since it only specifies the syntactic and structural properties of the data without any further semantic meaning. As a result, XML semantic-aware processing has become a motivating challenge in Web data management, requiring dedicated semantic analysis and disambiguation methods to assign well-defined meaning to XML elements and attributes. In this context, most existing approaches: (i) ignore the problem of identifying ambiguous XML elements/nodes, (ii) only partially consider their structural relationships/context, (iii) use syntactic information in processing XML data regardless of the semantics involved, and (iv) are static in adopting fixed disambiguation constraints thus limiting user involvement. In this paper, we provide a new XML Semantic Disambiguation Framework titled XSDFdesigned to address each of the above limitations, taking as input: an XML document, and then producing as output a semantically augmented XML tree made of unambiguous semantic concepts extracted from a reference machine-readable semantic network. XSDF consists of four main modules for: (i) linguistic pre-processing of simple/compound XML node labels and values, (ii) selecting ambiguous XML nodes as targets for disambiguation, (iii) representing target nodes as special sphere neighborhood vectors including all XML structural relationships within a (user-chosen) range, and (iv) running context vectors through a hybrid disambiguation process, combining two approaches: concept-basedand context-based disambiguation, allowing the user to tune disambiguation parameters following her needs. Conducted experiments demonstrate the effectiveness and efficiency of our approach in comparison with alternative methods. We also discuss some practical applications of our method, ranging over semantic-aware query rewriting, semantic document clustering and classification, Mobile and Web services search and discovery, as well as blog analysis and event detection in social networks and tweets.  相似文献   

7.
Cloud computing offers massive scalability and elasticity required by many scientific and commercial applications. Combining the computational and data handling capabilities of clouds with parallel processing also has the potential to tackle Big Data problems efficiently. Science gateway frameworks and workflow systems enable application developers to implement complex applications and make these available for end-users via simple graphical user interfaces. The integration of such frameworks with Big Data processing tools on the cloud opens new opportunities for application developers. This paper investigates how workflow systems and science gateways can be extended with Big Data processing capabilities. A generic approach based on infrastructure aware workflows is suggested and a proof of concept is implemented based on the WS-PGRADE/gUSE science gateway framework and its integration with the Hadoop parallel data processing solution based on the MapReduce paradigm in the cloud. The provided analysis demonstrates that the methods described to integrate Big Data processing with workflows and science gateways work well in different cloud infrastructures and application scenarios, and can be used to create massively parallel applications for scientific analysis of Big Data.  相似文献   

8.
End-to-end scientific application workflows that integrate high-end experiments and instruments with large scale simulations and end-user displays are becoming increasingly important. These workflows require complex couplings and data sharing between distributed components involving large data volumes and present varying hard (in-time data delivery) and soft (in-transit processing) quality of service (QoS) requirements. As a result, supporting efficient data transport is critical for such workflows. In this paper, we leverage software-defined networking (SDN) to address issues of data transport service control and resource provisioning to meet varying QoS requirements from multiple coupled workflows sharing the same service medium. Specifically, we present a flexible control and a disciplined resource scheduling approach for data transport services for science networks. Furthermore, we emulate an SDN testbed on top of the FutureGrid virtualized testbed and use it to evaluate our approach for a realistic scientific workflow. Our results show that SDN-based control and resource scheduling based on simple intuitive models can meet the requirements of the targeted workflows with high resource utilization.  相似文献   

9.
Scientific workflow systems have been introduced in response to the demand of researchers from several domains of science who need to process and analyze increasingly larger datasets. The design of these systems is largely based on the observation that data analysis applications can be composed as pipelines or networks of computations on data. In this work, we present a run-time support system that is designed to facilitate this type of computation in distributed computing environments. Our system is optimized for data-intensive workflows, in which efficient management and retrieval of data, coordination of data processing and data movement, and check-pointing of intermediate results are critical and challenging issues. Experimental evaluation of our system shows that linear speedups can be achieved for sophisticated applications, which are implemented as a network of multiple data processing components.  相似文献   

10.
Many scientific workflows are data intensive: large volumes of intermediate datasets are generated during their execution. Some valuable intermediate datasets need to be stored for sharing or reuse. Traditionally, they are selectively stored according to the system storage capacity, determined manually. As doing science on clouds has become popular nowadays, more intermediate datasets in scientific cloud workflows can be stored by different storage strategies based on a pay-as-you-go model. In this paper, we build an intermediate data dependency graph (IDG) from the data provenances in scientific workflows. With the IDG, deleted intermediate datasets can be regenerated, and as such we develop a novel algorithm that can find a minimum cost storage strategy for the intermediate datasets in scientific cloud workflow systems. The strategy achieves the best trade-off of computation cost and storage cost by automatically storing the most appropriate intermediate datasets in the cloud storage. This strategy can be utilised on demand as a minimum cost benchmark for all other intermediate dataset storage strategies in the cloud. We utilise Amazon clouds’ cost model and apply the algorithm to general random as well as specific astrophysics pulsar searching scientific workflows for evaluation. The results show that benchmarking effectively demonstrates the cost effectiveness over other representative storage strategies.  相似文献   

11.
高性能计算机主要应用于传统的科学计算领域,而在云计算时代,数据密集型应用成为一大类新型应用,已经变得越来越重要.主要探索如何在高性能计算机上高效地进行海量数据处理,使高性能计算机在进行科学计算的同时,能够非常好地支持数据密集型应用,拓展高性能计算机的应用领域.分析了高性能计算机上MapReduce模型实现和部署的可行性之后,在高性能计算环境中进行了实验.实验结果表明,存储系统的并行I/O能力不能充分发挥,是造成系统无法高效运行的主要瓶颈.而导致这个性能瓶颈的原因,是高并发带来的对集群文件系统资源的竞争和冲突.最后,提出了几种解决集群文件系统资源冲突的方案,这是今后的研究方向.  相似文献   

12.
The volume of XML data has become enormous and still grows very quickly as many data have been typed in XML by virtue of its simplicity and extensibility. While a tree labeling algorithm has a crucial role in XML query processing, conventional algorithms are all sequential so that they fail to label a large volume of XML data in a timely manner. To address this issue, we devise parallel tree labeling algorithms for massive XML data. Specifically, we focus on how to efficiently label a single large XML file in parallel. We first propose parallel versions of two prominent tree labeling schemes based on the MapReduce framework. We then present techniques for runtime workload balancing and data repartition to solve performance issues caused by data skewness and MapReduce’s inherited limitation. Through extensive experiments with synthetic and real-world datasets on 15 nodes, we show that our parallel labeling algorithms are up to 17 times faster than conventional algorithms, providing strong durability against data skewness.  相似文献   

13.
对MapReduce栈的不同层进行优化有各自的优缺点。针对MapReduce工作负载的优化问题,提出了相关概念;通过与RoT的对比,介绍了MapReduce工作基于成本的优化及所使用的相关技术,并对MapReduce基于成本的优化进行了评估;基于工作流中的数据流依赖和资源依赖关系,提出了三种工作流优化器,评估了基于成本的工作流优化,并对工作流优化器进行了终端-对-终端的评估;通过实验评估了工作流优化器的优化开销并对这三种工作流优化器的优缺点进行了对比分析。  相似文献   

14.
一种支持异构数据集成的Web服务合成方法   总被引:1,自引:0,他引:1  
全立新  岳昆  刘惟一 《计算机应用》2007,27(6):1438-1441
基于“协作者”数据集成架构,以网络环境中的数据查询为基本Web服务、关系数据库和XML文档为异构数据源的典型代表,并以其上已有的查询处理和XML数据绑定技术为基础,给出了Web服务环境下的数据集成模型。通过定义该模型上的基本操作(服务),利用有向图结构描述服务合成过程,提出了支持异构数据集成的Web服务合成方法和相应的优化策略。  相似文献   

15.
16.
The problem of decentralized data sharing, which is relevant to a wide range of applications, is still a source of major theoretical and practical challenges, in spite of many years of sustained research. In this paper we focus on the challenge of efficiency of query evaluation in information integration systems that use the global-as-view approach, with the objective of developing query-processing strategies that would be widely applicable and easy to implement in real-life applications. Our algorithms take into account important features of today’s data sharing applications: XML as likely interface or representation for data sources; the potential for information overlap across data sources; and the need for inter-source processing, as in joins of data across sources. The focus of this paper is on performance-related characteristics of several alternative approaches that we propose for efficient query processing in information integration, including an approach that uses materialized restructured views. We use synthetic and real-life datasets in our implementation of an information integration system shell to provide experimental results that demonstrate that our algorithms are efficient and competitive in the information integration setting. In addition, our experimental results allow us to make context-specific recommendations on selecting query-processing approaches from our proposed alternatives. As such, our approaches could form a basis for scalable query processing in information integration and interoperability in many practical settings.  相似文献   

17.
There has been a lot of research on MapReduce for big data analytics. This new class of systems sacrifices DBMS functionality such as query languages, schemas, or indexes in order to maximize scalability and parallelism. However, as high functionality of the DBMS is considered important for big data analytics as well, there have been a lot of efforts to support DBMS functionality in MapReduce. HadoopDB is the only work that directly utilizes the DBMS for big data analytics in the MapReduce framework, taking advantage of both the DBMS and MapReduce. However, HadoopDB does not support sharability for the entire data since it stores the data into multiple nodes in a shared-nothing manner—i.e., it partitions a job into multiple tasks where each task is assigned to a fragment of data. Due to this limitation, HadoopDB cannot effectively process queries that require internode communication. That is, HadoopDB needs to re-load the entire data to process some queries (e.g., 2-way joins) or cannot support some complex queries (e.g., 3-way joins). In this paper, we propose a new notion of the DFS-integrated DBMS where a DBMS is tightly integrated with the distributed file system (DFS). By using the DFS-integrated DBMS, we can obtain sharability of the entire data. That is, a DBMS process in the system can access any data since multiple DBMSs are run on an integrated storage system in the DFS. To process big data analytics in parallel, our approach use the MapReduce framework on top of a DFS-integrated DBMS. We call this framework PARADISE. In PARADISE, we employ a job splitting method that logically splits a job based on the predicate in the integrated storage system. This contrasts with physical splitting in HadoopDB. We also propose the notion of locality mapping for further optimization of logical splitting. We show that PARADISE effectively overcomes the drawbacks of HadoopDB by identifying the following strengths. (1) It has a significantly faster (by up to 6.41 times) amortized query processing performance since it obviates the need to re-load data required in HadoopDB. (2) It supports query types more complex than the ones supported by HadoopDB.  相似文献   

18.
Hadoop MapReduce has evolved to an important industry standard for massive parallel data processing and has become widely adopted for a variety of use-cases. Recent works have shown that indexes can improve the performance of selective MapReduce jobs dramatically. However, one major weakness of existing approaches is high index creation costs. We present HAIL (Hadoop Aggressive Indexing Library), a novel indexing approach for HDFS and Hadoop MapReduce. HAIL creates different clustered indexes over terabytes of data with minimal, often invisible costs, and it dramatically improves runtimes of several classes of MapReduce jobs. HAIL features two different indexing pipelines, static indexing and adaptive indexing. HAIL static indexing efficiently indexes datasets while uploading them to HDFS. Thereby, HAIL leverages the default replication of Hadoop and enhances it with logical replication. This allows HAIL to create multiple clustered indexes for a dataset, e.g., one for each physical replica. Still, in terms of upload time, HAIL matches or even improves over the performance of standard HDFS. Additionally, HAIL adaptive indexing allows for automatic, incremental indexing at job runtime with minimal runtime overhead. For example, HAIL adaptive indexing can completely index a dataset as byproduct of only four MapReduce jobs while incurring an overhead as low as 11 % for the very first of those job only. In our experiments, we show that HAIL improves job runtimes by up to 68 $\times $ over Hadoop. This article is an extended version of the VLDB 2012 paper (Dittrich et al. in PVLDB 5(11):1591–1602, 2012).  相似文献   

19.
XML (extensible Markup Language) has been developed and deployed by domain-specific standardization bodies and commercial companies. Studies have been conducted on a wide variety of issues encompassing XML. In the use of XML for wireless computing, the focus has been on investigating ways to efficiently represent XML data for transmission over a wireless environment. We propose a middleware, Xstream (XML Streaming), for efficiently streaming XML contents over a wireless environment by leveraging the rich semantics and structural characteristics of XML documents and by flexibly managing units containing fragments of data into autonomous units, known as XDU (Xstream Data Unit) fragments. The concept of an XDU is fundamental to the operation of Xstream. It provides for the efficient transfer of documents across a wireless link and allows other issues and challenges pertaining to wireless transmission to be addressed. By fragmenting and organizing an XML document into XDU fragments, we are able to incrementally send fragments across a wireless link, while the receiver is able to perform look-ahead processing of the document without having to wait for the entire document to be downloaded. We propose a fragmenting strategy based on the value of the wireless link's Maximum Transfer Units (MTUs). In addition, we present and evaluate several packetizing strategies, i.e., strategies wherein a collection of XDUs are grouped into a packet to optimize packet delivery and processing. At the receiving end of this process, a reassembly strategy incrementally reconstructs the XML document as XDU fragments are being received, thereby facilitating client application implementation of look-ahead processing.  相似文献   

20.
This paper introduces FireWorks, a workflow software for running high‐throughput calculation workflows at supercomputing centers. FireWorks has been used to complete over 50 million CPU‐hours worth of computational chemistry and materials science calculations at the National Energy Research Supercomputing Center. It has been designed to serve the demanding high‐throughput computing needs of these applications, with extensive support for (i) concurrent execution through job packing, (ii) failure detection and correction, (iii) provenance and reporting for long‐running projects, (iv) automated duplicate detection, and (v) dynamic workflows (i.e., modifying the workflow graph during runtime). We have found that these features are highly relevant to enabling modern data‐driven and high‐throughput science applications, and we discuss our implementation strategy that rests on Python and NoSQL databases (MongoDB). Finally, we present performance data and limitations of our approach along with planned future work. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号