首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 406 毫秒
1.
A number of recent technological trends have made data intensive applications such as continuous media (audio and video) servers a reality. These servers store and retrieve large volumes of data using magnetic disks. Servers consisting of multiple nodes and large arrays of heterogeneous disk drives have become a fact of life for several reasons. First, magnetic disks might fail. Failed disks are almost always replaced with newer disk models because the current technological trend for these devices is one of annual increase in both performance and storage capacity. Second, storage requirements are ever increasing, forcing servers to be scaled up progressively. In this study, we present a framework to enable parity-based data protection for heterogeneous storage systems and to compute their mean lifetime. We describe the tradeoffs associated with three alternative techniques: independent subservers, dependent subservers, and disk merging. The disk merging approach provides a solution for systems that require highly available secondary storage in environments that also necessitate maximum flexibility.  相似文献   

2.
Distribution of data and computation allows for solving larger problems and executing applications that are distributed in nature. The grid is a distributed computing infrastructure that enables coordinated resource sharing within dynamic organizations consisting of individuals, institutions, and resources. The grid extends the distributed and parallel computing paradigms allowing for resource negotiation and dynamical allocation, heterogeneity, open protocols, and services. Grid environments can be used both for compute-intensive tasks and data intensive applications by exploiting their resources, services, and data access mechanisms. Data mining algorithms and knowledge discovery processes are both compute and data intensive, therefore the grid can offer a computing and data management infrastructure for supporting decentralized and parallel data analysis. This paper discusses how grid computing can be used to support distributed data mining. Research activities in grid-based data mining and some challenges in this area are presented along with some promising future directions for developing grid-based distributed data mining.  相似文献   

3.
A large class of intensive numerical applications show an irregular structure, exhibiting an unpredictable runtime behavior. Two kinds of irregularity can be distinguished in these applications. First, irregular control structures, derived from the use of conditional statements on data only known at runtime. Second, irregular data structures, derived from computations involving sparse matrices, grids, trees, graphs, etc. Many of these applications exhibit a large amount of parallelism, but the above features usually make that exploiting such parallelism becomes a very difficult task. This paper discusses the effective parallelization of numerical irregular codes, focusing on the definition and use of data-parallel extensions to express the parallelism that they exhibit. We show that the combination of data distributions with storage structures allows to obtain efficient parallel codes. Codes dealing with sparse matrices, finite element methods and molecular dynamics (MD) simulations are taken as working examples.  相似文献   

4.
In this paper, we describe a software infrastructure that unifies transactions and replication in three-tier architectures and provides data consistency and high availability for enterprise applications. The infrastructure uses transactions based on the CORBA object transaction service to protect the application data in databases on stable storage, using a roll-backward recovery strategy, and replication based on the fault tolerant CORBA standard to protect the middle-tier servers, using a roll-forward recovery strategy. The infrastructure replicates the middle-tier servers to protect the application business logic processing. In addition, it replicates the transaction coordinator, which renders the two-phase commit protocol nonblocking and, thus, avoids potentially long service disruptions caused by failure of the coordinator. The infrastructure handles the interactions between the replicated middle-tier servers and the database servers through replicated gateways that prevent duplicate requests from reaching the database servers. It implements automatic client-side failover mechanisms, which guarantee that clients know the outcome of the requests that they have made, and retries aborted transactions automatically on behalf of the clients.  相似文献   

5.
面向服务的云数据挖掘引擎的研究   总被引:1,自引:0,他引:1  
数据挖掘算法处理海量数据时,扩展性受到制约。在商业和科学研究的各个领域,知识发现的过程和需求差异较大,需要有效的机制来设计和运行各种类型的分布式数据挖掘应用。提出了一种面向服务的云数据挖掘引擎的框架CloudDM。不同于基于网格的分布式数据挖掘框架,CloudDM利用开源云计算平台Hadoop处理海量数据的能力,以面向服务的形式支持分布式数据挖掘应用的设计和运行,并描述面向服务的云数据挖掘引擎系统的关键部件和实现技术。依据面向服务的软件体系结构和基于云平台的数据挖掘引擎,可以有效解决海量数据挖掘中的海量数据存储、数据处理和数据挖掘算法互操作性等问题。  相似文献   

6.
《Information Systems》2005,30(1):71-88
Many large organizations have multiple databases distributed in different branches, and therefore multi-database mining is an important task for data mining. To reduce the search cost in the data from all databases, we need to identify which databases are most likely relevant to a data mining application. This is referred to as database selection. For real-world applications, database selection has to be carried out multiple times to identify relevant databases that meet different applications. In particular, a mining task may be without reference to any specific application. In this paper, we present an efficient approach for classifying multiple databases based on their similarity between each other. Our approach is application-independent.  相似文献   

7.
金光浩  莫则尧 《计算机学报》2005,28(12):2045-2051
在以离散网格为基础的某些数值模拟中,网格间的数据依赖关系可以抽象为有向图.如何剖分这些有向图成多个子图,将各子图对应的数值模拟任务映射到不同的处理机,是该类数值模拟并行计算的基础.剖分算法中,需要综合考虑连通性、并行度、负载平衡、通信开销四个目标.文章在传统有向图剖分算法的基础上,提出了一个权衡这四个目标的有向图多目标剖分区域分解算法.应用于二维非结构网格上的柱对称中子输运并行计算中,通量扫描并行算法在该区域剖分算法上获得的并行效率比原来的无向图区域剖分算法高50%以上.  相似文献   

8.
This paper describes the design, implementation and evaluation of a parallel object database server. While a number of research groups and companies now provide object database servers designed to run on uniprocessors, there has been surprisingly little work on the exploitation of parallelism to provide scalable performance in Object Database Management Systems (ODBMS). The work described in this paper takes as its starting-point the Object Database Management Group (ODMG) standard for object databases, thereby allowing the project to focus on research into parallelism, rather than on the ODBMS interfaces. The system is designed to run on a distributed memory parallel machine, and the paper describes the key issues and design decisions including: parallel query optimisation and execution, flow control, support for user-defined operations in queries, object distribution, cache management and navigational client access. The work shows that the significant differences between the object and relational database paradigms lead to significant differences in the designs of parallel servers to support these two paradigms. The paper presents an extensive performance analysis of the prototype systems which shows that good performance can be achieved on a cluster of linux PCs.  相似文献   

9.
This paper addresses the problem of parallel dynamic security assessment applications from static homogeneous cluster environment to dynamic heterogeneous grid environment. Functional parallelism and data parallelism are supported by each of the message passing interface model and TCP/IP model. To consider the differences in heterogeneous computing resources and complexity of large-scale power system communities, a kernel-based multilevel algorithm is proposed for network partitioning. Since the bottleneck in distributed computation is low speed network communication, a bi-level latency exploitation technique is introduced for numerically solving system differential equations. The proposed grid-based implementation includes the core simulation engine, grid computing middleware, a Python interface and Python front-end utilities. Tests for a 39-bus network, a 4000-bus network and a 10,000-bus network are reported, and the results of these experiments demonstrate that the proposed scheme is able to execute the distributed simulations on computational grid infrastructure and provide efficient parallelism.  相似文献   

10.
As part of its commitment to conserve biodiversity, the Canadian government passed legislation in 2003 for the protection and recovery of wildlife and plant populations at risk of extinction in Canada. There is currently no single system to store, retrieve and interpret information on species and their critical habitats in support of this legislation (i.e. the Species at Risk Act). In order to meet the information requirements for the Species at Risk Act (SARA), it is highly desirable to develop network designs, infrastructures and applications that link distributed data sources into an integrated system that manages data and provides decision support. The system architecture described here will be built on the versatile WILDSPACE™ Decision Support System (hereafter referred to as ‘WILDSPACE DSS’) and will be web-based consisting of distributed servers (database servers, web servers and map servers) providing different kinds of information including species and habitat data, geo-spatial data, metadata, web services and decision support analyses. Its design takes into consideration the needs of different user groups (Intranet, Extranet and Internet) and data security. The complexity of Species at Risk data requires considerable “best practice” database design efforts that strike an optimum balance among storage, maintenance, and application performance. The WILDSPACE DSS provides an effective platform for the delivery of information and services to Species at Risk practitioners for better decision-making through its data mining and modeling functionality.  相似文献   

11.
To meet the huge demands of computation power and storage space, a future data center may have to include up to millions of servers. The conventional hierarchical tree-based data center network architecture faces several challenges in scaling a data center to that size. Previous research effort has shown that a server-centric architecture, where servers are not only computation and storage workstations but also intermediate nodes relaying traffic for other servers, performs well in scaling a data center to a huge number of servers. This paper presents a server-centric data center network called DPillar, whose topology is inspired by the classic butterfly network. DPillar provides several nice properties and achieves the balance between topological scalability, network performance, and cost efficiency, which make it suitable for building large scale future data centers. Using only commodity hardware, a DPillar network can easily accommodate millions of servers. The structure of a DPillar network is symmetric so that any network bottleneck is eliminated at the architectural level. With each server having only two ports, DPillar is able to provide the bandwidth to support communication intensive distributed applications. This paper studies the interconnection features of DPillar, how to compute routes in DPillar, and how to forward packets in DPillar. Extensive simulation experiments have been performed to evaluate the performance of DPillar. The results show that DPillar performs well even in the presence of a large number of server and switch failures.  相似文献   

12.
二分网格聚类方法及有效性   总被引:6,自引:1,他引:6  
这是一个新的基于网格的聚类算法.通过逐级二分每个网格成为等体积的两部分,算法使用新的标准度量所有格之间的不相似性,并借此找到数据集中聚类原型的候选,能够克服目前基于网格聚类算法的聚类结果对输入参数敏感的缺点,并且以线性的计算时间耗费,在包含任意形状和密度分布不均匀类的数据集中运行得很好.通过两个实验验证了所提出算法的有效性.  相似文献   

13.
Heterogeneous multicore chipsets with many levels of parallelism are becoming increasingly common in high-performance computing systems. Effective use of parallelism in these new chipsets constitutes the challenge facing a new generation of large scale scientific computing applications. This study examines methods for improving the performance of two-dimensional and three-dimensional atmospheric constituent transport simulation on the Cell Broadband Engine Architecture (CBEA). A function offloading approach is used in a 2D transport module, and a vector stream processing approach is used in a 3D transport module. Two methods for transferring incontiguous data between main memory and accelerator local storage are compared. By leveraging the heterogeneous parallelism of the CBEA, the 3D transport module achieves performance comparable to two nodes of an IBM BlueGene/P, or eight Intel Xeon cores, on a single PowerXCell 8i chip. Module performance on two CBEA systems, an IBM BlueGene/P, and an eight-core shared-memory Intel Xeon workstation are given.  相似文献   

14.
skyline计算在数据挖掘、多标准决策和数据库可视化等领域有着非常重要的作用,这些年已经得到了广泛的关注,以往对于skyline查询的研究大多集中在处理集中的数据集上,即集中式skyline查询,已经得到了很多的研究成果。然而,实际情况是:相关数据几乎分散在几个不同的服务器上,因此在分布式环境中的skyline查询计算需要从各个服务器收集大量的数据;现有的在分布式环境中的skyline查询方法有两个主要问题:一是skyline查询的处理时间较慢;二是在网络中服务器之间传输了很多不必要的重叠数据。提出了一种二分式多层网格法(DMLG),可以有效地处理在分布式环境中的skyline查询。该方法利用网格的方法,借鉴二分法,最大限度地减少了不必要的重叠数据传输,基于不同的数据集的实验表明,这种方法优于现有的方法。  相似文献   

15.
Big data processing systems are characterized by a relevant number of components that are used in parallel to run multiple instances of the same tasks in order to achieve the needed performance levels in applications characterized by huge amounts of data. Such a number of components depend on the dimension of the involved data, so that new resources (e.g., processing or storage servers) are usually added as the working database grows. A reliable performance evaluation of these systems is at the same time crucial, in order to enable administrators and developers to keep the pace with data growth, and extremely difficult, due to the intrinsic complexity of these architectures. Notwithstanding, the available literature does not yet offer sufficient experiences, nor significant methodologies, in such a direction.This paper presents a novel modeling approach, based on mean field analysis, a set of methods for approximate inference of probabilistic models, derived from statistical physics, for performance evaluation of big data systems. This approach, by containing the excessive state space growth characterizing more traditional modeling methodologies, also requires a significantly reduced effort with respect to simulation based ones.  相似文献   

16.
Effective data distribution techniques can significantly reduce the total execution time of a program on grid computing environments, especially for data mining applications. In this paper, we describe a linear programming formulation for the data distribution problem on grids. Furthermore, a heuristic method, named Heuristic Data Distribution Scheme (HDDS), is proposed to solve this problem. We implement two types of data mining applications, Association Rule Mining and Decision Tree Construction, and conduct experiments on grid testbeds. Experimental results show that data mining programs using the proposed HDDS to distribute data could execute more efficiently than traditional schemes could.  相似文献   

17.
18.
Visualization techniques for mining large databases: a comparison   总被引:9,自引:0,他引:9  
Visual data mining techniques have proven to be of high value in exploratory data analysis, and they also have a high potential for mining large databases. In this article, we describe and evaluate a new visualization-based approach to mining large databases. The basic idea of our visual data mining techniques is to represent as many data items as possible on the screen at the same time by mapping each data value to a pixel of the screen and arranging the pixels adequately. The major goal of this article is to evaluate our visual data mining techniques and to compare them to other well-known visualization techniques for multidimensional data: the parallel coordinate and stick-figure visualization techniques. For the evaluation of visual data mining techniques, the perception of data properties counts most, while the CPU time and the number of secondary storage accesses are only of secondary importance. In addition to testing the visualization techniques using real data, we developed a testing environment for database visualizations similar to the benchmark approach used for comparing the performance of database systems. The testing environment allows the generation of test data sets with predefined data characteristics which are important for comparing the perceptual abilities of visual data mining techniques  相似文献   

19.
基于网格的数据分析方法以网格为单位处理数据,避免了数据对象点对点的计算,极大提高了数据分析的效率。但是,传统基于网格的方法在数据分析过程中独立处理网格,忽略了网格之间的耦合关系,影响了分析的精确度。在应用网格检测数据流异常的过程中不再独立处理网格,而是考虑了网格之间的耦合关系,提出了一种基于网格耦合的数据流异常检测算法GCStream-OD。该算法通过网格耦合精确地表达了数据流对象之间的相关性,并通过剪枝策略提高算法的效率。在5个真实数据集上的实验结果表明,GCStream-OD算法具有较高的异常检测质量和效率。  相似文献   

20.
A multitenant database cluster is a data-storage subsystem for applications with multiclient architecture. Essentially, it may be regarded as an additional level of abstraction beyond the set of regular servers in relational databases. This approach permits effective operation in multiclient applications. Various management strategies for client data in a multitenant database cluster are compared. Simple strategies may be based on the client-base structure of the applications, while more complex strategies are based on special metrics. The comparison of the management strategies relies on simulation of the cluster. On the basis of experimental results, conclusions are formulated regarding the best data-management strategy.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号