首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The optimization of the execution time of a parallel algorithm can be achieved through the use of an analytical cost model function representing the running time. Typically the cost function includes a set of parameters that model the behavior of the system and the algorithm. In order to reach an optimal execution, some of these parameters must be fitted according to the input problem and to the target architecture. An optimization problem can be stated where the modeled execution time for the algorithm is used to estimate the parameters. Due to the large number of variable parameters in the model, analytical minimization techniques are discarded. Exhaustive search techniques can be used to solve the optimization problem, but when the number of parameters or the size of the computational system increases, the method is impracticable due to time restrictions. The use of approximation methods to guide the search is also an alternative. However, the dependence on the algorithm modeled and the bad quality of the solutions as a result of the presence of many local optima values in the objective functions are also drawbacks to these techniques. The problem becomes particularly difficult in complex systems hosting a large number of heterogeneous processors solving non-trivial scientific applications. The use of metaheuristics allows for the development of valid approaches to solve general problems with a large number of parameters. A well-known advantage of metaheuristic methods is the ability to obtain high-quality solutions at low running times while maintaining generality. We propose combining the parameterized analytical cost model function and metaheuristic minimization methods, which contributes to a novel real alternative to minimize the parallel execution time in complex systems. The success of the proposed approach is shown with two different algorithmic schemes on parallel heterogeneous systems. Furthermore, the development of a general framework allows us to easily develop and experiment with different metaheuristics to adjust them to particular problems.  相似文献   

2.
Statistical measures for quantifying task and machine heterogeneities   总被引:1,自引:1,他引:0  
We study heterogeneous computing (HC) systems that consist of a set of different machines that have varying capabilities. These machines are used to execute a set of heterogeneous tasks that vary in their computational complexity. Finding the optimal mapping of tasks to machines in an HC system has been shown to be, in general, an NP-complete problem. Therefore, heuristics have been used to find near-optimal mappings. The performance of allocation heuristics can be affected significantly by factors such as task and machine heterogeneities. In this paper, we identify different statistical measures used to quantify the heterogeneity of HC systems, and show the correlation between the performance of the heuristics and these measures through simple mapping examples and synthetic data analysis. In addition, we illustrate how regression trees can be used to predict the most appropriate heuristic for an HC system based on its heterogeneity.  相似文献   

3.
New reconfigurable computing architectures are introduced to overcome some of the limitations of conventional microprocessors and fine-grained reconfigurable devices (e.g., FPGAs). One of the new promising architectures are Configurable System-on-Chip (CSoC) solutions. They were designed to offer high computational performance for real-time signal processing and for a wide range of applications exhibiting high degrees of parallelism. The programming of such systems is an inherently challenging problem due to the lack of an programming model. This paper describes a novel heterogeneous system architecture for signal processing and data streaming applications. It offers high computational performance and a high degree of flexibility and adaptability by employing a micro Task Controller (mTC) unit in conjunction with programmable and configurable hardware. The hierarchically organized architecture provides a programming model, allows an efficient mapping of applications and is shown to be easy scalable to future VLSI technologies. Several mappings of commonly used digital signal processing algorithms for future telecommunication and multimedia systems and implementation results are given for a standard-cell ASIC design realization in 0.18 micron 6-layer UMC CMOS technology.  相似文献   

4.
考虑到异构密码系统中的安全通信与隐私保护,提出了一个无证书密码系统到基于身份密码系统的异构签密方案,允许多个发送者与多个接收者进行安全通信并为系统中的发送者实现了匿名,只有信任机构可追踪发送者的真实信息。对于接收者,通过应用拉格朗日插值公式防止了向未授权用户泄露合法接收者的用户信息。在随机预言模型下证明了本方案满足适应性选择密文攻击以及适应性选择消息攻击下的不可区分性和存在性不可伪造。实验结果表明,本方案有效地提高了系统的计算效率。  相似文献   

5.
The increase in computational power and the networking abilities of home appliances are revolutionizing the way we interact with our homes. This trend is growing stronger and opening a number of technological challenges. From the point of view of distributed systems, there is a need to design architectures for enhancing the comfort and safety of the home, which deals with issues of heterogeneity, scalability and openness. By considering the evolution of domotic research and projects, we advocate a role for Web services in the domestic network. We ground our claim by proposing a concrete architecture for a home in which the health of an elder is monitored. The architecture is implemented on a heterogeneous set of devices, which allows us to evaluate it and draw conclusions on the feasibility of using service-oriented approaches in ubiquitous computing.  相似文献   

6.
Mapping the Internet generally consists in sampling the network from a limited set of sources by using traceroute-like probes. This methodology, akin to the merging of different spanning trees to a set of destination, has been argued to introduce uncontrolled sampling biases that might produce statistical properties of the sampled graph which sharply differ from the original ones. In this paper, we explore these biases and provide a statistical analysis of their origin. We derive an analytical approximation for the probability of edge and vertex detection that exploits the role of the number of sources and targets and allows us to relate the global topological properties of the underlying network with the statistical accuracy of the sampled graph. In particular, we find that the edge and vertex detection probability depends on the betweenness centrality of each element. This allows us to show that shortest path routed sampling provides a better characterization of underlying graphs with broad distributions of connectivity. We complement the analytical discussion with a throughout numerical investigation of simulated mapping strategies in network models with different topologies. We show that sampled graphs provide a fair qualitative characterization of the statistical properties of the original networks in a fair range of different strategies and exploration parameters. Moreover, we characterize the level of redundancy and completeness of the exploration process as a function of the topological properties of the network. Finally, we study numerically how the fraction of vertices and edges discovered in the sampled graph depends on the particular deployements of probing sources. The results might hint the steps toward more efficient mapping strategies.  相似文献   

7.
This work presents a new algorithm, called Heterogeneous Dynamic Pipeline Mapping, that allows for dynamically improving the performance of pipeline applications running on heterogeneous systems. It is aimed at balancing the application load by determining the best replication (of slow stages) and gathering (of fast stages) combination taking into account processors computation and communication capacities. In addition, the algorithm has been designed with the requirement of keeping complexity low to allow its usage in a dynamic tuning tool. For this reason, it uses an analytical performance model of pipeline applications that addresses hardware heterogeneity and which depends on parameters that can be known in advance or measured at run-time. A wide experimentation is presented, including the comparison with the optimal brute force algorithm, a general comparison with the Binary Search Closest algorithm, and an application example with the Ferret pipeline included in the PARSEC benchmark suite. Results, matching those of the best existing algorithms, show significant performance improvements with lower complexity (\(O(N^3\)), where N is the number of pipeline stages).  相似文献   

8.
The potential computational power of today multicore processors has drastically improved compared to the single processor architecture. Since the trend of increasing the processor frequency is almost over, the competition for increased performance has moved on the number of cores. Consequently, the fundamental feature of system designs and their associated design flows and tools need to change, so that, to support the scalable parallelism and the design portability. The same feature can be exploited to design reconfigurable hardware, such as FPGAs, which leads to rethink the mapping of sequential algorithms to HDL. The sequential programming paradigm, widely used for programming single processor systems, does not naturally provide explicit or implicit forms of scalable parallelism. Conversely, dataflow programming is an approach that naturally provides parallelism and the potential to unify SW and HDL designs on heterogeneous platforms. This study describes a dataflow-based design methodology aiming at a unified co-design and co-synthesis of heterogeneous systems. Experimental results on the implementation of a JPEG codec and a MPEG 4 SP decoder on heterogeneous platforms demonstrate the flexibility and capabilities of this design approach.  相似文献   

9.
We propose an exact mathematical mapping that can be useful for making an analog quantum simulator that uses ion-based systems to realize the many-body electron–electron Coulomb interaction of an electron gas. This exact mathematical mapping allows us to deal with a system that is difficult to solve and control using a potentially more experimentally feasible setup. We show that ions can efficiently simulate electronic Coulomb interactions by using a unitary dilatation transform. The transformation does not need to be physically implemented if only the energy spectrum is desired, eliminating the complexity overhead. This proposal works in any number of dimensions and could be used to simulate different topological phases of electrons in graphene-like structures, by using ions confined in honeycomb lattices.  相似文献   

10.
Priority scheduling principle plays a crucial role in the Differentiated Services (DiffServ) architecture for the provisioning of Quality-of-Service (QoS) of network-based applications. Analytical modelling and performance evaluation of priority queuing systems have received significant attention and research efforts. However, most existing work has primarily focused on the analysis of priority queuing under either Short Range Dependent (SRD) or Long Range Dependent (LRD) traffic only. Recent studies have shown that realistic traffic reveals heterogeneous nature within modern multi-service networks. With the aim of investigating the impact of heterogeneous traffic on the design and performance of network-based systems, this paper proposes a novel analytical model for priority queuing systems subject to heterogeneous LRD self-similar and SRD Poisson traffic. The key contribution of the paper is to extend the application of the generalized Schilder's theorem (originally a large deviation principle for handling Gaussian processes only) to deal with heterogeneous traffic and further develop the analytical upper and lower bounds of the queue length distributions for individual traffic flows. The validity and accuracy of the model demonstrated through extensive comparisons between analytical bounds and simulation results make it a practical and cost-effective evaluation tool for investigating the performance behaviour of priority queuing systems under heterogeneous traffic with various parameter settings.  相似文献   

11.
In this paper, we study the application of the max-product algorithm (MPA) to the generalized multiple-fault diagnosis (GMFD) problem, which consists of components (to be diagnosed) and alarms/connections that can be unreliable. The MPA and the improved sequential MPA (SMPA) that we develop in this paper are local-message-passing algorithms that operate on the bipartite diagnosis graph (BDG) associated with the GMFD problem and converge to the maximum a posteriori probability (MAP) solution if this graph is acyclic (in addition, the MPA requires the MAP solution to be unique). Our simulations suggest that both the MPA and the SMPA perform well in more general systems that may exhibit cycles in the associated BDGs (the SMPA also appears to outperform the MPA in these more general systems). In this paper, we provide analytical results for acyclic BDGs and also assess the performance of both algorithms under particular patterns of alarm observations in general graphs; this allows us to obtain analytical bounds on the probability of making erroneous diagnosis with respect to the MAP solution. We also evaluate the performance of the MPA and the SMPA algorithms via simulations, and provide comparisons with previously developed heuristics for this type of diagnosis problems. We conclude that the MPA and the SMPA perform well under reasonable computational complexity when the underlying diagnosis graph is sparse.  相似文献   

12.
13.
One of the most significant causes for performance degradation of scientific and engineering applications on high performance computing systems is the uneven distribution of the computational work to the resources of the system. This effect, which is known as load imbalance, is even more noticeable in the case of irregular applications and heterogeneous distributed systems. This motivated the parallel and distributed computing research community to focus on methods that provide good load balancing for scientific and engineering applications running on (heterogeneous) distributed systems. Efficient load balancing and scheduling methods are employed for scientific applications from various fields, such as mechanics, materials, physics, chemistry, biology, applied mathematics, etc. Such applications typically employ a large number of computational methods in order to simulate complex phenomena, on very large scales of time and magnitude. These simulations consist of routines that perform repetitive computations (in the form of DO/FOR loops) over very large data sets, which, if not properly implemented and executed, may suffer from poor performance. The number of repetitive computations in the simulation codes is not always constant. Moreover, the computational nature of these simulations may be in fact irregular, leading to the case when one computation takes (unpredictably) more time than others. For successful and timely results, large scale simulations require the use of large scale computing systems, which often are widely distributed and highly heterogeneous. Moreover, large scale computing systems are usually shared among multiple users, which causes the quality and quantity of the available resources to be highly unpredictable. There are numerous load balancing methods in the literature for different parallel architectures. The most recent of these methods typically follow the master-worker paradigm, where a single coordinator (master) is responsible for making all the scheduling decisions based on information provided by the workers. Depending on the application requirements, the scheduling policy and the computational environment, the benefits of this paradigm may be limited as follows: (1) its efficiency may not scale as the number of processors increases, and (2) it is quite probable that the scheduling decisions are made based on outdated information, especially on systems where the workload changes rapidly. In an effort to address these limitations, we propose a distributed (master-less) load balancing scheme, in which the scheduling decisions are made by the workers in a distributed fashion. We implemented this method along with other two master-worker schemes (a previously existing one and a recently modified one) for three different scientific computational kernels. In order to validate the usefulness and efficiency of the proposed scheme, we conducted a series of comparative performance tests with the two master-worker schemes for each computational kernel. The target system is an SMP cluster, on which we simulated three different patterns of system load fluctuation. The experiments strongly support the belief that the distributed approach offers greater performance and better scalability on such systems, showing an overall improvement ranging from 13% to 24% over the master-worker approaches.  相似文献   

14.
The increasing computational needs of parallel applications inevitably require portability across parallel architectures, which now include heterogeneous processing resources, such as CPUs and GPUs, and multiple SIMD/SIMT widths. However, the lack of a common parallel programming paradigm that provides predictable, near-optimal performance on each resource leads to the use of low-level frameworks with architecture-specific optimizations, which in turn cause the code base to diverge and makes porting difficult. Our experiences with parallel applications and frameworks lead us to the conclusion that achieving performance portability requires a common set of high-level directives and efficient mapping onto each architecture.  相似文献   

15.
Hybrid architectures are systems where a high performance general purpose computer is coupled to one or more special purpose devices (SPDs). Such a system can be the optimal choice for several fields of computational science. Configuring the system and finding the optimal mapping of the application tasks onto the hybrid machine often is not straightforward. Performance modeling is a tool to tackle and solve these problems. We have developed a performance model to simulate the behavior of a hybrid architecture consisting of a parallel multiprocessor where some nodes are the host of a GRAPE board. GRAPE is a very high performance SPD used in computational astrophysics. We validate our model on the architecture at our disposal, and show examples of predictions that our model can produce.  相似文献   

16.
Computational Grids are emerging as a new infrastructure for high performance computing. Since the resources in a Grid can be heterogeneous and distributed, mesh-based applications require a mesh partitioner that considers both processor and network heterogeneity. We have developed a heterogeneous mesh partitioner, called PaGrid. PaGrid uses a multilevel graph partitioning approach, augmented by execution time load balancing in the final uncoarsening phase. We show that minimization of total communication cost (e.g., as used by JOSTLE) can lead to significant load being placed on processors connected by slow links, which results in higher application execution times. Therefore, PaGrid balances the estimated execution time of the application across processors. PaGrid performance is compared with two existing mesh partitioners, METIS 4.0 and JOSTLE 3.0, for mapping several application meshes to two models of heterogeneous computational Grids. PaGrid is found to produce significantly better partitions than JOSTLE and slightly better partitions than METIS in most cases, in terms of estimated application execution time averaged over a large number of runs with different random number seeds.  相似文献   

17.
Concurrence, as one of the entanglement measures, is a useful tool to characterize quantum entanglement in various quantum systems. However, the computation of the concurrence involves difficult optimizations and only for the case of two qubits, an exact formula was found. We investigate the concurrence of four-qubit quantum states and derive analytical lower bound of concurrence using the multiqubit monogamy inequality. It is shown that this lower bound is able to improve the existing bounds. This approach can be generalized to arbitrary qubit systems. We present an exact formula of concurrence for some mixed quantum states. For even-qubit states, we derive an improved lower bound of concurrence using a monogamy equality for qubit systems. At the same time, we show that a multipartite state is k-nonseparable if the multipartite concurrence is larger than a constant related to the value of k, the qudit number and the dimension of the subsystems. Our results can be applied to detect the multipartite k-nonseparable states.  相似文献   

18.
Image population analysis is the class of statistical methods that plays a central role in understanding the development, evolution, and disease of a population. However, these techniques often require excessive computational power and memory that are compounded with a large number of volumetric inputs. Restricted access to supercomputing power limits its influence in general research and practical applications. In this paper we introduce ISP, an Image-Set Processing streaming framework that harnesses the processing power of commodity heterogeneous CPU/GPU systems and attempts to solve this computational problem. In ISP, we introduce specially designed streaming algorithms and data structures that provide an optimal solution for out-of-core multiimage processing problems both in terms of memory usage and computational efficiency. ISP makes use of the asynchronous execution mechanism supported by parallel heterogeneous systems to efficiently hide the inherent latency of the processing pipeline of out-of-core approaches. Consequently, with computationally intensive problems, the ISP out-of-core solution can achieve the same performance as the in-core solution. We demonstrate the efficiency of the ISP framework on synthetic and real datasets.  相似文献   

19.
Randomized algorithms are gaining ground in high-performance computing applications as they have the potential to outperform deterministic methods, while still providing accurate results. We propose a randomized solver for distributed multicore architectures to efficiently solve large dense symmetric indefinite linear systems that are encountered, for instance, in parameter estimation problems or electromagnetism simulations. The contribution of this paper is to propose efficient kernels for applying random butterfly transformations and a new distributed implementation combined with a runtime (PaRSEC) that automatically adjusts data structures, data mappings, and the scheduling as systems scale up. Both the parallel distributed solver and the supporting runtime environment are innovative. To our knowledge, the randomization approach associated with this solver has never been used in public domain software for symmetric indefinite systems. The underlying runtime framework allows seamless data mapping and task scheduling, mapping its capabilities to the underlying hardware features of heterogeneous distributed architectures. The performance of our software is similar to that obtained for symmetric positive definite systems, but requires only half the execution time and half the amount of data storage of a general dense solver.  相似文献   

20.
This work presents an efficient mapping scheme for the multilayer perceptron (MLP) network trained using back-propagation (BP) algorithm on network of workstations (NOWs). Hybrid partitioning (HP) scheme is used to partition the network and each partition is mapped on to processors in NOWs. We derive the processing time and memory space required to implement the parallel BP algorithm in NOWs. The performance parameters like speed-up and space reduction factor are evaluated for the HP scheme and it is compared with earlier work involving vertical partitioning (VP) scheme for mapping the MLP on NOWs. The performance of the HP scheme is evaluated by solving optical character recognition (OCR) problem in a network of ALPHA machines. The analytical and experimental performance shows that the proposed parallel algorithm has better speed-up, less communication time, and better space reduction factor than the earlier algorithm. This work also presents a simple and efficient static mapping scheme on heterogeneous system. Using divisible load scheduling theory, a closed-form expression for number of neurons assigned to each processor in the NOW is obtained. Analytical and experimental results for static mapping problem on NOWs are also presented.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号