首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
众核体系结构对Cilk语言的硬件支持及评测研究   总被引:4,自引:0,他引:4  
如何编程众核体系结构是当前一个亟待解决的问题.研究可扩展的硬件机制支持Cilk编程模型的目的是在良好的编程性和可扩展硬件实现之间达到平衡.Cilk语言是C的精简扩展,程序员编写Cilk程序时和串行编程近似,且不需关心调度、负载均衡和局部性等系统底层相关的问题.文中以域一致性存储模型为基础,主要工作包括两方面:首先针对域一致性模型编程性不好的缺点提出一种以数据为中心维护高速缓存一致性的方法;其次提出实现DAG Consistency的缓存一致性协议,并在此基础上支持Cilk编程模型.实验结果表明,当处理器核数目较少(<16)时所有测试程序都能获得比较好的性能加速,并且指出了众核情况下(>16)难以获得理想加速效果的两个根本原因:静态路由导致片上网络带宽利用不均衡以及有限的访存带宽.  相似文献   

2.
High-end embedded systems, like their general-purpose counterparts, are turning to many-core cluster-based shared-memory architectures that provide a shared memory abstraction subject to non-uniform memory access costs. In order to keep the cores and memory hierarchy simple, many-core embedded systems tend to employ simple, scratchpad-like memories, rather than hardware managed caches that require some form of cache coherence management. These “coherence-free” systems still require some means to synchronize memory accesses and guarantee memory consistency. Conventional lock-based approaches may be employed to accomplish the synchronization, but may lead to both usability and performance issues. Instead, speculative synchronization, such as hardware transactional memory, may be a more attractive approach. However, hardware speculative techniques traditionally rely on the underlying cache-coherence protocol to synchronize memory accesses among the cores. The lack of a cache-coherence protocol adds new challenges in the design of hardware speculative support. In this article, we present a new scheme for hardware transactional memory (HTM) support within a cluster-based, many-core embedded system that lacks an underlying cache-coherence protocol. We propose two alternative data versioning implementations for the HTM support, Full-Mirroring and Distributed Logging and we conduct a performance comparison between them. To the best of our knowledge, these are the first designs for speculative synchronization for this type of architecture. Through a set of benchmark experiments using our simulation platform, we show that our designs can achieve significant performance improvements over traditional lock-based schemes.  相似文献   

3.
In many-core architectures different distributed applications are executed in parallel. The applications may need hard guarantees for communication with respect to latency and throughput to cope with their constraints. Networks on Chip (NoC) are the most promising approach to handle these requirements in architectures with a large number of cores. Dynamic reservation of communication resources in virtual channel NoCs is used to enable quality of service for concurrent communication. This paper presents a router design supporting best effort and connection-oriented guaranteed service communication. The communication resources are shared dynamically between the two communication schemes. The key contribution is a concept for virtual channel reservation supporting different bandwidth and latency guarantees for simultaneous guaranteed service communication flows. Different to state-of-the-art, the used scheduling approach allows to give hard guarantees regarding throughput and latency. The concept enables to adjust the bandwidth and latency requirements of connections at run-time to cope with dynamically changing application requirements. Due to its distributed reservation process and resource allocation it offers good scalability for many-core architectures. The implementation of a router and the required extension of a network interface to support the proposed concept are presented. The software perspective is discussed. An algorithm is presented that is used to establish guaranteed service connections according to the applications bandwidth requirements. Simulation results are compared to state-of-the-art arbitration schemes and show significant improvements of latency and throughput, e.g. for an MPEG4 application. Synthesis results expose the low area overhead and impact on energy consumption which makes the concepts highly attractive for QoS-constraint many-core architectures.  相似文献   

4.
The increasing transistor count on a single chip provides an unprecedented amount of resources for chip designers. Unfortunately, the power consumed by each transistor does not shrink similarly, decreasing the amount of transistors that can be on simultaneously. This utilization wall leaves a growing percentage of transistors dark, or powered-off, as the chip cannot (a) provide the necessary current or (b) maintain a low operating temperature. To account for dark silicon, the computer architecture community has begun taking advantage of the wealth of available transistors to design efficient, time-sharing systems, often through specialized architectures. Meanwhile, security is quickly becoming a first-tier design constraint, increasing the need for hardware security mechanisms, in order to maintain high levels of availability and to detect and protect from intrusion. As we move into the many-core environment, many of these security mechanisms will need to be integrated on-chip. In a chip-multiprocessor environment, security will be necessary as multiple programs or users are sharing resources, thus facilitating attacks. In both a single-user and multiple-user environment, designers can build specialized hardware to provide support for security functions, such as authenticity, cryptography, and intrusion detection. In this paper, we survey current hardware security trends and provide insight on how future chip designs can leverage dark silicon for more secure designs. We provide preliminary designs and discuss future challenges and opportunities in dark silicon security. The merging of hardware security and dark silicon will facilitate efficient, fast, and secure designs.  相似文献   

5.
Super-scalar, out-of-order processors that can have tens of read and write requests in the execution window place significant demands on Memory Level Parallelism (MLP). Multi- and many-cores with shared parallel caches further increase MLP demand. Current cache hierarchies however have been unable to keep up with this trend, with modern designs allowing only 4?C16 concurrent cache misses. This disconnect is exacerbated by recent highly parallel architectures (e.g. GPUs) where power and area per-core budget favor numerous lighter cores with less resources, further reducing support for MLP on a per-core basis. Support for hardware and software prefetch increases MLP pressure since these techniques overlap multiple memory requests with existing computation. In this paper, we propose and evaluate a novel Resource-Aware Prefetching (RAP) compiler algorithm that is aware of the number of simultaneous prefetches supported, and optimized for the same. We implemented our algorithm in a GCC-derived compiler and evaluated its performance using an emerging fine-grained many-core architecture. Our results show that the RAP algorithm outperforms a well-known loop prefetching algorithm by up to 40.15% in run-time on average across benchmarks and the state-of-the art GCC implementation by up to 34.79%, depending upon hardware configuration. Moreover, we compare the RAP algorithm with a simple hardware prefetching mechanism, and show run-time improvements of up to 24.61%. To demonstrate the robustness of our approach, we conduct a design-space exploration (DSE) for the considered target architecture by varying (i) the amount of chip resources designated for per-core prefetch storage and (ii) off-chip bandwidth. We show that the RAP algorithm is robust in that it improves performance across all design points considered. We also identify the Pareto-optimal hardware-software configuration which delivers 53.66% run-time improvement on average while using only 5.47% more chip area than the bare-bones design.  相似文献   

6.
As demand of higher computing power is steadily increasing, it becomes popular to equip a many-core accelerator in a computer system to run concurrent applications. Efficient management of compute resources in such a system is challenging because various factors such as workload variation, QoS requirement change, and hardware failure may cause dynamic change in system status. Recently, a variety of resource management techniques for many-core accelerators have been proposed. They are usually tailored for a specific target architecture. In this paper, we present SoPHy+, which supports various types of many-core accelerators, based on a hybrid resource management technique. SoPHy+ provides a seamless design flow from programming front-end, which generates dataflow-style function codes automatically from the task specification, to run-time environment, which adaptively manages compute resources for concurrent applications in response to system status change. SoPHy+ has been implemented on two different many-core architectures: the Intel Xeon Phi coprocessor and an Epiphany-like NoC virtual prototype. Experimental results prove that SoPHy+ is capable of adapting to the run-time workload variation effectively with affordable overhead of run-time resource management.  相似文献   

7.
The short-range pair interaction consumes most of the CPU time in molecular dynamics(MD)simulations.The inherent computation sparsity makes it challenging to achieve high-performance kernel on the emerging many-core ar-chitecture.In this paper,we present a highly efficient short-range force kernel on the Sunway,a novel many-core architecture with many unique features.The parallel efficiency of this algorithm on the Sunway many-core processor is strongly limited by the poor data locality and write conflicts.To enhance the data locality,we adopt a super cluster based neighbor list with an appropriate granularity that fits in the local memory of computing cores.In the absence of a low overhead locking mechanism,using data-privatization force array is a more feasible method to avoid write conflicts,but results in the large overhead of data reduction.We adopt a dual-slice partitioning scheme for both hardware resources and computing tasks,which utilizes the on-chip data communication to reduce data reduction overhead and provide load balancing.Moreover,we exploit the single instruction multiple data(SIMD)parallelism and perform instruction reordering of the force kernel on this many-core processor.The experimental results show that the optimized force kernel obtains a performance speedup of 226x compared with the reference implementation and achieves 20%of peak flop rate on the Sunway many-core processor.  相似文献   

8.
受限于功耗,十多年前通用微处理器就停止追求更高的主频转而向集成更多处理器核的方向发展;同时,随着晶体管密度按摩尔定律不断提高,单片可集成的处理器核数成倍增长,片上多核、众核处理器已成为高性能微处理器发展的主流。未来千核级通用众核处理器支持共享存储编程模型是一种必然趋势,但传统的Cache一致性目录结构面临着查找延迟高、目录项替换频繁以及硬件代价和功耗可扩展性有限等问题。稀疏目录实现了传统目录结构硬件开销与一致性维护效率的折衷,被认为是众核处理器维护Cache一致性的一种高能效、可扩展结构。综述了近年来提高稀疏目录性能的相关研究与方法,并对其在面积、访问延迟、功耗和实现复杂性等方面进行分析,归纳出这些方法各自的优点和存在的不足,对创新设计未来高性能众核处理器共享存储体系结构具有一定的参考价值。  相似文献   

9.
We present specialized implementations of the preconditioned iterative linear system solver in ILUPACK for Non-Uniform Memory Access (NUMA) platforms and many-core hardware co-processors based on the Intel Xeon Phi and graphics accelerators. For the conventional x86 architectures, our approach exploits task parallelism via the OmpSs runtime as well as a message-passing implementation based on MPI, respectively yielding a dynamic and static schedule of the work to the cores, with different numeric semantics to those of the sequential ILUPACK. For the graphics processor we exploit data parallelism by off-loading the computationally expensive kernels to the accelerator while keeping the numeric semantics of the sequential case.  相似文献   

10.
In the last 15 years we have seen, as a response to power and thermal limits for current chip technologies, an explosion in the use of multiple and even many computer cores on a single chip. But now, to further improve performance and energy efficiency, when there are potentially hundreds of computing cores on a chip, we see a need for a specialization of individual cores and the development of heterogeneous manycore computer architectures.However, developing such heterogeneous architectures is a significant challenge. Therefore, we propose a design method to generate domain specific manycore architectures based on RISC-V instruction set architecture and automate the main steps of this method with software tools. The design method allows generation of manycore architectures with different configurations including core augmentation through instruction extensions and custom accelerators. The method starts from developing applications in a high-level dataflow language and ends by generating synthesizable Verilog code and cycle accurate emulator for the generated architecture.We evaluate the design method and the software tools by generating several architectures specialized for two different applications and measure their performance and hardware resource usages. Our results show that the design method can be used to generate specialized manycore architectures targeting applications from different domains. The specialized architectures show at least 3 to 4 times better performance than the general purpose counterparts. In certain cases, replacing general purpose components with specialized components saves hardware resources. Automating the method increases the speed of architecture development and facilitates the design space exploration of manycore architectures.  相似文献   

11.
The growing number of processing cores in a single CPU is demanding more parallelism from sequential programs. But in the past decades few work has succeeded in automatically exploiting enough parallelism, which casts a shadow over the many-core architecture and the automatic parallelization research. However, actually few work was tried to understand the nature, or amount, of the potentially available parallelism in programs. In this paper we will analyze at runtime the dynamic data dependencies among superblocks of sequential programs. We designed a meta re-arrange buffer to measure and exploit the available parallelism, with which the superblocks are dynamically analyzed, reordered and dispatched to run in parallel on an ideal many-core processor, while the data dependencies and program correctness are still maintained. In our experiments, we observed that with the superblock reordering, the potential speedup ranged from 1.08 to 89.60. The results showed that the potential parallelism of normal programs was still far from fully exploited by existing technologies. This observation makes the automatic parallelization a promising research direction for many-core architectures.  相似文献   

12.
13.
Spatial locality of task execution is becoming important in future hardware platforms since the number of cores is steadily increasing. The large amount of cores requires an intelligent power manager and the high chip and core density requires increased thermal awareness to avoid thermal hotspots on the chip. This paper presents a lightweight task migration mechanism explicitly for distributed operating systems running on many-core platforms. As the distributed OS runs one scheduler on each core, the tasks are migrated between OS kernels within the same shared memory platform. The benefits, such as performance and energy efficiency, of task migration are achieved by re-locating running tasks on the most appropriate cores and keeping the overhead of executing such a migration sufficiently low. We investigate the overhead of migrating tasks on a distributed OS running both on a bus-based platform and a many-core NoC—with these means of measures, we can predict the task migration overhead and pinpoint the emerging bottlenecks. With the presented task migration mechanism, we intend to improve the dynamism of power and performance characteristics in distributed many-core operating systems.  相似文献   

14.
Increasing the number of cores in a multi-core processor can only be achieved by reducing the resources available in each core, and hence sacrificing the per-core performance. Furthermore, having a large number of homogeneous cores may not be effective for all the applications. For instance, threads with high instruction level parallelism will under-perform considerably in the resource-constrained cores. In this paper, we propose a core architecture that can be adapted to improve a single thread’s performance or to execute multiple threads. In particular, we integrate Reconfigurable Hardware Unit (RHU) in the resource-constrained cores of a many-core processor. The RHU can be reconfigured to execute the frequently encountered instructions from a thread in order to increase the core’s overall execution bandwidth, thus improving its performance. On the other hand, if the core’s resources are sufficient for a thread, then the RHU can be configured to executed instructions from a different thread to increase the thread level parallelism. The RHU has low area overhead, and hence has minimal impact on scalability of the number of cores. To further limit the area overhead of this mechanism, generation of the reconfiguration bits for the RHUs of multiple cores is delegated to a single core. In this paper, we present the results for using the RHU to improve a single thread’s performance. Our experiments show that the proposed architecture improves the per-core performance by an average of about 23% across a wide range of applications.  相似文献   

15.
Recent technological developments made various many-core hardware platforms widely accessible. These massively parallel architectures have been used to significantly accelerate many computation demanding tasks. In this paper, we show how the algorithms for LTL model checking can be redesigned in order to accelerate LTL model checking on many-core GPU platforms. Our detailed experimental evaluation demonstrates that using the NVIDIA CUDA technology results in a significant speedup of the verification process. Together with state space generation based on shared hash-table and DFS exploration, our CUDA accelerated model checker is the fastest among state-of-the-art shared memory model checking tools.  相似文献   

16.
As semiconductor manufacturing technology continues to improve, it is possible to integrate more and more transistors onto a single processor. Many-core processor design has resulted in part from the search to utilize this enormous transistor real estate. The Single-Chip Cloud Computer (SCC) is an experimental many-core processor created by Intel Labs. In this paper we present a study in which we analyze this innovative many-core system by running several workloads with distinctive parallelism characteristics. We investigate the effect on system performance by monitoring specific hardware performance counters. Then, we experiment on varying different hardware configuration parameters such as number of cores, clock frequency and voltage levels. We execute the chosen workloads and collect the timing, power consumption and energy consumption information on such a many-core research platform. Thus, we can comprehensively analyze the behavior and scalability of the Intel SCC system with the introduced workload in terms of performance and energy consumption. Our results show that the profiled parallel workload execution has a communication bottleneck on the Intel SCC system. Moreover, our results indicate that we should carefully choose the number of cores to execute different workloads in order to yield a balance between execution performance and energy efficiency for different applications.  相似文献   

17.
Previous work in scalable hardware distributed shared memory (DSM) multiprocessors has established the critical and dominant role that protocol processing bandwidth (or its inverse, occupancy) plays in determining overall performance in architectures with standalone memory/coherence controllers. However, with recent architectural trends toward integrated (on-chip) memory controllers and the well-known fact that processor frequency is increasing more rapidly than memory systems, we must ask whether parallel coherence processing engines (either multiple integrated protocol processors/cores or multiple protocol threads) are needed in DSM machines constructed from modern processor architectures and, if so, when. We construct a useful analytical model to give the designer insight into when parallel coherence streams will improve performance and verify our model via detailed simulation on 64-threaded microbenchmarks and parallel applications and on single-node multiprogrammed workloads. Surprisingly, and contrary to related work, we find that, in these architectures, adding a second coherence engine has almost no impact on performance. Further, for less-tuned applications that suffer from hot spots (contentious requests to the same memory line), additional engines offer no benefit whatsoever. Even with double the memory bandwidth (or channels), an additional coherence processing stream yields only slight performance improvement. Only for a special class of DSM machines employing directoryless broadcast protocols over unordered interconnects does parallel "snoop" processing offer reasonable performance improvement for communication-intensive applications. Overall, given the architectural trends, this is good news for DSM designers who want to minimize the resources necessary (protocol threads or integrated protocol processor cores for maintaining internode coherence, respectively) to create SMTp-based or multi-CMP-based scalable DSM machines using directory protocols.  相似文献   

18.
共享存储系统中如何高效地实现高速缓存一致性是体系结构设计面临的一个关键问题和难点问题.已有的基于目录的协议存在难于实现、验证复杂和存储空间开销大等问题.面向片上众核处理器,文中提出一种由硬件结构支持、基于同步的高速缓存一致性协议.该方案不使用目录,而是通过使用bloom-filter表示一致性信息,并在并行程序中的同步点维护高速缓存一致性.与现有的基于目录的高速缓存一致性协议相比,该方案可以降低目录协议的实现、验证复杂度.用SPLASH一2测试程序集评估表明,基于同步的协议可以获得与基于目录的协议相当的性能.  相似文献   

19.
Modern hardware is abundantly parallel and increasingly heterogeneous. The numerous processing cores have non-uniform access latencies to the main memory and processor caches, which causes variability in the communication costs. Unfortunately, database systems mostly assume that all processing cores are the same and that microarchitecture differences are not significant enough to appear in critical database execution paths. As we demonstrate in this paper, however, non-uniform core topology does appear in the critical path and conventional database architectures achieve suboptimal and even worse, unpredictable performance. We perform a detailed performance analysis of OLTP deployments in servers with multiple cores per CPU (multicore) and multiple CPUs per server (multisocket). We compare different database deployment strategies where we vary the number and size of independent database instances running on a single server, from a single shared-everything instance to fine-grained shared-nothing configurations. We quantify the impact of non-uniform hardware on various deployments by (a) examining how efficiently each deployment uses the available hardware resources and (b) measuring the impact of distributed transactions and skewed requests on different workloads. We show that no strategy is optimal for all cases and that the best choice depends on the combination of hardware topology and workload characteristics. Finally, we argue that transaction processing systems must be aware of the hardware topology in order to achieve predictably high performance.  相似文献   

20.
Advances at an unprecedented rate in computer hardware and networking technologies have made the many-core computing affordable and readily available in a matter of few years. Nonetheless, it incurs challenges to programmers to build scalable parallel software. Optimizations of parallel programs for a many-core platform are viewed as a multifaceted problem, where system and architectural factors should be taken into account. In this paper, we tackle this problem by implementing parallel programs with different available programming paradigms and evaluate application behaviors on TILE64 many-core platform. That is, we investigate a hybrid producer-write plus consumer-read shared memory programming paradigm for the implementation of master–worker video decoder and encoder in the referred many-core platform. Experimental results show that the proposed implementation has achieved competitive performance speedup, scaling well with the number of available cores and up to four times of performance improvement over other implementations on the decoding of sample 1080P video.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号