首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 10 毫秒
1.
In recent years, image processing has been a key application area for mobile and embedded computing platforms. In this context, many-core accelerators are a viable solution to efficiently execute highly parallel kernels. However, architectural constraints impose hard limits on the main memory bandwidth, and push for software techniques which optimize the memory usage of complex multi-kernel applications. In this work, we propose a set of techniques, mainly based on graph analysis and image tiling, targeted to accelerate the execution of image processing applications expressed as standard OpenVX graphs on cluster-based many-core accelerators. We have developed a run-time framework which implements these techniques using a front-end compliant to the OpenVX standard, and based on an OpenCL extension that enables more explicit control and efficient reuse of on-chip memory and greatly reduces the recourse to off-chip memory for storing intermediate results. Experiments performed on the STHORM many-core accelerator demonstrate that our approach leads to massive reduction of time and bandwidth, even when the main memory bandwidth for the accelerator is severely constrained.  相似文献   

2.
Multiprocessor embedded systems integrates diverse dedicated processing units to handle high performance applications such as in multimedia and network processing. However, lock-based synchronization limits the efficiency of such heterogeneous concurrent systems. Hardware Transactional Memory (HTM) is a promising approach in creating an abstraction layer for multi-threaded programming. However, HTM performance is application-specific and determined by version and conflict management configurations. Most previous HTM implementations for embedded system in literature were built on fixed version management that result in significant performance loss when transaction behaviour changes. In this paper, we propose a HTM targeted for embedded applications which is able to adapt its version management based on application behaviour at runtime. It is prototyped and analysed on Altera Cyclone IV platform. Random requests at different contention levels and different transaction sizes are used to verify the performance of the proposed HTM. Based on our experiments, lazy version management is able to obtain up to 12.82% speed-up compared to eager version management at high contention level. Meanwhile, eager version management obtains up to 37.84% speed-up compared to lazy version management at low contention. The adaptive mechanism is able to switch configuration at runtime based on applications behaviour for maximum performance.  相似文献   

3.
H.264/AVC video encoders have been widely used for its high coding efficiency. Since the computational demand proportional to the frame resolution is constantly increasing, it has been of great interest to accelerate H.264/AVC by parallel processing. Recently, graphics processing units (GPUs) have emerged as a viable target for accelerating general purpose applications by exploiting fine-grain data parallelisms. Despite extensive research efforts to use GPUs to accelerate the H.264/AVC algorithm, it has not been successful to achieve any speed-up over the x264 algorithm that is known as the fastest CPU implementation, mainly due to significant communication overhead between the host CPU and the GPU and intra-frame dependency in the algorithm. In this paper, we propose a novel motion-estimation (ME) algorithm tailored for NVIDIA GPU implementation. It is accompanied by a novel pipelining technique, called sub-frame ME processing, to effectively hide the communication overhead between the host CPU and the GPU. Further, we incorporate frame-level parallelization technique to improve the overall throughput. Experimental results show that our proposed H.264 encoder has higher performance than x264 encoder.  相似文献   

4.
5.
Processor speeds are increasing so much faster than memory speeds that within a decade processors may spend most of their time waiting for data. Most modern DRAM components support modes that make it possible to perform some access sequences more quickly than others. The authors describe how reordering streams can result in better memory performance  相似文献   

6.
In this paper, a finite memory filter is proposed to estimate the available bandwidth through real-time tracking unknown parameters of the sloping straight line while removing undesired system and measurement noises. The finite memory filter is developed under a weighted least square criterion using only the most recent finite probe-packet measurements on the window. The proposed finite memory filtering based available bandwidth estimate is shown to have several inherent properties such as unbiasedness, deadbeat, and robustness. A guideline for choosing appropriate window length is described as it can significantly affect the estimation performance. Finally, computer simulations show that the proposed finite memory filtering based approach can be comparable with the Kalman filtering based approach with infinite memory structure for constantly or slowly changing available bandwidth and outperform that for dynamically changing available bandwidth.  相似文献   

7.
8.
In deeply embedded heterogeneous multicores the allocation of data to memories is crucial for application performance. For applications with stringent throughput constraints, the allocation is often done manually by carefully assigning static memory locations to the logical buffers of the application. Today, designers are confronted with applications with thousands of buffers and architectures with hundreds of memories, rendering manual approaches impractical. In this paper we present an automatic approach for statically allocating logical buffers to physical memories, assuming a fixed task-to-processor mapping and respecting multiple throughput constraints.In our approach, we model the application in a data-centric way, by explicitly defining buffers and associating computational tasks that access the buffers within well-specified time intervals. Besides, we use an architecture model that allows to perform an allocation that is aware of the topology of the multicore and the physical bandwidth constraints of the interconnect. We present a layered approach to describe and solve the buffer-allocation problem as well as related subproblems, using mixed-integer linear programming. We show that the buffer-allocation problem is NP-complete, and present a more scalable formulation as a semi-definite programming problem. We evaluate the proposed LP methods by allocating around 1000 buffers corresponding to processing one frame in the Long-Term Evolution (LTE) standard, onto a multicore with 80 processing elements. We introduce a solution approach that allowed to find an optimal allocation in around 2 hours, which is at least two orders of magnitude faster than a straightforward formulation.  相似文献   

9.
Digital Cinema (DC) consists of integration of new advanced digital technologies in the context of the cinema system. As regards the transport of DC content towards theatres, Distributors may select the method that is both economically and technically sound.In this work, which is carried out within the framework of the IST Integrated Project Enhanced Digital CINEma (EDCINE), we deal with the network distribution service provided by a Network Service Provider, which becomes a new actor in the DC business. One of the main criticalities of the system is the very large size of the contents to be transferred towards theatres. From the operator's perspective, this criticality translates into the objective of optimising the usage of network resources while complying with quality of service (QoS) constraints.The goal of this paper is to present the system which is able to support the network delivery of DC contents, with a special focus on live event delivery. This service can consume a large amount of network bandwidth, not only because of the volume of transmitted data, but also due to the number of receivers, and thus multicast transmission proves to be very useful. Consequently, a key issue of the overall distribution system is the request-routing algorithm, the goal of which is to optimise the QoS-guaranteed delivery of a number of live streams in the backbone, each one of which is sent towards a set of theatres (QoS multicast routing). We consider the MultiProtocol Label Switching mechanism, which has emerged as an elegant solution to meet traffic engineering and resource reservation requirements in backbone networks, and focus especially on the overall request-routing procedure, the mathematical modelling of the problem, and relevant solving algorithms. Finally, we present the comparative performance evaluation of these algorithms by means of an extensive simulation campaign performed with the OMNeT++ simulation platform.  相似文献   

10.
Advanced e-services require efficient, flexible, and easy-to-use workflow technology that integrates well with mainstream Internet technologies such as XML and Web servers. This paper discusses an XML-enabled architecture for distributed workflow management that is implemented in the latest version of our Mentor-lite prototype system. The key asset of this architecture is an XML mediator that handles the exchange of business and flow control data between workflow and business-object servers on the one hand and client activities on the other via XML messages over http. Our implementation of the mediator has made use of Oracle's XSQL servlet. The major benefit of the advocated architecture is that it provides seamless integration of client applications into e-service workflows with scalable efficiency and very little explicit coding, in contrast to an earlier, Java-based, version of our Mentor-lite prototype that required much more code and exhibited potential performance problems. Received: 30 October 2000 / Accepted: 19 December 2000 Published online: 27 April 2001  相似文献   

11.
Open computing language (OpenCL) is a new industry standard for task-parallel and data-parallel heterogeneous computing on a variety of modern CPUs, GPUs, DSPs, and other microprocessor designs. OpenCL is vendor independent and hence not specialized for any particular compute device. To develop efficient OpenCL applications for the particular platform, we still need a more profound understanding of architecture features on the OpenCL model and computing devices. For this purpose, we design and implement an OpenCL micro-benchmark suite for GPUs and CPUs. In this paper, we introduce the implementations of our OpenCL micro benchmarks, and present the measuring results of hardware and software features like performance of mathematical operations, bus bandwidths, memory architectures, branch synchronizations and scalability, etc., on two multi-core CPUs, i.e. AMD Athlon II X2 250 and Intel Pentium Dual-Core E5400, and two different GPUs, i.e. NVIDIA GeForce GTX 460se and AMD Radeon HD 6850. We also compared the measuring results with existing benchmarks to demonstrate the reasonableness and correctness of our benchmark suite.  相似文献   

12.
DVB-RCS 卫星通信系统中,已有的带宽请求算法多注重于队列延迟以及带宽利用率等性能的提高而未考虑卫星终端的存储优化针对此问题,提出了一种应用于 DVB-RCS 卫星通信系统的存储优化带宽请求算法,该算法通过预测到达数据流的容量来实现对发送队列中数据总量的控制,同时兼顾对传输时延以及传输效率的平衡。仿真表明,该算法能够合理控制系统的存储容量,并且在优化带帘利用率、控制时延抖动等方面与已有算法相比具有同样的高性能.  相似文献   

13.
14.
Nowadays everyone can review everything. Online customer-opinion platforms often help potential buyers take a decision. Sometimes, however, the multitude of contradictory opinions may confuse customers. Submitting a review requires time and effort, yet it only benefits others. Therefore jokes and shill reviews represent quite a percentage of reviews, because an average reviewer has no motivation to submit reviews. Reviews with Revenue in Reputation (RRR) Method is designed to encourage reviewers by gamification. RRR allows customers to easily spot credible reviews and restricts the number of reviews an unreliable reviewer can submit.  相似文献   

15.
16.
Web services technology is being adopted as a viable deployment approach for future distributed software systems that enable business-to-business and business-to-consumer interactions across the open and dynamic internet environment. Recent research is focused on developing support technologies for web service discovery, on-demand service composition, and robust execution to facilitate web services based deployment of business processes. Developing techniques to cope with the volatile and open nature of the web during execution of composite services at the service platform is essential for delivering reliable and acceptable performance in this new process delivery framework. In this paper, we propose a simulation-based framework to guide scheduling of composite service execution. Online simulation of the dynamics of the open environment is used for scheduling service requests at the service platform. Comparison of the look-ahead simulation for different scheduling policies with the current execution state provides guidelines for service execution in order to cope with system volatility. We have implemented a prototype of the proposed framework and illustrate the feasibility of our approach with experimental studies.  相似文献   

17.
Memory management is a critical issue in stream processing involving stateful operators such as join. Traditionally, the memory requirement for a stream join is query-driven: a query has to explicitly define a window for each (potentially unbounded) input. The window essentially bounds the size of the buffer allocated for that stream. However, output produced this way may not be desirable (if the window size is not part of the intended query semantic) due to the volatile input characteristics. We discover that when streams are ordered or partially ordered, it is possible to use a data-driven memory management scheme to improve the performance. In this work, we present a novel data-driven memory management scheme, called Window-Oblivious Join (WO-Join), which adaptively adjusts the state buffer size according to the input characteristics. Our performance study shows that, compared to traditional Window-Join (W-Join), WO-Join is more robust with respect to the dynamic input and therefore produces higher quality results with lower memory costs.  相似文献   

18.
The Real-time Specification for Java (RTSJ) introduced a range of language features for explicit memory management. While the RTSJ gives programmers fine control over memory use and allows linear allocation and constant-time deallocation, the RTSJ relies upon dynamic runtime checks for safety, making it unsuitable for safety critical applications. We introduce ScopeJ, a statically-typed, multi-threaded, object calculus in which scopes are first class constructs. Scopes reify allocation contexts and provide a safe alternative to automatic memory management. Safety follows from the use of an ownership type system that enforces a topology on run-time patterns of references. ScopeJ’s type system is novel in that ownership annotations are implicit. This substantially reduces the burden for developers and increases the likelihood of adoption. The notion of implicit ownership is particularly appealing when combined with pluggable type systems, as one can apply different type constraints to different components of an application depending on the requirements without changing the source language. In related work we have demonstrated the usefulness of our approach in the context of highly-responsive systems and stream processing.  相似文献   

19.
Ford  R. 《Software, IEEE》1988,5(5):10-23
The conflict between the performance demands of real-time systems and the shared-resource needs of high-level languages (Ada in particular) is examined. Shared memory requires carefully designed concurrency control, but the traditional approach, which is to embed the entire allocate-release implementation code in critical sections, is unsuitable for real-time applications because it results in excessively high response time. The design and performance of three memory-management systems for real-time applications are evaluated, and it is shown that one system, an optimized optimistic version, does deliver performance that is acceptable for real-time applications  相似文献   

20.
As demand of higher computing power is steadily increasing, it becomes popular to equip a many-core accelerator in a computer system to run concurrent applications. Efficient management of compute resources in such a system is challenging because various factors such as workload variation, QoS requirement change, and hardware failure may cause dynamic change in system status. Recently, a variety of resource management techniques for many-core accelerators have been proposed. They are usually tailored for a specific target architecture. In this paper, we present SoPHy+, which supports various types of many-core accelerators, based on a hybrid resource management technique. SoPHy+ provides a seamless design flow from programming front-end, which generates dataflow-style function codes automatically from the task specification, to run-time environment, which adaptively manages compute resources for concurrent applications in response to system status change. SoPHy+ has been implemented on two different many-core architectures: the Intel Xeon Phi coprocessor and an Epiphany-like NoC virtual prototype. Experimental results prove that SoPHy+ is capable of adapting to the run-time workload variation effectively with affordable overhead of run-time resource management.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号