首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We describe an integrated compile time and run time system for efficient shared memory parallel computing on distributed memory machines. The combined system presents the user with a shared memory programming model. The run time system implements a consistent shared memory abstraction using memory access detection and automatic data caching. The compiler improves the efficiency of the shared memory implementation by directing the run time system to exploit the message passing capabilities of the underlying hardware. To do so, the compiler analyzes shared memory accesses and transforms the code to insert calls to the run time system that provide it with the access information computed by the compiler. The run time system is augmented with the appropriate entry points to use this information to implement bulk data transfer and to reduce the overhead of run time consistency maintenance. In those cases where the compiler analysis succeeds for the entire program, we demonstrate that the combined system achieves performance comparable to that produced by compilers that directly target message passing. If the compiler analysis is successful only for parts of the program, for instance, because of irregular accesses to some of the arrays, the resulting optimizations can be applied to those parts for which the analysis succeeds. If the compiler analysis fails entirely, we rely on the run time maintenance of shared memory and thereby avoid the complexity and the limitations of compilers that directly target message passing. The result is a single system that combines efficient support for both regular and irregular memory access patterns  相似文献   

2.
Simplifying the programming models is paramount to the success of reconfigurable computing with field programmable gate arrays (FPGAs). This paper presents a methodology to combine true object-oriented design of the compiler/CAD tool with an object-oriented hardware design methodology in C++. The resulting system provides all the benefits of object-oriented design to the compiler/CAD tool designer and to the hardware designer/programmer. The two examples for domain-specific compilers presented are BSAT and StReAm. Each domain-specific compiler is targeted at a very specific application domain, such as applications that accelerate Boolean satisfiability problems with BSAT, and applications which lend themselves for implementation as a stream architecture with StReAm. The key benefit of the presented domain specific compilers is a reduction of design time by orders of magnitude while keeping the optimal performance of hand-designed circuits  相似文献   

3.
This paper presents a domain-specific automated image analysis framework for the detection of pre-cancerous and cancerous lesions of the uterine cervix. Our proposed framework departs from previous methods in that we include domain-specific diagnostic features in a probabilistic manner using conditional random fields. Likewise, we provide a novel window-based performance assessment scheme for 2D image analysis which addresses the intrinsic problem of image misalignment. Image regions corresponding to different tissue types are indentified for the extraction of domain-specific anatomical features. The unique optical properties of each tissue type and the diagnostic relationships between neighboring regions are incorporated in the proposed conditional random field model. The validity of our method is examined using clinical data from 48 patients, and its diagnostic potential is demonstrated by a performance comparison with expert colposcopy annotations, using histopathology as the ground truth. The proposed automated diagnostic approach can support or potentially replace conventional colposcopy, allow tissue specimen sampling to be performed in a more objective manner, and lower the number of cervical cancer cases in developing countries by providing a cost effective screening solution in low-resource settings.  相似文献   

4.
B 《电子学报:英文版》2021,30(2):258-267
Frequent subgraph mining (FSM) is a subset of the graph mining domain that is extensively used for graph classification and clustering. Over the past decade, many efficient FSM algorithms have been devel-oped with improvements generally focused on reducing the time complexity by changing the algorithm structure or using parallel programming techniques. FSM algorithms also require high memory consumption, which is another problem that should be solved. In this paper, we propose a new approach called Predictive dynamic sized structure packing (PDSSP) to minimize the memory needs of FSM algorithms. Our approach redesigns the internal data structures of FSM algorithms without making algorithmic modifications. PDSSP offers two contributions. The first is the Dynamic Sized Integer Type, a newly designed unsigned integer data type, and the second is a data structure packing technique to change the behavior of the compiler. We examined the effectiveness and efficiency of the PDSSP approach by experimentally embedding it into two state-of-the-art algorithms, gSpan and Gaston. We compared our implementations to the performance of the originals. Nearly all results show that our proposed implementation consumes less memory at each support level, suggesting that PDSSP extensions could save memory, with peak memory usage decreasing up to 38%depending on the dataset.  相似文献   

5.
To support the quality-of-service (QoS) requirements of embedded multimedia applications off-the-shelf middleware like common object request broker architecture (CORBA) must be flexible, efficient, and predictable. Moreover, stringent memory constraints imposed by embedded system hardware necessitates a minimal footprint for middleware that supports multimedia applications. This paper provides three contributions toward developing efficient object request broker's (ORBs) middleware to support embedded multimedia applications. First, we describe optimization principle patterns used to develop a time and space-efficient CORBA inter-ORB protocol (IIOP) interpreter for the adaptive communication environment (ACE)-ORB (TAO), which is our high-performance, real-time ORB. Second, we describe the optimizations applied to TAO's interface definition language (IDL) compiler to generate efficient and small stubs/skeletons used in TAO's IIOP protocol engine. Third, we empirically compare the performance and memory footprint of interpretive (de)marshaling versus compiled (de)marshaling for a wide range of IDL data types. Applying our optimization principle patterns to TAO's IIOP protocol engine improved its interpretive (de)marshaling performance to the point where it is now comparable to the performance of compiled (de)marshaling. Moreover, our IDL compiler optimizations generate interpreted stubs/skeletons whose footprint is substantially smaller than compiled stubs/skeletons. Our results illustrate that careful application of optimization principle patterns can yield both time and space-efficient standards-based middleware  相似文献   

6.
The ParaScope parallel programming environment   总被引:1,自引:0,他引:1  
The ParaScope parallel programming environment, developed to support scientific programming of shared-memory multiprocessors, is described. It includes a collection of tools that use global program analysis to help users develop and debug parallel programs. The focus is on ParaScope's compilation system. The compilation system extends the traditional single-procedure compiler by providing a mechanism for managing the compilation of complete programs. The ParaScope editor brings both compiler analysis and user expertise to bear on program parallelization. The debugging system detects and reports timing-dependent errors, called data races, in execution of parallel programs. A project aimed at extending ParaScope to support programming in FORTRAN D, a machine-independent parallel programming language for use with both distributed-memory and shared-memory parallel computers, is described  相似文献   

7.
We address the problem of code generation for embedded DSP systems. In such systems, it is typical for one or more digital signal processors (DSPs), program memory, and custom circuitry to be integrated onto a single IC. Consequently, the amount of silicon area that is dedicated to program memory is limited, so the embedded software must be sufficiently dense. Additionally, this software must be written so as to meet various high-performance constraints, which may include hard real-time constraints. Unfortunately, existing compiler technology is unable to generate dense, high-performance code for DSPs since it does not provide adequate support for the specialized architectural features of DSPs. These specialized features not only allow for the fast execution of common DSP operations, but they also allow for the generation of dense assembly code that specifies these operations. Thus, system designers often hand-program the embedded software in assembly, which is a very time-consuming task. In this paper, we focus on providing compiler support for one particular specialized architectural feature, namely the paged absolute addressing mode – this feature is found in two commercial DSPs, the Texas Instruments' TMS320C25 and TMS320C50 fixed-point DSPs; however, it may also be featured in application-specific processors (ASIPs). We present some machine-dependent code optimizations that improve code density by exploiting this architectural feature. Experimental results demonstrate that for a set of typical DSP benchmarks, some of our optimizations reduce overall code size and data memory consumption by an average of 5.0% and 16.0%, respectively. Our experimental vehicle throughout this research is the TMS320C25.  相似文献   

8.
We address the problem of code generation for DSP systems on a chip. In such systems, the amount of silicon devoted to program ROM is limited, so in addition to meeting various high-performance constraints, the application software must be sufficiently dense. Unfortunately, existing compiler technology is unable to generate high-quality code for DSPs since it does not provide adequate support for the specialized architectural features of DSPs. Thus, designers often resort to programming application software in assembly, which is a very tedious and time-consuming task. In this paper, we focus on providing compiler support for a group of specialized architectural features that exist in many DSPs, namely indirect addressing modes with auto-increment/decrement arithmetic. In these DSPs, an indexed addressing mode is generally not available, so automatic variables must be accessed by allocating address registers and performing address arithmetic. Subsuming address arithmetic into auto-increment /decrement arithmetic improves both the performance and size of the generated code. Our objective is to provide a method for comprehensively analyzing the performance benefits and hardware cost due to an auto-increment /decrement feature that varies from-l to +l, and allowing access to k address registers in an address generator. We provide this method via a parameterizable optimization algorithm that operates on a procedure-wise basis. Thus, the optimization techniques in a compiler can be used not only to generate efficient or compact code, but also to help the designer of a custom DSP architecture make decisions on address arithmetic features.  相似文献   

9.
Recently, ISO/IEC standardized a dataflow-programming framework called Reconfigurable Video Coding (RVC) for the specification of video codecs. The RVC framework aims at providing the specification of a system at a high abstraction level so that the functionality (or behavior) of the system become independent of implementation details. The idea is to specify a system so that only intrinsic features of the algorithms are explicitly expressed, whereas implementation choices can then be made only once specific target platforms have been chosen. With this system design approach, one abstract design can be used to automatically create implementations towards multiple target platforms. In this paper, we report our investigations on applying the methodology standardized by the MPEG RVC framework to develop secure computing in the domains of cryptography and multimedia security, leading to the conclusion that the RVC framework can successfully be applied as a general-purpose framework to other fields beyond multimedia coding. This paper also highlights the challenges we faced in conducting our study, and how our study helped the RVC and the secure computing communities benefited from each other. Our investigations started with the development of a Crypto Tools Library (CTL) based on RVC, which covers a number of widely used ciphers and cryptographic hash functions such as AES, Triple DES, ARC4 and SHA-2. Performance benchmarking results on the RVC-based AES and SHA-2 implementations in both C and Java revealed that the automatically generated implementations can achieve a comparable performance to some manually written reference implementations. We also demonstrated that the RVC framework can easily produce implementations with multi-core support without any change to the RVC code. A security protocol for mutual authentication was also implemented to demonstrate how one can build heterogeneous systems easily with RVC. By combining CTL with Video Tool Library (a standard library defined by the RVC standard), a non-standard RVC-based H.264/AVC encoder and a non-standard RVC-based JPEG codec, we further demonstrated the benefits of using RVC to develop different kinds of multimedia security applications, which include joint multimedia encryption-compression schemes, digital watermarking and image steganography in JPEG compressed domain. Our study has shown that RVC can be used as a general-purpose implementation-independent development framework for diverse data-driven applications with different complexities.  相似文献   

10.
Reconfigurable hardware is ideal for use in systems-on-a-chip (SoC), as it provides both hardware-level performance and post-fabrication flexibility. However, any one architecture is rarely equally optimized for all applications. SoCs targeting a specific set of applications can greatly benefit from incorporating customized reconfigurable logic instead of generic field-programmable gate-array (FPGA) logic. Unfortunately, manually designing a domain-specific architecture for every SoC would require significant design time. Instead, this paper discusses our initial efforts towards creating a reconfigurable hardware generator capable of automatically creating flexible, yet domain-specific, designs. Our tests indicate that our generated architectures are more than 5times smaller than equivalent FPGA implementations and nearly as area-efficient as standard cell designs. We also use a novel technique employing synthetic circuit generation to demonstrate the flexibility of our architecture generation techniques.  相似文献   

11.
In this paper, we present Chameleon an application-level power management approach for reducing energy consumption in mobile processors. By using application domain knowledge, as opposed to OS-level or hardware-level inferred knowledge, Chameleon can substantially reduce CPU energy consumption. By exporting the energy management to user-space, designers can design more flexible and easily portable algorithms and systems, and use multiple energy management policies simultaneously. Specifically, we propose a minimal operating system interface that applications use to obtain global knowledge from the kernel in order to make local decisions. We consider three classes of applications soft real-time, interactive and batch and design user level power management strategies for representative applications such as a movie player, a word processor, a web browser, and a batch compiler. Our experiments show that, compared to the traditional system-wide CPU voltage scaling approaches, Chameleon can achieve up to 32-50% energy savings while delivering comparable or better performance to applications. Similarly, Chameleon extracts 9-41% more energy when compared to Grace OS, which uses some application knowledge but operates within the kernel. Further, Chameleon imposes minimal overhead and is effective at scheduling concurrent applications with diverse energy needs.  相似文献   

12.
In most parallel loops of embedded applications, every iteration executes the exact same sequence of instructions while manipulating different data. This fact motivates a new compiler‐hardware orchestrated execution framework in which all parallel threads share one fetch unit and one decode unit but have their own execution, memory, and write‐back units. This resource sharing enables parallel threads to execute in lockstep with minimal hardware extension and compiler support. Our proposed architecture, called multithreaded lockstep execution processor (MLEP), is a compromise between the single‐instruction multiple‐data (SIMD) and symmetric multithreading/chip multiprocessor (SMT/CMP) solutions. The proposed approach is more favorable than a typical SIMD execution in terms of degree of parallelism, range of applicability, and code generation, and can save more power and chip area than the SMT/CMP approach without significant performance degradation. For the architecture verification, we extend a commercial 32‐bit embedded core AE32000C and synthesize it on Xilinx FPGA. Compared to the original architecture, our approach is 13.5% faster with a 2‐way MLEP and 33.7% faster with a 4‐way MLEP in EEMBC benchmarks which are automatically parallelized by the Intel compiler.  相似文献   

13.
14.
The compiler is generally regarded as the most important software component that supports a processor design to achieve success. This paper describes our application of the open research compiler infrastructure to a novel VLIW DSP (known as the PAC DSP core) and the specific design of code generation for its register file architecture. The PAC DSP utilizes port-restricted, distributed, and partitioned register file structures in addition to a heterogeneous clustered data-path architecture to attain low power consumption and a smaller die. As part of an effort to overcome the new challenges of code generation for the PAC DSP, we have developed a new register allocation scheme and other retargeting optimization phases that allow the effective generation of high quality code. Our preliminary experimental results indicate that our developed compiler can efficiently utilize the features of the specific register file architectures in the PAC DSP. Our experiences in designing compiler support for the PAC VLIW DSP with irregular resource constraints may also be of interest to those involved in developing compilers for similar architectures.
Jenq-Kuen Lee (Corresponding author)Email:
  相似文献   

15.
SoC system designers commonly employ SystemC based Transaction level modeling (TLM) for its early software development usage and its analysis capabilities. TLM helps in realizing a SoC using virtual prototyping by integration of SoC components at different abstraction levels. The TLM 2 standard introduces interoperability rules for the models that may have been developed independently. However, neither SystemC compiler nor TLM library supports checking of such rules and manually debugging interoperability errors in such models could be a major problem. This provides motivation for developing automatic compliance checking techniques which can detect and report such errors. As the models are refined to incorporate detailed intercommunication protocols among the system components, the need for compliance checking extends to these protocols as well. In this paper, we present an efficient UML based compliance checking technique for TLM 2 models which supports static, dynamic and protocol-specific rule checking.  相似文献   

16.
17.
We address the problem of code generation for embedded DSP systems. Such systems devote a limited quantity of silicon to program memory, so the embedded software must be sufficiently dense. Additionally, this software must be written so as to meet various high-performance constraints. Unfortunately, current compiler technology is unable to generate dense, high-performance code for DSPs, due to the fact that it does not provide adequate support for the specialized architectural features of DSPs via machine-dependent code optimizations. Thus, designers often program the embedded software in assembly, a very time-consuming task. In order to increase productivity, compilers must be developed that are capable of generating high-quality code for DSPs. The compilation process must also be made retargetable, so that a variety of DSPs may be efficiently evaluated for potential use in an embedded system. We present a retargetable compilation methodology that enables high-quality code to be generated for a wide range of DSPs. Previous work in retargetable DSP compilation has focused on complete automation, and this desire for automation has limited the number of machine-dependent optimizations that can be supported. In our efforts, we have given code quality higher priority over complete automation. We demonstrate how by using a library of machine-dependent optimization routines accessible via a programming interface, it is possible to support a wide range of machine-dependent optimizations, albeit at some cost to automation. Experimental results demonstrate the effectiveness of our methodology, which has been used to build good-quality compilers for three fixed-point DSPs. This revised version was published online in July 2006 with corrections to the Cover Date.  相似文献   

18.
Domain analysis is the process of analyzing related software systems in a domain to find their common and variable parts. In the case of device drivers, they are highly suitable for domain analysis because device drivers of the same domain are implemented similarly for each device and each system that they support. Considering this characteristic, this paper introduces a new approach to the domain analysis of device drivers. Our method uses a code clone detection technique to extract similarity among device drivers of the same domain. To examine the applicability of our method, we investigated whole device drivers of a Linux source. Results showed that many reusable similar codes can be discerned by the code clone detection method. We also investigated if our method is applicable to other kernel sources. However, the results show that the code clone detection method is not useful for the domain analysis of all kernel sources. That is, the applicability of the code clone detection method to domain analysis is a peculiar feature of device drivers.  相似文献   

19.
Recently coarse-grained reconfigurable architectures (CGRAs) have drawn increasing attention due to their efficiency and flexibility. While many CGRAs have demonstrated impressive performance improvements, the effectiveness of CGRA platforms ultimately hinges on the compiler. Existing CGRA compilers do not model the details of the CGRA, and thus they are i) unable to map applications, even though a mapping exists, and ii) using too many processing elements (PEs) to map an application. In this paper, we model several CGRA details, e.g., irregular CGRA topologies, shared resources and routing PEs in our compiler and develop a graph drawing based approach, split-push kernel mapping (SPKM), for mapping applications onto CGRAs. On randomly generated graphs our technique can map on average 4.5times more applications than the previous approach, while generating mappings which have better qualities in terms of utilized CGRA resources. Utilizing fewer resources is directly translated into increased opportunities for novel power and performance optimization techniques. Our technique shows less power consumption in 71 cases and shorter execution cycles in 66 cases out of 100 synthetic applications, with minimum mapping time overhead. We observe similar results on a suite of benchmarks collected from Livermore loops, Mediabench, Multimedia, Wavelet and DSPStone benchmarks. SPKM is not a customized algorithm only for a specific CGRA template, and it is demonstrated by exploring various PE interconnection topologies and shared resource configurations with SPKM.  相似文献   

20.
Software testing is an expensive and time-consuming activity; it is also error-prone due to human factors. But, it still is the most common effort used in the software industry to achieve an acceptable level of quality for its products. An alternative is to use formal verification approaches, although they are not widespread in industry yet. This paper proposes an automatic verification approach to aid system testing based on refinement checking, where the underlying formalisms are hidden from the developers. Our approach consists in using a controlled natural language (a subset of English) to describe requirements (where it is automatically translated into the formal specification language CSP) and extracting a model directly from a mobile phone using a developed tool support; these artifacts are normalized in the same abstraction level and compared using the refinement checker FDR. This approach is being used at Motorola; the source of our case study.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号