首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.

Development in photonic integrated circuits (PICs) provides a promising solution for on-chip optical computation and communication. PICs provides the best alternative to traditional networks-on-chip (NoC) circuits which face serious challenges such as bandwidth, latency and power consumption. Integrated optics have substantiated the ability to accomplish low-power communication and low-power data processing at ultra-high speeds. In this work, we propose a new architecture for NoC, which might improve overall on-chip network performance by reducing its power consumption, providing large channel capacity for communication, decreasing latency among nodes and reducing hop count. Some of the key features of the proposed architecture are to reduce the waveguide network for communication among nodes, and this architecture can be used as a brick to construct other architectures. In this architecture, we use micro-ring resonator (MRR) and it is used to provide a high bandwidth connection among nodes with a lesser number of waveguide networks. Furthermore, results show that this architecture of PICs provides better performance in terms of low communication latency, low power consumption, high bandwidth. It also provides acceptable FSR value, FWHR value, finesse value and Q-factor of micro-ring resonators used for the design of MRR in this architecture.

  相似文献   

2.
General purpose computation on graphics processing unit (GPU) is rapidly entering into various scientific and engineering fields. Many applications are being ported onto GPUs for better performance. Various optimizations, frameworks, and tools are being developed for effective programming of GPU. As part of communication and computation optimizations for GPUs, this paper proposes and implements an optimization method called as kernel coalesce that further enhances GPU performance and also optimizes CPU to GPU communication time. With kernel coalesce methods, proposed in this paper, the kernel launch overheads are reduced by coalescing the concurrent kernels and data transfers are reduced incase of intermediate data generated and used among kernels. Computation optimization on a device (GPU) is performed by optimizing the number of blocks and threads launched by tuning it to the architecture. Block level kernel coalesce method resulted in prominent performance improvement on a device without the support for concurrent kernels. Thread level kernel coalesce method is better than block level kernel coalesce method when the design of a grid structure (i.e., number of blocks and threads) is not optimal to the device architecture that leads to underutilization of the device resources. Both the methods perform similar when the number of threads per block is approximately the same in different kernels, and the total number of threads across blocks fills the streaming multiprocessor (SM) capacity of the device. Thread multi‐clock cycle coalesce method can be chosen if the programmer wants to coalesce more than two concurrent kernels that together or individually exceed the thread capacity of the device. If the kernels have light weight thread computations, multi clock cycle kernel coalesce method gives better performance than thread and block level kernel coalesce methods. If the kernels to be coalesced are a combination of compute intensive and memory intensive kernels, warp interleaving gives higher device occupancy and improves the performance. Multi clock cycle kernel coalesce method for micro‐benchmark1 considered in this paper resulted in 10–40% and 80–92% improvement compared with separate kernel launch, without and with shared input and intermediate data among the kernels, respectively, on a Fermi architecture device, that is, GTX 470. A nearest neighbor (NN) kernel from Rodinia benchmark is coalesced to itself using thread level kernel coalesce method and warp interleaving giving 131.9% and 152.3% improvement compared with separate kernel launch and 39.5% and 36.8% improvement compared with block level kernel coalesce method, respectively.Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

3.
4.
A distributed virtual environment (DVE) is a shared virtual environment where multiple users at their workstations interact with each other over a network. Some of these systems may support a large number of users, for example, multiplayer online games. An important issue is how well the system scales as the number of users increases. In terms of scalability, a promising system architecture is a two-level hierarchical architecture. At the lower level, multiple servers are deployed; each server interacts with its assigned users. At the higher level, the servers ensure that their copies of the virtual environment are as consistent as possible. Although the two-level architecture is believed to have good properties with respect to scalability, not much is known about its performance characteristics. In this paper, we develop a performance model for the two-level architecture and obtain analytic results on the workload experienced by each server. Our results provide valuable insights into the scalability of the architecture. We also investigate the issue of consistency and develop a novel technique to achieve weak consistency among copies of the virtual environment at the various servers. Simulation results on the consistency/scalability trade-off are presented.  相似文献   

5.
Abstract: A prototype system for automated defect classification and characterisation of automotive or other components involving two separate inspection sensors, vision and electromagnetic, was developed. This paper concentrates on the development work and issues related to the electromagnetic sensor. In particular, the issues relating to knowledge acquisition and knowledge representation are discussed. For instance, one of the problems which arose during the development work was that it appeared that the reasoning carried out unconsciously by a human was more complex than had been realised and not easily encapsulated as high level knowledge. A blackboard architecture was used to integrate the different areas of expertise required for each sensor to interpret the results of the inspections. The main issue here was in the effective use of the blackboard architecture for intelligent data fusion at all levels to improve interpretation.  相似文献   

6.
This article describes the design and implementation of the reasoning engine developed for the interpretation of FLORIAN rule language. A key feature of the language is to allow the specification of control knowledge using generalized meta-rules. The user can define how to solve conflicts at the object level, at the meta-level, or at any higher level using meta-i-rules. Object-level rules and generalized meta-i-rules share the same rule format. Several examples of meta-rules and higher level rules are presented using the rule syntax. The architecture and working of the rule interpreter is analyzed, describing the main algorithms and abstract data types implementing The reasoning engine.  相似文献   

7.
8.
This paper proposes a novel concept ofinteractive self-reflection (ISR), and claims its importance and necessity for implementingautonomy. Concretely, ISR includes three elements that (a) precisely recognize the current situation by determining the boundary between self and others, (b) appropriately detect or produce the required information according to a sense of values, and (c) autonomously generate both the goal and the evaluation criteria. To address these issues on ISR, we implement an ISR architecture based on cellular automata, and investigate its adaptation ability as one aspect of autonomy. Through intensive simulations, we reveal the following implications: (1) the ISR architecture provides a high level of adaptability that cannot be obtained by either adaptation to the environment or adaptation from the environment; (2) the adaptability of the architecture is supported by appropriate interaction control between adaptation to the environment and adaptation from the environment. This work was presented in part at the Fifth International Symposium on Artificial Life and Robotics, Oita, Japan, January 26–28, 2000  相似文献   

9.
10.
This paper presents a formal model and a systematic approach to the validation of communication architectures at a high level of abstraction. This model is described mathematically by a function, named GeNoC. The correctness of GeNoC is expressed as a theorem, which states that messages emitted on the architecture reach their expected destination without any modification of their content. The model identifies the key constituents common to all on chip communication architectures, and their essential properties from which the correctness theorem is deduced. Each constituent is represented by a function that has no explicit definition but is constrained to satisfy the essential properties. Thus, the validation of a particular architecture is reduced to the proof that its concrete definition satisfies the essential properties. In practice, the model has been defined in the logic of the ACL2 theorem proving system. We illustrate our approach on several architectures that constitute concrete instances of the generic GeNoC model. Some of these applications come from industrial designs, such as the AMBA AHB bus or the Octagon network from ST Microelectronics. C. Delgado Kloos  相似文献   

11.
12.
Heterogeneous performance prediction models are valuable tools to accurately predict application runtime, allowing for efficient design space exploration and application mapping. The existing performance models require intricate system architecture knowledge, making the modeling task difficult. In this research, we propose a regression‐based performance prediction framework for general purpose graphical processing unit (GPGPU) clusters that statistically abstracts the system architecture characteristics, enabling performance prediction without detailed system architecture knowledge. The regression‐based framework targets deterministic synchronous iterative algorithms using our synchronous iterative GPGPU execution model and is broken into two components: the computation component that models the GPGPU device and host computations and the communication component that models the network‐level communications. The computation component regression models use algorithm characteristics such as the number of floating‐point operations and total bytes as predictor variables and are trained using several small, instrumented executions of synchronous iterative algorithms that include a range of floating‐point operations‐to‐byte requirements. The regression models for network‐level communications are developed using micro‐benchmarks and employ data transfer size and processor count as predictor variables. Our performance prediction framework achieves prediction accuracy over 90% compared with the actual implementations for several tested GPGPU cluster configurations. The end goal of this research is to offer the scientific computing community, an accurate and easy‐to‐use performance prediction framework that empowers users to optimally utilize the heterogeneous resources. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

13.
Unger  Oren  Cidon  Israel 《World Wide Web》2004,7(3):315-336
The architecture of overlay networks should support high-performance and high-scalability at low costs. This becomes more crucial when communication, storage costs as well as service latencies grow with the exploding amounts of data exchanged and with the size and span of the overlay network. For that end, multicast methodologies can be used to deliver content from regional servers to end users, as well as for the timely and economical synchronization of content among the distributed servers. Another important architectural problem is the efficient allocation of objects to servers to minimize storage, delivery and update costs. In this work, we suggest a multicast based architecture and address the optimal allocation and replication of dynamic objects that are both consumed and updated. Our model network includes consumers which are served using multicast or unicast transmissions and media sources (that may be also consumers) which update the objects using multicast communication. General costs are associated with distribution (download) and update traffic as well as the storage of objects in the servers. Optimal object allocation algorithms for tree networks are presented with complexities of O(N) and O(N 2) in case of multicast and unicast distribution respectively. To our knowledge, the model of multicast distribution combined with multicast updates has not been analytically dealt before, despite its popularity in the industry.  相似文献   

14.
Agent integration architectures enable a heterogeneous, distributed set of agents to work together to address problems of greater complexity than those addressed by the individual agents themselves. Unfortunately, integrating software agents and humans to perform real-world tasks in a large-scale system remains difficult, especially due to three main challenges: ensuring robust execution in the face of a dynamic environment, providing abstract task specifications without all the low-level coordination details, and finding appropriate agents for inclusion in the overall system. To address these challenges, our Teamcore project provides the integration architecture with general-purpose teamwork coordination capabilities. We make each agent team-ready by providing it with a proxy capable of general teamwork reasoning. Thus, a key novelty and strength of our framework is that powerful teamwork capabilities are built into its foundations by providing the proxies themselves with a teamwork model.Given this teamwork model, the Teamcore proxies addresses the first agent integration challenge, robust execution, by automatically generating the required coordination actions for the agents they represent. We can also exploit the proxies' reusable general teamwork knowledge to address the second agent integration challenge. Through team-oriented programming, a developer specifies a hierarchical organization and its goals and plans, abstracting away from coordination details. Finally, KARMA, our Knowledgeable Agent Resources Manager Assistant, can aid the developer in conquering the third agent integration challenge by locating agents that match the specified organization's requirements. Our integration architecture enables teamwork among agents with no coordination capabilities, and it establishes and automates consistent teamwork among agents with some coordination capabilities. Thus, team-oriented programming provides a level of abstraction that can be used on top of previous approaches to agent-oriented programming. We illustrate how the Teamcore architecture successfully addressed the challenges of agent integration in two application domains: simulated rehearsal of a military evacuation mission and facilitation of human collaboration.  相似文献   

15.
By their structure and operation, biomolecules have resolved fundamental problems as a distributed computational system that we are just beginning to unveil. One advantageous approach to gain a good understanding of the processes and algorithms involved is simulation on conventional computers. Simulations allow better understanding of the capabilities of molecules because they can occur at the level of reliability, efficiency, and programmability that are standard in conventional computation and are desirable for experiments in vitro. Here, we describe in some detail the architecture of a general-purpose simulation environment in silico, EdnaCo, establish its soundness and reliability, and benchmark its performance. The system can be described as an emulation of the events in a real test tube. We describe the major pieces of its architecture, namely, a distributed memory (file) system, a kinetic engine, and input/output mechanisms. Finally, the ability of this environment in preserving major features of the wet counterpart in vitro is evaluated via an implementation on a cluster of PCs. The results of several simulations are summarized that establish the soundness, utility, applicability, and cost efficiency of the software to facilitate experimentation in vitro.  相似文献   

16.
Simulation is indispensable in computer architecture research. Researchers increasingly resort to detailed architecture simulators to identify performance bottlenecks, analyze interactions among different hardware and software components, and measure the impact of new design ideas on the system performance. However, the slow speed of conventional execution‐driven architecture simulators is a serious impediment to obtaining desirable research productivity. This paper describes a novel fast multicore processor architecture simulation framework called Two‐Phase Trace‐driven Simulation (TPTS), which splits detailed timing simulation into a trace generation phase and a trace simulation phase. Much of the simulation overhead caused by uninteresting architectural events is only incurred once during the cycle‐accurate simulation‐based trace generation phase and can be omitted in the repeated trace‐driven simulations. We report our experiences with tsim, an event‐driven multicore processor architecture simulator that models detailed memory hierarchy, interconnect, and coherence protocol based on the TPTS framework. By applying aggressive event filtering, tsim achieves an impressive simulation speed of 146 millions of simulated instructions per second, when running 16‐thread parallel applications. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

17.
On-demand waypoints for live P2P video broadcasting   总被引:1,自引:1,他引:0  
A peer-to-peer architecture has emerged as a promising approach to enabling the ubiquitous deployment of live video broadcasting on the Internet. However the performance in these architectures is unpredictable and fundamentally constrained by the characteristics of the members participating in the broadcast. By characteristics, we refer to user dynamics, out-going bandwidth connectivity, whether the member is behind NAT/firewall, and the network conditions among participating members. While several researchers have looked at hybrid P2P/CDN approaches to address these issues, such approaches require provisioning of centralized server resources prior to a broadcast, which complicates the goal of ubiquitous video broadcasting. In this paper, we explore an alternative architecture where users are willing to donate their bandwidth resources to a broadcast event, even though they are not a participant in the event. Such users constitute what we term a waypoint community. Any given broadcast event involves constructing overlays only based on participants to the extent possible, however waypoints may be dynamically invoked in an on-demand, performance-driven fashion to improve the performance of a broadcast. We present the design of a system built on this idea. Detailed results from trace-driven experiments over the PlanetLab distributed infrastructure and Emulab demonstrate the potential of the waypoint architecture to improve the performance of purely P2P-based overlays.  相似文献   

18.
Ambient displays provide us with information in the background of our awareness. However, as each user has individual wishes and needs of how, which and when information is presented, the acceptance of ambient displays is low.In this paper we introduce an extensible architecture for personalized ambient information.We employ a notification system to extend the capability of a fixture to display more than one variable. Multiple variables can be updated by multiple information providers. Thereby, our architecture covers a broader spectrum of notifications from alarms to ambient information.We evaluate our concept within a dual-task experiment in comparison to preset notifications. The results show a level of self-interruption which is significantly lower than using preset notifications. Therefore our approach outperforms preset notifications and moves ambient displays closer to secondary displays in human–computer interaction.  相似文献   

19.
Integer motion estimation (IME), which acts as a key component in video encoder, is to remove temporal redundancies by searching the best integer motion vectors for dynamic partition blocks in a macro-block (MB). Huge memory bandwidth requirements and unbearable computational resource demanding are two key bottlenecks in IME engine design, especially for large search window (SW) cases. In this paper, a three-level pipelined VLSI architecture design is proposed, where efficiently integrates the reference data sharing search (RDSS) into multi-resolution motion estimation algorithm (MMEA). First, a hardware-friendly MMEA algorithm is mapped into three-level pipelined architecture with neglected coding quality loss. Second, sub-sampled RDSS coupled with Level C?+?are adopted to reduce on-chip memory and bandwidth at the coarsest and middle level. Data sharing between IME and fractional motion estimation (FME) is achieved by loading only a local predictive SW at the finest level. Finally, the three levels are parallelized and pipelined to guarantee the gradual refinement of MMEA and the hardware utilization. Experimental results show that the proposed architecture can reach a good balance among complexity, on-chip memory, bandwidth, and the data flow regularity. Only 320 processing elements (PE) within 550 cycles are required for IME search, where the SW is set to 256?×?256. Our architecture can achieve 1080P@30 fps real-time processing at the working frequency of 134.6 MHz, with 135 K gates and 8.93 KB on-chip memory.  相似文献   

20.
Interest in the Web services (WS) composition (WSC) paradigm is increasing tremendously. A real shift in distributed computing history is expected to occur when the dream of implementing Service-Oriented Architecture (SOA) is realized. However, there is a long way to go to achieve such an ambitious goal. In this paper, we support the idea that, when challenging the WSC issue, the earlier that the inevitability of failures is recognized and proper failure-handling mechanisms are defined, from the very early stage of the composite WS (CWS) specification, the greater are the chances of achieving a significant gain in dependability. To formalize this vision, we present the FENECIA (Failure Endurable Nested-transaction based Execution of Composite Web services with Incorporated state Analysis) framework. Our framework approaches the WSC issue from different points of view to guarantee a high level of dependability. In particular, it aims at being simultaneously a failure-handling-devoted CWS specification, execution, and quality of service (QoS) assessment approach. In the first section of our framework, we focus on answering the need for a specification model tailored for the WS architecture. To this end, we introduce WS-SAGAS, a new transaction model. WS-SAGAS introduces key concepts that are not part of the WS architecture pillars, namely, arbitrary nesting, state, vitality degree, and compensation, to specify failure-endurable CWS as a hierarchy of recursively nested transactions. In addition, to define the CWS execution semantics, without suffering from the hindrance of an XML-based notation, we describe a textual notation that describes a WSC in terms of definition rules, composability rules, and ordering rules, and we introduce graphical and formal notations. These rules provide the solid foundation needed to formulate the execution semantics of a CWS in terms of execution correctness verification dependencies. To ensure dependable execution of the CWS, we present in the second section of FENECIA our architecture THROWS, in which the execution control of the resulting CWS is distributed among engines, discovered dynamically, that communicate in a peer-to-peer fashion. A dependable execution is guaranteed in THROWS by keeping track of the execution progress of a CWS and by enforcing forward and backward recovery. We concentrate in the third section of our approach on showing how the failure consideration is trivial in acquiring more accurate CWS QoS estimations. We propose a model that assesses several QoS properties of CWS, which are specified as WS-SAGAS transactions and executed in THROWS. We validate our proposal and show its feasibility and broad applicability by describing an implemented prototype and a case study.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号