共查询到20条相似文献,搜索用时 0 毫秒
1.
A study of the memory characteristics of the brain and the computer prompts the creation of a new neuron architecture for neural computation. We hypothesize that neural responses resemble hysteresis loops. The upper and lower halves of the hysteresis loop are described by two sigmoids. Generalizing the two sigmoids to two families of curves accommodates loops of various sizes. This model, which we call the ;hystery model', is capable of memorizing the entire history of its bipolar inputs in an adaptive fashion, with larger memory for longer sequences. We theorize and prove that the hystery model's response converges asymptotically to hysteresis-like loops. A simple application to temporal pattern discrimination demonstrates the nonlinear short-term memory characteristics of the hystery model. This model may have important applications for time-based computations such as control, signal processing and spatiotemporal pattern recognition, especially if it can take advantage of existing hysteresis phenomena in semiconductor materials. 相似文献
2.
3.
A number of highly-threaded, many-core architectures hide memory-access latency by low-overhead context switching among a large number of threads. The speedup of a program on these machines depends on how well the latency is hidden. If the number of threads were infinite, theoretically, these machines could provide the performance predicted by the PRAM analysis of these programs. However, the number of threads per processor is not infinite, and is constrained by both hardware and algorithmic limits. In this paper, we introduce the Threaded Many-core Memory (TMM) model which is meant to capture the important characteristics of these highly-threaded, many-core machines. Since we model some important machine parameters of these machines, we expect analysis under this model to provide a more fine-grained and accurate performance prediction than the PRAM analysis. We analyze 4 algorithms for the classic all pairs shortest paths problem under this model. We find that even when two algorithms have the same PRAM performance, our model predicts different performance for some settings of machine parameters. For example, for dense graphs, the dynamic programming algorithm and Johnson’s algorithm have the same performance in the PRAM model. However, our model predicts different performance for large enough memory-access latency and validates the intuition that the dynamic programming algorithm performs better on these machines. We validate several predictions made by our model using empirical measurements on an instantiation of a highly-threaded, many-core machine, namely the NVIDIA GTX 480. 相似文献
4.
《International Journal of Parallel, Emergent and Distributed Systems》2013,28(5):383-406
Recent graphics processing units (GPUs), which have many processing units, can be used for general purpose parallel computation. To utilise the powerful computing ability, GPUs are widely used for general purpose processing. Since GPUs have very high memory bandwidth, the performance of GPUs greatly depends on memory access. The main contribution of this paper is to present a GPU implementation of computing Euclidean distance map (EDM) with efficient memory access. Given a two-dimensional (2D) binary image, EDM is a 2D array of the same size such that each element stores the Euclidean distance to the nearest black pixel. In the proposed GPU implementation, we have considered many programming issues of the GPU system such as coalesced access of global memory and shared memory bank conflicts, and so on. To be concrete, by transposing 2D arrays, which are temporal data stored in the global memory, with the shared memory, the main access from/to the global memory enables to be performed by coalesced access. In practice, we have implemented our parallel algorithm in the following three modern GPU systems: Tesla C1060, GTX 480 and GTX 580. The experimental results have shown that, for an input binary image with size of 9216 × 9216, our implementation can achieve a speedup factor of 54 over the sequential algorithm implementation. 相似文献
5.
The uniform memory hierarchy model of computation 总被引:9,自引:0,他引:9
TheUniform Memory Hierarchy (UMH) model introduced in this paper captures performance-relevant aspects of the hierarchical nature of computer memory. It is used to quantify architectural requirements of several algorithms and to ratify the faster speeds achieved by tuned implementations that use improved data-movement strategies.A sequential computer's memory is modeled as a sequence M
0,M
1,... of increasingly large memory modules. Computation takes place inM
0. Thus,M
0 might model a computer's central processor, whileM
1 might be cache memory,M
2 main memory, and so on. For each moduleM
u, a busB
u connects it with the next larger module Mu+1. All buses may be active simultaneously. Data is transferred along a bus in fixed-sized blocks. The size of these blocks, the time required to transfer a block, and the number of blocks that fit in a module are larger for modules farther from the processor. The UMH model is parametrized by the rate at which the blocksizes increase and by the ratio of the blockcount to the blocksize. A third parameter, the transfer-cost (inverse bandwidth) function, determines the time to transfer blocks at the different levels of the hierarchy.UMH analysis refines traditional methods of algorithm analysis by including the cost of data movement throughout the memory hierarchy. Thecommunication efficiency of a program is a ratio measuring the portion of UMH running time during which M0 is active. An algorithm that can be implemented by a program whose communication efficiency is nonzero in the limit is said to becommunication- efficient. The communication efficiency of a program depends on the parameters of the UMH model, most importantly on the transfer-cost function. Athreshold function separates those transfer-cost functions for which an algorithm is communication-efficient from those that are too costly. Threshold functions for matrix transpose, standard matrix multiplication, and Fast Fourier Transform algorithms are established by exhibiting communication-efficient programs at the threshold and showing that more expensive transfer-cost functions are too costly.A parallel computer can be modeled as a tree of memory modules with computation occurring at the leaves. Threshold functions are established for multiplication ofN×N matrices using up to N2 processors in a tree with constant branching factor. 相似文献
6.
Abuzer Yakary?lmaz Rūsi?? Freivalds A. C. Cem Say Ruben Agadzanyan 《Natural computing》2012,11(1):81-94
In classical computation, a “write-only memory” (WOM) is little more than an oxymoron, and the addition of WOM to a (deterministic or probabilistic) classical computer brings no advantage. We prove that quantum computers that are augmented with WOM can solve problems that neither a classical computer with WOM nor a quantum computer without WOM can solve, when all other resource bounds are equal. We focus on realtime quantum finite automata, and examine the increase in their power effected by the addition of WOMs with different access modes and capacities. Some problems that are unsolvable by two-way probabilistic Turing machines using sublogarithmic amounts of read/write memory are shown to be solvable by these enhanced automata. 相似文献
7.
8.
9.
Jaroslaw Nieplocha Robert J. Harrison Richard J. Littlefield 《The Journal of supercomputing》1996,10(2):169-189
Portability, efficiency, and ease of coding are all important considerations in choosing the programming model for a scalable parallel application. The message-passing programming model is widely used because of its portability, yet some applications are too complex to code in it while also trying to maintain a balanced computation load and avoid redundant computations. The shared-memory programming model simplifies coding, but it is not portable and often provides little control over interprocessor data transfer costs. This paper describes an approach, called Global Arrays (GAs), that combines the better features of both other models, leading to both simple coding and efficient execution. The key concept of GAs is that they provide a portable interface through which each process in a MIMD parallel program can asynchronously access logical blocks of physically distributed matrices, with no need for explicit cooperation by other processes. We have implemented the GA library on a variety of computer systems, including the Intel Delta and Paragon, the IBM SP-1 and SP-2 (all message passers), the Kendall Square Research KSR-1/2 and the Convex SPP-1200 (nonuniform access shared-memory machines), the CRAY T3D (a globally addressable distributed-memory computer), and networks of UNIX workstations. We discuss the design and implementation of these libraries, report their performance, illustrate the use of GAs in the context of computational chemistry applications, and describe the use of a GA performance visualization tool.(An earlier version of this paper was presented at Supercomputing'94.) 相似文献
10.
With the tremendous growth of cloud computing, verifiable computation has been firstly formalized by Gennaro et al. and then studied widely to provide integrity guarantees in the outsourced computation. However, existing verifiable computation protocols either work in the secret key setting or in the public key setting, namely, work either for single client or for all clients, which rules out some practical applications with access control policies. In this paper, we introduce and formalize the notion of verifiable computation with access control (AC-VC), in which only the computationally weak clients with necessary access control permissions can be allowed by a trusted source to apply the outsourced computation of a function to a server. We present a formal security definition and a proved secure black-box construction for AC-VC. This construction is built based on any verifiable computation in the secret key model and ciphertext-policy attribute-based encryption (CP-ABE). The access control policies that our AC-VC can realize depend on that realized in the based CP-ABE. 相似文献
11.
J. R. Cowie 《Software》1987,17(10):719-728
A technique is presented, based on the method of tries, which allows direct access to variable-length records in a sequential file. No reorganization of the original file is involved. 相似文献
12.
13.
OpenMP是现代多核机群系统采用的主要并行编程模型之一,在单CPU多核上可以获得良好的加速性能,但在整个机群系统上使用时,需要解决可扩展性差的问题.首先设计了求解非平衡动力学方程的并行算法.基于分布共享的多核机群系统,采用显式数据分布OpenMP并行计算方法,将数据进行分布式划分,分配到每个OpenMP线程,通过数据共享实现数据交换.计算结果表明显式OpenMP并行程序在保持可读性的同时,具有良好的可扩展性,在4核Xeon处理器构成的分布共享机群系统上,非平衡动力学方程组的数值并行计算可以扩展到1024个CPU核,具有明显的并行加速计算效果. 相似文献
14.
We present a model of discrete quantum computing focused on a set of discrete quantum states. For this, we choose the set that is the most outstanding in terms of simplicity of the states: the set of Gaussian coordinate states, which includes all the quantum states whose coordinates in the computation base, except for a normalization factor \(\sqrt{2^{-k}}\), belong to the ring of Gaussian integers \(\mathbb {Z}[i]=\{a+bi\ |\ a,b\in \mathbb {Z}\}\). We also introduce a finite set of quantum gates that transforms discrete states into discrete states and generates all discrete quantum states, and the set of discrete quantum gates, as the quantum gates that leave the set of discrete states invariant. We prove that the quantum gates of the model generate the expected discrete states and the discrete quantum gates of 2-qubits and conjecture that they also generate the discrete quantum gates of n-qubits. 相似文献
15.
A model is proposed that can be used to classify algorithms as inherently sequential. The model captures the internal computations of algorithms. Previous work in complexity theory has focused on the solutions algorithms compute. Direct comparison of algorithms within the framework of the model is possible. The model is useful for identifying hard to parallelize constructs that should be avoided by parallel programmers. The model's utility is demonstrated via applications to graph searching. A stack breadth-first search (BFS) algorithm is analyzed and proved inherently sequential. The proof technique used in the reduction is a new one. The result for stack BFS sharply contrasts with a result showing that a queue based BFS algorithm is in NC. An NC algorithm to compute greedy depth-first search numbers in a dag is presented, and a result proving that a combination search strategy called breadth-depth search is inherently sequential is also given. 相似文献
16.
17.
Qiuxia WU;Yue PENG;Wenwu XIAO;Wenxuan MA;Shuo ZHANG;Litao SUN;Chunfu ZHANG;Xiaohua MA;Yue HAO 《中国科学:信息科学(英文版)》2025,(1):396-397
<正>Since the discovery of HfO2-based ferroelectric (FE) materials in 2011,considerable advancements have been made in material preparation,mechanism research,and device realization [1].Consequently,these materials are now regarded as leading candidates for embedded non-volatile memory(eNVM).Beyond eNVM applications,recent studies have demonstrated that memory devices with antiferroelectric(AFE) films can achieve high response speeds,low latency,and reduced power consumption while maintaining excellent endurance characteristics [2,3].Additionally,the typical double hysteresis loops and multiple non-overlapping polarization current peaks of AFE films offer a novel approach to designing multistate memories.However,in order to further promote the application of high storage density AFE,an AFRAM electronic design automation (EDA) model must be developed,and it has yet to be proposed. 相似文献
18.
19.
Modern GPUs (Graphics Processing Units) offer very high computing power at relative low cost. To take advantage of their computing resources and develop efficient implementations is essential to have certain knowledge about the architecture and memory hierarchy. In this paper, we use the FFT (Fast Fourier Transform) as a benchmark tool to analyze different aspects of GPU architectures, like the influence of the memory access pattern or the impact of the register pressure. The FFT is a good tool for performance analysis because it is used in many digital signal processing applications and has a good balance between computational cost and memory bandwidth requirements. 相似文献
20.
The PRAM model of parallel computation is examined with respect to wordsize, the number of bits which can be held in each global memory cell. First, adversary arguments are used to show the incomparability of certain machines which store the same amount of global information but which differ in wordsize. Next, for machines with infinitely many memory cells, a counting argument is used to show a large lower bound and to separate a hierarchy of machine classes based on wordsize. Finally, an efficient simulation by boolean circuits is used to give a simple new proof of the tight Ω((log n)/(log log n)) time bound for
on small-wordsize machines. Overall the results suggest that, in some circumstances, the memory wordsize is a more significant resource than the write resolution rule, number of memory cells, or number of processors. 相似文献