Analysis of information sources in references of the Wikipedia article "并行编程模型" in Chinese language version.
Erlang is conceptually similar to the occam programming language, though it recasts the ideas of CSP in a functional framework and uses asynchronous message passing instead of the synchronous message passing in CSP.
Apache Giraph is an iterative graph processing framework, built on top of Apache Hadoop. The input to a Giraph computation is a graph composed of vertices and directed edges, …… Computation proceeds as a sequence of iterations, called supersteps in BSP. ……There is a barrier between consecutive supersteps.
Apache Hama(TM) is a framework for Big Data analytics which uses the Bulk Synchronous Parallel (BSP) computing model, ……It provides not only pure BSP programming model but also vertex and neuron centric programming models, inspired by Google's Pregel and DistBelief.
CUDA stands for Compute Unified Device Architecture and is a new hardware and software architecture for issuing and managing computations on the GPU as a data-parallel computing device without the need of mapping them to a graphics API.
OpenCL, a C-based open standard, allows developers to tap the incredible power of the graphics processing unit for tasks that go beyond graphics.
Note: OpenCL drivers have been included in all publicly available NVIDIA drivers since October 2009. The CUDA Toolkit now includes the Visual Profiler for OpenCL as well as the OpenCL Programming Guide, Best Practices Guide, and other developer documentation.
MPI is fully compatible with CUDA, CUDA Fortran, and OpenACC, all of which are designed for parallel computing on a single computer or node.
NVSHMEM implements the OpenSHMEM standard for GPU memory, with extensions for improved performance on GPUs.
Like many other communications libraries, BSPlib adopts a Single Program Multiple Data(SPMD) programming model. The task of writing an SPMD program will typically involve mapping a problem that manipulates a data structure of size N into p instances of a program that each manipulate an N/p sized block of the original domain. The role of BSPlib is to provide the infrastructure required for the user to take care of the data distribution, and any implied communication necessary to manipulate parts of the data structure that are on a remote process.
An important advance in parallel computing was the introduction of the notion of virtual models. …… Virtual models can be taken a step further and used to define performance in more abstract measures than just running time on a particular machine. A pair of such measures are work and depth: Work is defined as the total number of operations executed by a computation, and depth is defined as the longest chain of sequential dependencies in the computation.
The barebones PRAM model is low-level and cumbersome, and writing anything other than trivial algorithms in this model is a nightmare. We will therefore switch to an equivalent but higher-level abstraction called the Work-Time (WT) paradigm to be independent of these details. …… In our algorithmic notation, we will use the forall construct to denote such concurrent operations, and we drop explicit mention of the processor id and p, the number of processors. In fact the forall construct is the only construct that distinguishes a WT algorithm from a sequential algorithm.
For concreteness, we proceed to an example of a PRAM algorithm. However, before doing this we present the pardo “programming construct”, which is heavily used in these notes to express operations that are performed in parallell:
for Pi , 1 ≤ i ≤ n pardo
A(i) := B(i)
This means that the following n operations are performed concurrently: processor P1 assigns B(1) into A(1), processor P2 assigns B(2) into A(2), and so on. …… An alternative model which is actually an alternative presentation mode, called Work-Depth, …… Strictly speaking, WD actually defines a slightly different model of computation. Consider an instruction of the form
for i , 1 ≤ i ≤ α pardo
A(i) := B(C(i))
where the time unit under consideration, consists of a sequence of α concurrent instructions, for some positive integer α.
2) The single-instruction stream-multiple-data stream(SIMD), which includes most array processes, including Solomon[2](Illiac IV).
3) Multiple-instruction stream-single-data stream type organizations(MISD), which include specialized streaming organizations using multiple-instruction streams on a single sequence of data and the derivatives thereof. The plug-board machines of a bygone era were a degenerate form of MISD wherein the instruction streams were single instructions, and a derived datum(SD) passed from program step i to program step i+1(MI).
4) Multiple-Instruction stream-multiple-data stream (MIMD), which include organizations referred to as "multiprocessor."Univac[3], among other corporations, has proposed various MIMD structures.
An important advance in parallel computing was the introduction of the notion of virtual models. …… Virtual models can be taken a step further and used to define performance in more abstract measures than just running time on a particular machine. A pair of such measures are work and depth: Work is defined as the total number of operations executed by a computation, and depth is defined as the longest chain of sequential dependencies in the computation.
The barebones PRAM model is low-level and cumbersome, and writing anything other than trivial algorithms in this model is a nightmare. We will therefore switch to an equivalent but higher-level abstraction called the Work-Time (WT) paradigm to be independent of these details. …… In our algorithmic notation, we will use the forall construct to denote such concurrent operations, and we drop explicit mention of the processor id and p, the number of processors. In fact the forall construct is the only construct that distinguishes a WT algorithm from a sequential algorithm.
For concreteness, we proceed to an example of a PRAM algorithm. However, before doing this we present the pardo “programming construct”, which is heavily used in these notes to express operations that are performed in parallell:
for Pi , 1 ≤ i ≤ n pardo
A(i) := B(i)
This means that the following n operations are performed concurrently: processor P1 assigns B(1) into A(1), processor P2 assigns B(2) into A(2), and so on. …… An alternative model which is actually an alternative presentation mode, called Work-Depth, …… Strictly speaking, WD actually defines a slightly different model of computation. Consider an instruction of the form
for i , 1 ≤ i ≤ α pardo
A(i) := B(C(i))
where the time unit under consideration, consists of a sequence of α concurrent instructions, for some positive integer α.
Erlang is conceptually similar to the occam programming language, though it recasts the ideas of CSP in a functional framework and uses asynchronous message passing instead of the synchronous message passing in CSP.
The following overview lists more detailed questions and summarizes our findings:
Is MAP essentially the map combinator? NO.
Is MAP essentially an application of the map combinator? NO.
Does MAP essentially serve as the argument of map? YES.
Is REDUCE essentially the reduce combinator? NO.
Is REDUCE essentially an application of the reduce combinator? TYPICALLY.
Does REDUCE essentially serve as the argument of reduce? NO.
Does REDUCE essentially serve as the argument of map? YES.
Hence, there is no trivial correspondence between MapReduce's MAP&REDUCE and what is normally called map&reduce in functional programming.
The array exchange serial algorithm serially iterates the standard exchange algorithm n times. Its pseudo-code follows.
For i =0 to n−1 do
X:=A( i ) ; A( i ):=B( i ) ; B( i ):=X
…… A parallel array exchange algorithm uses an auxiliary array X[0..n-1] of size n, the parallel algorithm applies concurrently the iterations of the above serial algorithm, each exchanging A(i) with B(i) for a different value of i. Note the new pardo command in the following pseudo-code.
For i =0 to n−1 pardo
X( i ):=A( i ) ; A( i ):=B( i ) ; B( i ):=X( i )
…… Consider the following example of a small XMTC program for the parallel exchange algorithm of the previous section:
spawn (0 ,n−1) {
var x
x:=A( $ ) ; A( $):=B( $ ) ; B($ ):=x
}
The program simply spawns a concurrent thread for each of the depth-3 serial exchange iterations, using a local variable x. Note that the join command is implicit, and implied by the right parenthesis at the end of the above program. …… Commitments to silicon of XMT include a 64-processor, 75MHz computer based on field-programmable gate array(FPGA) technology [29], and 64-processor ASIC 10mm X 10mm chip using IBM’ s 90nm technology, pictured in Figure5. A basic yet stable compiler has also been developed. …… XMT is easy to build. A single graduate student, with no prior design experience, completed the XMT hardware description (in Verilog) in just over 2 year.
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.
Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical "record" in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. Our use of a functional model with user-specified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.
SINGLE PROGRAM: All tasks execute their copy of the same program simultaneously. This program can be threads, message passing, data parallel or hybrid.
MULTIPLE PROGRAM: Tasks may execute different programs simultaneously. The programs can be threads, message passing, data parallel or hybrid. ……MPMD applications are not as common as SPMD applications, but may be better suited for certain types of problems, particularly those that lend themselves better to functional decomposition than domain decomposition (discussed later under Partioning).
There is another related classification used especially by GPU vendors: Single Instruction, Multiple Threads (SIMT). This corresponds to a tiled SIMD architecture consisting of multiple SIMD processors, where each SIMD processor emulates multiple “threads” (fibers in our terminology) using masking. SIMT processors may appear to have thousands of threads, but in fact blocks of these share a control processor, and divergent control flow can significantly reduce efficiency within a block. On the other hand, synchronization between fibers is basically free, because when control flow is emulated with masking the fibers are always running synchronously.
On architectures prior to Volta, warps used a single program counter shared amongst all 32 threads in the warp together with an active mask specifying the active threads of the warp. As a result, threads from the same warp in divergent regions or different states of execution cannot signal each other or exchange data, and algorithms requiring fine-grained sharing of data guarded by locks or mutexes can easily lead to deadlock, depending on which warp the contending threads come from.
Starting with the Volta architecture, Independent Thread Scheduling allows full concurrency between threads, regardless of warp. With Independent Thread Scheduling, the GPU maintains execution state per thread, including a program counter and call stack, and can yield execution at a per-thread granularity, either to make better use of execution resources or to allow one thread to wait for data to be produced by another.
CUDA stands for Compute Unified Device Architecture and is a new hardware and software architecture for issuing and managing computations on the GPU as a data-parallel computing device without the need of mapping them to a graphics API.
OpenCL, a C-based open standard, allows developers to tap the incredible power of the graphics processing unit for tasks that go beyond graphics.
Note: OpenCL drivers have been included in all publicly available NVIDIA drivers since October 2009. The CUDA Toolkit now includes the Visual Profiler for OpenCL as well as the OpenCL Programming Guide, Best Practices Guide, and other developer documentation.
MPI is fully compatible with CUDA, CUDA Fortran, and OpenACC, all of which are designed for parallel computing on a single computer or node.
NVSHMEM implements the OpenSHMEM standard for GPU memory, with extensions for improved performance on GPUs.
CUDA stands for Compute Unified Device Architecture and is a new hardware and software architecture for issuing and managing computations on the GPU as a data-parallel computing device without the need of mapping them to a graphics API.
OpenCL, a C-based open standard, allows developers to tap the incredible power of the graphics processing unit for tasks that go beyond graphics.
Note: OpenCL drivers have been included in all publicly available NVIDIA drivers since October 2009. The CUDA Toolkit now includes the Visual Profiler for OpenCL as well as the OpenCL Programming Guide, Best Practices Guide, and other developer documentation.
MPI is fully compatible with CUDA, CUDA Fortran, and OpenACC, all of which are designed for parallel computing on a single computer or node.
NVSHMEM implements the OpenSHMEM standard for GPU memory, with extensions for improved performance on GPUs.
The Programming Models and Languages team is focused on developing the OpenSHMEM programming model for extreme scale systems. ……Currently, the team partners with NVIDIA, University of Tennessee, Knoxville, Florida State University and Paratools. …… UCX provides communication interfaces, and protocols for efficiently implementing parallel programming models such as MPI, OpenSHMEM, and task-based models.
The Programming Models and Languages team is focused on developing the OpenSHMEM programming model for extreme scale systems. ……Currently, the team partners with NVIDIA, University of Tennessee, Knoxville, Florida State University and Paratools. …… UCX provides communication interfaces, and protocols for efficiently implementing parallel programming models such as MPI, OpenSHMEM, and task-based models.
Brook is an extension of standard ANSI C and is designed to incorporate the ideas of data parallel computing and arithmetic intensity into a familiar, efficient language. The general computational model, referred to as streaming, ……A stream is a new data type addition which represents a collection of data which can be operated on in parallel. ……Kernels are special functions that operate on streams. A kernel is a parallel function applied to every element of the input streams.
Threading Building Blocks (TBB) supports fork–join with a work-stealing load balancer as its basic model. In contrast with Cilk Plus, TBB is a portable ISO C++ library, not a compiler extension. Because TBB is not integrated into the compiler, its fork–join implementation is not quite as efficient as Cilk Plus, and it cannot directly generate vectorized code. However, TBB also provides implementations of several patterns not available directly in Cilk Plus, such as pipelines.
There is another related classification used especially by GPU vendors: Single Instruction, Multiple Threads (SIMT). This corresponds to a tiled SIMD architecture consisting of multiple SIMD processors, where each SIMD processor emulates multiple “threads” (fibers in our terminology) using masking. SIMT processors may appear to have thousands of threads, but in fact blocks of these share a control processor, and divergent control flow can significantly reduce efficiency within a block. On the other hand, synchronization between fibers is basically free, because when control flow is emulated with masking the fibers are always running synchronously.
On architectures prior to Volta, warps used a single program counter shared amongst all 32 threads in the warp together with an active mask specifying the active threads of the warp. As a result, threads from the same warp in divergent regions or different states of execution cannot signal each other or exchange data, and algorithms requiring fine-grained sharing of data guarded by locks or mutexes can easily lead to deadlock, depending on which warp the contending threads come from.
Starting with the Volta architecture, Independent Thread Scheduling allows full concurrency between threads, regardless of warp. With Independent Thread Scheduling, the GPU maintains execution state per thread, including a program counter and call stack, and can yield execution at a per-thread granularity, either to make better use of execution resources or to allow one thread to wait for data to be produced by another.
An important advance in parallel computing was the introduction of the notion of virtual models. …… Virtual models can be taken a step further and used to define performance in more abstract measures than just running time on a particular machine. A pair of such measures are work and depth: Work is defined as the total number of operations executed by a computation, and depth is defined as the longest chain of sequential dependencies in the computation.
The barebones PRAM model is low-level and cumbersome, and writing anything other than trivial algorithms in this model is a nightmare. We will therefore switch to an equivalent but higher-level abstraction called the Work-Time (WT) paradigm to be independent of these details. …… In our algorithmic notation, we will use the forall construct to denote such concurrent operations, and we drop explicit mention of the processor id and p, the number of processors. In fact the forall construct is the only construct that distinguishes a WT algorithm from a sequential algorithm.
For concreteness, we proceed to an example of a PRAM algorithm. However, before doing this we present the pardo “programming construct”, which is heavily used in these notes to express operations that are performed in parallell:
for Pi , 1 ≤ i ≤ n pardo
A(i) := B(i)
This means that the following n operations are performed concurrently: processor P1 assigns B(1) into A(1), processor P2 assigns B(2) into A(2), and so on. …… An alternative model which is actually an alternative presentation mode, called Work-Depth, …… Strictly speaking, WD actually defines a slightly different model of computation. Consider an instruction of the form
for i , 1 ≤ i ≤ α pardo
A(i) := B(C(i))
where the time unit under consideration, consists of a sequence of α concurrent instructions, for some positive integer α.
This paper envisions an extension to a standard instruction set which efficiently implements PRAM-style algorithms using explicit multi-threaded instruction-level parallelism (ILP); that is, Explicit Multi-Threading (XMT), a fine-grained computational paradigm …… . …… We actually suggest two alternative bridging models: (1) Spawn-based multi-threading (Spawn-MT): This is based on (asynchronous, or synchronous) nesting-free Spawn-Join commands. This is the main model for the presentation in this paper. (2) Elastic multi-threading (EMT): This is more involved model concept enables nesting of Spawn-Join commands (similar to NESL) and relies on the ability of the hardware to suspend, and later reactive, threads.
The array exchange serial algorithm serially iterates the standard exchange algorithm n times. Its pseudo-code follows.
For i =0 to n−1 do
X:=A( i ) ; A( i ):=B( i ) ; B( i ):=X
…… A parallel array exchange algorithm uses an auxiliary array X[0..n-1] of size n, the parallel algorithm applies concurrently the iterations of the above serial algorithm, each exchanging A(i) with B(i) for a different value of i. Note the new pardo command in the following pseudo-code.
For i =0 to n−1 pardo
X( i ):=A( i ) ; A( i ):=B( i ) ; B( i ):=X( i )
…… Consider the following example of a small XMTC program for the parallel exchange algorithm of the previous section:
spawn (0 ,n−1) {
var x
x:=A( $ ) ; A( $):=B( $ ) ; B($ ):=x
}
The program simply spawns a concurrent thread for each of the depth-3 serial exchange iterations, using a local variable x. Note that the join command is implicit, and implied by the right parenthesis at the end of the above program. …… Commitments to silicon of XMT include a 64-processor, 75MHz computer based on field-programmable gate array(FPGA) technology [29], and 64-processor ASIC 10mm X 10mm chip using IBM’ s 90nm technology, pictured in Figure5. A basic yet stable compiler has also been developed. …… XMT is easy to build. A single graduate student, with no prior design experience, completed the XMT hardware description (in Verilog) in just over 2 year.
The following overview lists more detailed questions and summarizes our findings:
Is MAP essentially the map combinator? NO.
Is MAP essentially an application of the map combinator? NO.
Does MAP essentially serve as the argument of map? YES.
Is REDUCE essentially the reduce combinator? NO.
Is REDUCE essentially an application of the reduce combinator? TYPICALLY.
Does REDUCE essentially serve as the argument of reduce? NO.
Does REDUCE essentially serve as the argument of map? YES.
Hence, there is no trivial correspondence between MapReduce's MAP&REDUCE and what is normally called map&reduce in functional programming.
2) The single-instruction stream-multiple-data stream(SIMD), which includes most array processes, including Solomon[2](Illiac IV).
3) Multiple-instruction stream-single-data stream type organizations(MISD), which include specialized streaming organizations using multiple-instruction streams on a single sequence of data and the derivatives thereof. The plug-board machines of a bygone era were a degenerate form of MISD wherein the instruction streams were single instructions, and a derived datum(SD) passed from program step i to program step i+1(MI).
4) Multiple-Instruction stream-multiple-data stream (MIMD), which include organizations referred to as "multiprocessor."Univac[3], among other corporations, has proposed various MIMD structures.
An important advance in parallel computing was the introduction of the notion of virtual models. …… Virtual models can be taken a step further and used to define performance in more abstract measures than just running time on a particular machine. A pair of such measures are work and depth: Work is defined as the total number of operations executed by a computation, and depth is defined as the longest chain of sequential dependencies in the computation.
The barebones PRAM model is low-level and cumbersome, and writing anything other than trivial algorithms in this model is a nightmare. We will therefore switch to an equivalent but higher-level abstraction called the Work-Time (WT) paradigm to be independent of these details. …… In our algorithmic notation, we will use the forall construct to denote such concurrent operations, and we drop explicit mention of the processor id and p, the number of processors. In fact the forall construct is the only construct that distinguishes a WT algorithm from a sequential algorithm.
For concreteness, we proceed to an example of a PRAM algorithm. However, before doing this we present the pardo “programming construct”, which is heavily used in these notes to express operations that are performed in parallell:
for Pi , 1 ≤ i ≤ n pardo
A(i) := B(i)
This means that the following n operations are performed concurrently: processor P1 assigns B(1) into A(1), processor P2 assigns B(2) into A(2), and so on. …… An alternative model which is actually an alternative presentation mode, called Work-Depth, …… Strictly speaking, WD actually defines a slightly different model of computation. Consider an instruction of the form
for i , 1 ≤ i ≤ α pardo
A(i) := B(C(i))
where the time unit under consideration, consists of a sequence of α concurrent instructions, for some positive integer α.
Threading Building Blocks (TBB) supports fork–join with a work-stealing load balancer as its basic model. In contrast with Cilk Plus, TBB is a portable ISO C++ library, not a compiler extension. Because TBB is not integrated into the compiler, its fork–join implementation is not quite as efficient as Cilk Plus, and it cannot directly generate vectorized code. However, TBB also provides implementations of several patterns not available directly in Cilk Plus, such as pipelines.
SINGLE PROGRAM: All tasks execute their copy of the same program simultaneously. This program can be threads, message passing, data parallel or hybrid.
2) The single-instruction stream-multiple-data stream(SIMD), which includes most array processes, including Solomon[2](Illiac IV).
3) Multiple-instruction stream-single-data stream type organizations(MISD), which include specialized streaming organizations using multiple-instruction streams on a single sequence of data and the derivatives thereof. The plug-board machines of a bygone era were a degenerate form of MISD wherein the instruction streams were single instructions, and a derived datum(SD) passed from program step i to program step i+1(MI).
4) Multiple-Instruction stream-multiple-data stream (MIMD), which include organizations referred to as "multiprocessor."Univac[3], among other corporations, has proposed various MIMD structures.
There is another related classification used especially by GPU vendors: Single Instruction, Multiple Threads (SIMT). This corresponds to a tiled SIMD architecture consisting of multiple SIMD processors, where each SIMD processor emulates multiple “threads” (fibers in our terminology) using masking. SIMT processors may appear to have thousands of threads, but in fact blocks of these share a control processor, and divergent control flow can significantly reduce efficiency within a block. On the other hand, synchronization between fibers is basically free, because when control flow is emulated with masking the fibers are always running synchronously.
On architectures prior to Volta, warps used a single program counter shared amongst all 32 threads in the warp together with an active mask specifying the active threads of the warp. As a result, threads from the same warp in divergent regions or different states of execution cannot signal each other or exchange data, and algorithms requiring fine-grained sharing of data guarded by locks or mutexes can easily lead to deadlock, depending on which warp the contending threads come from.
Starting with the Volta architecture, Independent Thread Scheduling allows full concurrency between threads, regardless of warp. With Independent Thread Scheduling, the GPU maintains execution state per thread, including a program counter and call stack, and can yield execution at a per-thread granularity, either to make better use of execution resources or to allow one thread to wait for data to be produced by another.
MULTIPLE PROGRAM: Tasks may execute different programs simultaneously. The programs can be threads, message passing, data parallel or hybrid. ……MPMD applications are not as common as SPMD applications, but may be better suited for certain types of problems, particularly those that lend themselves better to functional decomposition than domain decomposition (discussed later under Partioning).
An important advance in parallel computing was the introduction of the notion of virtual models. …… Virtual models can be taken a step further and used to define performance in more abstract measures than just running time on a particular machine. A pair of such measures are work and depth: Work is defined as the total number of operations executed by a computation, and depth is defined as the longest chain of sequential dependencies in the computation.
The barebones PRAM model is low-level and cumbersome, and writing anything other than trivial algorithms in this model is a nightmare. We will therefore switch to an equivalent but higher-level abstraction called the Work-Time (WT) paradigm to be independent of these details. …… In our algorithmic notation, we will use the forall construct to denote such concurrent operations, and we drop explicit mention of the processor id and p, the number of processors. In fact the forall construct is the only construct that distinguishes a WT algorithm from a sequential algorithm.
For concreteness, we proceed to an example of a PRAM algorithm. However, before doing this we present the pardo “programming construct”, which is heavily used in these notes to express operations that are performed in parallell:
for Pi , 1 ≤ i ≤ n pardo
A(i) := B(i)
This means that the following n operations are performed concurrently: processor P1 assigns B(1) into A(1), processor P2 assigns B(2) into A(2), and so on. …… An alternative model which is actually an alternative presentation mode, called Work-Depth, …… Strictly speaking, WD actually defines a slightly different model of computation. Consider an instruction of the form
for i , 1 ≤ i ≤ α pardo
A(i) := B(C(i))
where the time unit under consideration, consists of a sequence of α concurrent instructions, for some positive integer α.
Like many other communications libraries, BSPlib adopts a Single Program Multiple Data(SPMD) programming model. The task of writing an SPMD program will typically involve mapping a problem that manipulates a data structure of size N into p instances of a program that each manipulate an N/p sized block of the original domain. The role of BSPlib is to provide the infrastructure required for the user to take care of the data distribution, and any implied communication necessary to manipulate parts of the data structure that are on a remote process.
Apache Giraph is an iterative graph processing framework, built on top of Apache Hadoop. The input to a Giraph computation is a graph composed of vertices and directed edges, …… Computation proceeds as a sequence of iterations, called supersteps in BSP. ……There is a barrier between consecutive supersteps.
Apache Hama(TM) is a framework for Big Data analytics which uses the Bulk Synchronous Parallel (BSP) computing model, ……It provides not only pure BSP programming model but also vertex and neuron centric programming models, inspired by Google's Pregel and DistBelief.
The Programming Models and Languages team is focused on developing the OpenSHMEM programming model for extreme scale systems. ……Currently, the team partners with NVIDIA, University of Tennessee, Knoxville, Florida State University and Paratools. …… UCX provides communication interfaces, and protocols for efficiently implementing parallel programming models such as MPI, OpenSHMEM, and task-based models.
Erlang is conceptually similar to the occam programming language, though it recasts the ideas of CSP in a functional framework and uses asynchronous message passing instead of the synchronous message passing in CSP.
CUDA stands for Compute Unified Device Architecture and is a new hardware and software architecture for issuing and managing computations on the GPU as a data-parallel computing device without the need of mapping them to a graphics API.
OpenCL, a C-based open standard, allows developers to tap the incredible power of the graphics processing unit for tasks that go beyond graphics.
Note: OpenCL drivers have been included in all publicly available NVIDIA drivers since October 2009. The CUDA Toolkit now includes the Visual Profiler for OpenCL as well as the OpenCL Programming Guide, Best Practices Guide, and other developer documentation.
MPI is fully compatible with CUDA, CUDA Fortran, and OpenACC, all of which are designed for parallel computing on a single computer or node.
NVSHMEM implements the OpenSHMEM standard for GPU memory, with extensions for improved performance on GPUs.
Brook is an extension of standard ANSI C and is designed to incorporate the ideas of data parallel computing and arithmetic intensity into a familiar, efficient language. The general computational model, referred to as streaming, ……A stream is a new data type addition which represents a collection of data which can be operated on in parallel. ……Kernels are special functions that operate on streams. A kernel is a parallel function applied to every element of the input streams.
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.
Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical "record" in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. Our use of a functional model with user-specified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.
The following overview lists more detailed questions and summarizes our findings:
Is MAP essentially the map combinator? NO.
Is MAP essentially an application of the map combinator? NO.
Does MAP essentially serve as the argument of map? YES.
Is REDUCE essentially the reduce combinator? NO.
Is REDUCE essentially an application of the reduce combinator? TYPICALLY.
Does REDUCE essentially serve as the argument of reduce? NO.
Does REDUCE essentially serve as the argument of map? YES.
Hence, there is no trivial correspondence between MapReduce's MAP&REDUCE and what is normally called map&reduce in functional programming.
This paper envisions an extension to a standard instruction set which efficiently implements PRAM-style algorithms using explicit multi-threaded instruction-level parallelism (ILP); that is, Explicit Multi-Threading (XMT), a fine-grained computational paradigm …… . …… We actually suggest two alternative bridging models: (1) Spawn-based multi-threading (Spawn-MT): This is based on (asynchronous, or synchronous) nesting-free Spawn-Join commands. This is the main model for the presentation in this paper. (2) Elastic multi-threading (EMT): This is more involved model concept enables nesting of Spawn-Join commands (similar to NESL) and relies on the ability of the hardware to suspend, and later reactive, threads.
2) The single-instruction stream-multiple-data stream(SIMD), which includes most array processes, including Solomon[2](Illiac IV).
3) Multiple-instruction stream-single-data stream type organizations(MISD), which include specialized streaming organizations using multiple-instruction streams on a single sequence of data and the derivatives thereof. The plug-board machines of a bygone era were a degenerate form of MISD wherein the instruction streams were single instructions, and a derived datum(SD) passed from program step i to program step i+1(MI).
4) Multiple-Instruction stream-multiple-data stream (MIMD), which include organizations referred to as "multiprocessor."Univac[3], among other corporations, has proposed various MIMD structures.
There is another related classification used especially by GPU vendors: Single Instruction, Multiple Threads (SIMT). This corresponds to a tiled SIMD architecture consisting of multiple SIMD processors, where each SIMD processor emulates multiple “threads” (fibers in our terminology) using masking. SIMT processors may appear to have thousands of threads, but in fact blocks of these share a control processor, and divergent control flow can significantly reduce efficiency within a block. On the other hand, synchronization between fibers is basically free, because when control flow is emulated with masking the fibers are always running synchronously.
On architectures prior to Volta, warps used a single program counter shared amongst all 32 threads in the warp together with an active mask specifying the active threads of the warp. As a result, threads from the same warp in divergent regions or different states of execution cannot signal each other or exchange data, and algorithms requiring fine-grained sharing of data guarded by locks or mutexes can easily lead to deadlock, depending on which warp the contending threads come from.
Starting with the Volta architecture, Independent Thread Scheduling allows full concurrency between threads, regardless of warp. With Independent Thread Scheduling, the GPU maintains execution state per thread, including a program counter and call stack, and can yield execution at a per-thread granularity, either to make better use of execution resources or to allow one thread to wait for data to be produced by another.
This paper envisions an extension to a standard instruction set which efficiently implements PRAM-style algorithms using explicit multi-threaded instruction-level parallelism (ILP); that is, Explicit Multi-Threading (XMT), a fine-grained computational paradigm …… . …… We actually suggest two alternative bridging models: (1) Spawn-based multi-threading (Spawn-MT): This is based on (asynchronous, or synchronous) nesting-free Spawn-Join commands. This is the main model for the presentation in this paper. (2) Elastic multi-threading (EMT): This is more involved model concept enables nesting of Spawn-Join commands (similar to NESL) and relies on the ability of the hardware to suspend, and later reactive, threads.