并行编程模型 (Chinese Wikipedia)

Analysis of information sources in references of the Wikipedia article "并行编程模型" in Chinese language version.

refsWebsite
Global rank Chinese rank
1st place
1st place
2nd place
23rd place
6,750th place
4,362nd place
low place
low place
1,747th place
1,773rd place
2,503rd place
1,088th place
3,206th place
2,374th place
6,358th place
4,641st place
low place
low place
low place
low place
243rd place
550th place
9,195th place
low place
580th place
642nd place
2,481st place
4,329th place
7,240th place
8,700th place
2,446th place
2,114th place
1,564th place
1,640th place
1,067th place
1,375th place
916th place
1,065th place
low place
low place
383rd place
336th place
5,534th place
3,178th place
low place
low place
1,185th place
809th place
67th place
154th place
low place
low place
179th place
275th place
9,657th place
6,478th place
4th place
5th place
18th place
57th place
4,147th place
2,263rd place
low place
low place
low place
low place

acm.org

dl.acm.org

anl.gov

mcs.anl.gov

apache.org

giraph.apache.org

  • Introduction to Apache Giraph. [2019-12-09]. (原始内容存档于2019-12-19). Apache Giraph is an iterative graph processing framework, built on top of Apache Hadoop. The input to a Giraph computation is a graph composed of vertices and directed edges, …… Computation proceeds as a sequence of iterations, called supersteps in BSP. ……There is a barrier between consecutive supersteps. 

hama.apache.org

  • Apache Hama. [2019-12-09]. (原始内容存档于2019-12-18). Apache Hama(TM) is a framework for Big Data analytics which uses the Bulk Synchronous Parallel (BSP) computing model, ……It provides not only pure BSP programming model but also vertex and neuron centric programming models, inspired by Google's Pregel and DistBelief. 

hadoop.apache.org

apple.com

  • CUDA Programming Guide 1.0 (PDF). 6/23/2007 [2019-12-09]. (原始内容存档 (PDF)于2018-04-17). CUDA stands for Compute Unified Device Architecture and is a new hardware and software architecture for issuing and managing computations on the GPU as a data-parallel computing device without the need of mapping them to a graphics API. 
    Apple to Ship Mac OS X Snow Leopard on August 28. August 24, 2009 [2019-12-09]. (原始内容存档于2019-12-09). OpenCL, a C-based open standard, allows developers to tap the incredible power of the graphics processing unit for tasks that go beyond graphics. 
    September 2009 OpenCL Public Downloads. [2019-12-09]. (原始内容存档于2019-12-09). Note: OpenCL drivers have been included in all publicly available NVIDIA drivers since October 2009. The CUDA Toolkit now includes the Visual Profiler for OpenCL as well as the OpenCL Programming Guide, Best Practices Guide, and other developer documentation. 
    MPI Solutions for GPUs. [2019-12-09]. (原始内容存档于2019-12-09). MPI is fully compatible with CUDA, CUDA Fortran, and OpenACC, all of which are designed for parallel computing on a single computer or node. 
    OpenSHMEM Communication on GPUs. [2019-12-09]. (原始内容存档于2019-12-09). NVSHMEM implements the OpenSHMEM standard for GPU memory, with extensions for improved performance on GPUs. 

berkeley.edu

people.eecs.berkeley.edu

brown.edu

cs.brown.edu

bsp-worldwide.org

  • Jonathan Hill, Bill McColl, Dan Stefanescu, Mark Goudreau, Kevin Lang, Satish Rao, Torsten Suel, Thanasis Tsantilas, and Rob Bisseling. BSP Programming Library (PDF). Parallel Computing. December 1998, 24 (14): 1947–1980 [2019-12-05]. (原始内容存档 (PDF)于2019-11-04). Like many other communications libraries, BSPlib adopts a Single Program Multiple Data(SPMD) programming model. The task of writing an SPMD program will typically involve mapping a problem that manipulates a data structure of size N into p instances of a program that each manipulate an N/p sized block of the original domain. The role of BSPlib is to provide the infrastructure required for the user to take care of the data distribution, and any implied communication necessary to manipulate parts of the data structure that are on a remote process. 
  • BSPlib页面存档备份,存于互联网档案馆

cmu.edu

cs.cmu.edu

  • Blelloch, Guy. Programming Parallel Algorithms (PDF). Communications of the ACM. 1996, 39 (3): 85–97 [2019-12-12]. doi:10.1145/227234.227246. (原始内容存档 (PDF)于2016-03-04). An important advance in parallel computing was the introduction of the notion of virtual models. …… Virtual models can be taken a step further and used to define performance in more abstract measures than just running time on a particular machine. A pair of such measures are work and depth: Work is defined as the total number of operations executed by a computation, and depth is defined as the longest chain of sequential dependencies in the computation. 
    Siddhartha Chatterjee, Jan Prins. Parallel and Distributed Computing PRAM Algorithms (PDF). Spring 2002 [2019-12-06]. (原始内容存档 (PDF)于2019-12-06). The barebones PRAM model is low-level and cumbersome, and writing anything other than trivial algorithms in this model is a nightmare. We will therefore switch to an equivalent but higher-level abstraction called the Work-Time (WT) paradigm to be independent of these details. …… In our algorithmic notation, we will use the forall construct to denote such concurrent operations, and we drop explicit mention of the processor id and p, the number of processors. In fact the forall construct is the only construct that distinguishes a WT algorithm from a sequential algorithm. 
    Vishkin, Uzi, Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques, 104 pages (PDF), Class notes of courses on parallel algorithms taught since 1992 at the University of Maryland, College Park, Tel Aviv University and the Technion, 2009 [2019-12-07], (原始内容存档 (PDF)于2017-12-15), For concreteness, we proceed to an example of a PRAM algorithm. However, before doing this we present the pardo “programming construct”, which is heavily used in these notes to express operations that are performed in parallell:
        for Pi , 1 ≤ i ≤ n pardo
            A(i) := B(i)
    This means that the following n operations are performed concurrently: processor P1 assigns B(1) into A(1), processor P2 assigns B(2) into A(2), and so on. …… An alternative model which is actually an alternative presentation mode, called Work-Depth, …… Strictly speaking, WD actually defines a slightly different model of computation. Consider an instruction of the form
        for i , 1 ≤ i ≤ α pardo
          A(i) := B(C(i))
    where the time unit under consideration, consists of a sequence of α concurrent instructions, for some positive integer α.
     

doi.org

  • Gropp, William; Lusk, Ewing; Skjellum, Anthony. A High-Performance, Portable Implementation of the MPI Message Passing Interface. Parallel Computing. 1996, 22 (6): 789–828. doi:10.1016/0167-8191(96)00024-5. 
  • Sur, Sayantan; Koop, Matthew J.; Panda, Dhabaleswar K. MPI and communication---High-performance and scalable MPI over Infini Band with reduced memory usage. High-performance and Scalable MPI over InfiniBand with Reduced Memory Usage: An In-depth Performance Analysis. ACM. 4 August 2017: 105. ISBN 978-0769527000. doi:10.1145/1188455.1188565. 
  • Flynn, Michael J. Some Computer Organizations and Their Effectiveness (PDF). IEEE Transactions on Computers. September 1972, C–21 (9): 948–960 [2019-12-05]. doi:10.1109/TC.1972.5009071. (原始内容 (PDF)存档于2022-04-11).
    2) The single-instruction stream-multiple-data stream(SIMD), which includes most array processes, including Solomon[2](Illiac IV英语ILLIAC).
    3) Multiple-instruction stream-single-data stream type organizations(MISD), which include specialized streaming organizations using multiple-instruction streams on a single sequence of data and the derivatives thereof. The plug-board英语Plugboard machines of a bygone era were a degenerate form of MISD wherein the instruction streams were single instructions, and a derived datum(SD) passed from program step i to program step i+1(MI).
    4) Multiple-Instruction stream-multiple-data stream (MIMD), which include organizations referred to as "multiprocessor."Univac[3], among other corporations, has proposed various MIMD structures.
     
  • Blelloch, Guy. Programming Parallel Algorithms (PDF). Communications of the ACM. 1996, 39 (3): 85–97 [2019-12-12]. doi:10.1145/227234.227246. (原始内容存档 (PDF)于2016-03-04). An important advance in parallel computing was the introduction of the notion of virtual models. …… Virtual models can be taken a step further and used to define performance in more abstract measures than just running time on a particular machine. A pair of such measures are work and depth: Work is defined as the total number of operations executed by a computation, and depth is defined as the longest chain of sequential dependencies in the computation. 
    Siddhartha Chatterjee, Jan Prins. Parallel and Distributed Computing PRAM Algorithms (PDF). Spring 2002 [2019-12-06]. (原始内容存档 (PDF)于2019-12-06). The barebones PRAM model is low-level and cumbersome, and writing anything other than trivial algorithms in this model is a nightmare. We will therefore switch to an equivalent but higher-level abstraction called the Work-Time (WT) paradigm to be independent of these details. …… In our algorithmic notation, we will use the forall construct to denote such concurrent operations, and we drop explicit mention of the processor id and p, the number of processors. In fact the forall construct is the only construct that distinguishes a WT algorithm from a sequential algorithm. 
    Vishkin, Uzi, Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques, 104 pages (PDF), Class notes of courses on parallel algorithms taught since 1992 at the University of Maryland, College Park, Tel Aviv University and the Technion, 2009 [2019-12-07], (原始内容存档 (PDF)于2017-12-15), For concreteness, we proceed to an example of a PRAM algorithm. However, before doing this we present the pardo “programming construct”, which is heavily used in these notes to express operations that are performed in parallell:
        for Pi , 1 ≤ i ≤ n pardo
            A(i) := B(i)
    This means that the following n operations are performed concurrently: processor P1 assigns B(1) into A(1), processor P2 assigns B(2) into A(2), and so on. …… An alternative model which is actually an alternative presentation mode, called Work-Depth, …… Strictly speaking, WD actually defines a slightly different model of computation. Consider an instruction of the form
        for i , 1 ≤ i ≤ α pardo
          A(i) := B(C(i))
    where the time unit under consideration, consists of a sequence of α concurrent instructions, for some positive integer α.
     
  • Armstrong, Joe. Erlang. Communications of the ACM. September 2010, 53 (9): 68–75 [2020-05-07]. doi:10.1145/1810891.1810910. (原始内容存档于2020-06-09). Erlang is conceptually similar to the occam programming language, though it recasts the ideas of CSP in a functional framework and uses asynchronous message passing instead of the synchronous message passing in CSP. 
  • Stone, John E.; Gohara, David; Shi, Guochin. OpenCL: a parallel programming standard for heterogeneous computing systems. Computing in Science & Engineering. 2010, 12 (3): 66–73. Bibcode:2010CSE....12c..66S. PMC 2964860可免费查阅. PMID 21037981. doi:10.1109/MCSE.2010.69. 
  • Lämmel, R. Google's MapReduce programming model — Revisited (PDF). Science of Computer Programming. 2008, 70: 1–30 [2019-12-07]. doi:10.1016/j.scico.2007.07.001. (原始内容存档 (PDF)于2019-12-07). The following overview lists more detailed questions and summarizes our findings:
    Is MAP essentially the map combinator? NO.
    Is MAP essentially an application of the map combinator? NO.
    Does MAP essentially serve as the argument of map? YES.
    Is REDUCE essentially the reduce combinator? NO.
    Is REDUCE essentially an application of the reduce combinator? TYPICALLY.
    Does REDUCE essentially serve as the argument of reduce? NO.
    Does REDUCE essentially serve as the argument of map? YES.
    Hence, there is no trivial correspondence between MapReduce's MAP&REDUCE and what is normally called map&reduce in functional programming.
     
  • Vishkin, Uzi, Using simple abstraction to reinvent computing for parallelism (PDF), Communications of the ACM, 2011, 54: 75–85, doi:10.1145/1866739.1866757, The array exchange serial algorithm serially iterates the standard exchange algorithm n times. Its pseudo-code follows.
    For i =0 to n−1 do
        X:=A( i ) ; A( i ):=B( i ) ; B( i ):=X
    …… A parallel array exchange algorithm uses an auxiliary array X[0..n-1] of size n, the parallel algorithm applies concurrently the iterations of the above serial algorithm, each exchanging A(i) with B(i) for a different value of i. Note the new pardo command in the following pseudo-code.
    For i =0 to n−1 pardo
        X( i ):=A( i ) ; A( i ):=B( i ) ; B( i ):=X( i )
    …… Consider the following example of a small XMTC program for the parallel exchange algorithm of the previous section:
    spawn (0 ,n−1) {
        var x
          x:=A( $ ) ; A( $):=B( $ ) ; B($ ):=x
    }
    The program simply spawns a concurrent thread for each of the depth-3 serial exchange iterations, using a local variable x. Note that the join command is implicit, and implied by the right parenthesis at the end of the above program. …… Commitments to silicon of XMT include a 64-processor, 75MHz computer based on field-programmable gate array(FPGA) technology [29], and 64-processor ASIC 10mm X 10mm chip using IBM’ s 90nm technology, pictured in Figure5. A basic yet stable compiler has also been developed. …… XMT is easy to build. A single graduate student, with no prior design experience, completed the XMT hardware description (in Verilog) in just over 2 year.
     

github.com

googleusercontent.com

static.googleusercontent.com

  • MapReduce: Simplified Data Processing on Large Clusters (PDF). googleusercontent.com. [2019-12-05]. (原始内容存档 (PDF)于2020-06-15). MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. 
  • MapReduce: Simplified Data Processing on Large Clusters (PDF). googleusercontent.com. [2019-12-05]. (原始内容存档 (PDF)于2020-06-15). Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical "record" in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. Our use of a functional model with user-specified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance. 

harvard.edu

ui.adsabs.harvard.edu

khronos.org

liu.se

ida.liu.se

llnl.gov

computing.llnl.gov

multicorebsp.com

nih.gov

ncbi.nlm.nih.gov

nvidia.com

docs.nvidia.com

  • Michael McCool; James Reinders; Arch Robison. 2.4.3 Flynn’s Characterization. Structured Parallel Programming: Patterns for Efficient Computation (PDF). Elsevier. 2013: 52 [2019-12-12]. (原始内容存档 (PDF)于2018-11-23). There is another related classification used especially by GPU vendors: Single Instruction, Multiple Threads (SIMT). This corresponds to a tiled SIMD architecture consisting of multiple SIMD processors, where each SIMD processor emulates multiple “threads” (fibers in our terminology) using masking. SIMT processors may appear to have thousands of threads, but in fact blocks of these share a control processor, and divergent control flow can significantly reduce efficiency within a block. On the other hand, synchronization between fibers is basically free, because when control flow is emulated with masking the fibers are always running synchronously. 
    Parallel Thread Execution ISA Version 6.5. https://docs.nvidia.com/. [2019-12-18]. (原始内容存档于2019-12-01).
    On architectures prior to Volta, warps used a single program counter shared amongst all 32 threads in the warp together with an active mask specifying the active threads of the warp. As a result, threads from the same warp in divergent regions or different states of execution cannot signal each other or exchange data, and algorithms requiring fine-grained英语Granularity (parallel computing) sharing of data guarded by locks or mutexes can easily lead to deadlock, depending on which warp the contending threads come from.
    Starting with the Volta architecture, Independent Thread Scheduling allows full concurrency between threads, regardless of warp. With Independent Thread Scheduling, the GPU maintains execution state per thread, including a program counter and call stack, and can yield execution at a per-thread granularity英语Granularity (parallel computing), either to make better use of execution resources or to allow one thread to wait for data to be produced by another.
     

developer.download.nvidia.com

  • CUDA Programming Guide 1.0 (PDF). 6/23/2007 [2019-12-09]. (原始内容存档 (PDF)于2018-04-17). CUDA stands for Compute Unified Device Architecture and is a new hardware and software architecture for issuing and managing computations on the GPU as a data-parallel computing device without the need of mapping them to a graphics API. 
    Apple to Ship Mac OS X Snow Leopard on August 28. August 24, 2009 [2019-12-09]. (原始内容存档于2019-12-09). OpenCL, a C-based open standard, allows developers to tap the incredible power of the graphics processing unit for tasks that go beyond graphics. 
    September 2009 OpenCL Public Downloads. [2019-12-09]. (原始内容存档于2019-12-09). Note: OpenCL drivers have been included in all publicly available NVIDIA drivers since October 2009. The CUDA Toolkit now includes the Visual Profiler for OpenCL as well as the OpenCL Programming Guide, Best Practices Guide, and other developer documentation. 
    MPI Solutions for GPUs. [2019-12-09]. (原始内容存档于2019-12-09). MPI is fully compatible with CUDA, CUDA Fortran, and OpenACC, all of which are designed for parallel computing on a single computer or node. 
    OpenSHMEM Communication on GPUs. [2019-12-09]. (原始内容存档于2019-12-09). NVSHMEM implements the OpenSHMEM standard for GPU memory, with extensions for improved performance on GPUs. 

developer.nvidia.com

  • CUDA Programming Guide 1.0 (PDF). 6/23/2007 [2019-12-09]. (原始内容存档 (PDF)于2018-04-17). CUDA stands for Compute Unified Device Architecture and is a new hardware and software architecture for issuing and managing computations on the GPU as a data-parallel computing device without the need of mapping them to a graphics API. 
    Apple to Ship Mac OS X Snow Leopard on August 28. August 24, 2009 [2019-12-09]. (原始内容存档于2019-12-09). OpenCL, a C-based open standard, allows developers to tap the incredible power of the graphics processing unit for tasks that go beyond graphics. 
    September 2009 OpenCL Public Downloads. [2019-12-09]. (原始内容存档于2019-12-09). Note: OpenCL drivers have been included in all publicly available NVIDIA drivers since October 2009. The CUDA Toolkit now includes the Visual Profiler for OpenCL as well as the OpenCL Programming Guide, Best Practices Guide, and other developer documentation. 
    MPI Solutions for GPUs. [2019-12-09]. (原始内容存档于2019-12-09). MPI is fully compatible with CUDA, CUDA Fortran, and OpenACC, all of which are designed for parallel computing on a single computer or node. 
    OpenSHMEM Communication on GPUs. [2019-12-09]. (原始内容存档于2019-12-09). NVSHMEM implements the OpenSHMEM standard for GPU memory, with extensions for improved performance on GPUs. 

openucx.org

  • OpenSHMEM. [2019-12-09]. (原始内容存档于2019-12-09). The Programming Models and Languages team is focused on developing the OpenSHMEM programming model for extreme scale systems. ……Currently, the team partners with NVIDIA, University of Tennessee, Knoxville, Florida State University and Paratools. …… UCX provides communication interfaces, and protocols for efficiently implementing parallel programming models such as MPI, OpenSHMEM, and task-based models. 

ornl.gov

csm.ornl.gov

  • OpenSHMEM. [2019-12-09]. (原始内容存档于2019-12-09). The Programming Models and Languages team is focused on developing the OpenSHMEM programming model for extreme scale systems. ……Currently, the team partners with NVIDIA, University of Tennessee, Knoxville, Florida State University and Paratools. …… UCX provides communication interfaces, and protocols for efficiently implementing parallel programming models such as MPI, OpenSHMEM, and task-based models. 

portablecl.org

rpi.edu

wcl.cs.rpi.edu

secdatabase.com

edgar.secdatabase.com

stanford.edu

graphics.stanford.edu

  • Brook Language. [2019-12-09]. (原始内容存档于2019-11-19). Brook is an extension of standard ANSI C and is designed to incorporate the ideas of data parallel computing and arithmetic intensity into a familiar, efficient language. The general computational model, referred to as streaming, ……A stream is a new data type addition which represents a collection of data which can be operated on in parallel. ……Kernels are special functions that operate on streams. A kernel is a parallel function applied to every element of the input streams. 

stmik-banjarbaru.ac.id

digilib.stmik-banjarbaru.ac.id

  • Michael McCool; James Reinders; Arch Robison. Structured Parallel Programming: Patterns for Efficient Computation (PDF). Elsevier. 2013 [2019-12-12]. (原始内容存档 (PDF)于2018-11-23). Threading Building Blocks (TBB) supports fork–join with a work-stealing load balancer as its basic model. In contrast with Cilk Plus, TBB is a portable ISO C++ library, not a compiler extension. Because TBB is not integrated into the compiler, its fork–join implementation is not quite as efficient as Cilk Plus, and it cannot directly generate vectorized code. However, TBB also provides implementations of several patterns not available directly in Cilk Plus, such as pipelines. 
  • Michael McCool; James Reinders; Arch Robison. 2.4.3 Flynn’s Characterization. Structured Parallel Programming: Patterns for Efficient Computation (PDF). Elsevier. 2013: 52 [2019-12-12]. (原始内容存档 (PDF)于2018-11-23). There is another related classification used especially by GPU vendors: Single Instruction, Multiple Threads (SIMT). This corresponds to a tiled SIMD architecture consisting of multiple SIMD processors, where each SIMD processor emulates multiple “threads” (fibers in our terminology) using masking. SIMT processors may appear to have thousands of threads, but in fact blocks of these share a control processor, and divergent control flow can significantly reduce efficiency within a block. On the other hand, synchronization between fibers is basically free, because when control flow is emulated with masking the fibers are always running synchronously. 
    Parallel Thread Execution ISA Version 6.5. https://docs.nvidia.com/. [2019-12-18]. (原始内容存档于2019-12-01).
    On architectures prior to Volta, warps used a single program counter shared amongst all 32 threads in the warp together with an active mask specifying the active threads of the warp. As a result, threads from the same warp in divergent regions or different states of execution cannot signal each other or exchange data, and algorithms requiring fine-grained英语Granularity (parallel computing) sharing of data guarded by locks or mutexes can easily lead to deadlock, depending on which warp the contending threads come from.
    Starting with the Volta architecture, Independent Thread Scheduling allows full concurrency between threads, regardless of warp. With Independent Thread Scheduling, the GPU maintains execution state per thread, including a program counter and call stack, and can yield execution at a per-thread granularity英语Granularity (parallel computing), either to make better use of execution resources or to allow one thread to wait for data to be produced by another.
     

streamcomputing.eu

umd.edu

umiacs.umd.edu

  • Blelloch, Guy. Programming Parallel Algorithms (PDF). Communications of the ACM. 1996, 39 (3): 85–97 [2019-12-12]. doi:10.1145/227234.227246. (原始内容存档 (PDF)于2016-03-04). An important advance in parallel computing was the introduction of the notion of virtual models. …… Virtual models can be taken a step further and used to define performance in more abstract measures than just running time on a particular machine. A pair of such measures are work and depth: Work is defined as the total number of operations executed by a computation, and depth is defined as the longest chain of sequential dependencies in the computation. 
    Siddhartha Chatterjee, Jan Prins. Parallel and Distributed Computing PRAM Algorithms (PDF). Spring 2002 [2019-12-06]. (原始内容存档 (PDF)于2019-12-06). The barebones PRAM model is low-level and cumbersome, and writing anything other than trivial algorithms in this model is a nightmare. We will therefore switch to an equivalent but higher-level abstraction called the Work-Time (WT) paradigm to be independent of these details. …… In our algorithmic notation, we will use the forall construct to denote such concurrent operations, and we drop explicit mention of the processor id and p, the number of processors. In fact the forall construct is the only construct that distinguishes a WT algorithm from a sequential algorithm. 
    Vishkin, Uzi, Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques, 104 pages (PDF), Class notes of courses on parallel algorithms taught since 1992 at the University of Maryland, College Park, Tel Aviv University and the Technion, 2009 [2019-12-07], (原始内容存档 (PDF)于2017-12-15), For concreteness, we proceed to an example of a PRAM algorithm. However, before doing this we present the pardo “programming construct”, which is heavily used in these notes to express operations that are performed in parallell:
        for Pi , 1 ≤ i ≤ n pardo
            A(i) := B(i)
    This means that the following n operations are performed concurrently: processor P1 assigns B(1) into A(1), processor P2 assigns B(2) into A(2), and so on. …… An alternative model which is actually an alternative presentation mode, called Work-Depth, …… Strictly speaking, WD actually defines a slightly different model of computation. Consider an instruction of the form
        for i , 1 ≤ i ≤ α pardo
          A(i) := B(C(i))
    where the time unit under consideration, consists of a sequence of α concurrent instructions, for some positive integer α.
     
  • Vishkin, Uzi; Dascal, Shlomit; Berkovich, Efraim; Nuzman, Joseph, Explicit Multi-Threading (XMT) bridging models for instruction parallelism, Proc. 1998 ACM Symposium on Parallel Algorithms and Architectures (SPAA): 140–151, 1998 [2019-12-12], (原始内容存档于2017-09-21), This paper envisions an extension to a standard instruction set which efficiently implements PRAM-style algorithms using explicit multi-threaded instruction-level parallelism (ILP); that is, Explicit Multi-Threading (XMT), a fine-grained computational paradigm …… . …… We actually suggest two alternative bridging models: (1) Spawn-based multi-threading (Spawn-MT): This is based on (asynchronous, or synchronous) nesting-free Spawn-Join commands. This is the main model for the presentation in this paper. (2) Elastic multi-threading (EMT): This is more involved model concept enables nesting of Spawn-Join commands (similar to NESL英语NESL) and relies on the ability of the hardware to suspend, and later reactive, threads. 

users.umiacs.umd.edu

  • Vishkin, Uzi, Using simple abstraction to reinvent computing for parallelism (PDF), Communications of the ACM, 2011, 54: 75–85, doi:10.1145/1866739.1866757, The array exchange serial algorithm serially iterates the standard exchange algorithm n times. Its pseudo-code follows.
    For i =0 to n−1 do
        X:=A( i ) ; A( i ):=B( i ) ; B( i ):=X
    …… A parallel array exchange algorithm uses an auxiliary array X[0..n-1] of size n, the parallel algorithm applies concurrently the iterations of the above serial algorithm, each exchanging A(i) with B(i) for a different value of i. Note the new pardo command in the following pseudo-code.
    For i =0 to n−1 pardo
        X( i ):=A( i ) ; A( i ):=B( i ) ; B( i ):=X( i )
    …… Consider the following example of a small XMTC program for the parallel exchange algorithm of the previous section:
    spawn (0 ,n−1) {
        var x
          x:=A( $ ) ; A( $):=B( $ ) ; B($ ):=x
    }
    The program simply spawns a concurrent thread for each of the depth-3 serial exchange iterations, using a local variable x. Note that the join command is implicit, and implied by the right parenthesis at the end of the above program. …… Commitments to silicon of XMT include a 64-processor, 75MHz computer based on field-programmable gate array(FPGA) technology [29], and 64-processor ASIC 10mm X 10mm chip using IBM’ s 90nm technology, pictured in Figure5. A basic yet stable compiler has also been developed. …… XMT is easy to build. A single graduate student, with no prior design experience, completed the XMT hardware description (in Verilog) in just over 2 year.
     

uni-koblenz.de

userpages.uni-koblenz.de

  • Lämmel, R. Google's MapReduce programming model — Revisited (PDF). Science of Computer Programming. 2008, 70: 1–30 [2019-12-07]. doi:10.1016/j.scico.2007.07.001. (原始内容存档 (PDF)于2019-12-07). The following overview lists more detailed questions and summarizes our findings:
    Is MAP essentially the map combinator? NO.
    Is MAP essentially an application of the map combinator? NO.
    Does MAP essentially serve as the argument of map? YES.
    Is REDUCE essentially the reduce combinator? NO.
    Is REDUCE essentially an application of the reduce combinator? TYPICALLY.
    Does REDUCE essentially serve as the argument of reduce? NO.
    Does REDUCE essentially serve as the argument of map? YES.
    Hence, there is no trivial correspondence between MapReduce's MAP&REDUCE and what is normally called map&reduce in functional programming.
     

utah.edu

cs.utah.edu

utexas.edu

cs.utexas.edu

washington.edu

homes.cs.washington.edu

  • Blelloch, Guy. Programming Parallel Algorithms (PDF). Communications of the ACM. 1996, 39 (3): 85–97 [2019-12-12]. doi:10.1145/227234.227246. (原始内容存档 (PDF)于2016-03-04). An important advance in parallel computing was the introduction of the notion of virtual models. …… Virtual models can be taken a step further and used to define performance in more abstract measures than just running time on a particular machine. A pair of such measures are work and depth: Work is defined as the total number of operations executed by a computation, and depth is defined as the longest chain of sequential dependencies in the computation. 
    Siddhartha Chatterjee, Jan Prins. Parallel and Distributed Computing PRAM Algorithms (PDF). Spring 2002 [2019-12-06]. (原始内容存档 (PDF)于2019-12-06). The barebones PRAM model is low-level and cumbersome, and writing anything other than trivial algorithms in this model is a nightmare. We will therefore switch to an equivalent but higher-level abstraction called the Work-Time (WT) paradigm to be independent of these details. …… In our algorithmic notation, we will use the forall construct to denote such concurrent operations, and we drop explicit mention of the processor id and p, the number of processors. In fact the forall construct is the only construct that distinguishes a WT algorithm from a sequential algorithm. 
    Vishkin, Uzi, Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques, 104 pages (PDF), Class notes of courses on parallel algorithms taught since 1992 at the University of Maryland, College Park, Tel Aviv University and the Technion, 2009 [2019-12-07], (原始内容存档 (PDF)于2017-12-15), For concreteness, we proceed to an example of a PRAM algorithm. However, before doing this we present the pardo “programming construct”, which is heavily used in these notes to express operations that are performed in parallell:
        for Pi , 1 ≤ i ≤ n pardo
            A(i) := B(i)
    This means that the following n operations are performed concurrently: processor P1 assigns B(1) into A(1), processor P2 assigns B(2) into A(2), and so on. …… An alternative model which is actually an alternative presentation mode, called Work-Depth, …… Strictly speaking, WD actually defines a slightly different model of computation. Consider an instruction of the form
        for i , 1 ≤ i ≤ α pardo
          A(i) := B(C(i))
    where the time unit under consideration, consists of a sequence of α concurrent instructions, for some positive integer α.
     

web.archive.org

  • Skillicorn, David B., "Models for practical parallel computation", International Journal of Parallel Programming, 20.2 133–158 (1991), https://www.ida.liu.se/~chrke55/papers/modelsurvey.pdf页面存档备份,存于互联网档案馆
  • Leslie G. Valiant. A bridging model for parallel computation (PDF). Communications of the ACM. August, 1990, 33 (8): 103–111 [2019-12-05]. (原始内容 (PDF)存档于2019-08-11). 
  • John E. Savage, Models of Computation: Exploring the Power of Computing, 2008, Chapter 7 (Parallel Computation), http://cs.brown.edu/~jes/book/页面存档备份,存于互联网档案馆
  • Ian Foster, Designing and Building Parallel Programs, 1995, Section 1.3, "A Parallel Programming Model", http://www.mcs.anl.gov/~itf/dbpp/text/node9.html页面存档备份,存于互联网档案馆
  • Blaise Barney, Introduction to Parallel Computing, "Models", 2015, Lawrence Livermore National Laboratory, https://computing.llnl.gov/tutorials/parallel_comp/#Models页面存档备份,存于互联网档案馆
  • Michael McCool; James Reinders; Arch Robison. Structured Parallel Programming: Patterns for Efficient Computation (PDF). Elsevier. 2013 [2019-12-12]. (原始内容存档 (PDF)于2018-11-23). Threading Building Blocks (TBB) supports fork–join with a work-stealing load balancer as its basic model. In contrast with Cilk Plus, TBB is a portable ISO C++ library, not a compiler extension. Because TBB is not integrated into the compiler, its fork–join implementation is not quite as efficient as Cilk Plus, and it cannot directly generate vectorized code. However, TBB also provides implementations of several patterns not available directly in Cilk Plus, such as pipelines. 
  • SALSA页面存档备份,存于互联网档案馆
  • Ian Foster, Designing and Building Parallel Programs, 1995, Section 2.2, "Partitioning", http://www.mcs.anl.gov/~itf/dbpp/text/node16.html页面存档备份,存于互联网档案馆
  • Blaise Barney, Introduction to Parallel Computing, "Partitioning", 2015, Lawrence Livermore National Laboratory, https://computing.llnl.gov/tutorials/parallel_comp/#DesignPartitioning页面存档备份,存于互联网档案馆
  • Blaise Barney. Introduction to Parallel Computing. [2019-12-02]. (原始内容存档于2019-12-24). SINGLE PROGRAM: All tasks execute their copy of the same program simultaneously. This program can be threads, message passing, data parallel or hybrid. 
  • Flynn, Michael J. Some Computer Organizations and Their Effectiveness (PDF). IEEE Transactions on Computers. September 1972, C–21 (9): 948–960 [2019-12-05]. doi:10.1109/TC.1972.5009071. (原始内容 (PDF)存档于2022-04-11).
    2) The single-instruction stream-multiple-data stream(SIMD), which includes most array processes, including Solomon[2](Illiac IV英语ILLIAC).
    3) Multiple-instruction stream-single-data stream type organizations(MISD), which include specialized streaming organizations using multiple-instruction streams on a single sequence of data and the derivatives thereof. The plug-board英语Plugboard machines of a bygone era were a degenerate form of MISD wherein the instruction streams were single instructions, and a derived datum(SD) passed from program step i to program step i+1(MI).
    4) Multiple-Instruction stream-multiple-data stream (MIMD), which include organizations referred to as "multiprocessor."Univac[3], among other corporations, has proposed various MIMD structures.
     
  • Michael McCool; James Reinders; Arch Robison. 2.4.3 Flynn’s Characterization. Structured Parallel Programming: Patterns for Efficient Computation (PDF). Elsevier. 2013: 52 [2019-12-12]. (原始内容存档 (PDF)于2018-11-23). There is another related classification used especially by GPU vendors: Single Instruction, Multiple Threads (SIMT). This corresponds to a tiled SIMD architecture consisting of multiple SIMD processors, where each SIMD processor emulates multiple “threads” (fibers in our terminology) using masking. SIMT processors may appear to have thousands of threads, but in fact blocks of these share a control processor, and divergent control flow can significantly reduce efficiency within a block. On the other hand, synchronization between fibers is basically free, because when control flow is emulated with masking the fibers are always running synchronously. 
    Parallel Thread Execution ISA Version 6.5. https://docs.nvidia.com/. [2019-12-18]. (原始内容存档于2019-12-01).
    On architectures prior to Volta, warps used a single program counter shared amongst all 32 threads in the warp together with an active mask specifying the active threads of the warp. As a result, threads from the same warp in divergent regions or different states of execution cannot signal each other or exchange data, and algorithms requiring fine-grained英语Granularity (parallel computing) sharing of data guarded by locks or mutexes can easily lead to deadlock, depending on which warp the contending threads come from.
    Starting with the Volta architecture, Independent Thread Scheduling allows full concurrency between threads, regardless of warp. With Independent Thread Scheduling, the GPU maintains execution state per thread, including a program counter and call stack, and can yield execution at a per-thread granularity英语Granularity (parallel computing), either to make better use of execution resources or to allow one thread to wait for data to be produced by another.
     
  • Blaise Barney. Introduction to Parallel Computing. [2019-12-02]. (原始内容存档于2019-12-24). MULTIPLE PROGRAM: Tasks may execute different programs simultaneously. The programs can be threads, message passing, data parallel or hybrid. ……MPMD applications are not as common as SPMD applications, but may be better suited for certain types of problems, particularly those that lend themselves better to functional decomposition than domain decomposition (discussed later under Partioning). 
  • Blelloch, Guy. Programming Parallel Algorithms (PDF). Communications of the ACM. 1996, 39 (3): 85–97 [2019-12-12]. doi:10.1145/227234.227246. (原始内容存档 (PDF)于2016-03-04). An important advance in parallel computing was the introduction of the notion of virtual models. …… Virtual models can be taken a step further and used to define performance in more abstract measures than just running time on a particular machine. A pair of such measures are work and depth: Work is defined as the total number of operations executed by a computation, and depth is defined as the longest chain of sequential dependencies in the computation. 
    Siddhartha Chatterjee, Jan Prins. Parallel and Distributed Computing PRAM Algorithms (PDF). Spring 2002 [2019-12-06]. (原始内容存档 (PDF)于2019-12-06). The barebones PRAM model is low-level and cumbersome, and writing anything other than trivial algorithms in this model is a nightmare. We will therefore switch to an equivalent but higher-level abstraction called the Work-Time (WT) paradigm to be independent of these details. …… In our algorithmic notation, we will use the forall construct to denote such concurrent operations, and we drop explicit mention of the processor id and p, the number of processors. In fact the forall construct is the only construct that distinguishes a WT algorithm from a sequential algorithm. 
    Vishkin, Uzi, Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques, 104 pages (PDF), Class notes of courses on parallel algorithms taught since 1992 at the University of Maryland, College Park, Tel Aviv University and the Technion, 2009 [2019-12-07], (原始内容存档 (PDF)于2017-12-15), For concreteness, we proceed to an example of a PRAM algorithm. However, before doing this we present the pardo “programming construct”, which is heavily used in these notes to express operations that are performed in parallell:
        for Pi , 1 ≤ i ≤ n pardo
            A(i) := B(i)
    This means that the following n operations are performed concurrently: processor P1 assigns B(1) into A(1), processor P2 assigns B(2) into A(2), and so on. …… An alternative model which is actually an alternative presentation mode, called Work-Depth, …… Strictly speaking, WD actually defines a slightly different model of computation. Consider an instruction of the form
        for i , 1 ≤ i ≤ α pardo
          A(i) := B(C(i))
    where the time unit under consideration, consists of a sequence of α concurrent instructions, for some positive integer α.
     
  • Skillicorn, David B., and Domenico Talia, Models and languages for parallel computation, ACM Computing Surveys, 30.2 123–169 (1998), https://www.cs.utexas.edu/users/browne/CS392Cf2000/papers/ModelsOfParallelComputation-Skillicorn.pdf页面存档备份,存于互联网档案馆
  • Jonathan Hill, Bill McColl, Dan Stefanescu, Mark Goudreau, Kevin Lang, Satish Rao, Torsten Suel, Thanasis Tsantilas, and Rob Bisseling. BSP Programming Library (PDF). Parallel Computing. December 1998, 24 (14): 1947–1980 [2019-12-05]. (原始内容存档 (PDF)于2019-11-04). Like many other communications libraries, BSPlib adopts a Single Program Multiple Data(SPMD) programming model. The task of writing an SPMD program will typically involve mapping a problem that manipulates a data structure of size N into p instances of a program that each manipulate an N/p sized block of the original domain. The role of BSPlib is to provide the infrastructure required for the user to take care of the data distribution, and any implied communication necessary to manipulate parts of the data structure that are on a remote process. 
  • BSPlib页面存档备份,存于互联网档案馆
  • MulticoreBSP页面存档备份,存于互联网档案馆
  • BSPonMPI页面存档备份,存于互联网档案馆
  • Introduction to Apache Giraph. [2019-12-09]. (原始内容存档于2019-12-19). Apache Giraph is an iterative graph processing framework, built on top of Apache Hadoop. The input to a Giraph computation is a graph composed of vertices and directed edges, …… Computation proceeds as a sequence of iterations, called supersteps in BSP. ……There is a barrier between consecutive supersteps. 
  • Apache Hama. [2019-12-09]. (原始内容存档于2019-12-18). Apache Hama(TM) is a framework for Big Data analytics which uses the Bulk Synchronous Parallel (BSP) computing model, ……It provides not only pure BSP programming model but also vertex and neuron centric programming models, inspired by Google's Pregel and DistBelief. 
  • OpenSHMEM. [2019-12-09]. (原始内容存档于2019-12-09). The Programming Models and Languages team is focused on developing the OpenSHMEM programming model for extreme scale systems. ……Currently, the team partners with NVIDIA, University of Tennessee, Knoxville, Florida State University and Paratools. …… UCX provides communication interfaces, and protocols for efficiently implementing parallel programming models such as MPI, OpenSHMEM, and task-based models. 
  • Armstrong, Joe. Erlang. Communications of the ACM. September 2010, 53 (9): 68–75 [2020-05-07]. doi:10.1145/1810891.1810910. (原始内容存档于2020-06-09). Erlang is conceptually similar to the occam programming language, though it recasts the ideas of CSP in a functional framework and uses asynchronous message passing instead of the synchronous message passing in CSP. 
  • CUDA Programming Guide 1.0 (PDF). 6/23/2007 [2019-12-09]. (原始内容存档 (PDF)于2018-04-17). CUDA stands for Compute Unified Device Architecture and is a new hardware and software architecture for issuing and managing computations on the GPU as a data-parallel computing device without the need of mapping them to a graphics API. 
    Apple to Ship Mac OS X Snow Leopard on August 28. August 24, 2009 [2019-12-09]. (原始内容存档于2019-12-09). OpenCL, a C-based open standard, allows developers to tap the incredible power of the graphics processing unit for tasks that go beyond graphics. 
    September 2009 OpenCL Public Downloads. [2019-12-09]. (原始内容存档于2019-12-09). Note: OpenCL drivers have been included in all publicly available NVIDIA drivers since October 2009. The CUDA Toolkit now includes the Visual Profiler for OpenCL as well as the OpenCL Programming Guide, Best Practices Guide, and other developer documentation. 
    MPI Solutions for GPUs. [2019-12-09]. (原始内容存档于2019-12-09). MPI is fully compatible with CUDA, CUDA Fortran, and OpenACC, all of which are designed for parallel computing on a single computer or node. 
    OpenSHMEM Communication on GPUs. [2019-12-09]. (原始内容存档于2019-12-09). NVSHMEM implements the OpenSHMEM standard for GPU memory, with extensions for improved performance on GPUs. 
  • Pocl页面存档备份,存于互联网档案馆
  • Brook Language. [2019-12-09]. (原始内容存档于2019-11-19). Brook is an extension of standard ANSI C and is designed to incorporate the ideas of data parallel computing and arithmetic intensity into a familiar, efficient language. The general computational model, referred to as streaming, ……A stream is a new data type addition which represents a collection of data which can be operated on in parallel. ……Kernels are special functions that operate on streams. A kernel is a parallel function applied to every element of the input streams. 
  • Howes, Lee. The OpenCL Specification Version: 2.1 Document Revision: 23 (PDF). Khronos OpenCL Working Group. November 11, 2015 [November 16, 2015]. (原始内容存档 (PDF)于2015-11-18). 
  • Xilinx Inc, Form 8-K, Current Report, Filing Date Apr 26, 2006. secdatabase.com. [May 6, 2018]. (原始内容存档于2018-05-07). 
  • Why use OpenCL on FPGAs?. StreamComputing. 2014-09-16 [2019-12-06]. (原始内容存档于2017-01-01). 
  • MapReduce: Simplified Data Processing on Large Clusters (PDF). googleusercontent.com. [2019-12-05]. (原始内容存档 (PDF)于2020-06-15). MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. 
  • MapReduce Tutorial. Apache Hadoop. [3 July 2019]. (原始内容存档于2019-12-14). 
  • MapReduce: Simplified Data Processing on Large Clusters (PDF). googleusercontent.com. [2019-12-05]. (原始内容存档 (PDF)于2020-06-15). Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical "record" in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. Our use of a functional model with user-specified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance. 
  • Lämmel, R. Google's MapReduce programming model — Revisited (PDF). Science of Computer Programming. 2008, 70: 1–30 [2019-12-07]. doi:10.1016/j.scico.2007.07.001. (原始内容存档 (PDF)于2019-12-07). The following overview lists more detailed questions and summarizes our findings:
    Is MAP essentially the map combinator? NO.
    Is MAP essentially an application of the map combinator? NO.
    Does MAP essentially serve as the argument of map? YES.
    Is REDUCE essentially the reduce combinator? NO.
    Is REDUCE essentially an application of the reduce combinator? TYPICALLY.
    Does REDUCE essentially serve as the argument of reduce? NO.
    Does REDUCE essentially serve as the argument of map? YES.
    Hence, there is no trivial correspondence between MapReduce's MAP&REDUCE and what is normally called map&reduce in functional programming.
     
  • Vishkin, Uzi; Dascal, Shlomit; Berkovich, Efraim; Nuzman, Joseph, Explicit Multi-Threading (XMT) bridging models for instruction parallelism, Proc. 1998 ACM Symposium on Parallel Algorithms and Architectures (SPAA): 140–151, 1998 [2019-12-12], (原始内容存档于2017-09-21), This paper envisions an extension to a standard instruction set which efficiently implements PRAM-style algorithms using explicit multi-threaded instruction-level parallelism (ILP); that is, Explicit Multi-Threading (XMT), a fine-grained computational paradigm …… . …… We actually suggest two alternative bridging models: (1) Spawn-based multi-threading (Spawn-MT): This is based on (asynchronous, or synchronous) nesting-free Spawn-Join commands. This is the main model for the presentation in this paper. (2) Elastic multi-threading (EMT): This is more involved model concept enables nesting of Spawn-Join commands (similar to NESL英语NESL) and relies on the ability of the hardware to suspend, and later reactive, threads. 

wikipedia.org

en.wikipedia.org

  • Flynn, Michael J. Some Computer Organizations and Their Effectiveness (PDF). IEEE Transactions on Computers. September 1972, C–21 (9): 948–960 [2019-12-05]. doi:10.1109/TC.1972.5009071. (原始内容 (PDF)存档于2022-04-11).
    2) The single-instruction stream-multiple-data stream(SIMD), which includes most array processes, including Solomon[2](Illiac IV英语ILLIAC).
    3) Multiple-instruction stream-single-data stream type organizations(MISD), which include specialized streaming organizations using multiple-instruction streams on a single sequence of data and the derivatives thereof. The plug-board英语Plugboard machines of a bygone era were a degenerate form of MISD wherein the instruction streams were single instructions, and a derived datum(SD) passed from program step i to program step i+1(MI).
    4) Multiple-Instruction stream-multiple-data stream (MIMD), which include organizations referred to as "multiprocessor."Univac[3], among other corporations, has proposed various MIMD structures.
     
  • Michael McCool; James Reinders; Arch Robison. 2.4.3 Flynn’s Characterization. Structured Parallel Programming: Patterns for Efficient Computation (PDF). Elsevier. 2013: 52 [2019-12-12]. (原始内容存档 (PDF)于2018-11-23). There is another related classification used especially by GPU vendors: Single Instruction, Multiple Threads (SIMT). This corresponds to a tiled SIMD architecture consisting of multiple SIMD processors, where each SIMD processor emulates multiple “threads” (fibers in our terminology) using masking. SIMT processors may appear to have thousands of threads, but in fact blocks of these share a control processor, and divergent control flow can significantly reduce efficiency within a block. On the other hand, synchronization between fibers is basically free, because when control flow is emulated with masking the fibers are always running synchronously. 
    Parallel Thread Execution ISA Version 6.5. https://docs.nvidia.com/. [2019-12-18]. (原始内容存档于2019-12-01).
    On architectures prior to Volta, warps used a single program counter shared amongst all 32 threads in the warp together with an active mask specifying the active threads of the warp. As a result, threads from the same warp in divergent regions or different states of execution cannot signal each other or exchange data, and algorithms requiring fine-grained英语Granularity (parallel computing) sharing of data guarded by locks or mutexes can easily lead to deadlock, depending on which warp the contending threads come from.
    Starting with the Volta architecture, Independent Thread Scheduling allows full concurrency between threads, regardless of warp. With Independent Thread Scheduling, the GPU maintains execution state per thread, including a program counter and call stack, and can yield execution at a per-thread granularity英语Granularity (parallel computing), either to make better use of execution resources or to allow one thread to wait for data to be produced by another.
     
  • Vishkin, Uzi; Dascal, Shlomit; Berkovich, Efraim; Nuzman, Joseph, Explicit Multi-Threading (XMT) bridging models for instruction parallelism, Proc. 1998 ACM Symposium on Parallel Algorithms and Architectures (SPAA): 140–151, 1998 [2019-12-12], (原始内容存档于2017-09-21), This paper envisions an extension to a standard instruction set which efficiently implements PRAM-style algorithms using explicit multi-threaded instruction-level parallelism (ILP); that is, Explicit Multi-Threading (XMT), a fine-grained computational paradigm …… . …… We actually suggest two alternative bridging models: (1) Spawn-based multi-threading (Spawn-MT): This is based on (asynchronous, or synchronous) nesting-free Spawn-Join commands. This is the main model for the presentation in this paper. (2) Elastic multi-threading (EMT): This is more involved model concept enables nesting of Spawn-Join commands (similar to NESL英语NESL) and relies on the ability of the hardware to suspend, and later reactive, threads.