Two Fundamental Hardware Techniques
Used To Increase Performance
- Multiple copies of hardware unit used
- All copies can operate simultaneously
- Occurs at many levels of architecture
- Term parallel computer applied when parallelism dominates entire architecture
Characterizations Of Parallelism :
- Microscopic vs. macroscopic
- Symmetric vs. asymmetric
- Fine-grain vs. coarse-grain
- Explicit vs. implicit
- Types Of Parallel Architectures
SISD Single Instruction Single Data stream
SIMD Single Instruction Multiple Data streams
MIMD Multiple Instructions Multiple Data streams
A New Conceptual Model
The ancient Greeks found a similar problem in trying to describe the motions of the planets and stars. They had a concept of planets moving relative to stars and to the Earth, and stars moving relative to the Earth. A model involving crystal spheres was developed. In this model the heavenly bodies were painted onto various layers of crystal spheres which were centred on the Earth and could move relative to one another. Objects on close spheres move more compared with objects on distant spheres. Various complex models of the movements of heavenly bodies were formulated in terms of these spherical shells surrounding the Earth. We now know that much of the observed motion of the stars is due to the rotation of the Earth, but since the observed motion is relative, a successful model centred on the Earth is possible.
Distributed processing is the type of processing whereby processing occurs on more than one processor in order for a transaction to be completed. In other words, processing is distributed across two or more machines and the processes are most likely not running at the same time.
The word distributed in terms such as “distributed system”, “distributed programming”, and “distributed algorithm” originally referred to computer networks where individual computers were physically distributed within some geographical area. The terms are nowadays used in a much wider sense, even referring to autonomous processes that run on the same physical computer and interact with each other by message passing. While there is no single definition of a distributed system, the following defining properties are commonly used:
- There are several autonomous computational entities, each of which has its own local memory
- The entities communicate with each other by message passing.
In this article, the computational entities are called computers or nodes.
A distributed system may have a common goal, such as solving a large computational problem. Alternatively, each computer may have its own user with individual needs, and the purpose of the distributed system is to coordinate the use of shared resources or provide communication services to the users.
Other typical properties of distributed systems include the following:
- The system has to tolerate failures in individual computers.
- The structure of the system (network topology, network latency, number of computers) is not known in advance, the system may consist of different kinds of computers and network links, and the system may change during the execution of a distributed program.
- Each computer has only a limited, incomplete view of the system. Each computer may know only one part of the input.
- The world wide web – information, resource sharing
- Clusters, Network of workstations
- Distributed manufacturing system (e.g., automated assembly line)
- Network of branch office computers – Information system to handle automatic processing of orders
- Network of embedded systems
- New Cell processor (PlayStation 3)
Pitfalls when Developing Distributed Systems
False assumptions made by first time developer:
- The network is reliable.
- The network is secure.
- The network is homogeneous.
- The topology does not change.
- Latency is zero.
- Bandwidth is infinite.
- Transport cost is zero.
- There is one administrator.
Types of Distributed Systems
- Distributed Computing Systems
- Distributed information systems
- Distributed Pervasive/Embedded Systems
What’s Parallel Architecture?
A parallel computer is a collection of processing elements that cooperate to solve large problems fast. Extension of “computer architecture” to support communication and cooperation
• OLD: Instruction Set Architecture
• NEW: Communication Architecture
- Critical abstractions, boundaries, and primitives (interfaces)
- Organizational structures that implement interfaces (hw or sw)
- Compilers, libraries and OS are important bridges
Communication Architecture :
- User/System Interface+Implementation
- Comm. primitives exposed to user-level by hw and system-level sw
- Organizational structures that implement the pri
mitives: hw or OS
- How optimized are they? How integrated into processing node?
- Structure of network
- Broad applicability
- Low Cost
An Introduction to Programming with Threads
Many experimental operating systems, and some commercial ones, have recently included support for concurrent programming. The most popular mechanism for this is some provision for allowing multiple lightweight “threads” within a single address space, used from within a single program. Programming with threads introduces new difficulties even for experienced programmers. Concurrent programming has techniques and pitfalls that do not occur in sequential programming. Many of the techniques are obvious, but some are obvious only with hindsight. Some of the pitfalls are comfortable (for example, deadlock is a pleasant sort of bug—your program stops with all the evidence intact), but some take the form of insidious performance penalties.
The purpose of this paper is to give youan introduction to the programming techniques that work well with threads, and to warn you about techniques or interactions that work out badly. It should provide the experienced sequential programmer with enough hints to be able to build a substantial multi-threaded program that works—correctly, efficiently, and with a minimum of surprises.
Having “multiple threads” in a program means that at any instant the program has multiple points of execution, one in each of its threads. The programmer can mostly view the threads as executing simultaneously, as if the computer were endowed with as many processors as there are threads. The programmer is required to decide when and where to create multiple threads, or to accept such decisions made for him by implementers of existing library packages or runtime systems. Additionally, the programmer must occasionally be aware that the computer might not in fact execute all his threads simultaneously.
Having the threads execute within a “single address space” means that the computer’s addressing hardware is configured so as to permit the threads to read and write the same memory locations. In a high-level language, this usually corresponds to the fact that the off-stack (global) variables are shared among all the threads of the program. Each thread executes on a separate call stack with its own separate local variables. The programmer is responsible for using the synchronization mechanisms of the thread facility to ensure that the shared memory is accessed in a manner that will give the correct answer.
Thread facilities are always advertised as being “lightweight”. This means that thread creation, existence, destruction and synchronization primitives are cheap enough that the programmer will use them for all his concurrency needs. Please be aware that I am presenting you with a selective, biased and idiosyncratic collection of techniques. Selective, because an exhaustive survey would be premature, and would be too exhausting to serve as an introduction—I will be discussing only the most important thread primitives, omitting features such as per-thread context information. Biased, because I present examples, problems and solutions in the context of one.
Introduction Programing CUDA
What is CUDA?
* CUDA Architecture
— Expose general-purpose GPU computing as first-class capability
— Retain traditional DirectX/OpenGL graphics performance
— Based on industry-standard C
— A handful of language extensions to allow heterogeneous programs
— Straightforward APIs to manage devices, memory, etc.
*This talk will introduce you to CUDA CIntroduction to CUDA C
*What will you learn today?
— Start from “Hello, World!”
— Write and launch CUDA C kernels
— Manage GPU memory
— Run parallel kernels in CUDA C
— Parallel communication and synchronization
— Race conditions and atomic operations
CUDA C Prerequisites
- You (probably) need experience with C or C++
- You do not need any GPU experience
- You do not need any graphics experience
- You do not need any parallel programming experience
CUDA C: The Basics
Note: Figure Not to Scale
- Host – The CPU and its memory (host memory)
- Device – The GPU and its memory (device memory)
- Host and device memory are distinct entities
— Device pointers point to GPU memory
- May be passed to and from host code
- May not be dereferenced from host code
— Host pointers point to CPU memory
- May be passed to and from device code
- May not be dereferenced from device code
- Basic CUDA API for dealing with device memory
— cudaMalloc(), cudaFree(), cudaMemcpy()
— Similar to their C equivalents, malloc(), free(), memcpy()