Skip to content

Commit

Permalink
Update 01-GPU-intro.md
Browse files Browse the repository at this point in the history
  • Loading branch information
csccva authored Apr 30, 2024
1 parent dd7eb99 commit aabaccf
Showing 1 changed file with 39 additions and 100 deletions.
139 changes: 39 additions & 100 deletions gpu-openmp/docs/01-GPU-intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,120 +4,54 @@ event: CSC Summer School in High-Performance Computing 2024
lang: en
---

# Introduction to GPU parallel computing and programming model{.section}

# High-performance computing
# High Performance Computing through the ages

<div class="column">

- High performance computing is fueled by ever increasing performance
- Increasing performance allows breakthroughs in many major challenges that
humankind faces today
- Not only hardware performance, algorithmic improvements have also added orders of magnitude of real performance

![](img/top500-perf-dev.png){.center width=70%}
</div>

<div class="column">
![](img/top500-perf-dev.png)
</div>

# HPC through the ages

<div class="column" width=55%>
- Achieving performance has been based on various strategies throughout the years
- Frequency, vectorization, multi-node, multi-core ...
- Now performance is mostly limited by power consumption
- Accelerators provide compute resources based on a very high level of parallelism to reach
high performance at low relative power consumption
</div>

<div class="column" width=43%>
![](img/microprocessor-trend-data.png)
<div class="column">
![](img/microprocessor-trend-data.png){.center width=100%}
</div>


# Accelerators

- Specialized parallel hardware for floating point operations
- Specialized parallel hardware for compute-intensive operations
- Co-processors for traditional CPUs
- Based on highly parallel architectures
- Graphics processing units (GPU) have been the most common
accelerators during the last few years
- Promises
- Very high performance per node
- More FLOPS/Watt
- Usually major rewrites of programs required

# Why use them?
CPU vs Accelerator

![ <span style=" font-size:0.5em;">https://github.com/karlrupp/cpu-gpu-mic-comparison</span> ](img/comparison.png)


# Different design philosophies

<div class="column">

**CPU**
\
\

- General purpose
- Good for serial processing
- Great for task parallelism
- Low latency per thread
- Large area dedicated cache and control


</div>

<div class="column">
# Accelerator model today

**GPU**
\
\

- Highly specialized for parallelism
- Good for parallel processing
- Great for data parallelism
- High-throughput
- Hundreds of floating-point execution units
- Local memory in GPU
- Smaller than main memory (32 GB in Puhti, 64GB in LUMI)
- Very high bandwidth (up to 3200 GB/s in LUMI)
- Latency high compared to compute performance

![](img/gpu-bws.png){width=100%}

</div>
- GPUs are connected to CPUs via PCIe
- Data must be copied from CPU to GPU over the PCIe bus


# Lumi - Pre-exascale system in Finland

![](img/lumi.png){.center width=50%}


# Accelerator model today

<div class="column">
- GPU is connected to CPUs via PCIe
- Local memory in GPU
- Smaller than main memory (32 GB in Puhti)
- Very high bandwidth (up to 3200 GB/s in LUMI)
- Latency high compared to compute performance
- Data must be copied from CPU to GPU over the PCIe bus

</div>
<div class="column">
![](img/gpuConnect.png){}
![](img/gpu-bws.png){width=100%}
</div>

# Heterogeneous Programming Model

- GPUs are co-processors to the CPU
- CPU controls the work flow:
- *offloads* computations to GPU by launching *kernels*
- allocates and deallocates the memory on GPUs
- handles the data transfers between CPU and GPUs
- CPU and GPU can work concurrently
- kernel launches are normally asynchronous

# GPU architecture

<div class="column">
- Designed for running tens of thousands of threads simultaneously on
thousands of cores
Expand All @@ -131,16 +65,6 @@ CPU vs Accelerator
<small>AMD Instinct MI100 architecture (source: AMD)</small>
</div>


# Advance features & Performance considerations

- Memory accesses:
- data resides in the GPU memory; maximum performance is achieved when reading/writing is done in continuous blocks
- very fast on-chip memory can be used as a user programmable cache
- *Unified Virtual Addressing* provides unified view for all memory
- Asynchronous calls can be used to overlap transfers and computations.


# Challenges in using Accelerators

**Applicability**: Is your algorithm suitable for GPU?
Expand All @@ -154,29 +78,43 @@ CPU vs Accelerator
**Scalability**: Can you scale the GPU software efficiently to several nodes?


# Heterogeneous Programming Model

- GPUs are co-processors to the CPU
- CPU controls the work flow:
- *offloads* computations to GPU by launching *kernels*
- allocates and deallocates the memory on GPUs
- handles the data transfers between CPU and GPUs
- CPU and GPU can work concurrently
- kernel launches are normally asynchronous

# Using GPUs

<div class="column">



1. Use existing GPU applications
2. Use accelerated libraries
3. Directive based methods
- **OpenMP**, OpenACC
4. Use native GPU language
- CUDA, HIP, SYCL, Kokkos,...
- OpenMP, OpenACC
4. High-level GPU programming
- **SYCL**, **Kokkos**, ...
5. Use direct GPU programming
- CUDA, HIP, ...
</div>
<div class="column" width=40%>

Easier, but more limited
<div class="column">
**Easier, more limited**

![](img/arrow.png){.center width=20% }
![](img/arrow.png){width=16% }

More difficult, but more opportunities
**More difficult, more opportunities**

</div>




<!--
# Directive-based accelerator languages
- Annotating code to pinpoint accelerator-offloadable regions
Expand Down Expand Up @@ -206,6 +144,7 @@ More difficult, but more opportunities
- almost a one-on-one clone of CUDA from the user perspective
- ecosystem is new and developing fast
-->

# GPUs @ CSC

Expand Down

0 comments on commit aabaccf

Please sign in to comment.