Book contents
- Frontmatter
- Dedication
- Contents
- Figures
- Tables
- Examples
- Preface
- 1 Introduction to GPU Kernels and Hardware
- 2 Thinking and Coding in Parallel
- 3 Warps and Cooperative Groups
- 4 Parallel Stencils
- 5 Textures
- 6 Monte Carlo Applications
- 7 Concurrency Using CUDA Streams and Events
- 8 Application to PET Scanners
- 9 Scaling Up
- 10 Tools for Profiling and Debugging
- 11 Tensor Cores
- Appendix A A Brief History of CUDA
- Appendix B Atomic Operations
- Appendix C The NVCC Compiler
- Appendix D AVX and the Intel Compiler
- Appendix E Number Formats
- Appendix F CUDA Documentation and Libraries
- Appendix G The CX Header Files
- Appendix H AI and Python
- Appendix I Topics in C++
- Index
1 - Introduction to GPU Kernels and Hardware
Published online by Cambridge University Press: 04 May 2022
- Frontmatter
- Dedication
- Contents
- Figures
- Tables
- Examples
- Preface
- 1 Introduction to GPU Kernels and Hardware
- 2 Thinking and Coding in Parallel
- 3 Warps and Cooperative Groups
- 4 Parallel Stencils
- 5 Textures
- 6 Monte Carlo Applications
- 7 Concurrency Using CUDA Streams and Events
- 8 Application to PET Scanners
- 9 Scaling Up
- 10 Tools for Profiling and Debugging
- 11 Tensor Cores
- Appendix A A Brief History of CUDA
- Appendix B Atomic Operations
- Appendix C The NVCC Compiler
- Appendix D AVX and the Intel Compiler
- Appendix E Number Formats
- Appendix F CUDA Documentation and Libraries
- Appendix G The CX Header Files
- Appendix H AI and Python
- Appendix I Topics in C++
- Index
Summary
The key to parallel programming is sharing a task between many cooperating threads running in parallel. A chart is presented showing how since 2003 the Moore’s law growth in computing performance has depended on parallel computing. This chapter includes a simple introductory CUDA example which performs numerical integration using 1000 000 000 threads. Using CUDA gives a speed-up of about 1000 compared to a single CPU thread. Key CUDA concepts including thread blocks, thread grids and warps are introduced. The hardware differences between conventional CPU architectures and GPUs are then discussed. Optimisations in memory caching on GPUs are also explained as memory access time is often a key performance constraint for many programs. The use of OpenMP to share a single task across all cores of a multicore CPU is also discussed.
- Type
- Chapter
- Information
- Programming in Parallel with CUDAA Practical Guide, pp. 1 - 21Publisher: Cambridge University PressPrint publication year: 2022