cuda 编程

anneng

1.基本概念
C代码在主机host上运行，Cuda　C在device上面运行。在GPU上运行的函数function 叫做kernel.

compute capability 卡的计算能力用版本编号感觉和android的API版本号类似确定了API的版本

2.基本流程
CPU和GPU的内存不能直接操作，需要使用相关的接口来传递数据。因此，整个CUDA的编程过程为：
Allocate host memory and initialized host data
Allocate device memory
Transfer input data from host to device memory
Execute kernels
Transfer output from device memory to host

3.核心概念

https://github.com/jiekebo/CUDA-By-Example
https://cuda-tutorial.readthedocs.io/en/latest/
https://developer.nvidia.com/blog/even-easier-introduction-cuda/
https://dadaiscrazy.github.io/usuba/2020/03/28/CUDA-basics.html

anneng

cuda的代码结构：

//kernel 其实就是一个线程的入口函数这个函数会在GPU上以多线程的方式执行这么看 GPU有点像一个巨大硬件线程池可以同时执行这些函数线程之间的同步估计有类似信号量或者队列的方式
global kernelFunctionName(arg1, arg2, ..., arg n){
//Find the thread's ID which is executing the kernel code
int i = threadIdx.x;

//Kernel code to excute 
....
....

}

int main(int argc, char * argv[]) {

//host code containing also device memory allocation and transfer
....
....

//Lauching the kernel code
dim3 Grid(10, 5, 1);        //Grid definition with x, y and z dimensions respectively 10, 5 and 1.
dim3 ThreadsPerBlock(50);   // Block definition with x, y and z respectively 50, 1 and 1 (y and z equal to 1 if not given).

//指定使用多少资源来执行这个kernel
kernelFunctionName<<<Grid,ThreadsPerBlock>>>(arg1, arg2, ..., arg n);
}

anneng

//GPU的内存空间和CPU的内存的空间是分离的需要将数据来回转移这样看来 GPU对整个主机而言就是一个外设在OS中也确实是这么实现的可以用lscpi查找GPU设备
cudaMemcpy(d_a, a, numBytes, cudaMemcpyHostToDevice); // Synchronous function, no overlapping allowed.
kernelFunctionName<<<Grid,ThreadsPerBlock>>>(arg1, arg2, ..., arg n); // Asynchronous, we can take advantage of overlapping.
//Host code during device execution
...

// End of the overlapped host code. Waiting for the end of the kernel execution to transfer data between the host and the device. 
cudaMemcpy(a, d_a, numBytes, cudaMemcpyDeviceToHost); // Synchronous function, no overlapping allowed.