cuda 编程
-
1.基本概念
C代码在主机host上运行,Cuda C在device上面运行。在GPU上运行的函数function 叫做kernel.compute capability 卡的计算能力 用版本编号 感觉和android的API版本号类似 确定了API的版本
2.基本流程
CPU和GPU的内存不能直接操作,需要使用相关的接口来传递数据。因此,整个CUDA的编程过程为:
Allocate host memory and initialized host data
Allocate device memory
Transfer input data from host to device memory
Execute kernels
Transfer output from device memory to host3.核心概念

https://github.com/jiekebo/CUDA-By-Example
https://cuda-tutorial.readthedocs.io/en/latest/
https://developer.nvidia.com/blog/even-easier-introduction-cuda/
https://dadaiscrazy.github.io/usuba/2020/03/28/CUDA-basics.html -
cuda的代码结构:
//kernel 其实就是一个线程的入口函数 这个函数会在GPU上以多线程的方式执行 这么看 GPU有点像一个巨大硬件线程池 可以同时执行这些函数 线程之间的同步 估计有类似信号量或者队列的方式
global kernelFunctionName(arg1, arg2, ..., arg n){
//Find the thread's ID which is executing the kernel code
int i = threadIdx.x;//Kernel code to excute .... ....}
int main(int argc, char * argv[]) {
//host code containing also device memory allocation and transfer .... .... //Lauching the kernel code dim3 Grid(10, 5, 1); //Grid definition with x, y and z dimensions respectively 10, 5 and 1. dim3 ThreadsPerBlock(50); // Block definition with x, y and z respectively 50, 1 and 1 (y and z equal to 1 if not given).//指定使用多少资源来执行这个kernel
kernelFunctionName<<<Grid,ThreadsPerBlock>>>(arg1, arg2, ..., arg n);
} -
//GPU的内存空间和CPU的内存的空间是分离的 需要将数据来回转移 这样看来 GPU对整个主机而言 就是一个外设 在OS中也确实是这么实现的 可以用lscpi查找GPU设备
cudaMemcpy(d_a, a, numBytes, cudaMemcpyHostToDevice); // Synchronous function, no overlapping allowed.
kernelFunctionName<<<Grid,ThreadsPerBlock>>>(arg1, arg2, ..., arg n); // Asynchronous, we can take advantage of overlapping.
//Host code during device execution
...// End of the overlapped host code. Waiting for the end of the kernel execution to transfer data between the host and the device. cudaMemcpy(a, d_a, numBytes, cudaMemcpyDeviceToHost); // Synchronous function, no overlapping allowed.