Sept 2025 – Present
GPT-2 CUDA Acceleration
Custom CUDA kernels for GPT-2 inference optimization
Technologies
CUDAGPU AccelerationDeep LearningNsight Systems
About This Project
Implement and optimize custom CUDA kernels for GPT-2 inference, applying various GPU optimizations such as shared memory tiling, memory coalescing, warp-level parallelism, and Tensor Core acceleration to achieve substantial speedups over CPU baselines.
Conduct detailed system-level and kernel-level profiling using NVIDIA Nsight Systems and Nsight Compute, identifying performance bottlenecks, memory stalls, and occupancy issues to guide iterative kernel optimization.
Analyze GPU memory hierarchies, occupancy, and execution divergence, gaining deep insight into transformer model performance characteristics on modern GPUs.
Links & Resources
Source code available for private review upon request.