Improving Oversubscribed GPU Memory Performance in the PyTorch Framework
Citations

WEB OF SCIENCE

2
Citations

SCOPUS

3

초록

Popular deep learning frameworks like PyTorch utilize GPUs heavily for training, and suffer from out-of-memory (OOM) problems if memory is not managed properly. CUDA Unified Memory (UM) allows the oversubscription of tensor objects in the GPU, but suffers from heavy performance penalties. In this paper, we build upon our UM implementation and create and utilize a minimal overhead CUPTI dynamic profiler to trace unified memory page fault and memory transfer statistics in PyTorch applications. We also implement CUDA memory prefetch and advise API which can be called directly from the PyTorch application based on the dynamically profiled statistics to improve oversubscription performance in various PyTorch models including Resnet and BERT.

키워드

CUDAUnified memoryPyTorchprefetchAdviseCUPTIUNIFIED MEMORY
제목
Improving Oversubscribed GPU Memory Performance in the PyTorch Framework
저자
Choi, JakeYeom, Heon YoungKim, Yoonhee
DOI
10.1007/s10586-022-03805-x
발행일
2023-10
유형
Article; Early Access
저널명
Cluster Computing
26
5
페이지
2835 ~ 2850