Improving Oversubscribed GPU Memory Performance in the PyTorch Framework

Choi, Jake; Yeom, Heon Young; Kim, Yoonhee

doi:10.1007/s10586-022-03805-x

상세 보기

Improving Oversubscribed GPU Memory Performance in the PyTorch Framework

Choi, Jake;
Yeom, Heon Young;
Kim, Yoonhee

Citations

WEB OF SCIENCE

2

Citations

SCOPUS

3

초록

Popular deep learning frameworks like PyTorch utilize GPUs heavily for training, and suffer from out-of-memory (OOM) problems if memory is not managed properly. CUDA Unified Memory (UM) allows the oversubscription of tensor objects in the GPU, but suffers from heavy performance penalties. In this paper, we build upon our UM implementation and create and utilize a minimal overhead CUPTI dynamic profiler to trace unified memory page fault and memory transfer statistics in PyTorch applications. We also implement CUDA memory prefetch and advise API which can be called directly from the PyTorch application based on the dynamically profiled statistics to improve oversubscription performance in various PyTorch models including Resnet and BERT.

키워드

CUDA; Unified memory; PyTorch; prefetch; Advise; CUPTI; UNIFIED MEMORY

제목: Improving Oversubscribed GPU Memory Performance in the PyTorch Framework

저자: Choi, Jake; Yeom, Heon Young; Kim, Yoonhee

DOI: 10.1007/s10586-022-03805-x

발행일: 2023-10

유형: Article; Early Access

저널명: Cluster Computing

권: 26

호: 5

페이지: 2835 ~ 2850

ScholarWorks@숙명여자대학교

상세 보기

초록

키워드