Detailed Information

Cited 0 time in webofscience Cited 1 time in scopus
Metadata Downloads

Design of an adaptive GPU sharing and scheduling scheme in container-based cluster

Authors
Chen QichenOh JisunKim SeoyoungKim Yoonhee
Issue Date
Sep-2020
Publisher
Springer
Keywords
GPU management; GPU resource sharing; GPU scheduling; GPU virtualization
Citation
Cluster Computing, v.23, no.3, pp 2179 - 2191
Pages
13
Journal Title
Cluster Computing
Volume
23
Number
3
Start Page
2179
End Page
2191
URI
https://scholarworks.sookmyung.ac.kr/handle/2020.sw.sookmyung/1260
DOI
10.1007/s10586-019-02969-3
ISSN
1386-7857
1573-7543
Abstract
Container based virtualization is an innovative technology that accelerates software development by providing portability and maintainability of applications. Recently, a growing number of workloads such as high performance computing (HPC) and Deep Learning(DL) are deployed in the container based environment. However, GPU resource management issues especially the GPU memory over subscription issue in container-based clusters, which brings substantial performance loss, is still challenging. This paper proposes an adaptive fair-share method to share effectively in container-based virtualization environment as well as an execution rescheduling method to manage the execution order of each container for acquiring maximum performance gain. We also proposed a checkpoint based mechanism especially for DL workload running with TensorFlow, which can efficiently solve the GPU memory over subscription problem. We demonstrate that our approach contributes to overall performance improvement as well as higher resource utilization compared to default and static fair-share methods with homogeneous and heterogeneous workloads. Compared to two other conditions, their results show that the proposed method reduces by 16.37%, 15.61% in average execution time and boosts approximately by 52.46%, 10.3% in average GPU memory utilization, respectively. We also evaluated our checkpoint based mechanism by running multiple CNN workloads with TensorFlow at the same time and the result shows our proposed mechanism can ensure each workload executing safely without out of memory (OOM) error occurs. © 2019, Springer Science+Business Media, LLC, part of Springer Nature.
Files in This Item
Go to Link
Appears in
Collections
공과대학 > 소프트웨어학부 > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Kim, Yoonhee photo

Kim, Yoonhee
공과대학 (소프트웨어학부(첨단))
Read more

Altmetrics

Total Views & Downloads

BROWSE