We experienced a gradual decrease of free GPU memory over time, which we pinned down to the Cholesky decomposition. Because of our architecture, this calculation occasionally ends up running in a new thread, each time causing thread-local allocations to be repeated over and over, eventually exhausting memory (40MB allocation per thread).
Description
Our software runs a number of services in different threads. Occasionally, the software will destroy these threads and create new ones. We run ArrayFire in one of these threads. We noticed that, each time after our software destroys and re-creates threads, GPU memory usage increases and never goes down. After a series of such destroy/re-create, GPU memory becomes clogged.
We pinned down this allocation in af::choleskyInPlace, i.e., likely inside cuSolver.
Reproducible Code and/or Steps
The code below reproduces the problem. Each thread allocates an extra 40MB of GPU memory, which is never released even after the thread is joined.
#include <arrayfire.h>
#include <cuda_runtime.h>
#include <iostream>
#include <thread>
std::size_t available_mem()
{
std::size_t free = 0;
std::size_t total = 0;
cudaMemGetInfo(&free, &total);
return free;
}
int main()
try
{
af::info();
std::size_t init_mem = available_mem();
for (std::size_t i = 0; i < 10; ++i)
{
std::thread t(
[&]()
{
af::array x = af::randu(40, 10);
af::array l = af::matmulTN(x, x);
x = af::array();
af::eval(l);
af::deviceGC();
if (af::choleskyInPlace(l, false) != 0)
{
std::cout << "bad" << std::endl;
return;
}
af::eval(l);
l = af::array();
af::deviceGC();
std::cout << "consumed: " << (init_mem - available_mem()) / 1024.0 / 1024.0 << std::endl;
});
t.join();
}
return 0;
}
catch (const std::exception& e)
{
std::cout << e.what() << std::endl;
return 1;
}
Output:
consumed: 50
consumed: 90
consumed: 130
consumed: 172
consumed: 212
consumed: 252
consumed: 294
consumed: 334
consumed: 374
consumed: 416
System Information
- ArrayFire version: 3.8.3
- Devices installed on the system: Nvidia GTX 1070 8GB
- (optional) Output from the af::info() function if applicable.
-
ArrayFire v3.8.3 (CUDA, 64-bit Windows, build 987d5675a)
Platform: CUDA Runtime 11.8, Driver: 13000
[0] NVIDIA GeForce GTX 1070, 8192 MB, CUDA Compute 6.1
- ...
Checklist
- Using the latest available ArrayFire release
- GPU drivers are up to date
We experienced a gradual decrease of free GPU memory over time, which we pinned down to the Cholesky decomposition. Because of our architecture, this calculation occasionally ends up running in a new thread, each time causing thread-local allocations to be repeated over and over, eventually exhausting memory (40MB allocation per thread).
Description
Our software runs a number of services in different threads. Occasionally, the software will destroy these threads and create new ones. We run ArrayFire in one of these threads. We noticed that, each time after our software destroys and re-creates threads, GPU memory usage increases and never goes down. After a series of such destroy/re-create, GPU memory becomes clogged.
We pinned down this allocation in af::choleskyInPlace, i.e., likely inside cuSolver.
Reproducible Code and/or Steps
The code below reproduces the problem. Each thread allocates an extra 40MB of GPU memory, which is never released even after the thread is joined.
Output:
System Information
Checklist