|
What is the speedup range ? |
Sorry, something went wrong.
|
TEST(Index, blah) {
array a = randu(20000000, 10);
array idx = seq(20000000);
array b = a(idx);
b.eval();
af::sync();
}
Master: Time(%) Total Time (ns) Instances Average Minimum Maximum Name
------- --------------- --------- ------------ ---------- ---------- ----------------------------------------------------------------------------------------------------
49.7 10,107,916 1 10,107,916.0 10,107,916 10,107,916 void cuda::index<float>(cuda::Param<float>, cuda::CParam<float>, cuda::AssignKernelParam, int, int)
38.4 7,811,232 1 7,811,232.0 7,811,232 7,811,232 void cuda::kernel::uniformPhilox<float>(float*, unsigned int, unsigned int, unsigned int, unsigned …
8.1 1,645,554 1 1,645,554.0 1,645,554 1,645,554 KER9745534647381087054
3.9 791,546 1 791,546.0 791,546 791,546 void cuda::range<float>(cuda::Param<float>, int, int, int)
This PR: Time(%) Total Time (ns) Instances Average Minimum Maximum Name
------- --------------- --------- ----------- --------- --------- ----------------------------------------------------------------------------------------------------
60.6 7,809,308 1 7,809,308.0 7,809,308 7,809,308 void cuda::kernel::uniformPhilox<float>(float*, unsigned int, unsigned int, unsigned int, unsigned …
20.3 2,615,049 1 2,615,049.0 2,615,049 2,615,049 void cuda::index<float>(cuda::Param<float>, cuda::CParam<float>, cuda::AssignKernelParam, int, int)
12.8 1,646,066 1 1,646,066.0 1,646,066 1,646,066 KER9745534647381087054
6.3 812,761 1 812,761.0 812,761 812,761 void cuda::range<float>(cuda::Param<float>, int, int, int)
3.8x faster |
Sorry, something went wrong.
This optimization dynamically sets the block size based on the output array
dimension. Originally we had a block size of 32x8 threads per block. This
configuration was not ideal when indexing into a long array where you
had few columns and many rows. The current approach creates blocks of
256x1, 128x2, 64x4 and 32,8 to better accommodate smaller dimensions.
Description
This optimization dynamically sets the block size based on the output array
dimension. Originally we had a block size of 32x8 threads per block. This
configuration was not ideal when indexing into a long array where you
had few columns and many rows. The current approach creates blocks of
256x1, 128x2, 64x4 and 32,8 to better accommodate smaller dimensions.
Changes to Users
None
Checklist