Sorry, something went wrong.
There was a problem hiding this comment.
Thanks for sending this in. I made a couple of suggestions which are required to get the OpenCL backend working again.
Sorry, something went wrong.
|
There is a invalid read access error with the sparse test in the CUDA backend. I am able to reproduce it using the following code: TEST(Cast, abs) {
using namespace af;
array a = randu(100, 100, f64);
array b = a.as(f32);
array c = max<double>(abs(a - b));
}
There is something odd going on with the implicit casts in the subtraction operation. It looks like the buffer object's shape is not set during the conversion. I am not sure where this is happening and I am investigating it. |
Sorry, something went wrong.
|
@umar456 any update on this? Can I help in any way? |
Sorry, something went wrong.
|
@jacobkahn We can limit this optimization to casts between integer types or floating point types. This way it behaves like C++ types. Alternatively, we could keep the current behavior and allow destructive casts and expect the user to use functions like floor or ceil to get the same behavior. |
Sorry, something went wrong.
|
@umar456 — revisiting this after some time — I think for now, destructive casts that emulate floor/ceiling operations probably aren't good to implicitly-implement. The casts that I think are more interesting to optimize away are casts between similar types with different precisions — f16 <> f32 <> f64 or u32 <> u64, etc. While some of these casts are destructive, there isn't really an operation to emulate them. A user who casts f32 --> f16 --> f32 almost certainly isn't doing so to intentionally lose precision. Thoughts? |
Sorry, something went wrong.
|
I modified the PR to remove the cast in limited scenarios. Currently I am removing the intermediate casts for casts between floating point values. Floating point to integer types are not removed and integer casts from larger to smaller types are not removed. Here are all the combinations of casts that will be removed. The x cell indicates that the cast will be removed if the left hand column type is the outer type and the top row is the inner type. for example outer -> inner -> outer | inner-> | f32 | f64 | c32 | c64 | s32 | u32 | u8 | b8 | s64 | u64 | s16 | u16 | f16 |
|---------|-----|-----|-----|-----|-----|-----|----|----|-----|-----|-----|-----|-----|
| f32 | x | x | x | x | | | | | | | | | x |
| f64 | x | x | x | x | | | | | | | | | x |
| c32 | x | x | x | x | | | | | | | | | x |
| c64 | x | x | x | x | | | | | | | | | x |
| s32 | x | x | x | x | x | x | | | x | x | | | x |
| u32 | x | x | x | x | x | x | | | x | x | | | x |
| u8 | x | x | x | x | x | x | x | x | x | x | x | x | x |
| b8 | x | x | x | x | x | x | x | x | x | x | x | x | x |
| s64 | x | x | x | x | | | | | x | x | | | x |
| u64 | x | x | x | x | | | | | x | x | | | x |
| s16 | x | x | x | x | x | x | | | x | x | x | x | x |
| u16 | x | x | x | x | x | x | | | x | x | x | x | x |
| f16 | x | x | x | x | | | | | | | | | x |
|
Sorry, something went wrong.
Adds a JIT optimization which does a NOOP in the case of sequential cases that don't result in a differently-typed result.
Description
The following code is technically a noop:
No casting kernels should be generated for any of the above operations, especially for c, but they are. The solution here is to, when creating the CastOp/CastWrapper for c, to check to see if the previous operation was a cast. If it was, and the previous operation's previous operation's output type is the same output type as the current cast, create a __noop node between the prev-prev operation and the current one.
This also precludes tricky cases like:
where the result of b could be used, in which case the intermediate casting operation can't be discounted completely.
With the change, running AF_JIT_KERNEL_TRACE=stderr ./test/cast_cuda --gtest_filter="*Test_JIT_DuplicateCastNoop" produces the following kernels:
Before the change, the generated kernel used wasteful casts:
This PR also adds op, type, and children accessors to Node/NaryNode to facilitate inspecting the JIT tree for optimization.
Further optimization could be had by recursively checking previous operations until an operation has no previous operations are casts - this would fix arbitrarily long chains of casts that were noops on a particular subtree of JIT operations.
Changes to Users
No changes to user behavior.
Checklist