-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast Function Approximations lowering. #8566
base: main
Are you sure you want to change the base?
Conversation
@@ -38,6 +38,7 @@ extern "C" WEAK void *halide_opencl_get_symbol(void *user_context, const char *n | |||
"opencl.dll", | |||
#else | |||
"libOpenCL.so", | |||
"libOpenCL.so.1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's the case that libOpenCL.so.1
should rather replace libOpenCL.so
? The latter is a namelink that's only present when -dev
packages are installed. It should always point to the former.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking the same. Can fix it later, but I needed this on my local machine, so without being too destructive without consensus, I did this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed it. We'll see what the build bots do.
bea8612
to
0de4dbc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for doing this, and sorry it took me so long to review it (I'm finally clawing out from under my deadlines). It generally looks good but I have some review comments.
What order should this be done in vs our change to strict_float? Is there any reason to delay this until after strict_float is changed?
@@ -33,6 +33,12 @@ bool should_extract(const Expr &e, bool lift_all) { | |||
return false; | |||
} | |||
|
|||
if (const Call *c = e.as<Call>()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The need for this makes me think the extra args would be better just as flat extra args to the math intrinsic, instead of being packed into a struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if I don't pass the things as a make_struct()
, CSE will still lift these arguments out of the call if you have multiple of those. This makes it only harder to figure out what the actual precision-arguments were to the Call node in the lowering pass that wants to read them back. They might now have become Let or LetStmt, instead of a simple ImmFloat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CSE will never lift constants, so they should be fine.
strict_float is already broken today in a few scenarios. All these transcendentals are broken right now regarding strict_float(). Every GPU-backend has a different implementation of them, with different precisions. So a The good thing is, is that this PR actually paves the way towards fixing strict_float for transcendentals. Long term, if we have our own implementation for all of them, we can strict_float guarantee they will yield the same result. However, this would require us to have an FMA intrinsic as well (to be able to express the polynomials using Horner's method with fma-ops). So yeah, I'm definitely in favor of accepting that strict_float is badly broken, and moving forward with this PR. |
No worries! Thanks a lot for looking into this! I'll address your feedback, questions and improvements tomorrow, such that this is all still fresh in your head. |
Update: Nevermind, I found another approach that works well for now. |
…nge (-1, 1) to test (-4, 4). Cleanup code/comments. Test performance for all approximations.
…st not touching input: prevents constant folding.
…ynomial by default instead.
…r lost precision. Also test accuracy of non-forced polynomials, i.e., potentially intrinsics.
3211d3a
to
f6f7fd0
Compare
…Selectively disable some tests that require strict_float on GPU backends.
I took care of all feedback, except for the make_struct wrapper for the precision arguments. That's for later. I updated the original post for the PR on top. More info with the latest concerns and notes can be found there! |
f28a8b0
to
7000f21
Compare
The big transcendental lower update! Replaces #8388.
TODO
I still have to do:
Overview:
Fast transcendentals implemented for: sin, cos, tan, atan, atan2, exp, log, expm1, tanh, asin, acos.
Simple API to specify precision requirements. Default-initialized precision (
AUTO
without constraints) means "don't care about precision, as long as it's reasonable and fast", which gives you the highest chance of selecting a high-performance implementation based on hardware instructions. Optimization objectivesMULPE
(max ULP error), andMAE
(max absolute error) are available. Compared to previous PR, I removedMULPE_MAE
as I didn't see a good purpose for it.Tabular info on intrinsics and native functions their precision and speed, to select an appropriate implementation for lowering to something that is definitely not slower, while satisfying the precision requirements.
native_cos
,native_exp
, etc...fast::cos
,fast::exp
, etc...Tabular info measuring exact precision obtained by exhaustively iterating over all floats in the polynomial's native interval. Measured both MULPE and MAE for Float32. Precisions are not yet evaluated on f16 or f64, which is future work (which I have currently not planned). Precisions are measured by
correctness/determine_fast_function_approximation_metrics.cpp
.Performance tests validating that:
Accuracy tests validating that:
Drive-by fix for adding
libOpenCL.so.1
to the list of tested sonames for the OpenCL runtime.Review guide
Call::make_struct
node with 4 parameters (see API below). This approximation precision Call node survives until lowering pass where the transcendentals are lowered. In this pass, they are extracted again from this Call node's arguments. I conceptually like that this way, they are bundled and clearly not at the same level as the actual mathematical arguments. Is this a good approach? In order for this to work, I had to stopCSE
from extracting those precision arguments, andStrictfyFloat
from recursing down into that struct and litterstrict_float
on those numbers. I have seen theCall::bundle
intrinsic. Perhaps this one is better for that purpose? @abadamsFloat(16)
andFloat(64)
, but those are not yet implemented/tested. The polynomial approximations should work correctly (although untested) for these other data-types.native_tan()
compiles to the same three instructions as I implemented on CUDA:sin.approx.f32
,cos.approx.f32
,div.approx.f32
. I haven't investigated AMD's documentation on available hardware instructions.Concerns
exp()
(regularexp()
, notfast_exp()
) claims to be bit-exact, which is proven wrong by the pre-existing testcorrectness/vector_math.cpp
:tan_f32(vec2<f32>)
did not get converted totan_f32(first element), tan_f32(second element)
.API