-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TFloat data type for neural network #3486
Conversation
Using While floating point calculations should be faster with The new code passes most unit tests ( Up to now I only implemented SIMD code for AVX/AVX2. |
What is t in tfloat? type? |
Tesseract float? We don't have a clear convention whether data types should be lower or upper case. Upper case examples: |
I'd like to rename internal types to either CamelCase or snake_case to keep tess style consistent. |
I could use |
Primitive typedefs could be just snake_case? |
I am really curious what the results are. Personally I do not have any experience with There is also the issue of type mismatch in some places (for example simdetect.cpp:97). In some cases compiler will still operate on Although after checking on CompilerExplorer it looks like some |
src/arch/simddetect.cpp
Outdated
for (int k = 0; k < n; ++k) { | ||
total += u[k] * v[k]; | ||
} | ||
return total; | ||
} | ||
|
||
// Compute dot product using std::inner_product. | ||
static double DotProductStdInnerProduct(const double *u, const double *v, int n) { | ||
static TFloat DotProductStdInnerProduct(const TFloat *u, const TFloat *v, int n) { | ||
return std::inner_product(u, u + n, v, 0.0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return std::inner_product(u, u + n, v, 0.0); | |
return std::inner_product(u, u + n, v, TFloat(0.0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, that would largely improve this function for the float
case.
See also #3490 -- just a rough first run on medium hardware (laptop i7) looks like this shaves off a at-least-noticable amount. Haven't got time to dive in more right now and run VTune or anything, but even 10% off on the DotProduct would be great as that's eating upwards of 90% off total product run-time on my rigs. More floats multiplied per clock is a big win then IMO. [Edit: ^^^ DotProduct == AVX2 version, of course] (#3490 has preliminary code for AVX2, etc. fitted for TFloat. When I file a proper pullreq and you like it, then someone with more AVX/AVX2 savvy should peer-review that hacky edit. 😊 ) |
Yes, the DotProduct function (for AVX / AVX2 / SSE / NEON / native / ...) uses most of the processor time with best models, especially during training runs. Ideally using |
Once we have this float 32 option ready, next step could be OpenMP offloading, to calculate the dot product in GPU with just a few lines of code. |
Do you have code examples for that? |
One issue with the current Tesseract code is the mix of the two implementations "fast" and "best" in the same classes. It results in lots of places which check for Ideally "fast" and "best" should use separate classes, and the "best" classes should be template classes which allow |
Regarding OpenMP offloading, here is one source: |
Re the OpenMP bit: might we be better off using BLAS (or BLIS?) primitives with a OpenCL/etc. supporting library and thus leave the whole microtuning of math operations on multiple platforms to someone else? I know it's not going to be as simple as that, but my current concern is that I at least am not running off the usual gcc+linux rigs for this (MSVC+Windows mainly, on not always the quickest hardware), nor are my users, so something like clBLAS (OpenCL+BLAS) looks more promising to me than OpenMP right now (OpenMP is at spec 5.something and MSVC2019 is at.... 2.0 (/openMP compiler flag)? Plus I don't see where MSVC has that alleged "GPU support" in their OpenMP implementation, so... 🤔 🤔 BLAS/BLIS comes with SDOT, DDOT, etc. plus several BLAS libs out there mention AMD+ARM (OpenCL) support, next to ubiquitous NVidia. At least the way I read their pages. Meanwhile OpenMP Anyway, this is new ground for me, so disregard or correct if I'm saying dumb things here. 🍼 |
Ok, some performance number for FAST_FLOAT: Is FAST_FLOAT useful?TL;DR: YES. Here's the results from a MSVC2019 (latest) + VTune (latest) set of profile runs:
What's this say: DotProduct of ANY KIND will be about 50% faster, so @stweil was on the money. 😄 This is using #3490 + fixes (to be posted later, running out of time again - RL), i.e. FAST_FLOAT, done pervasively as coded by bleeding edge commits from @stweil + all AVX/SEE/FMA dorproduct calls ported as well (that's in #3490 and will be re-submitted in a cleaner pullreq later) REL64 = Windows 10, 64bit build, using MSVC with AVX2 and other Release Build optimizations enabled. For comparison, as an aside, here are my Debug Build numbers: 64-bit MSVC build, but with all debug flags ON: much slower, but the same relative speed up:
What are the AVX_A and AVX_8 versions? And TF vs DBL?DBL is Native, done with all double's. Aside: resulting values differ in the 6-7th digit between SEE/AVX/FMA and Native versions. There's also that same inaccuracy visible in double vs. float so output changes ever so slightly. (Makes sense given IEEE754, anyway.) AVX_A is DotProductAVX optimized for ALIGNED LOADS vs unaligned ( AVX_8 is like ~FAST_FLOAT16, i.e. 8 floats per round, regular AVX is 16 floats per round (I took out the FLOAT16 conditional logic there). Do note that the ALIGNED code is exercised only a few times, relatively speaking, due to the way the test/benchmark code has been hacked together (intentionally! as I was testing the stability of the code changes). I don't have a good answer why the FMA code is so much faster than the rest. 🤔 -- At least the validation code in the test assures me that the output is the same value, so it's not due to bugs "helping" that one. Here's the test/benchmark code snippet used, so you can see what was done, globally speaking:
|
Oh, in case anyone wonders why DBL isn't the same speed everywhere: it's still TFloat vectors coming in, so only the calculus is done as all-double. Hence different numbers for the same func in DOUBLE and FLOAT scenarios. |
Latest @stweil results on MSVC2019 latest, /openmp:experimental (https://devblogs.microsoft.com/cppblog/improved-openmp-support-for-cpp-in-visual-studio/) on lightly loaded i7 4700 laptop (has AVX2, not AVX512): TL;DR: The CPUtime:Self column has these 0 numbers because the compiler clearly decided to inline a few of these functions, but not all of them. (This does also affect your debugging experience in release builds, BTW: the inlined stuff is not steppable) Sad discovery right now: Vtune doesn't cope well with templatized functions: it somehow bundles them together once they are inside another template instance (the benchmark runner in this case): which not exactly how the benchmark is coded:
Anyway, the diff between the double and float run is 2:1 and that matches the visual experience of the running benchmark quite well: the progress feedback via fprintf() in the double run feels about twice as slow as the float run part. |
…: added that one as another enabling condition since benchmarks have shown MSVC2019's `/openmp:experimental` to deliver. :-) (See tesseract-ocr#3486 benchmark reports on @stweil's DotProductNative() implementation)
…: added that one as another enabling condition since benchmarks have shown MSVC2019's `/openmp:experimental` to deliver. :-) (See tesseract-ocr#3486 benchmark reports on @stweil's DotProductNative() implementation)
My tests now also show a performance gain. I have run Results:
|
@stweil: very nice set of hardware rigs you got there! 😄 And some interesting results! From far away, it looks like the ranking of the 'Native' variant is highly dependent on which compiler (and its settings, probably) is used. What I've seen on my dev rig so far is that MSVC2019 (v16.10.3) is quite smart in its optimizations:
Asides:
Here's a snap how it looks with all OpenMP support dialed up to the max in MSVC2019: note the 4 cores at work when there's bits of LSTM to distribute (4 cores = 4 brown graphs!): And here's that same report showing the DotProductAVX score and where it is called from (top-down view again -- the cryptic func identifiers are OpenMP/MSVC multithreading at work): Here's a second bulktest run, now with the OpenMP multithread macros DISABLED in LSTM as mentioned above, so we get a single core load and easier to read numbers; also visible in the core graphs is the no-activity there: the red stuff is all VCOMP (Microsoft internals) spinning like mad waiting for OpenMP work that will never arrive): (The rest of the cost (61%!) is all VCOMP Spinning/MS runtime internals ("SwitchToThread" in KERNELBASE.dll and children thereof.) The way I read this, is that OpenMP is definitely quite experimental over at Microsoft -- this is the latest available compiler you're looking at. Granted, on hardware that's a couple of years old, but that VCOMP/NtYield stuff feels a bit yucky IMO. On the other hand, it looks like the optimizations done by hand (AVX, SSE, etc. DotProduct and Matrix codes, float vs. double now) look pretty solid, at least on this hardware. That means my users will get faster OCR results -- still slow, but definitely faster! 👍 (🤔 What you also MAY note here is that the matrix calls don't feature anywhere in these benchmarks. Was that a happy optimization once that's now not important anymore, or is the matrix code more involved with training tesseract? Just asking as I'm a tesseract n00b. 😄 ) |
1703267
to
894e7c4
Compare
The pull request is now ready for merging. I squashed my own commits and added three commits from @GerHobbelt which fixed my dot product implementations and added another one. Still missing: documentation, continuous integration and more unittest code. I want to add a configure parameter to enable How should For continuous integration it might be sufficient to run some builds with There is also more finetuning to be done by updating more code locations from using |
Up to now Tesseract used double for training and recognition with "best" models. This commit replaces double by a new data type TFloat which is double by default, but float if FAST_FLOAT is defined. Ideally this should allow faster training. Signed-off-by: Stefan Weil <[email protected]>
8 float FPs fit in a single 256bit vector (8x32) (contrasting 4 double FPs: 4*64) [sw] Format commit message and use float instead of TFloat
There's lines like `__m256d scale01234567 = _mm256_loadu_ps(scales)`, i.e. loading float vectors into double vector types. [sw] Formatted commit message
[sw] Formatted commit message
I was thinking about template parameter TFloat, not compiler definition. |
Yes, that should be addressed by future commits. But that also requires separating the classes for fast and best models. |
I elaborate a bit more. IMO we should aim on real C++ with template metaprogramming. Yes, this will involve more header only stuff, and other related things, burden. |
Using more templates is fine if it helps maintaining the code and does not cost performance and memory resources. Here I have a very concrete need: I want to run more training, and LSTM training is currently very slow. So making it faster is important for me. Templates cannot be used for the low level SIMD code, and I think they cannot be used here as long as we don't have different objects for fast and best LSTM. The changes suggested here don't prevent using template code later to make executables which support |
@stweil and @GerHobbelt, thank you! |
Please add this option to the autotools build. |
I also agree that adding the autotools option is a good idea.
|
With pull request #3510 it will be possible to use |
I tried that meanwhile, it basically worked but took much more time. The faster calculation of the dot product does not help because before that the data for that calculation has to be transferred from the CPU RAM to the GPU memory. So either the GPU code must begin at a higher level and transfer scales and weights only once initially from CPU to GPU memory, or one needs a GPU which shares memory with the CPU and avoids the memory transfer overhead. |
Intel has |
Up to now Tesseract used double for training and recognition
with "best" models.
This commit replaces double by a new data type
TFloat
whichis double by default, but float if FAST_FLOAT is defined.
Ideally this should allow faster training.
Signed-off-by: Stefan Weil [email protected]