Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TFloat data type for neural network #3486

Merged
merged 4 commits into from
Jul 24, 2021
Merged

Conversation

stweil
Copy link
Member

@stweil stweil commented Jul 5, 2021

Up to now Tesseract used double for training and recognition
with "best" models.

This commit replaces double by a new data type TFloat which
is double by default, but float if FAST_FLOAT is defined.

Ideally this should allow faster training.

Signed-off-by: Stefan Weil [email protected]

@stweil stweil marked this pull request as draft July 5, 2021 15:07
@stweil
Copy link
Member Author

stweil commented Jul 5, 2021

Using float instead of double requires a build with -DFAST_FLOAT.

While floating point calculations should be faster with float, it also has some overhead because (de-)serialization requires additional steps as long as the traineddata files store double values.

The new code passes most unit tests (intsimdmatrix_test and lstm_test still fail). I am still checking whether it is faster or not. My first test with lstm_squashed_test was not as expected because the execution time even increased a little.

Up to now I only implemented SIMD code for AVX/AVX2.

@egorpugin
Copy link
Contributor

What is t in tfloat? type?
Maybe using tfloat = ... (lowercase name)?

@stweil
Copy link
Member Author

stweil commented Jul 5, 2021

What is t in tfloat?

Tesseract float? We don't have a clear convention whether data types should be lower or upper case. Upper case examples: PRIORITY, PROTO_ID, EDGE_INDEX and some more.

@egorpugin
Copy link
Contributor

I'd like to rename internal types to either CamelCase or snake_case to keep tess style consistent.

@stweil
Copy link
Member Author

stweil commented Jul 5, 2021

I could use TessFloat instead of TFloat ...

@egorpugin
Copy link
Contributor

egorpugin commented Jul 5, 2021

Primitive typedefs could be just snake_case?
TessFloat is ok, but as I understand we have some local type for neural networks.
It should be stated somehow also.

@nocun
Copy link
Contributor

nocun commented Jul 5, 2021

I am really curious what the results are. Personally I do not have any experience with float vs. double speed comparison, Alexandrescu in his talk mentions that the two should be similar (see slide 22).

There is also the issue of type mismatch in some places (for example simdetect.cpp:97). In some cases compiler will still operate on double and then convert back to float.

Although after checking on CompilerExplorer it looks like some float with double comparisons (like -1.0, 0.0, 1.0) do not involve a promotion conversion (at least on x86).

for (int k = 0; k < n; ++k) {
total += u[k] * v[k];
}
return total;
}

// Compute dot product using std::inner_product.
static double DotProductStdInnerProduct(const double *u, const double *v, int n) {
static TFloat DotProductStdInnerProduct(const TFloat *u, const TFloat *v, int n) {
return std::inner_product(u, u + n, v, 0.0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return std::inner_product(u, u + n, v, 0.0);
return std::inner_product(u, u + n, v, TFloat(0.0));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that would largely improve this function for the float case.

@GerHobbelt
Copy link
Contributor

GerHobbelt commented Jul 11, 2021

See also #3490 -- just a rough first run on medium hardware (laptop i7) looks like this shaves off a at-least-noticable amount. Haven't got time to dive in more right now and run VTune or anything, but even 10% off on the DotProduct would be great as that's eating upwards of 90% off total product run-time on my rigs. More floats multiplied per clock is a big win then IMO.

[Edit: ^^^ DotProduct == AVX2 version, of course]

(#3490 has preliminary code for AVX2, etc. fitted for TFloat. When I file a proper pullreq and you like it, then someone with more AVX/AVX2 savvy should peer-review that hacky edit. 😊 )

@stweil
Copy link
Member Author

stweil commented Jul 11, 2021

More floats multiplied per clock is a big win then IMO.

Yes, the DotProduct function (for AVX / AVX2 / SSE / NEON / native / ...) uses most of the processor time with best models, especially during training runs.

Ideally using float (4 bytes) instead of double (8 bytes) data types could double the speed if the calculation is limited by the memory bandwidth and when SIMD instructions can handle for example 8 float operations instead of 4 double operations in the same time. Getting 10 % off is already a nice improvement, but I still hope to get nearer to 50 % (maybe 30 or 40 %) off.

@amitdo
Copy link
Collaborator

amitdo commented Jul 11, 2021

Once we have this float 32 option ready, next step could be OpenMP offloading, to calculate the dot product in GPU with just a few lines of code.

@stweil
Copy link
Member Author

stweil commented Jul 11, 2021

Do you have code examples for that?

@stweil
Copy link
Member Author

stweil commented Jul 11, 2021

One issue with the current Tesseract code is the mix of the two implementations "fast" and "best" in the same classes. It results in lots of places which check for int_mode, and the classes waste memory because they contain data for both implementations although only one is used at a time.

Ideally "fast" and "best" should use separate classes, and the "best" classes should be template classes which allow float and double. Then the Tesseract code could support "fast" and two variants of "best" without recompilation.

@amitdo
Copy link
Collaborator

amitdo commented Jul 11, 2021

Regarding OpenMP offloading, here is one source:
https://developer.ibm.com/technologies/systems/articles/gpu-programming-with-openmp

@GerHobbelt
Copy link
Contributor

Re the OpenMP bit: might we be better off using BLAS (or BLIS?) primitives with a OpenCL/etc. supporting library and thus leave the whole microtuning of math operations on multiple platforms to someone else?

I know it's not going to be as simple as that, but my current concern is that I at least am not running off the usual gcc+linux rigs for this (MSVC+Windows mainly, on not always the quickest hardware), nor are my users, so something like clBLAS (OpenCL+BLAS) looks more promising to me than OpenMP right now (OpenMP is at spec 5.something and MSVC2019 is at.... 2.0 (/openMP compiler flag)? Plus I don't see where MSVC has that alleged "GPU support" in their OpenMP implementation, so... 🤔 🤔

BLAS/BLIS comes with SDOT, DDOT, etc. plus several BLAS libs out there mention AMD+ARM (OpenCL) support, next to ubiquitous NVidia. At least the way I read their pages.

Meanwhile OpenMP #pragma omp promises fewer lines of code than the entire setup and cleanup I see in the usual OpenCL & CUDA samples. BLAS libs look like they take care of this under the hood, but I haven't looked deep enough to sure yet.

Anyway, this is new ground for me, so disregard or correct if I'm saying dumb things here. 🍼

@GerHobbelt
Copy link
Contributor

Ok, some performance number for FAST_FLOAT:

Is FAST_FLOAT useful?

TL;DR: YES.

Here's the results from a MSVC2019 (latest) + VTune (latest) set of profile runs:


DOUBLE / REL 64

Function                        CPU Time        Module  Function (Full) Source File     Start Address
tesseract::DotProductSSE        4.661s  tesseract-unittests.exe tesseract::DotProductSSE(double const *,double const *,int)     dotproductsse.cpp       0x140101b40
DotProductNativeTF              2.224s  tesseract-unittests.exe DotProductNativeTF(double const *,double const *,int)   main.cpp        0x1400270f0
tesseract::DotProductFMA        2.027s  tesseract-unittests.exe tesseract::DotProductFMA(double const *,double const *,int)     dotproductfma.cpp       0x140101860
DotProductDBL                   1.952s  tesseract-unittests.exe DotProductDBL(double const *,double const *,int)        main.cpp        0x1400270a0
tesseract::DotProductNative     1.883s  tesseract-unittests.exe tesseract::DotProductNative(double const *,double const *,int)  dotproduct.cpp  0x140101ca0
tesseract::DotProductAVX        1.660s  tesseract-unittests.exe tesseract::DotProductAVX(double const *,double const *,int)     dotproductavx.cpp       0x1401019d0
_stdio_common_vfprintf          0.597s  ucrtbase.dll    _stdio_common_vfprintf  [Unknown]       0x180022930



FLOAT / REL 64

Function / Call Stack           CPU Time        Module  Function (Full) Source File     Start Address
tesseract::DotProductSSE        2.173s  tesseract-unittests.exe tesseract::DotProductSSE(float const *,float const *,int)       dotproductsse.cpp       0x140101b30
tesseract::DotProductNative     1.817s  tesseract-unittests.exe tesseract::DotProductNative(float const *,float const *,int)    dotproduct.cpp  0x140101c90
DotProductDBL                   1.687s  tesseract-unittests.exe DotProductDBL(float const *,float const *,int)  main.cpp        0x1400271f0
tesseract::DotProductAVX_8      1.244s  tesseract-unittests.exe tesseract::DotProductAVX_8(float const *,float const *,int)     main.cpp        0x1400270a0
DotProductNativeTF              0.991s  tesseract-unittests.exe DotProductNativeTF(float const *,float const *,int)     main.cpp        0x140027260
tesseract::DotProductAVX_A      0.881s  tesseract-unittests.exe tesseract::DotProductAVX_A(float const *,float const *,int)     main.cpp        0x140027120
tesseract::DotProductFMA        0.856s  tesseract-unittests.exe tesseract::DotProductFMA(float const *,float const *,int)       dotproductfma.cpp       0x1401017d0
tesseract::DotProductAVX        0.846s  tesseract-unittests.exe tesseract::DotProductAVX(float const *,float const *,int)       dotproductavx.cpp       0x140101980
_stdio_common_vfprintf          0.626s  ucrtbase.dll    _stdio_common_vfprintf  [Unknown]       0x180022930

What's this say: DotProduct of ANY KIND will be about 50% faster, so @stweil was on the money. 😄

This is using #3490 + fixes (to be posted later, running out of time again - RL), i.e. FAST_FLOAT, done pervasively as coded by bleeding edge commits from @stweil + all AVX/SEE/FMA dorproduct calls ported as well (that's in #3490 and will be re-submitted in a cleaner pullreq later)

REL64 = Windows 10, 64bit build, using MSVC with AVX2 and other Release Build optimizations enabled.

For comparison, as an aside, here are my Debug Build numbers: 64-bit MSVC build, but with all debug flags ON: much slower, but the same relative speed up:

FLOAT 

Function / Call Stack           CPU Time        Module  Function (Full) Source File     Start Address
tesseract::DotProductNative     16.672s tesseract-unittests.exe tesseract::DotProductNative(float const *,float const *,int)    dotproduct.cpp  0x14019cc60
DotProductDBL                   15.519s tesseract-unittests.exe DotProductDBL(float const *,float const *,int)  main.cpp        0x1400186e0
DotProductNativeTF              15.406s tesseract-unittests.exe DotProductNativeTF(float const *,float const *,int)     main.cpp        0x140018760
tesseract::DotProductSSE        8.211s  tesseract-unittests.exe tesseract::DotProductSSE(float const *,float const *,int)       dotproductsse.cpp       0x14019c870
tesseract::DotProductAVX        5.571s  tesseract-unittests.exe tesseract::DotProductAVX(float const *,float const *,int)       dotproductavx.cpp       0x14019c4a0
tesseract::DotProductAVX_8      5.224s  tesseract-unittests.exe tesseract::DotProductAVX_8(float const *,float const *,int)     main.cpp        0x140017e70
tesseract::DotProductAVX_A      5.065s  tesseract-unittests.exe tesseract::DotProductAVX_A(float const *,float const *,int)     main.cpp        0x140018140
tesseract::DotProductFMA        4.022s  tesseract-unittests.exe tesseract::DotProductFMA(float const *,float const *,int)       dotproductfma.cpp       0x14019c110



DOUBLE:

Function / Call Stack           CPU Time        Module  Function (Full) Source File     Start Address
DotProductDBL                   30.756s tesseract-unittests.exe DotProductDBL(double const *,double const *,int)        main.cpp        0x140017e70
DotProductNativeTF              17.317s tesseract-unittests.exe DotProductNativeTF(double const *,double const *,int)   main.cpp        0x140017ef0
tesseract::DotProductSSE        15.845s tesseract-unittests.exe tesseract::DotProductSSE(double const *,double const *,int)     dotproductsse.cpp       0x14019bf50
tesseract::DotProductAVX        11.916s tesseract-unittests.exe tesseract::DotProductAVX(double const *,double const *,int)     dotproductavx.cpp       0x14019bbc0
tesseract::DotProductFMA        7.616s  tesseract-unittests.exe tesseract::DotProductFMA(double const *,double const *,int)     dotproductfma.cpp       0x14019b870

What are the AVX_A and AVX_8 versions? And TF vs DBL?

DBL is Native, done with all double's.
TF is a correction to the commits from @stweil as I merged them yesterday: all-float calculus in Native (there was still a single double in there for the sum).

Aside: resulting values differ in the 6-7th digit between SEE/AVX/FMA and Native versions. There's also that same inaccuracy visible in double vs. float so output changes ever so slightly. (Makes sense given IEEE754, anyway.)

AVX_A is DotProductAVX optimized for ALIGNED LOADS vs unaligned (load vs. loadu, as is already in original tesseract SSE code)

AVX_8 is like ~FAST_FLOAT16, i.e. 8 floats per round, regular AVX is 16 floats per round (I took out the FLOAT16 conditional logic there).

Do note that the ALIGNED code is exercised only a few times, relatively speaking, due to the way the test/benchmark code has been hacked together (intentionally! as I was testing the stability of the code changes).

I don't have a good answer why the FMA code is so much faster than the rest. 🤔 -- At least the validation code in the test assures me that the output is the same value, so it's not due to bugs "helping" that one.

Here's the test/benchmark code snippet used, so you can see what was done, globally speaking:

bool approx_eq(double a, double b)
{
	auto diff = a - b;
	if (diff == 0.0) return true;
	// take the log of both, as we know all incoming values will be positive,
	// so we can check the precision easily, i.e. at which significant digit
	// did the difference occur? ::
	a = log(a);
	b = log(b);
	diff = a - b;
	return (diff >= -1e-5 && diff <= 1e-5);
}

void testDP(void)
{
#define AMOUNT   655360
#define STEP	 1       // offset jump between tests; determines (indirectly) the number of memory-aligned tests vs. UNaligned tests executed. (STEP=4 would be 16-byte step @ float, i.e. each test aligned)
	static TFloat arr1[AMOUNT];
	static TFloat arr2[AMOUNT];

	for (int i = 0; i < AMOUNT; i++)
	{
		arr1[i] = 2.0 + i * 0.005;
		arr2[i] = 7.0 + i * 0.005;
	}

	TFloat total[10];
	const int size = 8192;     // size of vector(s) to feed the dot-product functions

	// start with aligned storage, then step through, including unaligned storage accesses
	int round = 0;
	for (int step = 0; step + size < AMOUNT; (step += STEP), round++)
	{
		TFloat* p1 = arr1 + step;
		TFloat* p2 = arr2 + step;

		total[0] = DotProductDBL(p1, p2, size);
		const double sollwert = total[0];
		//total[1] = DotProduct(p1, p2, size);
		total[1] = DotProductNative(p1, p2, size);
		total[2] = DotProductNativeTF(p1, p2, size);
		total[3] = DotProductAVX(p1, p2, size);
		total[4] = DotProductFMA(p1, p2, size);
		total[5] = DotProductSSE(p1, p2, size);
#if defined(FAST_FLOAT)
		total[6] = DotProductAVX_8(p1, p2, size);
		total[7] = DotProductAVX_A(p1, p2, size);
		total[8] = sollwert;             // regrettably, my i7 doesn't have AVX512 support.
		//total[8] = DotProductAVX_512(p1, p2, size);
#else
		total[6] = sollwert;
		total[7] = sollwert;
		total[8] = sollwert;
#endif
		// check calculations: this is always active to ensure we haven't got any obvious bugs in there (we got a few! -- fixed!)
		for (int i = 1; i <= 8; i++)
		{
			if (!approx_eq(total[i], sollwert))
				fprintf(stderr, "step %d: total %d = %lf %s\n", step, i, (double)total[i], approx_eq(total[i], sollwert) ? "(PASS)" : "*****FAIL!*****");
		}
		// only load the console output for information once in a while: we're running a benchmark here...
		if (round % 10000 == 0)
		{
			for (int i = 1; i <= 8; i++)
			{
				fprintf(stderr, "step %d: total %d = %lf %s\n", step, i, (double)total[i], approx_eq(total[i], sollwert) ? "(PASS)" : "*****FAIL!*****");
			}
		}
	}

	exit(7);
}

@GerHobbelt
Copy link
Contributor

Oh, in case anyone wonders why DBL isn't the same speed everywhere: it's still TFloat vectors coming in, so only the calculus is done as all-double. Hence different numbers for the same func in DOUBLE and FLOAT scenarios.

@GerHobbelt
Copy link
Contributor

Latest @stweil results on MSVC2019 latest, /openmp:experimental (https://devblogs.microsoft.com/cppblog/improved-openmp-support-for-cpp-in-visual-studio/) on lightly loaded i7 4700 laptop (has AVX2, not AVX512):

image

TL;DR: #pragma opm (the DotProductNative() function) is on par with the hand-optimized AVX ones. The rest of the bunch are about the same.

The CPUtime:Self column has these 0 numbers because the compiler clearly decided to inline a few of these functions, but not all of them. (This does also affect your debugging experience in release builds, BTW: the inlined stuff is not steppable)

Sad discovery right now: Vtune doesn't cope well with templatized functions: it somehow bundles them together once they are inside another template instance (the benchmark runner in this case):

image

which not exactly how the benchmark is coded:

// The benchmark runner:
void run_tfloat_benchmark(void) {
  run_tfloat_benchmark<float>();
  run_tfloat_benchmark<double>();
}

Anyway, the diff between the double and float run is 2:1 and that matches the visual experience of the running benchmark quite well: the progress feedback via fprintf() in the double run feels about twice as slow as the float run part.

GerHobbelt added a commit to GerHobbelt/tesseract that referenced this pull request Jul 13, 2021
…: added that one as another enabling condition since benchmarks have shown MSVC2019's `/openmp:experimental` to deliver. :-) (See tesseract-ocr#3486 benchmark reports on @stweil's DotProductNative() implementation)
GerHobbelt added a commit to GerHobbelt/tesseract that referenced this pull request Jul 13, 2021
…: added that one as another enabling condition since benchmarks have shown MSVC2019's `/openmp:experimental` to deliver. :-) (See tesseract-ocr#3486 benchmark reports on @stweil's DotProductNative() implementation)
@stweil
Copy link
Member Author

stweil commented Jul 13, 2021

Is FAST_FLOAT useful?

My tests now also show a performance gain. I have run lstm_squashed_test on a number of different platforms. Most runs used the default implementation with double, but selected runs were done with float.

Results:

  • double: fastest systems are AMD EPYC 7502 (32 s), AMD Ryzen 5 (26 s) and Mac mini (24 s)
  • float: fastest systems are AMD EPYC 7502 (28 s), AMD Ryzen 5 (22 s) and Mac mini (18 s)
// Comparison of execution time with different dot product implementations.
// time lstm_squashed_test
// time DOTPRODUCT=accelerate lstm_squashed_test
// time DOTPRODUCT=fma lstm_squashed_test
// time DOTPRODUCT=generic lstm_squashed_test
// time DOTPRODUCT=native lstm_squashed_test
// Results for Apple M1 (clang):
// DotProduct (default)    24 s
// DotProductAccelerate    33 s
// DotProductGeneric       64 s
// DotProductNative        29 s
// Results for Apple M1 (clang, float):
// DotProduct (default)    18 s
// DotProductAccelerate    23 s
// DotProductNative        22 s
// Results for Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (g++ 8.3.0)
// DotProduct (default)    53 s
// DotProductGeneric      105 s
// DotProductNative       139 s
// Results for Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (clang 7)
// DotProduct (default)    47 s
// DotProductGeneric       99 s
// DotProductNative        55 s
// Results for AMD EPYC 7502 (g++ 10.2.1, single threaded):
// DotProduct (default)    36 s
// DotProductGeneric       91 s
// DotProductNative        87 s
// Results for AMD EPYC 7502 (g++ 10.2.1, single threaded, float):
// DotProduct (default)    28 s
// Results for AMD EPYC 7502 (clang 11, single threaded):
// DotProduct (default)    32 s
// DotProductGeneric       76 s
// DotProductNative        37 s
// Results for AMD EPYC 7502 (clang 11, single threaded, float):
// DotProduct (default)    28 s
// DotProductNative        35 s
// Results for AMD Ryzen 5 3600 (clang 12, single threaded):
// DotProduct (default)    26 s
// DotProductFMA           29 s
// Results for AMD Ryzen 5 3600 (clang 12, single threaded, float):
// DotProduct (default)    22 s
// DotProductFMA           25 s
// Results for Macbook Intel Core i5 2.4 MHz (clang):
// DotProduct (default)    60 s
// DotProductAccelerate    78 s
// DotProductGeneric      108 s
// DotProductNative        65 s
// Results for Macbook Intel Core i5 2.4 MHz (clang, float):
// DotProduct (default)    49 s
// Results for Power8 3425 MHz (g++ 10.2.1, single threaded):
// DotProduct (default)   179 s
// DotProductGeneric      179 s
// DotProductNative       130 s
// Results for NVIDIA Jetson Xavier (g++ 9.3.0, single threaded):
// DotProduct (default)   113 s
// DotProductGeneric      180 s
// DotProductNative       123 s
// Results for NVIDIA Jetson Xavier (g++ 9.3.0, single threaded, float):
// DotProduct (default)   69 s
// DotProductNative       96 s
// Results for NVIDIA Jetson Xavier (clang 11, single threaded):
// DotProduct (default)   97 s
// DotProductGeneric      185 s
// DotProductNative       104 s
// Results for NVIDIA Jetson Xavier (clang 11, single threaded, float):
// DotProduct (default)   77 s
// DotProductNative       83 s

@GerHobbelt
Copy link
Contributor

@stweil: very nice set of hardware rigs you got there! 😄

And some interesting results!

From far away, it looks like the ranking of the 'Native' variant is highly dependent on which compiler (and its settings, probably) is used.

What I've seen on my dev rig so far is that MSVC2019 (v16.10.3) is quite smart in its optimizations:

  • when I turned on the AVX512 compiler setting, without any other change in the code, I got immediate crashes in the Native DotProduct loops as MSVC apparently had decided this could do with some AVX512 opcodes (which my old i7/4700 doesn't support, hence the crashes) -- which makes me wonder about the actual worth of the compiler feature detection code in Tesseract, as that will only work when the compiler optimizations are turned OFF -- otherwise MSVC will decide on its own where to apply AVX/AVX2/AVX512 opcodes if the compiler switches allowed it at build time.

    Don't know what to do about that, without reverting to a separate "non-optimized compile" build for "old hardware". 🤔

  • Your change to float cuts time in about half. With AVX on top of that, you get another halving of the cost. That's for DotProduct at least, across the board. Wow!

Asides:

  • I've created and run a benchmark for the matrix codes, but it turns out there that matrix::Init, i.e. the 2D fill of GENERIC_2D_ARRAY is eating up 50% or worse, no matter what I did. (last edit there: SHA-1: 648d5df) The matrix calculus work doesn't want to rise about ~10% cost in that mini-benchmark.
  • Ran a PDF bulktest benchmark tonight, which is my own mupdf fork, which, as part of the bulk PDF processing, kicks up Tesseract (embedded in the mupdf code) to do OCR work on the rendered pages), so that serves as a reasonable check how well it'll do in actual practice & at user sites (design: one PDF per thread; bulktest runs the PDFs sequentially, so usually only 1 core is loaded; that will of course change when running multiple PDFs in parallel, which is what will happen with the end user app; bulktest is coded sequentially to keep log & diagnostics as simple as possible)
    • observed CPU cost share of DotProduct (float+AVX, so BEST): ~20% of total (including lots of PDF text/metadata activities, so it's not just 'tesseract' taking time in there)

      --> is the 50% gain useful in general practice/use? definitely YES!

      --> how much gain do we get? Bit hard to say without also measuring the old style, but given the new cost number, the gain at userland level would be estimated at 15-20% speed gain. (Doubling the time of dpAVX doesn't double its cost %)

    • Tesseract already had OpenMP multithread #pragma macros in the LSTM code, which make it pretty tough for VTune to produce any legible numbers. With /openmp:experimental turned on, the entire bulktest (thus LSTM + DotProduct as the costliest ones, next to mupdf draw) is noticably faster and loading all 4 cores, though still suboptimally if you care about that stuff. Weird loads reported in VCOMP dll, but that looks like it's the OpenMP support code racking up a lot of NtYieldExecution calls, but that's outside my control as that's MSVC runtime lib internals.

    • When I forcibly turn the OpenMP multithread macros OFF in LSTM, the bulktest run slows down noticably (so OpenMP::experimental in MSVC does indeed distribute the LSTM work across the cores) but the reported numbers become easier to decode: this is where I come to the conclusion that DotProductAVX takes up ~20% of the CPU time. The rest is mostly consumed by LSTM::Forward (~25%) and the weird one in the bunch now that I turned OFF OpenMP multithreading: the VCOMP NtYieldExecution stuff from MS is reported as 50% cost. 😨 Must be something wrong there or VTune going 🤡 .


Here's a snap how it looks with all OpenMP support dialed up to the max in MSVC2019: note the 4 cores at work when there's bits of LSTM to distribute (4 cores = 4 brown graphs!):

image

And here's that same report showing the DotProductAVX score and where it is called from (top-down view again -- the cryptic func identifiers are OpenMP/MSVC multithreading at work):

image


Here's a second bulktest run, now with the OpenMP multithread macros DISABLED in LSTM as mentioned above, so we get a single core load and easier to read numbers; also visible in the core graphs is the no-activity there: the red stuff is all VCOMP (Microsoft internals) spinning like mad waiting for OpenMP work that will never arrive):

image

(The rest of the cost (61%!) is all VCOMP Spinning/MS runtime internals ("SwitchToThread" in KERNELBASE.dll and children thereof.)


The way I read this, is that OpenMP is definitely quite experimental over at Microsoft -- this is the latest available compiler you're looking at. Granted, on hardware that's a couple of years old, but that VCOMP/NtYield stuff feels a bit yucky IMO.

On the other hand, it looks like the optimizations done by hand (AVX, SSE, etc. DotProduct and Matrix codes, float vs. double now) look pretty solid, at least on this hardware. That means my users will get faster OCR results -- still slow, but definitely faster! 👍

(🤔 What you also MAY note here is that the matrix calls don't feature anywhere in these benchmarks. Was that a happy optimization once that's now not important anymore, or is the matrix code more involved with training tesseract? Just asking as I'm a tesseract n00b. 😄 )

@stweil stweil force-pushed the tfloat branch 7 times, most recently from 1703267 to 894e7c4 Compare July 24, 2021 12:56
@stweil stweil marked this pull request as ready for review July 24, 2021 12:57
@stweil
Copy link
Member Author

stweil commented Jul 24, 2021

The pull request is now ready for merging. I squashed my own commits and added three commits from @GerHobbelt which fixed my dot product implementations and added another one.

Still missing: documentation, continuous integration and more unittest code.

I want to add a configure parameter to enable float for builds using autotools. Would --enable-float32 be a good choice? Or are there other suggestions?

How should float builds be made with cmake?

For continuous integration it might be sufficient to run some builds with float and others with double. I think that duplicating the number of builds would be too much.

There is also more finetuning to be done by updating more code locations from using double to using float or TFloat, but I don't expect that will improve the performance much further.

stweil and others added 4 commits July 24, 2021 15:14
Up to now Tesseract used double for training and recognition
with "best" models.

This commit replaces double by a new data type TFloat which
is double by default, but float if FAST_FLOAT is defined.

Ideally this should allow faster training.

Signed-off-by: Stefan Weil <[email protected]>
8 float FPs fit in a single 256bit vector (8x32)
(contrasting 4 double FPs: 4*64)

[sw] Format commit message and use float instead of TFloat
There's lines like `__m256d scale01234567 = _mm256_loadu_ps(scales)`,
i.e. loading float vectors into double vector types.

[sw] Formatted commit message
[sw] Formatted commit message
@egorpugin
Copy link
Contributor

I was thinking about template parameter TFloat, not compiler definition.

@stweil
Copy link
Member Author

stweil commented Jul 24, 2021

I was thinking about template parameter TFloat, not compiler definition.

Yes, that should be addressed by future commits. But that also requires separating the classes for fast and best models.

@egorpugin
Copy link
Contributor

I elaborate a bit more.

IMO we should aim on real C++ with template metaprogramming.
Not compiler switches, but templated code.

Yes, this will involve more header only stuff, and other related things, burden.
But I see this as more powerful and preferable way.

@stweil
Copy link
Member Author

stweil commented Jul 24, 2021

Using more templates is fine if it helps maintaining the code and does not cost performance and memory resources.

Here I have a very concrete need: I want to run more training, and LSTM training is currently very slow. So making it faster is important for me. Templates cannot be used for the low level SIMD code, and I think they cannot be used here as long as we don't have different objects for fast and best LSTM.

The changes suggested here don't prevent using template code later to make executables which support double and float, so users can select which variant they want to have.

@amitdo amitdo merged commit e538cd7 into tesseract-ocr:master Jul 24, 2021
@stweil stweil deleted the tfloat branch July 25, 2021 05:50
@amitdo
Copy link
Collaborator

amitdo commented Jul 25, 2021

@stweil and @GerHobbelt, thank you!

@amitdo
Copy link
Collaborator

amitdo commented Jul 26, 2021

I want to add a configure parameter to enable float for builds using autotools. Would --enable-float32 be a good choice? Or are there other suggestions?

Please add this option to the autotools build.

@nagadomi
Copy link
Contributor

I also agree that adding the autotools option is a good idea.
I used the following command to build and check it. But I'm not sure if this is the right way.

./configure CPPFLAGS=-DFAST_FLOAT --prefix ~/local

@GerHobbelt
Copy link
Contributor

GerHobbelt commented Jul 27, 2021 via email

@stweil
Copy link
Member Author

stweil commented Jul 29, 2021

Please add this option to the autotools build.

With pull request #3510 it will be possible to use ./configure --enable-float32.

@stweil
Copy link
Member Author

stweil commented Jul 30, 2021

Once we have this float 32 option ready, next step could be OpenMP offloading, to calculate the dot product in GPU with just a few lines of code.

I tried that meanwhile, it basically worked but took much more time. The faster calculation of the dot product does not help because before that the data for that calculation has to be transferred from the CPU RAM to the GPU memory.

So either the GPU code must begin at a higher level and transfer scales and weights only once initially from CPU to GPU memory, or one needs a GPU which shares memory with the CPU and avoids the memory transfer overhead.

@amitdo
Copy link
Collaborator

amitdo commented Aug 2, 2021

Intel has 'DL Boost' AVX512_BF16 in some Xeon CPUs released since 2019. It is also available in desktop CPUs released a few months ago. It supports float32 += bfloat16 * bfloat16 for vectors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants