Optimize the serialization of large primitive floating-point arrays #61

chamb · 2015-04-14T15:54:39Z

Awesome library that works out of the box and then better when you tweak it.

In the scope of my personal usage I have found only one weakness: the serialization of large primitive arrays is slower than the standard java serialization, much slower for 'double' and 'float'.

That's because FST writes the primitive elements one by one, converting them in bytes one at a time, while java serialization uses esoteric intrinsic native methods such as java.io.ObjectOutputStream.doublesToBytes or java.io.ObjectOutputStream.floatsToBytes.

Could we consider using those intrinsics in FST too?

RuedigerMoeller · 2015-04-14T16:03:50Z

HI, thanks for the hint. I wasn't aware of this. Can't remember of benchmarks where JDK was better (probably new in 1.8 ? ) ..
Currently the (fastest) StreamCodec.java uses:

@Override
    public void writeFDouble(double value) throws IOException {
        writePlainLong(Double.doubleToLongBits(value));
    }

apparently ObjectOutputStream.doublesToBytes are native and private :(. Another alternative would be to use Unsafe. I will have to benchmark the differences ..

I am offline for one week, so stay tuned ;-) .. if you could provide a small benchmark reproducer menawhile .. even better :)

chamb · 2015-04-14T16:15:18Z

Thanks for the instant feedback! Here is a small benchmark that I run on my laptop. The larger the primitive array the stronger the difference. With an extreme size of 100M doubles for instance (which requires a few gigabytes of heap to run) I have the following result:

STD 800000027 bytes written in 859ms
FST 800000007 bytes written in 8239ms

Code below:

import java.io.ObjectOutputStream;
import java.io.OutputStream;

import org.nustaq.serialization.FSTObjectOutput;

public class Benchmark {

    /** Size of primitive array */
    static final int SIZE = 100_000_000;

    public static void main(String[] args) throws Exception {

        CountingOutputStream count;
        long start, elapsed;

        Object obj = new double[SIZE];


        count = new CountingOutputStream();
        start = System.nanoTime();
        try(ObjectOutputStream oos = new ObjectOutputStream(count)) {
            oos.writeObject(obj);
        }
        elapsed = System.nanoTime() - start;
        System.out.println("STD " + count.count + " bytes written in " + (elapsed)/1000000L + "ms");

        count = new CountingOutputStream();
        start = System.nanoTime();
        try(FSTObjectOutput fos = new FSTObjectOutput(count)) {
            fos.writeObject(obj);
        }
        elapsed = System.nanoTime() - start;
        System.out.println("FST " + count.count + " bytes written in " + (elapsed)/1000000L + "ms");

    }


    public static class CountingOutputStream extends OutputStream {
        public long count = 0;
        @Override public void write(int b) { count++; }
    }

}

RuedigerMoeller · 2015-04-14T16:39:42Z

Thanx !

First analysis:

Its not the float/double implementation but Object size.
FST does not like huge objects > 100MB (as documented) as i build up an internal buffer before flushing. As JDK streams are pretty broken performance wise, its not possible to change FST behaviour in that regard without hampering performance for the default use case (object size of some 10 kb)..
In case you actually need to write single objects >100MB its recommended to split them up.
see this test with a more realistic object size, there is no difference in float performance.

public static void main(String[] args) throws Exception {

        while( true ) {
            CountingOutputStream count;
            long start, elapsed;

            double obj[] = new double[10_000];
            ByteArrayOutputStream bout = new ByteArrayOutputStream(4*obj.length);

            start = System.nanoTime();
            for ( int n = 0; n < 10000; n++ ) {
                try(ObjectOutputStream oos = new ObjectOutputStream(bout)) {
                    oos.writeObject(obj);
                }
                bout.reset();
            }
            elapsed = System.nanoTime() - start;
            System.out.println("STD :" + (elapsed)/1000000L + "ms");

            FSTConfiguration conf = FSTConfiguration.createDefaultConfiguration();
            start = System.nanoTime();
            for ( int n = 0; n < 10000; n++ ) {
                bout.write(conf.asByteArray(obj));
                bout.reset();
            }
            elapsed = System.nanoTime() - start;
            System.out.println("FST :" + (elapsed)/1000000L + "ms");

        }
    }

yields:
FST :1078ms
STD :1093ms

Note also, writing plain number arrays is kind of unrealistic benchmark (no to zero optimization potential). Its real world object graphs where FST outperforms other serializers.

chamb · 2015-04-14T16:59:50Z

I understand that serializing very large arrays is far from the common usage of the library. And that it probably deserves custom code.

One example though: my project is a distributed database and it frequently serializes data that is a mix of standard objects together with arrays of numbers, for instance when nodes exchange partial results part of a distributed query, or when the transaction log is written to disk or pushed to replicas.

Maybe the usage of buffers in FST is not the only cause of slightly lower performance, in my original test case is you replace

Object obj = new double[SIZE];

with

Object obj = new long[SIZE];

the difference is much smaller than with doubles:

STD 800000027 bytes written in 511ms
FST 800000010 bytes written in 741ms

RuedigerMoeller · 2015-04-14T18:08:30Z

hm .. probably a mixture of both. The internal buffer written needs to grow + get copied several times. This could be avoided as I could signal expected size when detecting a large "blob" array, so the buffer is allocated with the correct final size. I alsready have an idea on how to add a special path for large primitive arrays ...
A second source for slowness if you look at FSTStreamEncoder .java, you'll see only long and int arrays are really optimized (incl prealloc). The other ones are simply looped, which makes a difference especially for large arrays.
As said i am just about to leave (goto Italy :-)) ), so tbc in one week.

RuedigerMoeller · 2015-04-14T21:12:45Z

argh just had to do some last minute investigation ..

this checkin fixes the issue 363d820 for double and float arrays. FST now is ~10-15% ahead for your test case

I'll release when back from vacation. It implicitely also optimizes buffer preallocation ("ensureReadahead()"). The fix is not complete as all arrays should get optimized ..

chamb · 2015-04-15T08:35:24Z

Well that was fast!

I checked out your master branch and confirm that my use case of serializing a large array of doubles now runs a bit faster with FST than with JDK. Obviously I was wrong staring at how doubles are split into bytes, but luckily you found the real optimization ;)

Looking forward to the upcoming release(s).

Enjoy your trip, this is probably one of the best weeks of the year to spend time in Italy...

RuedigerMoeller · 2015-04-23T12:17:46Z

thx, had a great time in italy (though it's tough for me to stay away from computers for a whopping full week ;), even tried to program using AIDE on a crappy tablet ..).

Thanks for your analysis, report and fix validation :-).
the issue is fixed with 2.25 (should be available on maven.org within some hours)

RuedigerMoeller closed this as completed Apr 23, 2015

RuedigerMoeller mentioned this issue Apr 15, 2016

low performance of serialization #125

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize the serialization of large primitive floating-point arrays #61

Optimize the serialization of large primitive floating-point arrays #61

chamb commented Apr 14, 2015

RuedigerMoeller commented Apr 14, 2015

chamb commented Apr 14, 2015

RuedigerMoeller commented Apr 14, 2015

chamb commented Apr 14, 2015

RuedigerMoeller commented Apr 14, 2015

RuedigerMoeller commented Apr 14, 2015

chamb commented Apr 15, 2015

RuedigerMoeller commented Apr 23, 2015

Optimize the serialization of large primitive floating-point arrays #61

Optimize the serialization of large primitive floating-point arrays #61

Comments

chamb commented Apr 14, 2015

RuedigerMoeller commented Apr 14, 2015

chamb commented Apr 14, 2015

RuedigerMoeller commented Apr 14, 2015

chamb commented Apr 14, 2015

RuedigerMoeller commented Apr 14, 2015

RuedigerMoeller commented Apr 14, 2015

chamb commented Apr 15, 2015

RuedigerMoeller commented Apr 23, 2015