Swivel: fastprep: use mmap()-ed IO for vocabulary parsing #1108

vmarkovtsev · 2017-03-03T17:30:31Z

This speeds up vocabulary parsing 2x for me.

I mmap 8x PAGE_SIZE. Some care was required on buffer boundaries and in the file end.

waterson

I like this approach, especially if it yields a 2x speedup!

But... could you not just mmap the whole file and avoid the complexity of advancing the page size?

Also, if we're going to mmap to compute the vocab, why not also mmap to count co-occurrences?

vmarkovtsev · 2017-03-03T21:35:25Z

I don't remember how mmap() works exactly, does it require the whole file to fit into memory?

Co-occurrences should be mmap-ed too, of course, but I plan to implement it in a separate PR - one feature at a time.

vmarkovtsev · 2017-03-03T22:07:02Z

Appears that mmap()-ing large files is an anti-pattern for reading sequentially:

http://stackoverflow.com/questions/35655915/using-mmap-to-search-large-file-1tb
http://stackoverflow.com/questions/13127855/what-is-the-size-limit-for-mmap
https://www.systutorials.com/qa/1969/maximum-number-of-mmap-ed-ranges-and-how-to-set-it-on-linux
http://stackoverflow.com/questions/7222164/mmap-an-entire-large-file

This involves eating much virtual memory at least (resident may be OK, but that's another story). I suggest keeping the virtual mem size low.

If we had to do random reads, it would make sense.

waterson · 2017-03-04T17:56:11Z

Thanks for the links, Vadim. I read through those but didn't draw the same conclusion: was there a specific discussion about sequentially reading a large file that I might have missed?

Anyhow, I do like the fact that you gained such a large speedup; however, I'm concerned about adding complexity if the problem could be solved more simply. Perhaps the ifstream's default buffer size is just too small? It would be interesting to see if something simple like this (reference) yielded similar speedup:

std::ifstream fin(input_filename, std::ifstream::ate);
constexpr size_t kBufSize = 8 * 1024 * 1024;
std::unique_ptr<char> buf(new char[kBufSize]());
fin.rdbuf()->pubsetbuf(buf.get(), kBufSize);

If so, it would be easy to apply a similar change to the co-occurrence scanning. WDYT?

vmarkovtsev · 2017-03-04T18:12:11Z

Chris, this was the first thing I tried to speed up the parsing. I tested several buffer sizes, from 1M to 10M and unfortunately did not observe any improvement. Besides, they say that the read buffer shall be set before opening the file:

std::ifstream fin;
constexpr size_t kBufSize = 8 * 1024 * 1024;
std::unique_ptr<char> buf(new char[kBufSize]());
fin.rdbuf()->pubsetbuf(buf.get(), kBufSize);
fin.open(input_filename, std::ifstream::ate);

Anyway, I tired both ways.

Regarding the discussion, http://stackoverflow.com/questions/35655915/using-mmap-to-search-large-file-1tb the last answer (though -1 voted):

To mmap() a large file into memory is totally wrong approach in your case. You just need to process your file step-by-step by chunks with fixed size (something about 1MB). You can use mmap() or just read() it into your intenal buffer - that doesn't matter. But putting a whole file into memory is totally overkill if you just want to process it sequentially.

http://stackoverflow.com/questions/13127855/what-is-the-size-limit-for-mmap has the following answer:

There is no restriction on mmap size but would depend on the existing address space used by the given process. But it is highly suggested that you dont mmap to a large contiguous virtual address space.

http://stackoverflow.com/questions/7222164/mmap-an-entire-large-file states that MAP_PRIVATE has indeed the maximum size limit equal to swap + free mem and shows that it is kind of tricky.

Finally, my dataset is 100GB+ so it is rather critical for me. mmap() with MAP_SHARED may work on my system but it may fail on some crazy 32-bit OS or in some crazy macOS environment (I had really bad exp with syscalls on Darwin). I really encourage you to accept this block mmap-ing.

waterson · 2017-03-04T18:55:44Z

Gotcha, glad to see that you tried the simple thing first.

Were you able to profile to see what the problem was in that case? For example, was something like tellg or eof bottoming out in a system call? In the interest of keeping things simple (and consistent between reading vocab and counting co-occurrences), it would be great to understand why ifstream is performing badly.

If blockwise mmap is really the best fix, implementing the tokenization consistently (i.e., using the same "GetWord" implementation) between the two phases would be my preference.

vmarkovtsev · 2017-03-06T17:55:34Z

This is still WIP, but here are the results.

Initial vocabulary parsing time of 64 GB dataset: 26 min
After mmap() optimization: 14:30
After hash table reservation: 13:30
After flat_hash_map: 10:50

vmarkovtsev · 2017-03-06T17:57:20Z

Profile before mmap()
fs.gprof.txt

Profile after mmap()
mmap.gprof.txt

Profile after mmap() and reservation
mmap_reserve.gprof.txt

vmarkovtsev · 2017-03-07T13:07:04Z

This is feature complete now aka "works for me". Please test it.

waterson · 2017-03-08T15:45:26Z

Very cool. I will pull this verify... thanks!

nealwu · 2017-03-14T23:29:11Z

Thanks @vmarkovtsev! Looks like this is waiting on your review @waterson.

waterson · 2017-04-13T03:38:53Z

Okay, so... sorry it's taken me so long to get back to this.

@vmarkovtsev, It turns out that I can't actually build fastprep.cc anymore: it looks like Ubuntu 14.04 (which I am running) got broken with #1081. Since I'm running on Ubuntu 14.04, this is a bit hard to verify ATM 😃 . I'm going to have to try to get that figured out first...

vmarkovtsev · 2017-04-19T16:32:38Z

@waterson This is the script I used to build fastprep on vanilla dockerized Ubuntu 14.04:

apt install git g++ curl automake libtool unzip make
git clone --depth 1 https://github.com/google/protobuf.git
cd protobuf
./autogen.sh
./configure --prefix=/usr
make -j4
make install
cd ..
git clone --depth 1 https://github.com/tensorflow/models
cd models/swivel
make -f fastprep.mk

It seems to work.

bzz · 2017-05-03T09:47:01Z

@waterson have tried it locally using a container based on @vmarkovtsev instructions

FROM ubuntu:14.04

RUN apt-get update
RUN apt-get install -y git g++ curl automake libtool unzip make

RUN git clone --depth 1 https://github.com/google/protobuf.git ;\
   cd protobuf ;\
  ./autogen.sh && ./configure --prefix=/usr ;\
  make -j4 && make install

RUN git clone --depth 1 https://github.com/tensorflow/models ;\
  cd models/swivel ;\
  make -f fastprep.mk

ENTRYPOINT ["models/swivel/fastprep"]
CMD []
' >> Dockerfile

docker build -t fastprep .

And build passes both, on master and this branch.

Once built, it can be run with

docker run -it -v <path_to_data>:/data fastprep --output_dir /data/out-tmp --input /data/wiki.txt

vmarkovtsev · 2017-07-04T17:10:01Z

Any feedback on this @waterson ?

waterson · 2017-07-05T22:59:28Z

I think this is really cool, but I also am not super interested in maintaining the additional complexity here. As I mentioned elsewhere, the primary purpose of this repo is to make it easy for folks to understand and extend the ideas we wrote about...

vmarkovtsev · 2017-07-11T17:02:22Z

Forked as https://github.com/src-d/tensorflow-swivel

googlebot added the cla: yes label Mar 3, 2017

vmarkovtsev force-pushed the patch-5 branch from a1ad788 to 06ceb09 Compare March 3, 2017 18:45

waterson reviewed Mar 3, 2017

View reviewed changes

vmarkovtsev closed this Mar 6, 2017

vmarkovtsev deleted the patch-5 branch March 6, 2017 17:51

vmarkovtsev restored the patch-5 branch March 6, 2017 17:51

vmarkovtsev reopened this Mar 6, 2017

vmarkovtsev force-pushed the patch-5 branch from 06ceb09 to aac81e3 Compare March 6, 2017 17:52

Swivel: fastprep: use mmap()-ed IO for vocabulary parsing

e53c9de

vmarkovtsev force-pushed the patch-5 branch from aac81e3 to e53c9de Compare March 7, 2017 09:49

nealwu added the stat:awaiting review Waiting on review label Mar 14, 2017

vmarkovtsev closed this Jul 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Swivel: fastprep: use mmap()-ed IO for vocabulary parsing #1108

Swivel: fastprep: use mmap()-ed IO for vocabulary parsing #1108

vmarkovtsev commented Mar 3, 2017 •

edited

Loading

waterson left a comment

vmarkovtsev commented Mar 3, 2017

vmarkovtsev commented Mar 3, 2017 •

edited

Loading

waterson commented Mar 4, 2017

vmarkovtsev commented Mar 4, 2017 •

edited

Loading

waterson commented Mar 4, 2017

vmarkovtsev commented Mar 6, 2017 •

edited

Loading

vmarkovtsev commented Mar 6, 2017

vmarkovtsev commented Mar 7, 2017

waterson commented Mar 8, 2017

nealwu commented Mar 14, 2017

waterson commented Apr 13, 2017

vmarkovtsev commented Apr 19, 2017 •

edited

Loading

bzz commented May 3, 2017 •

edited

Loading

vmarkovtsev commented Jul 4, 2017

waterson commented Jul 5, 2017

vmarkovtsev commented Jul 11, 2017

Swivel: fastprep: use mmap()-ed IO for vocabulary parsing #1108

Swivel: fastprep: use mmap()-ed IO for vocabulary parsing #1108

Conversation

vmarkovtsev commented Mar 3, 2017 • edited Loading

waterson left a comment

Choose a reason for hiding this comment

vmarkovtsev commented Mar 3, 2017

vmarkovtsev commented Mar 3, 2017 • edited Loading

waterson commented Mar 4, 2017

vmarkovtsev commented Mar 4, 2017 • edited Loading

waterson commented Mar 4, 2017

vmarkovtsev commented Mar 6, 2017 • edited Loading

vmarkovtsev commented Mar 6, 2017

vmarkovtsev commented Mar 7, 2017

waterson commented Mar 8, 2017

nealwu commented Mar 14, 2017

waterson commented Apr 13, 2017

vmarkovtsev commented Apr 19, 2017 • edited Loading

bzz commented May 3, 2017 • edited Loading

vmarkovtsev commented Jul 4, 2017

waterson commented Jul 5, 2017

vmarkovtsev commented Jul 11, 2017

vmarkovtsev commented Mar 3, 2017 •

edited

Loading

vmarkovtsev commented Mar 3, 2017 •

edited

Loading

vmarkovtsev commented Mar 4, 2017 •

edited

Loading

vmarkovtsev commented Mar 6, 2017 •

edited

Loading

vmarkovtsev commented Apr 19, 2017 •

edited

Loading

bzz commented May 3, 2017 •

edited

Loading