Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swivel: fastprep: use mmap()-ed IO for vocabulary parsing #1108

Closed
wants to merge 1 commit into from

Conversation

vmarkovtsev
Copy link
Contributor

@vmarkovtsev vmarkovtsev commented Mar 3, 2017

This speeds up vocabulary parsing 2x for me.

I mmap 8x PAGE_SIZE. Some care was required on buffer boundaries and in the file end.

Copy link
Contributor

@waterson waterson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach, especially if it yields a 2x speedup!

But... could you not just mmap the whole file and avoid the complexity of advancing the page size?

Also, if we're going to mmap to compute the vocab, why not also mmap to count co-occurrences?

@vmarkovtsev
Copy link
Contributor Author

I don't remember how mmap() works exactly, does it require the whole file to fit into memory?

Co-occurrences should be mmap-ed too, of course, but I plan to implement it in a separate PR - one feature at a time.

@vmarkovtsev
Copy link
Contributor Author

vmarkovtsev commented Mar 3, 2017

Appears that mmap()-ing large files is an anti-pattern for reading sequentially:

http://stackoverflow.com/questions/35655915/using-mmap-to-search-large-file-1tb
http://stackoverflow.com/questions/13127855/what-is-the-size-limit-for-mmap
https://www.systutorials.com/qa/1969/maximum-number-of-mmap-ed-ranges-and-how-to-set-it-on-linux
http://stackoverflow.com/questions/7222164/mmap-an-entire-large-file

This involves eating much virtual memory at least (resident may be OK, but that's another story). I suggest keeping the virtual mem size low.

If we had to do random reads, it would make sense.

@waterson
Copy link
Contributor

waterson commented Mar 4, 2017

Thanks for the links, Vadim. I read through those but didn't draw the same conclusion: was there a specific discussion about sequentially reading a large file that I might have missed?

Anyhow, I do like the fact that you gained such a large speedup; however, I'm concerned about adding complexity if the problem could be solved more simply. Perhaps the ifstream's default buffer size is just too small? It would be interesting to see if something simple like this (reference) yielded similar speedup:

std::ifstream fin(input_filename, std::ifstream::ate);
constexpr size_t kBufSize = 8 * 1024 * 1024;
std::unique_ptr<char> buf(new char[kBufSize]());
fin.rdbuf()->pubsetbuf(buf.get(), kBufSize);

If so, it would be easy to apply a similar change to the co-occurrence scanning. WDYT?

@vmarkovtsev
Copy link
Contributor Author

vmarkovtsev commented Mar 4, 2017

Chris, this was the first thing I tried to speed up the parsing. I tested several buffer sizes, from 1M to 10M and unfortunately did not observe any improvement. Besides, they say that the read buffer shall be set before opening the file:

std::ifstream fin;
constexpr size_t kBufSize = 8 * 1024 * 1024;
std::unique_ptr<char> buf(new char[kBufSize]());
fin.rdbuf()->pubsetbuf(buf.get(), kBufSize);
fin.open(input_filename, std::ifstream::ate);

Anyway, I tired both ways.

Regarding the discussion, http://stackoverflow.com/questions/35655915/using-mmap-to-search-large-file-1tb the last answer (though -1 voted):

To mmap() a large file into memory is totally wrong approach in your case. You just need to process your file step-by-step by chunks with fixed size (something about 1MB). You can use mmap() or just read() it into your intenal buffer - that doesn't matter. But putting a whole file into memory is totally overkill if you just want to process it sequentially.

http://stackoverflow.com/questions/13127855/what-is-the-size-limit-for-mmap has the following answer:

There is no restriction on mmap size but would depend on the existing address space used by the given process. But it is highly suggested that you dont mmap to a large contiguous virtual address space.

http://stackoverflow.com/questions/7222164/mmap-an-entire-large-file states that MAP_PRIVATE has indeed the maximum size limit equal to swap + free mem and shows that it is kind of tricky.

Finally, my dataset is 100GB+ so it is rather critical for me. mmap() with MAP_SHARED may work on my system but it may fail on some crazy 32-bit OS or in some crazy macOS environment (I had really bad exp with syscalls on Darwin). I really encourage you to accept this block mmap-ing.

@waterson
Copy link
Contributor

waterson commented Mar 4, 2017

Gotcha, glad to see that you tried the simple thing first.

Were you able to profile to see what the problem was in that case? For example, was something like tellg or eof bottoming out in a system call? In the interest of keeping things simple (and consistent between reading vocab and counting co-occurrences), it would be great to understand why ifstream is performing badly.

If blockwise mmap is really the best fix, implementing the tokenization consistently (i.e., using the same "GetWord" implementation) between the two phases would be my preference.

@vmarkovtsev vmarkovtsev closed this Mar 6, 2017
@vmarkovtsev vmarkovtsev deleted the patch-5 branch March 6, 2017 17:51
@vmarkovtsev vmarkovtsev restored the patch-5 branch March 6, 2017 17:51
@vmarkovtsev vmarkovtsev reopened this Mar 6, 2017
@vmarkovtsev
Copy link
Contributor Author

vmarkovtsev commented Mar 6, 2017

This is still WIP, but here are the results.

Initial vocabulary parsing time of 64 GB dataset: 26 min
After mmap() optimization: 14:30
After hash table reservation: 13:30
After flat_hash_map: 10:50

one

two

@vmarkovtsev
Copy link
Contributor Author

Profile before mmap()
fs.gprof.txt

Profile after mmap()
mmap.gprof.txt

Profile after mmap() and reservation
mmap_reserve.gprof.txt

@vmarkovtsev
Copy link
Contributor Author

This is feature complete now aka "works for me". Please test it.

@waterson
Copy link
Contributor

waterson commented Mar 8, 2017

Very cool. I will pull this verify... thanks!

@nealwu nealwu added the stat:awaiting review Waiting on review label Mar 14, 2017
@nealwu
Copy link
Contributor

nealwu commented Mar 14, 2017

Thanks @vmarkovtsev! Looks like this is waiting on your review @waterson.

@waterson
Copy link
Contributor

Okay, so... sorry it's taken me so long to get back to this.

@vmarkovtsev, It turns out that I can't actually build fastprep.cc anymore: it looks like Ubuntu 14.04 (which I am running) got broken with #1081. Since I'm running on Ubuntu 14.04, this is a bit hard to verify ATM 😃 . I'm going to have to try to get that figured out first...

@vmarkovtsev
Copy link
Contributor Author

vmarkovtsev commented Apr 19, 2017

@waterson This is the script I used to build fastprep on vanilla dockerized Ubuntu 14.04:

apt install git g++ curl automake libtool unzip make
git clone --depth 1 https://github.com/google/protobuf.git
cd protobuf
./autogen.sh
./configure --prefix=/usr
make -j4
make install
cd ..
git clone --depth 1 https://github.com/tensorflow/models
cd models/swivel
make -f fastprep.mk

It seems to work.

@bzz
Copy link

bzz commented May 3, 2017

@waterson have tried it locally using a container based on @vmarkovtsev instructions

FROM ubuntu:14.04

RUN apt-get update
RUN apt-get install -y git g++ curl automake libtool unzip make

RUN git clone --depth 1 https://github.com/google/protobuf.git ;\
   cd protobuf ;\
  ./autogen.sh && ./configure --prefix=/usr ;\
  make -j4 && make install

RUN git clone --depth 1 https://github.com/tensorflow/models ;\
  cd models/swivel ;\
  make -f fastprep.mk

ENTRYPOINT ["models/swivel/fastprep"]
CMD []
' >> Dockerfile

docker build -t fastprep .

And build passes both, on master and this branch.

Once built, it can be run with

docker run -it -v <path_to_data>:/data fastprep --output_dir /data/out-tmp --input /data/wiki.txt  

@vmarkovtsev
Copy link
Contributor Author

Any feedback on this @waterson ?

@waterson
Copy link
Contributor

waterson commented Jul 5, 2017

I think this is really cool, but I also am not super interested in maintaining the additional complexity here. As I mentioned elsewhere, the primary purpose of this repo is to make it easy for folks to understand and extend the ideas we wrote about...

@vmarkovtsev
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants