-
Notifications
You must be signed in to change notification settings - Fork 45.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Swivel: fastprep: use mmap()-ed IO for vocabulary parsing #1108
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this approach, especially if it yields a 2x speedup!
But... could you not just mmap the whole file and avoid the complexity of advancing the page size?
Also, if we're going to mmap to compute the vocab, why not also mmap to count co-occurrences?
I don't remember how mmap() works exactly, does it require the whole file to fit into memory? Co-occurrences should be mmap-ed too, of course, but I plan to implement it in a separate PR - one feature at a time. |
Appears that mmap()-ing large files is an anti-pattern for reading sequentially: http://stackoverflow.com/questions/35655915/using-mmap-to-search-large-file-1tb This involves eating much virtual memory at least (resident may be OK, but that's another story). I suggest keeping the virtual mem size low. If we had to do random reads, it would make sense. |
Thanks for the links, Vadim. I read through those but didn't draw the same conclusion: was there a specific discussion about sequentially reading a large file that I might have missed? Anyhow, I do like the fact that you gained such a large speedup; however, I'm concerned about adding complexity if the problem could be solved more simply. Perhaps the ifstream's default buffer size is just too small? It would be interesting to see if something simple like this (reference) yielded similar speedup:
If so, it would be easy to apply a similar change to the co-occurrence scanning. WDYT? |
Chris, this was the first thing I tried to speed up the parsing. I tested several buffer sizes, from 1M to 10M and unfortunately did not observe any improvement. Besides, they say that the read buffer shall be set before opening the file:
Anyway, I tired both ways. Regarding the discussion, http://stackoverflow.com/questions/35655915/using-mmap-to-search-large-file-1tb the last answer (though -1 voted):
http://stackoverflow.com/questions/13127855/what-is-the-size-limit-for-mmap has the following answer:
http://stackoverflow.com/questions/7222164/mmap-an-entire-large-file states that MAP_PRIVATE has indeed the maximum size limit equal to swap + free mem and shows that it is kind of tricky. Finally, my dataset is 100GB+ so it is rather critical for me. mmap() with MAP_SHARED may work on my system but it may fail on some crazy 32-bit OS or in some crazy macOS environment (I had really bad exp with syscalls on Darwin). I really encourage you to accept this block mmap-ing. |
Gotcha, glad to see that you tried the simple thing first. Were you able to profile to see what the problem was in that case? For example, was something like tellg or eof bottoming out in a system call? In the interest of keeping things simple (and consistent between reading vocab and counting co-occurrences), it would be great to understand why ifstream is performing badly. If blockwise mmap is really the best fix, implementing the tokenization consistently (i.e., using the same "GetWord" implementation) between the two phases would be my preference. |
Profile before mmap() Profile after mmap() Profile after mmap() and reservation |
This is feature complete now aka "works for me". Please test it. |
Very cool. I will pull this verify... thanks! |
Thanks @vmarkovtsev! Looks like this is waiting on your review @waterson. |
Okay, so... sorry it's taken me so long to get back to this. @vmarkovtsev, It turns out that I can't actually build fastprep.cc anymore: it looks like Ubuntu 14.04 (which I am running) got broken with #1081. Since I'm running on Ubuntu 14.04, this is a bit hard to verify ATM 😃 . I'm going to have to try to get that figured out first... |
@waterson This is the script I used to build
It seems to work. |
@waterson have tried it locally using a container based on @vmarkovtsev instructions FROM ubuntu:14.04
RUN apt-get update
RUN apt-get install -y git g++ curl automake libtool unzip make
RUN git clone --depth 1 https://github.com/google/protobuf.git ;\
cd protobuf ;\
./autogen.sh && ./configure --prefix=/usr ;\
make -j4 && make install
RUN git clone --depth 1 https://github.com/tensorflow/models ;\
cd models/swivel ;\
make -f fastprep.mk
ENTRYPOINT ["models/swivel/fastprep"]
CMD []
' >> Dockerfile
docker build -t fastprep . And build passes both, on master and this branch. Once built, it can be run with
|
Any feedback on this @waterson ? |
I think this is really cool, but I also am not super interested in maintaining the additional complexity here. As I mentioned elsewhere, the primary purpose of this repo is to make it easy for folks to understand and extend the ideas we wrote about... |
This speeds up vocabulary parsing 2x for me.
I mmap 8x PAGE_SIZE. Some care was required on buffer boundaries and in the file end.