-
Notifications
You must be signed in to change notification settings - Fork 45.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trouble building swivel/fastprep.cc on Mac? #127
Comments
To be clear, I have bazel and all the other dependencies installed. I ran the first command, and it returned without error: And my bazel-bin/tensorflow/tools/pip_package/build_pip_package.runfiles/ exists, just does not include bazel-bin/tensorflow/tools/pip_package/build_pip_package.runfiles/tensorflow I'm working off of these instructions. Is it possible that something isn't sync'ed? |
Hey there! Apologies for being so slow to respond! :-/ I have not tried building TF from source on Mac; let me take a look at that today and I'll update the issue. |
And, FWIW, yes, you can definitely do this! Basically you just need re-arrange Glove's co-occurrence matrix into tf.Examples that follow the format that prep.py produces. |
…. (Roundabout way to address issue tensorflow#127).
Okay, I just tried building from source on Mac and... I at least got past this part. Here is (more or less) what I did. YMMV because it's hard to know how your system is configured. I followed the OS/X "install from source" directions. This involved: installing JDK8, installing homebrew. Then I used homebrew to install bazel and swig, and pip_install to install six, numpy and wheel. Then I pulled the TF source repo, configured it and built the PIP package. Then I built and installed protocol buffers. Finally, I was able to compile fastprep.cc, but failed to link it. I suspect that I need to do some libtool magic to make that work.
LMK if you can get that far; if not, it might make sense to open another issue on the core TF project. In the meantime, there is certainly a real bug here that fastprep.cc can't link. Also, for what it's worth, note that TF may not support GPUs on OS/X. If that's the case, you may find that Swivel is not going to provide much of a speedup over Glove training. |
Okay, by changing
to
...I can get fastprep.cc to compile and link. So I need to figure out why TF OS/X names its libraries with ".pic" and see if there's a clean way to have fastprep.mk choose the right one. |
Sorry for the late reply @waterson. Your change Thanks! |
And thanks @waterson for sharing script to port GloVe co-occurrences to Swivel. Will try that as well. Not sure how I managed to miss notification from your response -- gotta check my GitHub settings. I really appreciate it. Excited to try these comparisons! |
Hi @waterson, unfortunately even though the fastprep compiles, it seems to crash in the co-occurrence phase when writing shards. Have you seen this error before? For comparison, prep.py works -- albeit very very slowly, so on a 10% sample of this text (which it itself a sample). As long as I will also separately search for this protobuf error. Thanks! `$ ./fastprep --output_dir /tmp/swivel --input /Volumes/xxx/rawtweets_10pct.txt --shard_size 16384 --min_count 100 --window_size 15 Computing vocabulary: 100.0% complete...56016451 distinct tokens Generating Swivel co-occurrence data into /tmp/swivel Shard size: 16384x16384 Vocab size: 98304 Computing co-occurrences: 100.0% complete...done. writing marginals... writing shards... writing shard 1/36 [libprotobuf FATAL google/protobuf/message_lite.cc:68] CHECK failed: (bytes_produced_by_serialization) == (byte_size_before_serialization): Byte size calculation and serialization were inconsistent. This may indicate a bug in protocol buffers or it may be caused by concurrent modification of the message. libc++abi.dylib: terminating with uncaught exception of type google::protobuf::FatalException: CHECK failed: (bytes_produced_by_serialization) == (byte_size_before_serialization): Byte size calculation and serialization were inconsistent. This may indicate a bug in protocol buffers or it may be caused by concurrent modification of the message. Abort trap: 6 |
@waterson Any update on this? |
Sorry, no update. I notice that 16K x 16K shards seems very large; you may be blowing out a protocol buffer limit somewhere, especially if the shards are dense. Have you tried using a smaller shard size; e.g., the default of 4096 x 4096? If you want to point me at the data you're using, I'm happy to try to reproduce the problem. (And, FWIW, to confirm: this is OS/X-specific, correct?) |
Interesting. Yes, this is OS-X specific, though I did not try running I would like to get back to this at some point soon, or will ask if a colleague has time for it. We've found the Word2Vec (original, vanilla version) works pretty well on our large (several billion lines of short text) corpus, while GloVe runs into several issues. I would not doubt that Swivel might solve some of these, but this becomes impossible if it can not really handle large amounts of data, and a vocabulary of 5-10 million words. Please ping me if any solutions emerge, or there is an update to address it. I will also keep my eyes out. Thanks for your help, and sorry it did not work the last time around, even with your helpful fixes. I definitely got further though. |
@waterson What's the status of this issue? |
So, my understanding is that there is seems to be a problem with very large protocol buffers, and that a few pretty easy work-arounds exist (e.g., use a smaller subblock size or use Linux). I'm happy to accept PRs but I'm not spending any cycles on it at the moment. If I've mis-understood, please let me know. |
I'll close for now due to age, but I'm happy to reopen if new information surfaces. |
Hey guys, love the Swivel library. That said, prep.py is too slow (on 1B lines of text dataset), so trying to build the fastprep version.
I get stuck on "rebuild Tensorflow from source" part. It says to build a pip package, but then I get the error:
bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
cp: bazel-bin/tensorflow/tools/pip_package/build_pip_package.runfiles/tensorflow: No such file or directory
cp: bazel-bin/tensorflow/tools/pip_package/build_pip_package.runfiles/external: No such file or directory
Indeed, those files do not exist. I could not find this error in the other peoples' bugs. A pointer would be appreciated!
Also... if I have vocab and co-occurrences cached from GloVe, can I skip the prep phase?
Thanks!
The text was updated successfully, but these errors were encountered: