allow nightly failures

JuliaString · Nov 24, 2017 · c0def20 · c0def20
1 parent 672993d
commit c0def20
Show file tree

Hide file tree

Showing 2 changed files with 5 additions and 3 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -13,7 +13,7 @@ git:
 
 ## uncomment the following lines to allow failures on nightly julia
 ## (tests will run but not make your overall status red)
-#matrix:
+matrix:
   allow_failures:
   - julia: nightly
 

diff --git a/README.md b/README.md
@@ -71,7 +71,7 @@ probably >10% the content (Ask Zipf).
 you might want to retrieve just the 10 most common verbs that occur in a sentence with 4 or more nouns.
       - You only need about 100bytes of content, but with SubStrings you are keeping all 100MB in memory.
  - Another is if you have trained word embeddings, then you need a dictionary where the keys are the set of tokens in the vocabulary.
-      - 10,000,000 (10⁷)words probably only has about 50,000 unique words tops, so you only need to be keeping 5×10⁵ bytes of content, not 10⁸)
+      - 10,000,000 (10⁷)words probably only has about 50,000 unique words (after cleaning out rare words), so you only need to be keeping 5×10⁵ bytes of content, not 10⁸)
 In all these cases you are keeping a lot more memory that you need.
 If you are smart you will spot it and convert them to Strings, so the content can be deleted.
 But i am not smart, and have made that mistake many times.
@@ -115,7 +115,9 @@ removing the copy in the interning pool will be handled automatically (it is a W
 
 Finally point **4:**.
 As I said before.
-The original 10⁸ byte document, with 10⁷ words probably only has about 50,000 (5×10⁴) unique words.
+The original 10⁸ byte document, with 10⁷ words probably only has about 50,000 (5×10⁴) unique words after cleaning.
+(Looking at real world data, the first 10⁷ tokens of wikipedia,
+is has 3.5×10⁵ words, but that is before rare words, numbers etc are removed)
 At an average of 10 bytes long you only need to be keeping 5×10⁵ bytes of content,
 plus for each 8 bytes of pointers/length markers (8×10⁴), plus 1 byte each for null terminating them all. (Grand total: 5.9×10⁵ bytes vs original 10⁸+9 bytes).