Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems loading glove.840B.300d/glove.840B.300d.txt (GloVe{:en}, 6) #24

Closed
robertfeldt opened this issue Nov 12, 2019 · 5 comments · Fixed by #26
Closed

Problems loading glove.840B.300d/glove.840B.300d.txt (GloVe{:en}, 6) #24

robertfeldt opened this issue Nov 12, 2019 · 5 comments · Fixed by #26

Comments

@robertfeldt
Copy link

Thanks for the Embeddings.jl package; it is great! I'm building a Word Mover's Distance implementation using Sinkhorn Distance approximation, on top of it.

I wanted to try also with one of the larger embeddings so tried this on my Macbook Pro 2015 with Julia 1.2:

using Embeddings
e = load_embeddings(GloVe{:en}, 6)

but after the long downloading process I then get:

ERROR: ArgumentError: cannot parse "." as Float32
Stacktrace:
 [1] _parse_failure(::Type, ::SubString{String}, ::Int64, ::Int64) at ./parse.jl:372 (repeats 2 times)
 [2] #tryparse_internal#351 at ./parse.jl:368 [inlined]
 [3] tryparse_internal at ./parse.jl:366 [inlined]
 [4] #parse#352 at ./parse.jl:378 [inlined]
 [5] parse at ./parse.jl:378 [inlined]
 [6] _broadcast_getindex_evalf at ./broadcast.jl:625 [inlined]
 [7] _broadcast_getindex at ./broadcast.jl:608 [inlined]
 [8] getindex at ./broadcast.jl:558 [inlined]
 [9] macro expansion at ./broadcast.jl:888 [inlined]
 [10] macro expansion at ./simdloop.jl:77 [inlined]
 [11] copyto! at ./broadcast.jl:887 [inlined]
 [12] copyto! at ./broadcast.jl:842 [inlined]
 [13] copy at ./broadcast.jl:818 [inlined]
 [14] materialize at ./broadcast.jl:798 [inlined]
 [15] (::getfield(Embeddings, Symbol("##5#6")){Set{Any},Array{String,1},Array{Array{Float32,1},1}})(::IOStream) at /Users/feldt/.julia/packages/Embeddings/awjFJ/src/glove.jl:62
 [16] #open#312(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(open), ::getfield(Embeddings, Symbol("##5#6")){Set{Any},Array{String,1},Array{Array{Float32,1},1}}, ::String) at ./iostream.jl:375
 [17] open at ./iostream.jl:373 [inlined]
 [18] _load_embeddings at /Users/feldt/.julia/packages/Embeddings/awjFJ/src/glove.jl:54 [inlined]
 [19] #load_embeddings#12 at /Users/feldt/.julia/packages/Embeddings/awjFJ/src/Embeddings.jl:99 [inlined]
 [20] #load_embeddings at ./none:0 [inlined]
 [21] #load_embeddings#11(::Int64, ::Set{Any}, ::typeof(load_embeddings), ::Type{GloVe{:en}}, ::Int64) at /Users/feldt/.julia/packages/Embeddings/awjFJ/src/Embeddings.jl:91
 [22] load_embeddings(::Type{GloVe{:en}}, ::Int64) at /Users/feldt/.julia/packages/Embeddings/awjFJ/src/Embeddings.jl:90
 [23] top-level scope at REPL[2]:1

When I load for example load_embeddings(GloVe{:en}, 4) there is no problem. Anyone else had a similar problem with glove.840B.300d and is there a workaround?

I also wonder if it would be possible to do the loading also of the Word2Vec default embeddings lazily since that could take down the time when first executing using Embeddings. Would simplify testing and use in "downstream" packages which might only optionally use the embeddings.

@oxinabox
Copy link
Member

I have reproduced this.

What seems to be happening is that somewhere betwen the 510^4 and 610^4 entry in that file is an invalid line.
I guess it has a a float that is just .

julia> e = load_embeddings(GloVe{:en}, 6; max_vocab_size=5*10^4)
Embeddings.EmbeddingTable{Array{Float32,2},Array{String,1}}(Float32[-0.082752 0.012001 … 0.76153 -0.081173; 0.67204 0.20751 … 0.079967 -0.84939; … ; -0.37846 -0.36049 … -0.30543 0.23822; -0.06589 -0.035 … -0.32546 0.32911], [",", ".", "the", "and", "to", "of", "a", "in", "\"", ":"  …  "interpolation",
 "McHenry", "monotonous", "M\$", "wc", "rosewood", "speculators", "Illini", "breather", "full-text"])

julia> e = load_embeddings(GloVe{:en}, 6; max_vocab_size=6*10^4)
ERROR: ArgumentError: cannot parse "." as Float32

I guess the solution is to change this line

push!(LL, parse.(Float32, xs[2:end]))

to call some function that does
glove_float_parse(x) = x == "." ? 0f0 : parse(Float32, x)
or maybe

glove_float_parse(x) = coalesce(tryparse(Float32, x), 0f0)

PR would be welcome if you care to investigate further

@oxinabox
Copy link
Member

Right found it
There is a line:

. . . -0.1573 -0.29517 0.30453 -0.54773 0.098293 -0.1776 0.21662 0.19261 -0.2110    1 0.53788 -0.047755 0.40675 0.023592 -0.32814 0.046858 0.19367 0.25565 -0.021019     -0.15957 -0.1023 0.20303 -0.043333 0.11618 -0.18486 0.0011948 -0.052301 0.34587     0.052335 0.167

The word is . . .
where the spaces are U+00A0 : NO-BREAK SPACE [NBSP]

So my fix above is wrong

@robertfeldt
Copy link
Author

Sounds almost easier to "patch" this specific file than introduce a potentially slower parsing overall to handle this?

@oxinabox
Copy link
Member

Nah the parsing change in #24 actually will speed it up

@robertfeldt
Copy link
Author

Ok, nice. :)

oxinabox added a commit that referenced this issue Nov 19, 2019
only split on spaces only in Glove, not any whitespace fixes #24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants