Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize storage in Embeddings constructor. #94

Closed
sebpuetz opened this issue May 27, 2020 · 3 comments
Closed

Normalize storage in Embeddings constructor. #94

sebpuetz opened this issue May 27, 2020 · 3 comments

Comments

@sebpuetz
Copy link
Member

If no norms are passed to Embeddings, normalize embeddings and add norms to the embeddings.

@sebpuetz sebpuetz changed the title Consequently handle norms Handle normalization equally in all loading May 28, 2020
@sebpuetz sebpuetz transferred this issue from finalfusion/ffp May 31, 2020
@sebpuetz sebpuetz changed the title Handle normalization equally in all loading Normalize storage in Embeddings constructor. May 31, 2020
@sebpuetz
Copy link
Member Author

sebpuetz commented Jun 2, 2020

Probably a bad idea, we'd have to treat quantized and array storages differently. What do you think @danieldk ?

@danieldk
Copy link
Member

danieldk commented Jun 3, 2020

Probably a bad idea, we'd have to treat quantized and array storages differently.

Not only quantized storages, but also mmap'ed storages.

finalfusion embeddings should already be l2-normalized according to the spec. We have some historical embeddings with are normalized, but do not have a norms chunk. And it would be senseless to renormalize them again and then store all-1 norms.

It's should be part of the interface that callers only provide l2-normalized embeddings. Ideally, we would check this as an invariant, but the invariant is too expensive (and also goes against mmap'ing embeddings). I guess we could consider validate the initial n embeddings as a sanity check.

@sebpuetz
Copy link
Member Author

sebpuetz commented Jun 3, 2020

Probably a bad idea, we'd have to treat quantized and array storages differently.

Not only quantized storages, but also mmap'ed storages.

For mmap'ed storages it'd at least raise an exception, for quantized storages it just wouldn't do anything to the storage but still store the norms. It'd be reconstructed by virtue of __array__(), normalized and thrown away. The Norms would be stored anyways. So definitely don't want to introduce this kind of unexpected behaviour.

It's should be part of the interface that callers only provide l2-normalized embeddings. Ideally, we would check this as an invariant, but the invariant is too expensive (and also goes against mmap'ing embeddings). I guess we could consider validate the initial n embeddings as a sanity check.

Might add something to the docs, but I don't think going further makes much sense. The storage array is writable and so is the norms array, we can't really enforce normalization unless the arrays are made immutable (I think there's also a flag for in-memory arrays to prevent modification).

@sebpuetz sebpuetz closed this as completed Jun 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants