-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BSON Binary Subtype 9 support for vector storage and retrieval. #1528
Conversation
- Implement encoding and decoding methods for new BSON binary subtype 9. - Add support for INT8, FLOAT32, and PACKED_BIT dtypes with padding. - Provide API methods for encoding vectors to BSON binary and decoding BSON binary to vectors. JAVA-5544
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just giving this an initial review as part of my review of the spec.
JAVA-5544
JAVA-5544
JAVA-5544
JAVA-5544
# Conflicts: # bson/src/main/org/bson/BsonBinarySubType.java # bson/src/test/unit/org/bson/BsonBinarySubTypeSpecification.groovy
JAVA-5544
JAVA-5544
Rename methods. JAVA-5544
JAVA-5544
@ParameterizedTest | ||
@MethodSource("provideValidVectors") | ||
void shouldStoreAndRetrieveValidVectorWithCodec(final Vector actualVector) { | ||
// Given | ||
Document documentToInsert = new Document(FIELD_VECTOR, actualVector); | ||
documentCollection.insertOne(documentToInsert); | ||
|
||
// when & then | ||
Binary vectorBinary = findExactlyOne(documentCollection) | ||
.get(FIELD_VECTOR, Binary.class); | ||
|
||
Assertions.assertEquals(actualVector, vectorBinary.asVector()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notice that the tests directly send a Vector
object rather than wrapping it in Binary
due to a codec that transforms Vector
to Binary
. However, during retrieval, BinaryCodec
is activated, resulting in a Document
that contains Binary
instead of Vector
. This inconsistency arises because BsonTypeClassMap
maps BsonType.BINARY
to Binary.class
, but does not account for Binary
subtypes similarly.
To ensure consistency between write and read operations, we could modify ContainerCodecHelper
to prioritize the VectorCodec
during decoding, akin to our handling of the UUID subtype when the representation is not UNSPECIFIED. See the implementation here: ContainerCodecHelper
.
However, this enforcement could still result in inconsistencies. For example, if a user initially writes a Vector
wrapped in Binary
, it would not be decoded back as Binary
, but as Vector
. This creates a reverse inconsistency from the current scenario where Vector
is automatically converted to Binary
.
We have several options:
- Retain the current behavior without any changes.
Vector -> raw binary
raw binary - > Binary - Enforce decoding to
Vector
if subtype is 9 (vector) inContainerCodecHelper
.
Vector -> raw binary
raw binary -> Vector - Make the behavior of
VectorCodec
configurable via the constructor, allowing users to determine the behaviour they want. This would necessitate detailed documentation to assist in discovery and understanding.
I suggest we maintain the existing functionality for now. If this becomes an issue for users, we can reevaluate.
UPD: I suggest we enforce decoding to Vector
if the subtype is 9 (vector) in ContainerCodecHelper, similar to how we handle the UUID subtype. This should provide a better user experience. I have updated ContainerHelper
to parse to Vector
if there is a VectorCodec
available. If a VectorCodec is not present, due to the user explicitly providing a registry without one, we will default to using the BinaryCodec.
Note that when a UUID codec
is not present, decoding fails rather than defaulting to Binary
. I think we should not let the decoding fail in the case of UUID
and Vector
codec not being preset and continue decoding to Binary. However, I have not made any changes related to UUID here; we could consider those changes in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that static checks don't pass. Please make sure there are no new warnings when building the project.
bson/src/test/unit/org/bson/codecs/ValueCodecProviderSpecification.groovy
Show resolved
Hide resolved
Co-authored-by: Valentin Kovalenko <[email protected]>
Co-authored-by: Valentin Kovalenko <[email protected]>
Add javadocs. Use state verification instead of behaviour verification in tests. JAVA-5544
JAVA-5544
JAVA-5544
JAVA-5544
JAVA-5544
JAVA-5544
JAVA-5544
JAVA-5544
Description
This PR introduces the implementation of BSON Binary Subtype 9 - Vector, based on the proposed specification for efficiently storing and retrieving densely packed arrays of numbers.
Key Changes:
BSON Binary subtype: Implements a new BSON binary subtype (value "\x09") for vectors to better manage dense numeric arrays.
API enhancements:
Updates include idiomatic APIs that facilitate the translation between arrays of numbers and their BSON binary representations.
Data type support: Supports various data types (dtypes), including
INT8
,FLOAT32
, andPACKED_BIT
.Implementation details
Encoding and decoding: Adds methods to encode and decode vectors according to the specified dtype and padding.
Draft status and review:
The specification this PR depends on is under review. You can review the specification here: BSON Binary Subtype 9 Specification PR. Substantial changes to the specification are not expected, therefore your feedback on this preliminary implementation would be valuable.
JAVA-5544