Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BSON Binary Subtype 9 support for vector storage and retrieval. #1528

Merged
merged 25 commits into from
Oct 30, 2024

Conversation

vbabanin
Copy link
Member

@vbabanin vbabanin commented Oct 10, 2024

Description

This PR introduces the implementation of BSON Binary Subtype 9 - Vector, based on the proposed specification for efficiently storing and retrieving densely packed arrays of numbers.

Key Changes:

BSON Binary subtype: Implements a new BSON binary subtype (value "\x09") for vectors to better manage dense numeric arrays.

API enhancements:
Updates include idiomatic APIs that facilitate the translation between arrays of numbers and their BSON binary representations.

Data type support: Supports various data types (dtypes), including INT8, FLOAT32, and PACKED_BIT.

Implementation details

Encoding and decoding: Adds methods to encode and decode vectors according to the specified dtype and padding.

Draft status and review:

The specification this PR depends on is under review. You can review the specification here: BSON Binary Subtype 9 Specification PR. Substantial changes to the specification are not expected, therefore your feedback on this preliminary implementation would be valuable.

JAVA-5544

- Implement encoding and decoding methods for new BSON binary subtype 9.
- Add support for INT8, FLOAT32, and PACKED_BIT dtypes with padding.
- Provide API methods for encoding vectors to BSON binary and decoding BSON binary to vectors.

JAVA-5544
@vbabanin vbabanin self-assigned this Oct 11, 2024
Copy link
Collaborator

@jyemin jyemin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just giving this an initial review as part of my review of the spec.

bson/src/main/org/bson/Vector.java Outdated Show resolved Hide resolved
bson/src/main/org/bson/Vector.java Outdated Show resolved Hide resolved
bson/src/main/org/bson/Vector.java Outdated Show resolved Hide resolved
bson/src/main/org/bson/Vector.java Outdated Show resolved Hide resolved
bson/src/main/org/bson/Vector.java Outdated Show resolved Hide resolved
bson/src/main/org/bson/BsonBinary.java Show resolved Hide resolved
bson/src/main/org/bson/Vector.java Outdated Show resolved Hide resolved
# Conflicts:
#	bson/src/main/org/bson/BsonBinarySubType.java
#	bson/src/test/unit/org/bson/BsonBinarySubTypeSpecification.groovy
JAVA-5544
@vbabanin vbabanin requested review from jyemin and stIncMale October 17, 2024 23:50
Rename methods.

JAVA-5544
@vbabanin vbabanin marked this pull request as ready for review October 22, 2024 19:44
Comment on lines 164 to 176
@ParameterizedTest
@MethodSource("provideValidVectors")
void shouldStoreAndRetrieveValidVectorWithCodec(final Vector actualVector) {
// Given
Document documentToInsert = new Document(FIELD_VECTOR, actualVector);
documentCollection.insertOne(documentToInsert);

// when & then
Binary vectorBinary = findExactlyOne(documentCollection)
.get(FIELD_VECTOR, Binary.class);

Assertions.assertEquals(actualVector, vectorBinary.asVector());
}
Copy link
Member Author

@vbabanin vbabanin Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice that the tests directly send a Vector object rather than wrapping it in Binary due to a codec that transforms Vector to Binary. However, during retrieval, BinaryCodec is activated, resulting in a Document that contains Binary instead of Vector. This inconsistency arises because BsonTypeClassMap maps BsonType.BINARY to Binary.class, but does not account for Binary subtypes similarly.

To ensure consistency between write and read operations, we could modify ContainerCodecHelper to prioritize the VectorCodec during decoding, akin to our handling of the UUID subtype when the representation is not UNSPECIFIED. See the implementation here: ContainerCodecHelper.

However, this enforcement could still result in inconsistencies. For example, if a user initially writes a Vector wrapped in Binary, it would not be decoded back as Binary, but as Vector. This creates a reverse inconsistency from the current scenario where Vector is automatically converted to Binary.

We have several options:

  • Retain the current behavior without any changes.
    Vector -> raw binary
    raw binary - > Binary
  • Enforce decoding to Vector if subtype is 9 (vector) in ContainerCodecHelper.
    Vector -> raw binary
    raw binary -> Vector
  • Make the behavior of VectorCodec configurable via the constructor, allowing users to determine the behaviour they want. This would necessitate detailed documentation to assist in discovery and understanding.

I suggest we maintain the existing functionality for now. If this becomes an issue for users, we can reevaluate.

UPD: I suggest we enforce decoding to Vector if the subtype is 9 (vector) in ContainerCodecHelper, similar to how we handle the UUID subtype. This should provide a better user experience. I have updated ContainerHelper to parse to Vector if there is a VectorCodec available. If a VectorCodec is not present, due to the user explicitly providing a registry without one, we will default to using the BinaryCodec.

Note that when a UUID codec is not present, decoding fails rather than defaulting to Binary. I think we should not let the decoding fail in the case of UUID and Vector codec not being preset and continue decoding to Binary. However, I have not made any changes related to UUID here; we could consider those changes in a separate PR.

Copy link
Member

@stIncMale stIncMale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that static checks don't pass. Please make sure there are no new warnings when building the project.

bson/src/test/resources/bson-binary-vector/README.md Outdated Show resolved Hide resolved
bson/src/main/org/bson/BsonBinary.java Show resolved Hide resolved
bson/src/main/org/bson/BsonBinary.java Outdated Show resolved Hide resolved
bson/src/main/org/bson/BsonBinary.java Outdated Show resolved Hide resolved
bson/src/test/unit/org/bson/codecs/VectorCodecTest.java Outdated Show resolved Hide resolved
bson/src/test/unit/org/bson/codecs/VectorCodecTest.java Outdated Show resolved Hide resolved
bson/src/test/unit/org/bson/codecs/VectorCodecTest.java Outdated Show resolved Hide resolved
vbabanin and others added 3 commits October 23, 2024 18:49
Co-authored-by: Valentin Kovalenko <[email protected]>
Add javadocs.
Use state verification instead of behaviour verification in tests.

JAVA-5544
@vbabanin vbabanin requested a review from stIncMale October 26, 2024 00:22
@vbabanin vbabanin requested a review from stIncMale October 30, 2024 18:02
@vbabanin vbabanin merged commit b34c0a6 into mongodb:main Oct 30, 2024
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants