Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚚 Add ability to have multiple indexes with same PK in Dynamo layer #6034

Merged
merged 8 commits into from
Dec 10, 2024

Conversation

rix0rrr
Copy link
Collaborator

@rix0rrr rix0rrr commented Dec 10, 2024

In the past, we used to disallow having multiple indexes with the
same partition key; the reason was that we wanted to be able to
determine from any { "input": "dict" } immediately which index
we should use.

And since sort keys are optional, the simplest way we could do that was
to make sure all PKs were unique: that way an index would always be
unambiguously identified.

This now blocks a use case we have, where we want to have 2 indexes with
the same partition key but with 2 different sort keys (for the purposes of
sorting the results differently).

In this PR, make it possible to have multiple indexes with the same
partition key, as long as the sort keys are different. This now
does pose a problem if you query with only a partition key and there
are two indexes that include that partition key:

indexes = [
    Index('epoch', sort_key='created'), # Index 1
    Index('epoch', sort_key='username'), # Index 2
]

results = table.get_many({ "epoch": 1 })

Should this use index 1 (return users sorted by timestamp) or should
this use index 2 (return users sorted by username?).

I considered picking the first index in the array that applies because
this would be backwards compatible when adding indexes to the end of
the array... but I'm concerned that this is too implicit and could lead
to unexpected results or action-at-a-distance problems: changing the
order of the indexes array could all of a sudden make another part of
the code base incorrect.

So the query above is now an error. When there are multiple indexes
that could apply, you need to explicitly disambiguate between
them by putting a dummy condition on the sort key field you want
to use:

results = table.get_many({
    "epoch": 1,
    "created": UseThisIndex(),
})

This makes it clear which index should be used, at the expense of having
to annotate all existing get_many() calls in the code base when you
add a conflicting index. The maintainability advantages of the extra
explicitness should hopefully win out in the long run.

How to test

No user visible changes, purely a feature change for @jpelay

In the past, we used to disallow having multiple indexes with the
same partition key; the reason was that we wanted to be able to
determine from any `{ "input": "dict" }` immediately which index
we should use.

And since sort keys are optional, the simplest way we could do that was
to make sure all PKs were unique: that way an index would always be
unambiguously identified.

This now blocks a use case we have, where we want to have 2 indexes with
the same partition key but with 2 different sort keys (for the purposes of
sorting the results differently).

In this PR, make it possible to have multiple indexes with the same
partition key, as long as the sort keys are different. This now
does pose a problem if you query with only a partition key and there
are two indexes that include that partition key:

```py
indexes = [
    Index('epoch', sort_key='created'), # Index 1
    Index('epoch', sort_key='username'), # Index 2
]

results = table.get_many({ "epoch": 1 })
```

Should this use index 1 (return users sorted by timestamp) or should
this use index 2 (return users sorted by username?).

I considered picking the first index in the array that applies because
this would be backwards compatible when adding indexes to the end of
the array... but I'm concerned that this is too implicit and could lead
to unexpected results or action-at-a-distance problems: changing the
order of the `indexes` array could all of a sudden make another part of
the code base incorrect.

So the query above is now an error. When there are multiple indexes
that could apply, you need to explicitly disambiguate between
them by putting a dummy condition on the sort key field you want
to use:

```py
results = table.get_many({
    "epoch": 1,
    "created": UseThisIndex(),
})
```

This makes it clear which index should be used, at the expense of having
to annotate all existing `get_many()` calls in the code base when you
add a conflicting index. The maintainability advantages of the extra
explicitness should hopefully win out in the long run.
@rix0rrr rix0rrr requested a review from jpelay December 10, 2024 08:47
Copy link
Member

@jpelay jpelay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor

mergify bot commented Dec 10, 2024

Thank you for contributing! Your pull request is now going on the merge train (choo choo! Do not click update from main anymore, and be sure to allow changes to be pushed to your fork).

Copy link
Contributor

mergify bot commented Dec 10, 2024

Thank you for contributing! Your pull request is now going on the merge train (choo choo! Do not click update from main anymore, and be sure to allow changes to be pushed to your fork).

@mergify mergify bot merged commit baf2815 into main Dec 10, 2024
11 checks passed
@mergify mergify bot deleted the no-condition-on-partition branch December 10, 2024 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants