Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove SerializedValues from public API of scylla crate. #1252

Merged
merged 13 commits into from
Mar 3, 2025

Conversation

Lorak-mmk
Copy link
Collaborator

@Lorak-mmk Lorak-mmk commented Feb 18, 2025

Some APIs on ClusterState still accept SerializedValues instead of impl SerializeRow / &dyn SerializeRow.
This is a leftover from the old serialization API, and should be changed. Why? Because there is no easy way for the user to create the SerializedValues.
That means that if a user wants to calculate the token / replicas, for some partition key of some table they need to:

  • Get the Table struct from ClusterState
  • Create new SerializedValues
  • Iterate over partition_key field and retrieve relevant ColumnSpecs from columns field.
  • For each column, call add_value on the SerializedValues and handle the error
  • Pass the result to compute_token / get_endpoints

This is a lot of unnecessary work.
This PR changes those methods to accepts &dyn SerializeRow.

As a small side change I removed RowSerializationContext::column_by_name.
Columns need to be serialized in order, so there is no reason to get the column out of order.
This is confirmed by this method being unused.

Creating RowSerializationContext from Table

See the commit "Table: Add Vec of primary key column specs" for an explanation.
I'm not sure that the approach I've taken is the best one, so I'm open t suggestions in this matter.

Fixes: #1152

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have provided docstrings for the public items that I want to introduce.
  • I have adjusted the documentation in ./docs/source/.
  • I added appropriate Fixes: annotations to PR description.

@Lorak-mmk Lorak-mmk requested a review from muzarski February 18, 2025 12:46
@github-actions github-actions bot added the semver-checks-breaking cargo-semver-checks reports that this PR introduces breaking API changes label Feb 18, 2025
Copy link

github-actions bot commented Feb 18, 2025

cargo semver-checks detected some API incompatibilities in this PR.
Checked commit: 473cc84

See the following report for details:

cargo semver-checks output
./scripts/semver-checks.sh --baseline-rev e808345d7ab1e80970a8bd8371e367aec0e5cdbf
+ cargo semver-checks -p scylla -p scylla-cql --baseline-rev e808345d7ab1e80970a8bd8371e367aec0e5cdbf
     Cloning e808345d7ab1e80970a8bd8371e367aec0e5cdbf
    Building scylla v0.15.0 (current)
       Built [  36.511s] (current)
     Parsing scylla v0.15.0 (current)
      Parsed [   0.050s] (current)
    Building scylla v0.15.0 (baseline)
       Built [  34.173s] (baseline)
     Parsing scylla v0.15.0 (baseline)
      Parsed [   0.050s] (baseline)
    Checking scylla v0.15.0 -> v0.15.0 (no change)
     Checked [   0.141s] 127 checks: 125 pass, 2 fail, 0 warn, 0 skip

--- failure constructible_struct_adds_private_field: struct no longer constructible due to new private field ---

Description:
A struct constructible with a struct literal has a new non-public field. It can no longer be constructed using a struct literal outside of its crate.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.39.0/src/lints/constructible_struct_adds_private_field.ron

Failed in:
  field Table.pk_column_specs in /home/runner/work/scylla-rust-driver/scylla-rust-driver/scylla/src/cluster/metadata.rs:208

--- failure function_missing: pub fn removed or renamed ---

Description:
A publicly-visible function cannot be imported by its prior path. A `pub use` may have been removed, or the function itself may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.39.0/src/lints/function_missing.ron

Failed in:
  function scylla::routing::partitioner::calculate_token_for_partition_key, previously in file /home/runner/work/scylla-rust-driver/scylla-rust-driver/target/semver-checks/git-e808345d7ab1e80970a8bd8371e367aec0e5cdbf/b0de710f68985dca38fbe4c6cffd9ea9bfd6e808/scylla/src/routing/partitioner.rs:352

     Summary semver requires new major version: 2 major and 0 minor checks failed
    Finished [  72.261s] scylla
    Building scylla-cql v0.4.0 (current)
       Built [  10.843s] (current)
     Parsing scylla-cql v0.4.0 (current)
      Parsed [   0.029s] (current)
    Building scylla-cql v0.4.0 (baseline)
       Built [  10.979s] (baseline)
     Parsing scylla-cql v0.4.0 (baseline)
      Parsed [   0.029s] (baseline)
    Checking scylla-cql v0.4.0 -> v0.4.0 (no change)
     Checked [   0.115s] 127 checks: 126 pass, 1 fail, 0 warn, 0 skip

--- failure inherent_method_missing: pub method removed or renamed ---

Description:
A publicly-visible method or associated fn is no longer available under its prior name. It may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.39.0/src/lints/inherent_method_missing.ron

Failed in:
  RowSerializationContext::column_by_name, previously in file /home/runner/work/scylla-rust-driver/scylla-rust-driver/target/semver-checks/git-e808345d7ab1e80970a8bd8371e367aec0e5cdbf/b0de710f68985dca38fbe4c6cffd9ea9bfd6e808/scylla-cql/src/serialize/row.rs:59

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  23.055s] scylla-cql
make: *** [Makefile:64: semver-rev] Error 1

@Lorak-mmk Lorak-mmk force-pushed the cluster-state-api-fixes branch from 355b39c to 37e2859 Compare February 18, 2025 13:11
@Lorak-mmk Lorak-mmk self-assigned this Feb 18, 2025
@Lorak-mmk Lorak-mmk added this to the 1.0.0 milestone Feb 18, 2025
Copy link
Contributor

@muzarski muzarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't yet thought of alternative way to solve the RowSerializationContext - Table issue. I'll do it later.

partition_key: &SerializedValues,
) -> Result<Token, TokenCalculationError> {
let partitioner = self
partition_key: &dyn SerializeRow,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We accept &dyn SerializeRow here, but impl SerializeRow in Session::[query/execute]_* API. We already discussed it shortly on the meeting, but I want to bring attention to this in case it requires further discussion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wprzytula wdyt, should we change session and caching session that way?
It makes sense especially for do_query_iter, where we can't just serialize in the beginning (like we do for execute_*) because we don't have context yet. This may result in a lot of code duplication.
Serializing will be just one virtual call per execution, it should not affect performance in any way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe virtual calls are cheap enough not to influence the request execution time if done once or a couple of times on the execution path. At the same time, monomorphised code bloat clutters cache lines on the request path.
My conclusion: where possible, let's move to &dyn.

@Lorak-mmk Lorak-mmk force-pushed the cluster-state-api-fixes branch from 37e2859 to 4727d0a Compare February 19, 2025 17:31
@Lorak-mmk Lorak-mmk requested a review from muzarski February 19, 2025 17:32
@@ -196,6 +197,7 @@ pub struct Table {
pub partition_key: Vec<String>,
pub clustering_key: Vec<String>,
pub partitioner: Option<String>,
pub(crate) pk_column_specs: Vec<ColumnSpec<'static>>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🆗 I believe the taken approach is perfectly acceptable.

💭 What I'm not convinced about is having a HashMap of columns. HashMaps are generally heavyweight structures, whereas tables never contain that many columns that HashMap's performance would be better than Vec's.

🔧 My another concern are metadata-related structs in metadata.rs which are pub and have pub fields. For now, at least Column and MaterializedView are such structs and are not #[non_exhaustive], which might be a potential issue in the future.
I strongly suggest pub(crate)'ing all those fields and exposing only getters for them. This leaves maximum freedom for us in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 What I'm not convinced about is having a HashMap of columns. HashMaps are generally heavyweight structures, whereas tables never contain that many columns that HashMap's performance would be better than Vec's.

This would also allow us to implement cass_table_meta_column

🔧 My another concern are metadata-related structs in metadata.rs which are pub and have pub fields. For now, at least Column and MaterializedView are such structs and are not #[non_exhaustive], which might be a potential issue in the future. I strongly suggest pub(crate)'ing all those fields and exposing only getters for them. This leaves maximum freedom for us in the future.

I agree

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would also allow us to implement cass_table_meta_column

It is possible (but problematic) to implement it now.
Afaik, the order of the columns is always:

  • partition key columns (in order defined in the key)
  • clustering key columns (in order defined in the key)
  • the rest of the column, in alphabetical order

cass_table_meta_column could sort the columns according to this, and retrieve the nth one. This is of course really ineffective.

💭 What I'm not convinced about is having a HashMap of columns. HashMaps are generally heavyweight structures, whereas tables never contain that many columns that HashMap's performance would be better than Vec's.

Are you talking about the performance of retrieving columns by name? Do you know at which size hashmaps start outperforming hashmaps (key length needs to be taken into consideration)?
Afaik it is not very unusual to have dozens or even hundreds columns in a table - Scylla / Cassandra design promotes using denormalized data.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 My another concern are metadata-related structs in metadata.rs which are pub and have pub fields. For now, at least Column and MaterializedView are such structs and are not #[non_exhaustive], which might be a potential issue in the future.
I strongly suggest pub(crate)'ing all those fields and exposing only getters for them. This leaves maximum freedom for us in the future.

Makes sense.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding HashMaps vs Vec, the argument that convinces me more than performance is that we don't expose all the information about the table (we don't expose column order used by Scylla).
Do you have an idea what should the Table struct look like?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not extend Column with name: String and then just keep a Vec<Column> in Table?

@Lorak-mmk Lorak-mmk force-pushed the cluster-state-api-fixes branch 3 times, most recently from f46fb5c to 15c210a Compare February 26, 2025 16:16
@Lorak-mmk Lorak-mmk requested a review from wprzytula February 26, 2025 16:30
@Lorak-mmk Lorak-mmk force-pushed the cluster-state-api-fixes branch 5 times, most recently from 887dac8 to dbc4a13 Compare February 28, 2025 19:16
This method is not used anywhere, and it is not useful for serialization
because columns must be serialized in the correct order anyway.
This function only accepts `T` through a reference, so the `T` itself
doesn't need to have a known size.
@Lorak-mmk Lorak-mmk force-pushed the cluster-state-api-fixes branch from 2c76f0c to 73ca2cc Compare March 1, 2025 16:33
@Lorak-mmk
Copy link
Collaborator Author

Rebased on main

Lorak-mmk and others added 6 commits March 1, 2025 17:45
In the further commits we will introduce another error condition that
doesn't fail the whole metadata fetch, but only a single keyspace.
HashMap was used before, but it provides no benefit over a Vec here.
Vec will make it easier to verify that we received all of pk and ck
columns, which the next commits will do.

Co-authored-by: Wojciech Przytuła <[email protected]>
Previously it was possible that some positions of `partition_key` and
`clustering_key` would remain unfilled and thus empty strings.

The probability was very low - Scylla would have to return very weird
data - but the possibility was there.

This commit verifies that this is not happening, and returns and error
if it is.

Co-authored-by: Wojciech Przytuła <[email protected]>
Now that we verify there are no gaps in pk and ck, we can explicitly
provide this guarantee to users.
We want to move token calculation methods in ClusterState to accept
SerializeRow. This in turn means we need to serialize those values,
so we have to create RowSerializationContext from the Table struct.

RowSerializationContext currently needs a slice of ColumnSpec.
Table has no such slice. Instead it has a hashmap from column name to
a ColumnSpec, and a Vec of primary key column names.

We have three options:
- Add a field with the required slice to Table struct.
- Modify RowSerializationContext somehow so it can be created from the
  data that we already have in Table. I'm not sure how to do that, idea
  would be appreciated.
- Hybrid: Modify both Table and RowSerializationContext to make them
  work together.

This commit takes the first approach because it seemed to be the easiest
one. Doing it a different way is of course open for discussion.

Co-authored-by: Wojciech Przytuła <[email protected]>
There is no easy way for the users to create SerializedValues, which
makes the current APIs cumbersome to use. Instead they should accept
`&dyn SerializeRow` and perform serialization based on table metadata.

This change means that those methods can now also return
SerializationError, and also need to handle a table missing from metadata,
preferably also returning an error in this case.
No existing error type fits here, so either we need to extend some
existing one, or create new one. First idea was extending
PartitionKeyError, but it needs to be convertible to ExecutionError,
in which we don't need those new variants.
For that reason I introduced a new error type for those methods,
called ClusterStateTokenError.
We may sometimes need to benchmark / integration test some private stuff.
There is no good way to do this. The least bad way seems to be to
re-export such private APIs under an unstable feature, and mark tests or
benchmarks that require them as depending on this feature.

It's worth noting that using "required-features" cargo attribute won't
automatically enable the feature - it will just prevent the target from
existing unless the feature is enabled.
It means that after this commit `cargo bench` won't run this benchmark.
Instead you need to run "cargo bench --features unstable-testing" or
"cargo bench --all-features".
…e flag

This will allow us to unpub the original calculate_token_for_partition_key,
which is the last step towards eliminating SerializedValues from public API.
This function operates on SerializedValues, and so is not type-safe.
Given that we already provide token-calculation API on PreparedStatement
and ClusterState, exposing this method seems redundant.

If it turns out that there are users who need this method and can't use
existing APIs then we can think about providing appropriate safe API,
or making this pub once again as a last resort.
@Lorak-mmk Lorak-mmk force-pushed the cluster-state-api-fixes branch from 73ca2cc to 473cc84 Compare March 1, 2025 17:17
@Lorak-mmk
Copy link
Collaborator Author

Incorporated changes proposed by @wprzytula in Lorak-mmk#13

@Lorak-mmk Lorak-mmk requested a review from wprzytula March 1, 2025 17:17
Comment on lines +261 to +278
#[cfg(feature = "unstable-testing")]
pub mod internal_testing {
use scylla_cql::serialize::row::SerializedValues;

use crate::routing::partitioner::PartitionerName;
use crate::routing::Token;
use crate::statement::prepared::TokenCalculationError;

pub fn calculate_token_for_partition_key(
serialized_partition_key_values: &SerializedValues,
partitioner: &PartitionerName,
) -> Result<Token, TokenCalculationError> {
crate::routing::partitioner::calculate_token_for_partition_key(
serialized_partition_key_values,
partitioner,
)
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ Once we have this, can we deduplicate some test utilities that used to be present for both scylla unit tests and integration tests?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically we could deduplicate them by only having them in the lib (instead of having them in integration tests) under this feature, like we have calculate_token_for_partition_key now.
This is a tradeoff: running integration tests would require passing a flag to cargo test to enable this feature, otherwise the integration tests target will not exist at all.
It's not a problem for CI, nor when using Makefile, but it is making it less convenient to call cargo test directly.

For benchmark doing this seemed ok to me (because of how rarely we run them). For integration tests I'm not sure.

@Lorak-mmk Lorak-mmk merged commit d65e7cf into scylladb:main Mar 3, 2025
12 checks passed
@Lorak-mmk Lorak-mmk mentioned this pull request Mar 4, 2025
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver-checks-breaking cargo-semver-checks reports that this PR introduces breaking API changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove uses of SerializedValues from public API
3 participants