-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove SerializedValues from public API of scylla
crate.
#1252
Conversation
See the following report for details: cargo semver-checks output
|
355b39c
to
37e2859
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't yet thought of alternative way to solve the RowSerializationContext
- Table
issue. I'll do it later.
partition_key: &SerializedValues, | ||
) -> Result<Token, TokenCalculationError> { | ||
let partitioner = self | ||
partition_key: &dyn SerializeRow, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We accept &dyn SerializeRow
here, but impl SerializeRow
in Session::[query/execute]_*
API. We already discussed it shortly on the meeting, but I want to bring attention to this in case it requires further discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wprzytula wdyt, should we change session and caching session that way?
It makes sense especially for do_query_iter
, where we can't just serialize in the beginning (like we do for execute_*) because we don't have context yet. This may result in a lot of code duplication.
Serializing will be just one virtual call per execution, it should not affect performance in any way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe virtual calls are cheap enough not to influence the request execution time if done once or a couple of times on the execution path. At the same time, monomorphised code bloat clutters cache lines on the request path.
My conclusion: where possible, let's move to &dyn
.
37e2859
to
4727d0a
Compare
@@ -196,6 +197,7 @@ pub struct Table { | |||
pub partition_key: Vec<String>, | |||
pub clustering_key: Vec<String>, | |||
pub partitioner: Option<String>, | |||
pub(crate) pk_column_specs: Vec<ColumnSpec<'static>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🆗 I believe the taken approach is perfectly acceptable.
💭 What I'm not convinced about is having a HashMap of columns. HashMaps are generally heavyweight structures, whereas tables never contain that many columns that HashMap's performance would be better than Vec's.
🔧 My another concern are metadata-related structs in metadata.rs
which are pub
and have pub
fields. For now, at least Column
and MaterializedView
are such structs and are not #[non_exhaustive]
, which might be a potential issue in the future.
I strongly suggest pub(crate)
'ing all those fields and exposing only getters for them. This leaves maximum freedom for us in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💭 What I'm not convinced about is having a HashMap of columns. HashMaps are generally heavyweight structures, whereas tables never contain that many columns that HashMap's performance would be better than Vec's.
This would also allow us to implement cass_table_meta_column
🔧 My another concern are metadata-related structs in
metadata.rs
which arepub
and havepub
fields. For now, at leastColumn
andMaterializedView
are such structs and are not#[non_exhaustive]
, which might be a potential issue in the future. I strongly suggestpub(crate)
'ing all those fields and exposing only getters for them. This leaves maximum freedom for us in the future.
I agree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would also allow us to implement
cass_table_meta_column
It is possible (but problematic) to implement it now.
Afaik, the order of the columns is always:
- partition key columns (in order defined in the key)
- clustering key columns (in order defined in the key)
- the rest of the column, in alphabetical order
cass_table_meta_column could sort the columns according to this, and retrieve the nth one. This is of course really ineffective.
💭 What I'm not convinced about is having a HashMap of columns. HashMaps are generally heavyweight structures, whereas tables never contain that many columns that HashMap's performance would be better than Vec's.
Are you talking about the performance of retrieving columns by name? Do you know at which size hashmaps start outperforming hashmaps (key length needs to be taken into consideration)?
Afaik it is not very unusual to have dozens or even hundreds columns in a table - Scylla / Cassandra design promotes using denormalized data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔧 My another concern are metadata-related structs in metadata.rs which are pub and have pub fields. For now, at least Column and MaterializedView are such structs and are not #[non_exhaustive], which might be a potential issue in the future.
I strongly suggest pub(crate)'ing all those fields and exposing only getters for them. This leaves maximum freedom for us in the future.
Makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding HashMaps vs Vec, the argument that convinces me more than performance is that we don't expose all the information about the table (we don't expose column order used by Scylla).
Do you have an idea what should the Table
struct look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not extend Column
with name: String
and then just keep a Vec<Column>
in Table
?
f46fb5c
to
15c210a
Compare
887dac8
to
dbc4a13
Compare
This method is not used anywhere, and it is not useful for serialization because columns must be serialized in the correct order anyway.
This function only accepts `T` through a reference, so the `T` itself doesn't need to have a known size.
2c76f0c
to
73ca2cc
Compare
Rebased on main |
In the further commits we will introduce another error condition that doesn't fail the whole metadata fetch, but only a single keyspace.
HashMap was used before, but it provides no benefit over a Vec here. Vec will make it easier to verify that we received all of pk and ck columns, which the next commits will do. Co-authored-by: Wojciech Przytuła <[email protected]>
Previously it was possible that some positions of `partition_key` and `clustering_key` would remain unfilled and thus empty strings. The probability was very low - Scylla would have to return very weird data - but the possibility was there. This commit verifies that this is not happening, and returns and error if it is. Co-authored-by: Wojciech Przytuła <[email protected]>
Now that we verify there are no gaps in pk and ck, we can explicitly provide this guarantee to users.
We want to move token calculation methods in ClusterState to accept SerializeRow. This in turn means we need to serialize those values, so we have to create RowSerializationContext from the Table struct. RowSerializationContext currently needs a slice of ColumnSpec. Table has no such slice. Instead it has a hashmap from column name to a ColumnSpec, and a Vec of primary key column names. We have three options: - Add a field with the required slice to Table struct. - Modify RowSerializationContext somehow so it can be created from the data that we already have in Table. I'm not sure how to do that, idea would be appreciated. - Hybrid: Modify both Table and RowSerializationContext to make them work together. This commit takes the first approach because it seemed to be the easiest one. Doing it a different way is of course open for discussion. Co-authored-by: Wojciech Przytuła <[email protected]>
There is no easy way for the users to create SerializedValues, which makes the current APIs cumbersome to use. Instead they should accept `&dyn SerializeRow` and perform serialization based on table metadata. This change means that those methods can now also return SerializationError, and also need to handle a table missing from metadata, preferably also returning an error in this case. No existing error type fits here, so either we need to extend some existing one, or create new one. First idea was extending PartitionKeyError, but it needs to be convertible to ExecutionError, in which we don't need those new variants. For that reason I introduced a new error type for those methods, called ClusterStateTokenError.
We may sometimes need to benchmark / integration test some private stuff. There is no good way to do this. The least bad way seems to be to re-export such private APIs under an unstable feature, and mark tests or benchmarks that require them as depending on this feature. It's worth noting that using "required-features" cargo attribute won't automatically enable the feature - it will just prevent the target from existing unless the feature is enabled. It means that after this commit `cargo bench` won't run this benchmark. Instead you need to run "cargo bench --features unstable-testing" or "cargo bench --all-features".
…e flag This will allow us to unpub the original calculate_token_for_partition_key, which is the last step towards eliminating SerializedValues from public API.
This function operates on SerializedValues, and so is not type-safe. Given that we already provide token-calculation API on PreparedStatement and ClusterState, exposing this method seems redundant. If it turns out that there are users who need this method and can't use existing APIs then we can think about providing appropriate safe API, or making this pub once again as a last resort.
Those are no longer used in any public API.
73ca2cc
to
473cc84
Compare
Incorporated changes proposed by @wprzytula in Lorak-mmk#13 |
#[cfg(feature = "unstable-testing")] | ||
pub mod internal_testing { | ||
use scylla_cql::serialize::row::SerializedValues; | ||
|
||
use crate::routing::partitioner::PartitionerName; | ||
use crate::routing::Token; | ||
use crate::statement::prepared::TokenCalculationError; | ||
|
||
pub fn calculate_token_for_partition_key( | ||
serialized_partition_key_values: &SerializedValues, | ||
partitioner: &PartitionerName, | ||
) -> Result<Token, TokenCalculationError> { | ||
crate::routing::partitioner::calculate_token_for_partition_key( | ||
serialized_partition_key_values, | ||
partitioner, | ||
) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓ Once we have this, can we deduplicate some test utilities that used to be present for both scylla
unit tests and integration tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically we could deduplicate them by only having them in the lib (instead of having them in integration tests) under this feature, like we have calculate_token_for_partition_key
now.
This is a tradeoff: running integration tests would require passing a flag to cargo test
to enable this feature, otherwise the integration tests target will not exist at all.
It's not a problem for CI, nor when using Makefile, but it is making it less convenient to call cargo test
directly.
For benchmark doing this seemed ok to me (because of how rarely we run them). For integration tests I'm not sure.
Some APIs on ClusterState still accept SerializedValues instead of
impl SerializeRow
/&dyn SerializeRow
.This is a leftover from the old serialization API, and should be changed. Why? Because there is no easy way for the user to create the SerializedValues.
That means that if a user wants to calculate the token / replicas, for some partition key of some table they need to:
Table
struct fromClusterState
partition_key
field and retrieve relevantColumnSpec
s fromcolumns
field.add_value
on the SerializedValues and handle the errorcompute_token
/get_endpoints
This is a lot of unnecessary work.
This PR changes those methods to accepts
&dyn SerializeRow
.As a small side change I removed
RowSerializationContext::column_by_name
.Columns need to be serialized in order, so there is no reason to get the column out of order.
This is confirmed by this method being unused.
Creating RowSerializationContext from Table
See the commit "Table: Add Vec of primary key column specs" for an explanation.
I'm not sure that the approach I've taken is the best one, so I'm open t suggestions in this matter.
Fixes: #1152
Pre-review checklist
I added relevant tests for new features and bug fixes.I have provided docstrings for the public items that I want to introduce.I have adjusted the documentation in./docs/source/
.Fixes:
annotations to PR description.