-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Query Planning Refactor]: SimpleExecute #1008
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks quite good, thank you for putting it together!
Since this is still a draft and we'll want a couple more iterations to finalize everything, it might be good to try splitting it up into a few parts. For example, the following parts have straightforward mutual dependencies and could in principle merge separately, shrinking the code size and making iteration and reviews easier:
- the
namedtuple -> dataclass
part - the "move existing planning functionality to a new module" part
- the new typedefs
- the code that uses the new typedefs to do query planning
query_and_parameters: QueryStringWithParameters, | ||
schema_info: Union[CommonSchemaInfo, SQLAlchemySchemaInfo], | ||
provider_id: str, | ||
backend_type: Optional[str], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not in love with the fact that this is Optional
, but I don't have a concrete suggestion in mind. Maybe we can work together to figure something out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched this to use the BackendType
enum so I think it makes sense for it to not be optional. It is still a bit weird because I currently check if schema_info
is of type SQLAlchemySchemaInfo
that backend_type
is either MSSQL or PostgreSQL, but I don't check if it matches the dialect in the schema info. If I add the check, we could run into the problem that you mentioned
SQLAlchemy version adds a new dialect for MSSQL that doesn't inherit from MSDialect — our code immediately becomes wrong in yet another place
so I'm not sure what the best solution is...
|
||
provider_metadata = ProviderMetadata( | ||
backend_type=schema_info.dialect.name, | ||
requires_fold_postprocessing=isinstance(schema_info.dialect, MSDialect), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the reasons why I'd love backend_type
to be non-optional and an enum is so that we can make this check stricter. Here's what I'm worried about: imagine a future SQLAlchemy version adds a new dialect for MSSQL that doesn't inherit from MSDialect
— our code immediately becomes wrong in yet another place.
Another (and perhaps even better) alternative would be to add requires_fold_postprocessing
or something like it to the SQLAlchemySchemaInfo
. After all, that object knows the dialect (and therefore the emitted SQL) better than anything else, so realistically it should be the one specifying whether post-processing is required or not. From that perspective, ProviderMetadata
would be viewed as simply representing the parts of SQLAlchemySchemaInfo
that the client may need to know as part of executing the query, which is appealing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea! I will work on a separate PR to incorporate.
query_and_parameters: QueryStringWithParameters, | ||
provider_id: str, | ||
*, | ||
desired_page_size: Optional[int] = 5000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm kind of on the fence about setting a default here. The right number will depend on each user's statistics collection configuration — aside from some special cases, we can't generate pages that are more fine-grained than the statistics backing them. Within Kensho, we always know and control what that configuration might be. But in general, we might not.
On the flip side, not setting a number would make pagination disabled by default, and that's also a default value — and not a good one. Even if the statistics aren't fine-grained enough, the compiler will emit advisories when paginating which should be an obvious and "soft" way to communicate the issue. And 5000 is a pretty decent compromise between large page sizes (for high throughput) and not blowing up the server or client.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the alternative is that we make this a required param? Do you think it would be better to force the user to make a conscious decision here since it does depend on user stats?
compilation_result = compile_graphql_to_match( | ||
schema_info, query_and_parameters.query_string | ||
) | ||
# TODO(selene): add InterpreterAdapter based backends |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure of how to deal with InterpreterAdapters. The public methods are interpret_ir
and interpret_query
. SimpleExecute
is set up to contain a query string, which is the input of interpret_query
, but that function also requires the schema to be passed. It seems like to want to use the schema to compile the query in this function, but then our output (the equivalent of the query
in SimpleExecute
) would be IrAndMetadata
rather than a query string. Making the type of query
be Union[str, IrAndMetadata]
doesn't really seem to make sense, but maybe it would be more tolerable if we renamed query
to backend_input
or something similar? Thoughts?
This is part of a larger refactor to streamline the way we generate query plans. This is the simplest form of query plan where the given query is executed against the database.
Future work includes adding paginated queries and cross database queries.
There are some outstanding TODOs mainly around how to best utilize provider_id to look up both the corresponding schema and corresponding database query execution function.