Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Query Planning Refactor]: SimpleExecute #1008

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from
Draft

Conversation

chewselene
Copy link
Collaborator

This is part of a larger refactor to streamline the way we generate query plans. This is the simplest form of query plan where the given query is executed against the database.

Future work includes adding paginated queries and cross database queries.

There are some outstanding TODOs mainly around how to best utilize provider_id to look up both the corresponding schema and corresponding database query execution function.

Copy link
Contributor

@obi1kenobi obi1kenobi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks quite good, thank you for putting it together!

Since this is still a draft and we'll want a couple more iterations to finalize everything, it might be good to try splitting it up into a few parts. For example, the following parts have straightforward mutual dependencies and could in principle merge separately, shrinking the code size and making iteration and reviews easier:

  • the namedtuple -> dataclass part
  • the "move existing planning functionality to a new module" part
  • the new typedefs
  • the code that uses the new typedefs to do query planning

graphql_compiler/query_planning_and_execution/typedefs.py Outdated Show resolved Hide resolved
graphql_compiler/query_planning_and_execution/typedefs.py Outdated Show resolved Hide resolved
graphql_compiler/query_planning_and_execution/typedefs.py Outdated Show resolved Hide resolved
graphql_compiler/query_planning_and_execution/typedefs.py Outdated Show resolved Hide resolved
graphql_compiler/query_planning_and_execution/typedefs.py Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
query_and_parameters: QueryStringWithParameters,
schema_info: Union[CommonSchemaInfo, SQLAlchemySchemaInfo],
provider_id: str,
backend_type: Optional[str],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not in love with the fact that this is Optional, but I don't have a concrete suggestion in mind. Maybe we can work together to figure something out?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched this to use the BackendType enum so I think it makes sense for it to not be optional. It is still a bit weird because I currently check if schema_info is of type SQLAlchemySchemaInfo that backend_type is either MSSQL or PostgreSQL, but I don't check if it matches the dialect in the schema info. If I add the check, we could run into the problem that you mentioned

SQLAlchemy version adds a new dialect for MSSQL that doesn't inherit from MSDialect — our code immediately becomes wrong in yet another place

so I'm not sure what the best solution is...


provider_metadata = ProviderMetadata(
backend_type=schema_info.dialect.name,
requires_fold_postprocessing=isinstance(schema_info.dialect, MSDialect),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the reasons why I'd love backend_type to be non-optional and an enum is so that we can make this check stricter. Here's what I'm worried about: imagine a future SQLAlchemy version adds a new dialect for MSSQL that doesn't inherit from MSDialect — our code immediately becomes wrong in yet another place.

Another (and perhaps even better) alternative would be to add requires_fold_postprocessing or something like it to the SQLAlchemySchemaInfo. After all, that object knows the dialect (and therefore the emitted SQL) better than anything else, so realistically it should be the one specifying whether post-processing is required or not. From that perspective, ProviderMetadata would be viewed as simply representing the parts of SQLAlchemySchemaInfo that the client may need to know as part of executing the query, which is appealing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea! I will work on a separate PR to incorporate.

query_and_parameters: QueryStringWithParameters,
provider_id: str,
*,
desired_page_size: Optional[int] = 5000,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kind of on the fence about setting a default here. The right number will depend on each user's statistics collection configuration — aside from some special cases, we can't generate pages that are more fine-grained than the statistics backing them. Within Kensho, we always know and control what that configuration might be. But in general, we might not.

On the flip side, not setting a number would make pagination disabled by default, and that's also a default value — and not a good one. Even if the statistics aren't fine-grained enough, the compiler will emit advisories when paginating which should be an obvious and "soft" way to communicate the issue. And 5000 is a pretty decent compromise between large page sizes (for high throughput) and not blowing up the server or client.

What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the alternative is that we make this a required param? Do you think it would be better to force the user to make a conscious decision here since it does depend on user stats?

graphql_compiler/query_planning_and_execution/__init__.py Outdated Show resolved Hide resolved
@chewselene chewselene mentioned this pull request Apr 26, 2021
compilation_result = compile_graphql_to_match(
schema_info, query_and_parameters.query_string
)
# TODO(selene): add InterpreterAdapter based backends
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure of how to deal with InterpreterAdapters. The public methods are interpret_ir and interpret_query. SimpleExecute is set up to contain a query string, which is the input of interpret_query, but that function also requires the schema to be passed. It seems like to want to use the schema to compile the query in this function, but then our output (the equivalent of the query in SimpleExecute) would be IrAndMetadata rather than a query string. Making the type of query be Union[str, IrAndMetadata] doesn't really seem to make sense, but maybe it would be more tolerable if we renamed query to backend_input or something similar? Thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants