-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Plugable Query Result Storage-Cache #3335
WIP: Plugable Query Result Storage-Cache #3335
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @ismailsimsek ! There are some implementation specific comments about this implementation.
But I think it's worth to discuss the implementation direction in more general lines first --
-
The API should allow for "streaming" load and save of results, this to really allow handling large results while minimizing the memory usage during both operations.
-
I'm not sure there is a point in splitting the results into two tables when using DB persistence. Postgres has its own mechanism to make sure rows stay within reasonable size (TOAST).
Also running a migration on query_results
might not be feasible on some deployments (because of size), so might worth to consider an implementation that doesn't require it. I was thinking about repurposing the data
column to store a "pointer" to the actual results location. Then it can include some more metadata (like rows count, columns metadata, and maybe even preview of the data).
agree, one way i can think of is passing QueryResultData object in the application and streaming data only when it needed. using save, load methods. |
Yep, something along these lines.
Yes, I was thinking about JSON Lines. It allows for streaming and more efficient storage wise, as there are no column names in each line. It might be a bit harder to read on the client side, but shouldn't be that much harder. |
data_handler = "file" | ||
sample_row_limit = 50 | ||
|
||
def __init__(self, data=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function __init__
has a Cognitive Complexity of 6 (exceeds 5 allowed). Consider refactoring.
class QueryResultDataFactory(object): | ||
|
||
@staticmethod | ||
def get_handler(data=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function get_handler
has a Cognitive Complexity of 7 (exceeds 5 allowed). Consider refactoring.
redash/query_runner/__init__.py
Outdated
@@ -136,6 +137,14 @@ def get_schema(self, get_stats=False): | |||
self._get_tables_stats(schema_dict) | |||
return schema_dict.values() | |||
|
|||
|
|||
def handle_result_data(self, cursor, columns): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too many blank lines (2)
redash/query_runner/__init__.py
Outdated
def handle_result_data(self, cursor, columns): | ||
# @TODO all stream ready query_runner runners shuld use QueryResultDataFactory | ||
# if not then they can override this function and use old behaviour which will be 'db' | ||
data_handler=QueryResultDataFactory().get_handler() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing whitespace around operator
redash/query_runner/__init__.py
Outdated
# @TODO all stream ready query_runner runners shuld use QueryResultDataFactory | ||
# if not then they can override this function and use old behaviour which will be 'db' | ||
data_handler=QueryResultDataFactory().get_handler() | ||
json_data = data_handler.save(cursor,columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing whitespace after ','
@arikfr high level design is ready do you have time to have look at it ? |
Thank you for your effort, @ismailsimsek. We decided to take a bit of a different approach in: #4147. |
This pull request is adding plugable QueryResultData which is responsible managing query_result.data (big python object).
This model is flexible and can easily be extended, like using s3. current implementations are 'db' default and 'file' as using local disk to store query_result.data
Why?
masking sure application can handle big Queryresults and hoping to improve application performance with big resultsets
Implementation notes
What's next
changing query result object to more performant object. avro, MessagePack ...