WIP: Plugable Query Result Storage-Cache #3335

ismailsimsek · 2019-01-24T13:33:52Z

This pull request is adding plugable QueryResultData which is responsible managing query_result.data (big python object).
This model is flexible and can easily be extended, like using s3. current implementations are 'db' default and 'file' as using local disk to store query_result.data

Why?
masking sure application can handle big Queryresults and hoping to improve application performance with big resultsets

Implementation notes

by default it is using ‘db’ and code is backward compatible
QueryResultData is used only when its necessary, rest of QueryResult operations are done without ‘data’ (large python object)
- since query_result_data rows are much smaller now any query hitting this table going to be fast. since it dont need to read “data” field from disk.
QueryResultData using QueryResult.id as a key value of result object. in case of ‘db’ its PK
QueryResultData only used from QueryResult so this way we get minimum refactoring impact.
QueryResultData is only responsible storing, retrieving or deleting data object. it is independent of the type of ‘data’ object.

What's next
changing query result object to more performant object. avro, MessagePack ...

arikfr

Thank you, @ismailsimsek ! There are some implementation specific comments about this implementation.

But I think it's worth to discuss the implementation direction in more general lines first --

The API should allow for "streaming" load and save of results, this to really allow handling large results while minimizing the memory usage during both operations.
I'm not sure there is a point in splitting the results into two tables when using DB persistence. Postgres has its own mechanism to make sure rows stay within reasonable size (TOAST).

Also running a migration on query_results might not be feasible on some deployments (because of size), so might worth to consider an implementation that doesn't require it. I was thinking about repurposing the data column to store a "pointer" to the actual results location. Then it can include some more metadata (like rows count, columns metadata, and maybe even preview of the data).

ismailsimsek · 2019-01-26T11:03:31Z

1. The API should allow for "streaming" load and save of results, this to really allow handling large results while minimizing the memory usage during both operations.

agree, one way i can think of is passing QueryResultData object in the application and streaming data only when it needed. using save, load methods.
do you have any serialization format in mind? with json only way doing it is appending multiple json objects(each one is a row) to a file which is problematic when reading it
avro has DataFileWriter.appendTo option and you can iterate over records when reading. but it might be problematic defining avro schema, data types from columns might not be compatible

arikfr · 2019-01-27T09:41:56Z

agree, one way i can think of is passing QueryResultData object in the application and streaming data only when it needed. using save, load methods.

Yep, something along these lines.

do you have any serialization format in mind? with json only way doing it is appending multiple json objects(each one is a row) to a file which is problematic when reading it

Yes, I was thinking about JSON Lines. It allows for streaming and more efficient storage wise, as there are no column names in each line. It might be a bit harder to read on the client side, but shouldn't be that much harder.

codeclimate · 2019-01-31T15:35:42Z

redash/query_runner/query_result_data.py

+    data_handler = "file"
+    sample_row_limit = 50
+
+    def __init__(self, data=None):


Function __init__ has a Cognitive Complexity of 6 (exceeds 5 allowed). Consider refactoring.

codeclimate · 2019-01-31T15:35:42Z

redash/query_runner/query_result_data.py

+class QueryResultDataFactory(object):
+
+    @staticmethod
+    def get_handler(data=None):


Function get_handler has a Cognitive Complexity of 7 (exceeds 5 allowed). Consider refactoring.

codeclimate · 2019-01-31T15:35:42Z

redash/query_runner/__init__.py

@@ -136,6 +137,14 @@ def get_schema(self, get_stats=False):
            self._get_tables_stats(schema_dict)
        return schema_dict.values()

+
+    def handle_result_data(self, cursor, columns):


Too many blank lines (2)

codeclimate · 2019-01-31T15:35:42Z

redash/query_runner/__init__.py

+    def handle_result_data(self, cursor, columns):
+        # @TODO all stream ready query_runner runners shuld use QueryResultDataFactory
+        # if not then they can override this function and use old behaviour which will be 'db'
+        data_handler=QueryResultDataFactory().get_handler()


Missing whitespace around operator

codeclimate · 2019-01-31T15:35:43Z

redash/query_runner/__init__.py

+        # @TODO all stream ready query_runner runners shuld use QueryResultDataFactory
+        # if not then they can override this function and use old behaviour which will be 'db'
+        data_handler=QueryResultDataFactory().get_handler()
+        json_data = data_handler.save(cursor,columns)


Missing whitespace after ','

ismailsimsek · 2019-03-18T08:12:44Z

@arikfr high level design is ready do you have time to have look at it ?

arikfr · 2019-09-24T10:32:05Z

Thank you for your effort, @ismailsimsek. We decided to take a bit of a different approach in: #4147.

ghost added the in progress label Jan 24, 2019

pr-triage bot added the PR: unreviewed label Jan 24, 2019

arikfr reviewed Jan 24, 2019

View reviewed changes

arikfr changed the title ~~Plugable Query Result Storage-Cache~~ WIP: Plugable Query Result Storage-Cache Jan 24, 2019

pr-triage bot removed the PR: unreviewed label Jan 24, 2019

codeclimate bot reviewed Jan 31, 2019

View reviewed changes

added plugable query result cache

435846c

arikfr closed this Sep 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Plugable Query Result Storage-Cache #3335

WIP: Plugable Query Result Storage-Cache #3335

ismailsimsek commented Jan 24, 2019

arikfr left a comment

ismailsimsek commented Jan 26, 2019

arikfr commented Jan 27, 2019

codeclimate bot Jan 31, 2019

codeclimate bot Jan 31, 2019

codeclimate bot Jan 31, 2019

codeclimate bot Jan 31, 2019

codeclimate bot Jan 31, 2019

ismailsimsek commented Mar 18, 2019

arikfr commented Sep 24, 2019

WIP: Plugable Query Result Storage-Cache #3335

WIP: Plugable Query Result Storage-Cache #3335

Conversation

ismailsimsek commented Jan 24, 2019

arikfr left a comment

Choose a reason for hiding this comment

ismailsimsek commented Jan 26, 2019

arikfr commented Jan 27, 2019

codeclimate bot Jan 31, 2019

Choose a reason for hiding this comment

codeclimate bot Jan 31, 2019

Choose a reason for hiding this comment

codeclimate bot Jan 31, 2019

Choose a reason for hiding this comment

codeclimate bot Jan 31, 2019

Choose a reason for hiding this comment

codeclimate bot Jan 31, 2019

Choose a reason for hiding this comment

ismailsimsek commented Mar 18, 2019

arikfr commented Sep 24, 2019