Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Plugable Query Result Storage-Cache #3335

Closed
wants to merge 1 commit into from
Closed

WIP: Plugable Query Result Storage-Cache #3335

wants to merge 1 commit into from

Conversation

ismailsimsek
Copy link

This pull request is adding plugable QueryResultData which is responsible managing query_result.data (big python object).
This model is flexible and can easily be extended, like using s3. current implementations are 'db' default and 'file' as using local disk to store query_result.data

Why?
masking sure application can handle big Queryresults and hoping to improve application performance with big resultsets

Implementation notes

  • by default it is using ‘db’ and code is backward compatible
  • QueryResultData is used only when its necessary, rest of QueryResult operations are done without ‘data’ (large python object)
    • since query_result_data rows are much smaller now any query hitting this table going to be fast. since it dont need to read “data” field from disk.
  • QueryResultData using QueryResult.id as a key value of result object. in case of ‘db’ its PK
  • QueryResultData only used from QueryResult so this way we get minimum refactoring impact.
  • QueryResultData is only responsible storing, retrieving or deleting data object. it is independent of the type of ‘data’ object.

What's next
changing query result object to more performant object. avro, MessagePack ...

@ghost ghost added the in progress label Jan 24, 2019
Copy link
Member

@arikfr arikfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @ismailsimsek ! There are some implementation specific comments about this implementation.

But I think it's worth to discuss the implementation direction in more general lines first --

  1. The API should allow for "streaming" load and save of results, this to really allow handling large results while minimizing the memory usage during both operations.

  2. I'm not sure there is a point in splitting the results into two tables when using DB persistence. Postgres has its own mechanism to make sure rows stay within reasonable size (TOAST).

Also running a migration on query_results might not be feasible on some deployments (because of size), so might worth to consider an implementation that doesn't require it. I was thinking about repurposing the data column to store a "pointer" to the actual results location. Then it can include some more metadata (like rows count, columns metadata, and maybe even preview of the data).

@arikfr arikfr changed the title Plugable Query Result Storage-Cache WIP: Plugable Query Result Storage-Cache Jan 24, 2019
@ismailsimsek
Copy link
Author

1. The API should allow for "streaming" load and save of results, this to really allow handling large results while minimizing the memory usage during both operations.

agree, one way i can think of is passing QueryResultData object in the application and streaming data only when it needed. using save, load methods.
do you have any serialization format in mind? with json only way doing it is appending multiple json objects(each one is a row) to a file which is problematic when reading it
avro has DataFileWriter.appendTo option and you can iterate over records when reading. but it might be problematic defining avro schema, data types from columns might not be compatible

@arikfr
Copy link
Member

arikfr commented Jan 27, 2019

agree, one way i can think of is passing QueryResultData object in the application and streaming data only when it needed. using save, load methods.

Yep, something along these lines.

do you have any serialization format in mind? with json only way doing it is appending multiple json objects(each one is a row) to a file which is problematic when reading it

Yes, I was thinking about JSON Lines. It allows for streaming and more efficient storage wise, as there are no column names in each line. It might be a bit harder to read on the client side, but shouldn't be that much harder.

data_handler = "file"
sample_row_limit = 50

def __init__(self, data=None):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function __init__ has a Cognitive Complexity of 6 (exceeds 5 allowed). Consider refactoring.

class QueryResultDataFactory(object):

@staticmethod
def get_handler(data=None):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function get_handler has a Cognitive Complexity of 7 (exceeds 5 allowed). Consider refactoring.

@@ -136,6 +137,14 @@ def get_schema(self, get_stats=False):
self._get_tables_stats(schema_dict)
return schema_dict.values()


def handle_result_data(self, cursor, columns):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too many blank lines (2)

def handle_result_data(self, cursor, columns):
# @TODO all stream ready query_runner runners shuld use QueryResultDataFactory
# if not then they can override this function and use old behaviour which will be 'db'
data_handler=QueryResultDataFactory().get_handler()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing whitespace around operator

# @TODO all stream ready query_runner runners shuld use QueryResultDataFactory
# if not then they can override this function and use old behaviour which will be 'db'
data_handler=QueryResultDataFactory().get_handler()
json_data = data_handler.save(cursor,columns)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing whitespace after ','

@ismailsimsek
Copy link
Author

@arikfr high level design is ready do you have time to have look at it ?

@arikfr
Copy link
Member

arikfr commented Sep 24, 2019

Thank you for your effort, @ismailsimsek. We decided to take a bit of a different approach in: #4147.

@arikfr arikfr closed this Sep 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants