Optimize the experimental python types flag #206

lpoulain · 2022-07-21T20:46:54Z

Optimizes the experimental_python_types flag. Instead of checking the type for each row, check the type once for each fetch() call and compute a list of lambdas which are to be applied to the values from each row.

A new RowMapperFactory class was created to wrap this behavior. The columns content from the status is analyzed and a list of lambda functions is created (one lambda per column)
The experimental_python_types flag is now processed in the TrinoQuery class instead of the TrinoResult class

tests/integration/test_types_integration.py

trino/client.py

mdesmet · 2022-07-22T10:09:27Z

trino/client.py

+    def _map_row(self, row):
+        return [self._map_value(value, self.mapping_funcs[idx]) for idx, value in enumerate(row)]
+
+    def _map_value(self, value, col_mapping_func):
+        if value is None:
+            return None
+
+        try:
+            return col_mapping_func(value)
+        except ValueError as e:
+            error_str = f"Could not convert '{value}' into the associated python type"
+            raise trino.exceptions.TrinoDataError(error_str) from e


Move to separate class RowMapper. If no col_mapping_funcs are provided in constructor, return the row value as is.

mdesmet · 2022-07-22T10:17:29Z

trino/client.py

@@ -776,6 +701,11 @@ def fetch(self) -> List[List[Any]]:
        self._response_headers = response.headers
        if status.next_uri is None:
            self._finished = True
+
+        if self._experimental_python_types:
+            factory = RowMapperFactory(status.columns)


RowMapperFactory().create(...) should only be called once when the columns are known. Maybe move this logic to the _update_state method. Add the created RowMapper instance on an instance field of TrinoQuery.

I modified the code as best as I could, because in execute() function expects some row (maybe) yet the columns may be not known yet.

Yeah, the columns are only known when Trino is starting to send data, which can be in the first execute or in any of the subsequent fetch calls. We should delay the creation of the RowMapper until the columns are present and reuse that cached RowMapper on the fetched data to get the performance benefit.

mdesmet · 2022-07-22T10:21:49Z

trino/client.py

+        if self._experimental_python_types:
+            factory = RowMapperFactory(status.columns)
+            rows = factory.map_rows(status.rows)
+        else:
+            rows = status.rows
+
+        self._result = TrinoResult(self, rows)


Also applies to the fetch method.

Suggested change

if self._experimental_python_types:

factory = RowMapperFactory(status.columns)

rows = factory.map_rows(status.rows)

else:

rows = status.rows

self._result = TrinoResult(self, rows)

assert self._row_mapper is not None

self._result = TrinoResult(self, self._row_mapper(rows))

Actually if we want to call RowMapperFactory() only once we know the column, then we shouldn't call this in execute(). At this point we have no guarantee we'll have columns at this point.

Correct, that's why I suggested to add this in the _update_state method to call it only when you actually have the columns.

hovaesco

Could you please add some docstrings/comments to the new classes?

hovaesco · 2022-07-27T11:49:38Z

trino/client.py

+    def _col_func(self, column):
+        col_type = column['rawType']
+
+        if col_type == 'array':


Can this be splitted to separate functions per data type?

That way we could do unit tests. especially on the date and time types, that would be beneficial.

hovaesco · 2022-07-27T11:51:57Z

trino/client.py

+            if ms_size > 0:
+                pattern += ".%f"
+
+            if col_type.startswith('timestamp'):


Let's try to avoid these nested conditionals, it would improve code visibility.

hovaesco · 2022-07-27T11:52:49Z

trino/client.py

+        col_type = column['rawType']
+
+        if col_type == 'array':
+            elt_mapping_func = self._col_func(column['arguments'][0]['value'])


What is the meaning of the name elt_mapping_func?

That's the mapping function for each element of the array

mdesmet · 2022-08-01T07:32:00Z

trino/client.py

        return self._result

    def _update_state(self, status):
        self._stats.update(status.stats)
        self._update_type = status.update_type
+        self._row_mapper = RowMapperFactory().create(columns=status.columns,


_update_state will be called on every request, in order to get the benefit of only initializing the row mapper once. we would need to to ensure it's only called once.

Suggested change

self._row_mapper = RowMapperFactory().create(columns=status.columns,

if self._row_mapper is None and status.columns is not None:

self._row_mapper = RowMapperFactory().create(columns=status.columns,

That way the RowMapperFactory().create() would only be called once when the columns are set.

mdesmet · 2022-08-01T14:38:06Z

trino/client.py

@@ -776,7 +701,8 @@ def fetch(self) -> List[List[Any]]:
        self._response_headers = response.headers
        if status.next_uri is None:
            self._finished = True
-        return status.rows
+
+        return self._row_mapper.map(status.rows) if self._row_mapper else status.rows


At this stage the self._row_mapper should never be undefined. When rows are returned, the columns are also known.

Actually _row_mapper may NOT be defined at this point. This is when the query is queued and we're still waiting for results.

mdesmet · 2022-08-01T14:40:39Z

trino/client.py

        return self._result

    def _update_state(self, status):
        self._stats.update(status.stats)
        self._update_type = status.update_type
+        if not self._row_mapper:


I think it's possible that the columns are not yet known at this point. The row mapper would not be recreated in case the columns arrive later.

That is correct, but then the RowMapperFactory will just return None

ebyhr

Let me defer the final review to @hashhar.

tests/integration/test_types_integration.py

hashhar

A quick pass.

tests/integration/test_types_integration.py

hashhar

Some more comments.

I'd recommend focusing on everything except temporal types for now since temporal types are a rabbit hole and need to be reviewed in more depth.

tests/integration/test_types_integration.py

hashhar · 2022-08-17T20:15:17Z

tests/integration/test_types_integration.py

+        .add_field(sql="cast(null AS VARBINARY)", python=None) \
+        .add_field(sql="cast('{}' AS JSON)", python='"{}"') \
+        .add_field(sql="cast('null' AS JSON)", python='"null"') \
+        .add_field(sql="cast(null AS JSON)", python=None) \


varbinary and JSON into their own methods.

Also let's test some varbinary values which don't represent valid strings.

Also empty varbinary (not null) is special as well.

for JSON we should also test float values, strings, strings with special chars (", ', :), empty array as well.

Actually right now we don't seem to perform anything special treatment for JSON

Done for splitting each type in its own test

Either add TODO to add the missing tests or add the tests for the other values and assert that they fail.

What test are you talking about here? JSON tests (even though we just convert them to strings), varbinary which represent invalid strings (e.g. non-UTF-8?) or something else?

Both.

Add test for varbinary which doesn't represent a string

Add test for JSON with:

JSON '3.14'

JSON 'a string'

JSON 'a " complex ' string :' (I'm not sure if I escaped that ' correctly)

JSON '[]'

Even if the test fails we should assert on that failure so that when something like #214 changes behaviour we know to update the tests. It also serves as documentation of the "issues" we need to fix.

I'm not entirely sure I know exactly what you have in mind but I've added such tests.

tests/integration/test_types_integration.py

hashhar · 2022-08-18T07:51:54Z

trino/client.py

+        return RowMapper()
+
+    def _col_func(self, column):
+        col_type = column['rawType']


What other fields does the typeSignature have available? I'm asking because it's possible for startswith to fail in presence of user contributed types.

typeSignature has rawType and arguments fields (the latter is used for types such as decimal(3, 1))

Then can the checks below use rawType == 'decimal' instead of existing rawType.startswith('decimal') etc.?

I've seen cases where the value is decimal(18, 18)

👀 Thanks. I'll need to re-read some code on the server side to have better understanding. 🙂

trino/client.py

hashhar · 2022-08-18T07:57:15Z

trino/client.py

+                return lambda val: \
+                    [datetime.strptime(val[:dt_size] + val[dt_tz_offset:], pattern + ' %z')
+                        if tz.startswith('+') or tz.startswith('-')
+                        else datetime.strptime(dt[:dt_size] + dt[dt_tz_offset:], pattern)
+                                     .replace(tzinfo=pytz.timezone(tz))
+                        for dt, tz in [val.rsplit(' ', 1)]][0]


while this is very concise a proper function would be more readable.

Would it? The function relies on multiple variables computed before the lambda (e.g. dt_tz_offset, tz, etc.)

trino/client.py

tests/integration/test_types_integration.py

trino/client.py

hashhar

LGTM.

@ebyhr FYI I've asked timestamp related tests to be added in separate PR to help with reviewing them.

hashhar · 2022-08-19T14:24:04Z

@lpoulain Can you squash the "fixups" together?

lpoulain · 2022-08-19T14:35:01Z

@hashhar done

hashhar · 2022-08-23T13:11:34Z

@lpoulain "All commits" do not need to be squashed. Just the logically related ones.

For future PRs see https://github.com/trinodb/trino/blob/master/.github/DEVELOPMENT.md#git-merge-strategy

hashhar · 2022-08-23T13:17:25Z

@lpoulain Please reword the commit message as:

Optimize experimental_python_types and add type-mapping tests

Instead of checking the type for each row, check the type once for each
fetch() call and compute a list of lambdas which are to be applied to
the values from each row. A new RowMapperFactory class is created to
wrap this behavior.
The experimental_python_types flag is now processed in the TrinoQuery
class instead of the TrinoResult class.

Type mapping tests for each lambda which maps rows to Python types is
added.

hashhar

Looks good but please reword commit message to have useful information.

Instead of checking the type for each row, check the type once for each fetch() call and compute a list of lambdas which are to be applied to the values from each row. A new RowMapperFactory class is created to wrap this behavior. The experimental_python_types flag is now processed in the TrinoQuery class instead of the TrinoResult class. Type mapping tests for each lambda which maps rows to Python types is added.

lpoulain · 2022-08-23T13:25:36Z

Looks good but please reword commit message to have useful information.

@hashhar done

cla-bot bot added the cla-signed label Jul 21, 2022

lpoulain requested review from mdesmet, ebyhr and hashhar July 21, 2022 20:47

ebyhr reviewed Jul 21, 2022

View reviewed changes

tests/integration/test_types_integration.py Outdated Show resolved Hide resolved

mdesmet reviewed Jul 22, 2022

View reviewed changes

hovaesco reviewed Jul 27, 2022

View reviewed changes

mdesmet reviewed Aug 1, 2022

View reviewed changes

hovaesco approved these changes Aug 2, 2022

View reviewed changes

mdesmet requested a review from ebyhr August 3, 2022 10:56

ebyhr reviewed Aug 16, 2022

View reviewed changes

tests/integration/test_types_integration.py Outdated Show resolved Hide resolved

tests/integration/test_types_integration.py Outdated Show resolved Hide resolved

tests/integration/test_types_integration.py Outdated Show resolved Hide resolved

hashhar reviewed Aug 16, 2022

View reviewed changes

hashhar reviewed Aug 17, 2022

View reviewed changes

hashhar reviewed Aug 18, 2022

View reviewed changes

tests/integration/test_types_integration.py Show resolved Hide resolved

hashhar reviewed Aug 18, 2022

View reviewed changes

trino/client.py Show resolved Hide resolved

hashhar reviewed Aug 18, 2022

View reviewed changes

trino/client.py Outdated Show resolved Hide resolved

hashhar reviewed Aug 18, 2022

View reviewed changes

trino/client.py Outdated Show resolved Hide resolved

hashhar reviewed Aug 18, 2022

View reviewed changes

trino/client.py Outdated Show resolved Hide resolved

hashhar reviewed Aug 19, 2022

View reviewed changes

tests/integration/test_types_integration.py Show resolved Hide resolved

trino/client.py Show resolved Hide resolved

trino/client.py Outdated Show resolved Hide resolved

hashhar approved these changes Aug 19, 2022

View reviewed changes

lpoulain force-pushed the optimize-experimental-type-processing branch from fe93827 to 33a7cd5 Compare August 19, 2022 14:34

hashhar approved these changes Aug 23, 2022

View reviewed changes

lpoulain force-pushed the optimize-experimental-type-processing branch from 33a7cd5 to f0da713 Compare August 23, 2022 13:24

hashhar merged commit cffd2b2 into trinodb:master Aug 23, 2022

	self._row_mapper = RowMapperFactory().create(columns=status.columns,
	if self._row_mapper is None and status.columns is not None:
	self._row_mapper = RowMapperFactory().create(columns=status.columns,

Optimize the experimental python types flag #206

Optimize the experimental python types flag #206

Conversation

lpoulain commented Jul 21, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hovaesco left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lpoulain Aug 1, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebyhr left a comment

Choose a reason for hiding this comment

hashhar left a comment

Choose a reason for hiding this comment

hashhar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hashhar left a comment

Choose a reason for hiding this comment

hashhar commented Aug 19, 2022

lpoulain commented Aug 19, 2022

hashhar commented Aug 23, 2022

hashhar commented Aug 23, 2022 • edited Loading

hashhar left a comment

Choose a reason for hiding this comment

lpoulain commented Aug 23, 2022

lpoulain Aug 1, 2022 •

edited

Loading

hashhar commented Aug 23, 2022 •

edited

Loading