BigTable: On read_row(), provide default to retrieve only most recent cell values #4468

zakons · 2017-11-28T03:27:57Z

The current signature on the Bigtable Table read_row() method has a default of filter_=None:

def read_row(self, row_key, filter_=None):
    ...

In cases where a cell value may have been updated multiple times, the default will be to return the full time series with timestamps for each value which can slow down read performance in a non-obvious way.

In the current Python API the cells() method on row_data (PartialRowData) makes a deep copy of the cells, which compounds the performance issue.

@property
def cells(self):
    """Property returning all the cells accumulated on this partial row.

    :rtype: dict
    :returns: Dictionary of the :class:`Cell` objects accumulated. This
              dictionary has two-levels of keys (first for column families
              and second for column names/qualifiers within a family). For
              a given column, a list of :class:`Cell` objects is stored.
    """
    return copy.deepcopy(self._cells)

Consider:

Making a default filter on read_row() to retrieve only the most recent value of any cell unless the full or partial time series is requested.
Allowing a ColumnFamily to implicitly or explicitly limit cells to only one value (no timeseries).
Adding a cell_value(column_family_id, column, index=0) method to row_data (PartialRowData) to allow more efficient retrieval of a single cell value.

The text was updated successfully, but these errors were encountered:

dhermes · 2017-11-28T03:41:18Z

Thanks for filing @zakons!

I don't think it's a good idea to have any default filter. However, I think we could provide some kind of easy to access filter, i.e. as a constant in the package google.cloud.bigtable.MOST_RECENT_FILTER or maybe as a class constant in Table so you could write code like
```
row_data = table.read_row(row_key, filter_=table.MOST_RECENT_FILTER)
```
I'm not quite clear on your ColumnFamily suggestion. The feature set of the actual backend API is all we can really expose. Is there a particular method where you'd like to see different behavior? (Most API calls that actually refer to a column family just use a string ID for the column family, not an actual ColumnFamily instance.)
Adding cell_value() seems very easy to implement and not at all controversial to add to the API surface.

/cc @garye

sduskis · 2017-11-28T04:16:38Z

I like the table.MOST_RECENT_FILTER approach. We can also add in table.KEY_ONLY_FILTER, which is useful for just doing things like row counts.
I'm not certain about ColumnFamily either. Is it the same as 1.?
+1

@dhermes, @garye is less work on the client-side these days. I am working with other developers across all languages for Cloud Bigtable.

zakons · 2017-11-28T05:55:57Z

Thanks.

I believe the name table.MOST_RECENT_VALUES_FILTER or something similar connoting what it is that is recent would be more clear.
I agree with your point that we should not be suppressing any existing backend functionality in our API.

Appreciate the quick feedback.

zakons · 2017-12-14T05:09:21Z

Note that the cell_value(column_family_id, column, index=0) method returns an immutable value, so the underlying Cell's, which are mutable, are not exposed to the user.

Also, add a cell_values(column_family_id, column) iterator or generator method to row_data (PartialRowData) which will provide an iteration over all the cell values, returning index, timestamp, value for each iteration - each of which are immutable. The first index in the iteration would be 0, or most recent value.

sduskis · 2017-12-14T13:33:56Z

@dhermes, can we please discuss design issues for PR #4564 here?

dhermes · 2017-12-14T20:35:32Z

Sure. Go ahead and discuss? I am 100% for dropping the deepcopy, but I'd prefer to do it as an opt-in, so users can knowingly get mutate-able values.

sduskis · 2018-05-25T14:39:12Z

@zakons: can this be closed?

dhermes added api: bigtable Issues related to the Bigtable API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. performance labels Nov 28, 2017

chemelnucfin changed the title ~~On BigTable read_row(), provide default to retrieve only most recent cell values~~ BigTable: On read_row(), provide default to retrieve only most recent cell values Nov 29, 2017

chemelnucfin assigned tseaver, dhermes and chemelnucfin Jan 15, 2018

tseaver unassigned tseaver, dhermes and chemelnucfin Apr 10, 2018

zakons closed this as completed May 29, 2018

sduskis mentioned this issue Aug 1, 2018

Bigtable: reads are much slower in latest release than previous release. #5725

Closed

sduskis mentioned this issue Nov 22, 2018

BigTable Row Data cells property not copy #6643

Closed

JustinBeckwith assigned zakons Feb 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigTable: On read_row(), provide default to retrieve only most recent cell values #4468

BigTable: On read_row(), provide default to retrieve only most recent cell values #4468

zakons commented Nov 28, 2017 •

edited

Loading

dhermes commented Nov 28, 2017

sduskis commented Nov 28, 2017

zakons commented Nov 28, 2017

zakons commented Dec 14, 2017 •

edited

Loading

sduskis commented Dec 14, 2017

dhermes commented Dec 14, 2017

sduskis commented May 25, 2018

BigTable: On read_row(), provide default to retrieve only most recent cell values #4468

BigTable: On read_row(), provide default to retrieve only most recent cell values #4468

Comments

zakons commented Nov 28, 2017 • edited Loading

dhermes commented Nov 28, 2017

sduskis commented Nov 28, 2017

zakons commented Nov 28, 2017

zakons commented Dec 14, 2017 • edited Loading

sduskis commented Dec 14, 2017

dhermes commented Dec 14, 2017

sduskis commented May 25, 2018

zakons commented Nov 28, 2017 •

edited

Loading

zakons commented Dec 14, 2017 •

edited

Loading