Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigTable: On read_row(), provide default to retrieve only most recent cell values #4468

Closed
zakons opened this issue Nov 28, 2017 · 7 comments
Assignees
Labels
api: bigtable Issues related to the Bigtable API. performance type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@zakons
Copy link
Contributor

zakons commented Nov 28, 2017

The current signature on the Bigtable Table read_row() method has a default of filter_=None:

def read_row(self, row_key, filter_=None):
    ...

In cases where a cell value may have been updated multiple times, the default will be to return the full time series with timestamps for each value which can slow down read performance in a non-obvious way.

In the current Python API the cells() method on row_data (PartialRowData) makes a deep copy of the cells, which compounds the performance issue.

@property
def cells(self):
    """Property returning all the cells accumulated on this partial row.

    :rtype: dict
    :returns: Dictionary of the :class:`Cell` objects accumulated. This
              dictionary has two-levels of keys (first for column families
              and second for column names/qualifiers within a family). For
              a given column, a list of :class:`Cell` objects is stored.
    """
    return copy.deepcopy(self._cells)

Consider:

  1. Making a default filter on read_row() to retrieve only the most recent value of any cell unless the full or partial time series is requested.
  2. Allowing a ColumnFamily to implicitly or explicitly limit cells to only one value (no timeseries).
  3. Adding a cell_value(column_family_id, column, index=0) method to row_data (PartialRowData) to allow more efficient retrieval of a single cell value.
@dhermes dhermes added api: bigtable Issues related to the Bigtable API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. performance labels Nov 28, 2017
@dhermes
Copy link
Contributor

dhermes commented Nov 28, 2017

Thanks for filing @zakons!

  1. I don't think it's a good idea to have any default filter. However, I think we could provide some kind of easy to access filter, i.e. as a constant in the package google.cloud.bigtable.MOST_RECENT_FILTER or maybe as a class constant in Table so you could write code like

    row_data = table.read_row(row_key, filter_=table.MOST_RECENT_FILTER)
  2. I'm not quite clear on your ColumnFamily suggestion. The feature set of the actual backend API is all we can really expose. Is there a particular method where you'd like to see different behavior? (Most API calls that actually refer to a column family just use a string ID for the column family, not an actual ColumnFamily instance.)

  3. Adding cell_value() seems very easy to implement and not at all controversial to add to the API surface.

/cc @garye

@sduskis
Copy link
Contributor

sduskis commented Nov 28, 2017

  1. I like the table.MOST_RECENT_FILTER approach. We can also add in table.KEY_ONLY_FILTER, which is useful for just doing things like row counts.

  2. I'm not certain about ColumnFamily either. Is it the same as 1.?

  3. +1

@dhermes, @garye is less work on the client-side these days. I am working with other developers across all languages for Cloud Bigtable.

@zakons
Copy link
Contributor Author

zakons commented Nov 28, 2017

Thanks.

  1. I believe the name table.MOST_RECENT_VALUES_FILTER or something similar connoting what it is that is recent would be more clear.
  2. I agree with your point that we should not be suppressing any existing backend functionality in our API.

Appreciate the quick feedback.

@chemelnucfin chemelnucfin changed the title On BigTable read_row(), provide default to retrieve only most recent cell values BigTable: On read_row(), provide default to retrieve only most recent cell values Nov 29, 2017
@zakons
Copy link
Contributor Author

zakons commented Dec 14, 2017

Note that the cell_value(column_family_id, column, index=0) method returns an immutable value, so the underlying Cell's, which are mutable, are not exposed to the user.

Also, add a cell_values(column_family_id, column) iterator or generator method to row_data (PartialRowData) which will provide an iteration over all the cell values, returning index, timestamp, value for each iteration - each of which are immutable. The first index in the iteration would be 0, or most recent value.

@sduskis
Copy link
Contributor

sduskis commented Dec 14, 2017

@dhermes, can we please discuss design issues for PR #4564 here?

@dhermes
Copy link
Contributor

dhermes commented Dec 14, 2017

Sure. Go ahead and discuss? I am 100% for dropping the deepcopy, but I'd prefer to do it as an opt-in, so users can knowingly get mutate-able values.

@sduskis
Copy link
Contributor

sduskis commented May 25, 2018

@zakons: can this be closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigtable Issues related to the Bigtable API. performance type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

5 participants