Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery: Add tqdm progress bar for downloads #7552

Merged
merged 12 commits into from
Mar 28, 2019
13 changes: 13 additions & 0 deletions bigquery/google/cloud/bigquery/table.py
Original file line number Diff line number Diff line change
Expand Up @@ -1334,8 +1334,21 @@ def _to_dataframe_tabledata_list(self, dtypes):
"""Use (slower, but free) tabledata.list to construct a DataFrame."""
column_names = [field.name for field in self.schema]
frames = []

# report progress if tqdm installed
try:
from tqdm import tqdm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aim for 100% unit test coverage, so could you follow the pattern that we do with the optional pandas dependency where we import it at the top of the module and catch import errors there?

That way we can install tqdm for our unit tests but then mock it out in another test to check that it doesn't fail when the tqdm module can't be loaded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

pbar = tqdm(desc="Downloading", total=self.total_rows, unit="rows")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this do something sensible when total_rows is not populated? As discussed in #7217 total_rows is None until iteration starts.

Copy link
Contributor Author

@JohnPaton JohnPaton Mar 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I'll update pbar.total in the loop if it's unset. tqdm handles this just fine.

except (ImportError, KeyError, TypeError):
pbar = None

for page in iter(self.pages):
frames.append(self._to_dataframe_dtypes(page, column_names, dtypes))

if pbar is not None:
# update progress bar with number of rows in last frame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is unnecessary, since it just states what is happening in the next line. Comments in this codebase should state why or explain something that looks "wrong".

This comment indicates to me that we need to rename pbar to progress_bar. Also, we could add some intermediate variables above to make this line more self-explanatory.

frames.append(self._to_dataframe_dtypes(page, column_names, dtypes))

     |
turns into
     |
     V

current_frame = self._to_dataframe_dtypes(page, column_names, dtypes)
frames.append(current_frame)
...
    progress_bar.update(len(current_frame))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome feedback, thank you :)

pbar.update(len(frames[-1]))

return pandas.concat(frames)

def _to_dataframe_bqstorage(self, bqstorage_client, dtypes):
Expand Down