Skip to content

Commit

Permalink
Update workflow, pyproject.toml, and blog post
Browse files Browse the repository at this point in the history
  • Loading branch information
rich-iannone committed Feb 10, 2025
1 parent cd24b0f commit 8dba124
Show file tree
Hide file tree
Showing 4 changed files with 74 additions and 60 deletions.
3 changes: 2 additions & 1 deletion .github/workflows/ci-docs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,8 @@ jobs:
python-version: "3.10"
- name: Install dependencies
run: |
python -m pip install ".[all,blog]"
python -m pip install ".[all]"
python -m pip install pointblank
- uses: quarto-dev/quarto-actions/setup@v2
with:
tinytex: true
Expand Down
16 changes: 16 additions & 0 deletions docs/_freeze/blog/pointblank-intro/index/execute-results/html.json

Large diffs are not rendered by default.

111 changes: 56 additions & 55 deletions docs/blog/pointblank-intro/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -11,21 +11,27 @@ The Great Tables package allows you to make tables, and they're really great whe

To go a bit further, we are working on a new Python package that generates tables (c/o Great Tables) as reporting objects. This package is called [Pointblank](https://github.com/posit-dev/pointblank), its focus is that of data validation, and the reporting tables it can produce informs users on the results of a data validation workflow. In this post we'll highlight the tables it can make and, in doing so, convince you that such outputs can be useful and worth the effort on the part of the maintainer.

### The table report for a data validation
This article will describe:

- Pointblank enables you to validate many types of DataFrames and SQL databases
- Easy to understand validation result tables and thorough drilldowns
- Nice previews of data tables across a range of backends

### Validating data with Pointblank

Just like Great Tables, Pointblank's primary input is a table and the goal of that library is to perform checks of the tabular data. Other libraries in this domain include [Great Expectations](https://github.com/great-expectations/great_expectations), [pandera](https://github.com/unionai-oss/pandera), and [Soda](https://github.com/sodadata/soda-core?tab=readme-ov-file), and [PyDeequ](https://github.com/awslabs/python-deequ). Let's look at the main reporting table that users are likely to see quite often.

```{python}
# | code-fold: true
# | code-summary: "Show the code"
#| code-fold: true
#| code-summary: "Show the code"
import pointblank as pb
validation = (
pb.Validate(
data=pb.load_dataset(dataset="small_table", tbl_type="polars"),
label="An example validation",
thresholds=(0.1, 0.2, 0.5)
thresholds=(0.1, 0.2, 0.5),
)
.col_vals_gt(columns="d", value=1000)
.col_vals_le(columns="c", value=5)
Expand All @@ -46,11 +52,53 @@ The table is chock full of the information you need when doing data validation t

It's a nice table and it scales nicely to the large variety of validation types and options available in the Pointblank library. Viewing this table is a central part of using that library and the great thing about the reporting being a table like this is that it can be shared by placing it in a publication environment of your choosing (for example, it could be put in a Quarto document).

We didn't stop there however... we went ahead and made it possible to view other artifacts as tables.
Here is the code that was used to generate the data validation above.

```{python}
#| eval: false
#| code-fold: true
#| code-summary: "Show the code"
import pointblank as pb
validation = (
pb.Validate(
data=pb.load_dataset(dataset="small_table", tbl_type="polars"),
label="An example validation",
thresholds=(0.1, 0.2, 0.5),
)
.col_vals_gt(columns="d", value=1000)
.col_vals_le(columns="c", value=5)
.col_exists(columns=["date", "date_time"])
.interrogate()
)
validation
```

Pointblank makes it easy to get started by having a simple entry point and as many validation steps as needed.

Pointblank enables you to validate many types of DataFrames and SQL databases. Pointblank supports Pandas and Polars through Narwhals, and numerous backends (like DuckDB and MySQL) are also supported though our Ibis integration.

### Exploring data validation failures

Note that the above validation showed 6 failures in the first step. You might want to know exactly what failed, giving you a chance to fix data quality issues. To data that, you can use the `get_step_report()` method:

```{python}
validation.get_step_report(i=1)
```

The use of a table for reporting is ideal here! The main features of this step report table include:

### Preview of a dataset
1. a header with summarized information
2. the selected rows that contain the failures
3. a highlighted column of interest

Because Pointblank allows for the collection of data extracts (subsets of the target table where data quality issues were encountered), we found it useful to have a function (`preview()`) that provides a consistent view of this tabular data. It also just works with any type of table that Pointblank supports (which is a lot). Here is how that looks with a 2,000 row dataset included in the package (`game_revenue`):
Different types of validation methods will have step report tables that organize the pertinent information in a way that makes sense for the validation performed.

### Previewing datasets across backends

Because Pointblank supports many backends, with varying ways for displaying the underlying data. Pointblank provides the `preview()` function that provides a beautiful and consistent view any data table. Here is how that looks against a 2,000 row DuckDB table that's included in the package (`game_revenue`):

```{python}
# | code-fold: true
Expand Down Expand Up @@ -78,53 +126,6 @@ pb.load_dataset(dataset="game_revenue", tbl_type="duckdb")

Which is not nearly as good.

### Explaining the result of a particular valdiation step, with a table!

We were clearly excited about the possibilities of Great Tables tables in Pointblank, because we did even more. Data validations are performed as distinct steps (e.g., a step could check that all values were greater than a fixed value, down a specific column) and while you get a reporting of atomic successes and failures in a step, it's better to see exactly what failed. This is all in the service of helping the user get to the root causes of a data quality issue. So, we have a method called `get_step_report()` that gives you a custom view of failures on a stepwise basis. Of course, it's using a table to get the job done.

Let's look at an example where you might check a table against an expected schema for that table. Turns out it's a schema expectation that doesn't match the schema of the actual table, and the report for this step shows what elements don't match up:

```{python}
# | code-fold: true
# | code-summary: "Show the code"
# Create a schema for the target table (`small_table` as a DuckDB table)
schema = pb.Schema(
columns=[
("date_time", "timestamp"),
("dates", "date"),
("a", "int64"),
("b",),
("c",),
("d", "float64"),
("e", ["bool", "boolean"]),
("f", "str"),
]
)
# Use the `col_schema_match()` validation method to perform a schema check
validation = (
pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="duckdb"))
.col_schema_match(schema=schema)
.interrogate()
)
validation.get_step_report(i=1)
```

With just a basic report of steps, you'd just see a failure and be left wondering what went wrong. The tabular reporting of the step report above serves to reveal the issues in an easy-to-understand manner.

The use of a table is so ideal here! On the left are the column names and the data types of the target table. On the right are the elements of the expected schema. We can very quickly see three places where the expectation doesn't match the actual:

1. the dtype for the first column `date_time` is incorrect
2. the column name of second column `date` is misspelled (as `"dates"`)
3. the dtype for the last column is incorrect (`"str"` instead of `"string"`)

This reporting can scale nicely to very large tables since the width of the table in this case will
always be fixed (having schema column comparisons represented in rows). Other nice touches include a robust header with: information on schema comparison settings, the step number, and an indication of the overall pass/fail status (here, a large red cross mark).

There are many types of validations and so naturally there are different types of step reports, but the common thread is that they all use Great Tables to provide reporting in a sensible fashion.

### In closing

We hope this post provides some insight on how Great Tables can be versatile enough to be used within Python libraries. The added benefit is that outputs that are GT object can be further modified or styled by the user of library producing GT tables.
We hope this post is a good introduction to Pointblank and that it provides some insight on how Great Tables makes sense for reporting in a different library. If you'd like to learn more about Pointblank, please visit the [project website](https://posit-dev.github.io/pointblank/) and check out the many [examples](https://posit-dev.github.io/pointblank/demos/).
4 changes: 0 additions & 4 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -66,10 +66,6 @@ dev = [
"pandas",
]

blog = [
"pointblank>=0.5.0",
]

dev-no-pandas = [
"ruff==0.8.0",
"jupyter",
Expand Down

0 comments on commit 8dba124

Please sign in to comment.