Skip to content

3.3.0

Latest
Compare
Choose a tag to compare
@jqnatividad jqnatividad released this 23 Mar 17:05
· 65 commits to master since this release

[3.3.0] - 2025-03-23

Highlights:

  • stats got another round of improvements:
    • boolean inferencing is now configurable!
      Before, it was limited to a simple, English-centric heuristic:
      • When a column's cardinality is 2; and the 2 values' first characters are 0/1, t/f or y/n case-insensitive, the data type of the column is inferred as boolean
      • With the new --boolean-patterns <arg> option, we can now specify arbitrary true_pattern:false_pattern pattern pairs. Each pattern can be a string of length > 1, case-insensitive. If a pattern ends with "*", it is treated as a prefix.
        For example, t*:f* matches "true", "Truthy", "T" as boolean true so long as the corresponding false pattern (e.g. "Fake, False, f") is also matched. Bear in mind that the cardinality still needs to be 2, so multiple matches on the same column on different patterns will disqualify the field as boolean if cardinality > 2 (e.g. If a column's domain is "True", "truthy" and "False", it doesn't qualify as it's cardinality is 3. On the other hand, if it's "True", "true", "False", "false", "FALSE" - it still qualifies as they resolve to just "true/false" case-insensitive).
        For backwards compatibility, the default true/false pairs are 1:0,t*:f*,y*:n*.
    • percentiles can now be computed!
      By enabling the --percentiles flag, stats will now return the 5th, 10th, 40th, 60th, 90th and 95th percentile by default using the nearest-rank method for all numeric and date/datetime columns. The returned percentiles can be configured to return different percentiles using the --percentile-list <arg> option.
      Note that the method for computing quartiles (Method 3) is basically a specialized implementation of the nearest rank method for q1 (25th), q2 (50th or median) and q3 (75th percentile), thus the choice of non-overlapping defaults for --percentile-list.
  • frequency: now uses qsv-stats 0.32.0, which uses the more memory-efficient, often faster foldhash crate
  • in the same vein, by replacing ahash with foldhash suite-wide, qsv got a lot more memory-efficient and often faster when doing hash lookups
  • sample: "streaming" bernoulli sampling now works for any remotely hosted CSVs with servers that support chunked downloads, without requiring range request support.
  • we're now using the latest Polars engine - v0.46.0 at the py-1.26.0 tag.

Added

  • stats: add configurable boolean inferencing #2595
  • stats: add --percentiles option #2617

Changed

  • refactor: replace ahash with faster foldhash #2619
  • replace std assert_eq! macro with similar_asserts::assert_eq! macro for easier debugging #2605
  • deps: bump polars to 0.46.0 at py-1.25.2 tag #2604
  • deps: bump Polars to v0.46.0 at py-1.26.0 tag #2621
  • build(deps): bump actix-web from 4.9.0 to 4.10.2 by @dependabot in #2591
  • build(deps): bump indexmap from 2.7.1 to 2.8.0 by @dependabot in #2592
  • build(deps): bump mimalloc from 0.1.43 to 0.1.44 by @dependabot in #2608
  • build(deps): bump qsv-stats from 0.30.0 to 0.31.0 by @dependabot in #2603
  • build(deps): bump qsv-stats from 0.31.0 to 0.32.0 by @dependabot in #2620
  • build(deps): bump reqwest from 0.12.12 to 0.12.13 by @dependabot in #2593
  • build(deps): bump reqwest from 0.12.13 to 0.12.14 by @dependabot in #2596
  • build(deps): bump reqwest from 0.12.14 to 0.12.15 by @dependabot in #2609
  • build(deps): bump rfd from 0.15.2 to 0.15.3 by @dependabot in #2597
  • build(deps): bump rust_decimal from 1.37.0 to 1.37.1 by @dependabot in #2616
  • build(deps): bump simd-json from 0.14.3 to 0.15.0 by @dependabot in #2615
  • build(deps): bump tempfile from 3.18.0 to 3.19.0 by @dependabot in #2602
  • build(deps): bump tempfile from 3.19.0 to 3.19.1 by @dependabot in #2612
  • build(deps): bump uuid from 1.15.1 to 1.16.0 by @dependabot in #2601
  • build(deps): bump zip from 2.2.3 to 2.4.1 by @dependabot in #2607
  • apply select clippy lint suggestions
  • bumped indirect dependencies to latest version
  • set Rust nightly to 2025-03-07, the same version Polars uses 17f6bdb

Fixed

  • updated lock file, primarily to fix CVE-2025-29787 e44e5df
  • luau: fix flaky register_lookup_table CI test that only intermittently fails in Windows by using buffered writer in lookup write_cache_file helper f494b46
  • sample: refactor "streaming" Bernoulli sampling, so it actually works without requiring range requests support #2600

Full Changelog: 3.2.0...3.3.0