Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

colexec: optimize all types #42043

Open
4 of 13 tasks
yuzefovich opened this issue Oct 30, 2019 · 5 comments
Open
4 of 13 tasks

colexec: optimize all types #42043

yuzefovich opened this issue Oct 30, 2019 · 5 comments
Labels
A-sql-vec SQL vectorized engine C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) meta-issue Contains a list of several other issues. T-sql-queries SQL Queries Team

Comments

@yuzefovich
Copy link
Member

yuzefovich commented Oct 30, 2019

With the addition of datumVec we now support all types with either a tree.Datum-backed representation or optimized "native" representation. The latter is more performant, and this issue tracks the addition of native representation to the remaining types.

  • INTERVAL
  • TIMESTAMPTZ
  • JSONB
  • Enum
  • ARRAY
  • TIME
  • INET
  • COLLATEDSTRING
  • TIMETZ
  • Geometry
  • Geography
  • Tuple
  • Bit

ARRAY seems to be the most frequently used currently unimplemented type. Then INET and TIME.

Jira issue: CRDB-5397

@yuzefovich yuzefovich added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Oct 30, 2019
@asubiotto asubiotto added A-sql-vec SQL vectorized engine meta-issue Contains a list of several other issues. labels Oct 30, 2019
@glennfawcett
Copy link

I have a customer that has a table with ARRAY and JSONB as columns. These columns are NOT used as part of the aggregate query or retrieved, but regardless this query doesn't benefit from vectorized execution. It would be nice to allow vectorization for this case.

@awoods187
Copy link
Contributor

@glennfawcett could you provide an example query? i wonder if we decode the query into an unsupported type. This seems similar to what @jseldess was reporting. You can also try casting the type to a supported type.

@bladefist
Copy link

We're unable to use vectorization across the board due to missing timestampz. Can this be prioritized in anyway? It should unlock a lot of performance gains for us. I don't think we're missing any of the others.

craig bot pushed a commit that referenced this issue Dec 31, 2019
43514: colexec: support TIMESTAMPTZ type r=yuzefovich a=yuzefovich

**colexec: support TIMESTAMPTZ type**

This commit adds the support for TimestampTZ data type which is
represented in the same way as Timestamp (as 'time.Time'). We already
had everything in place, so only the type-conversion was needed.

Addresses: #42043.

Release note (sql change): vectorized engine now supports TIMESTAMPTZ
data type.

**sqlsmith: add several types to vecSeedTable**

This commit adds previously supported INT2 and INT4 types to
vecSeedTable as well as newly supported TIMESTAMPTZ.

Release note: None

Co-authored-by: Yahor Yuzefovich <[email protected]>
@awoods187
Copy link
Contributor

We have merged timestamptz into 20.1 but it is a bit risky to backport it to 19.2. Does everyone one of your tables use it? For all queries?

@bladefist
Copy link

@awoods187 Yes, essentially our primary table for a user profile uses it so all queries join to that. We can wait a little bit if 20.1 is coming soon? thank you!

craig bot pushed a commit that referenced this issue Jan 30, 2020
43517: colexec, coldata: add support for INTERVAL type r=yuzefovich a=yuzefovich

**pgerror: clean up build deps**

The pgerror (and pgcode) packages are (perhaps inadvisably) used in
low-level utility packages. They had some pretty heavyweight build deps,
but this wasn't fundamentally necessary. Clean it up a bit and make
these packages more lightweight.

Release note: None

**colexec, coldata: add support for INTERVAL type**

This commit adds the support for INTERVAL type that is represented by
duration.Duration. Only comparison projections are currently supported.
The serialization is also missing.

Addresses: #42043.

Release note: None

Co-authored-by: Daniel Harrison <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
@yuzefovich yuzefovich changed the title colexec: add unsupported types colexec: optimize all types Jun 15, 2020
craig bot pushed a commit that referenced this issue Apr 22, 2021
63770: colexec: add builtin json datatype r=jordanlewis a=jordanlewis

This commit adds a builtin json datatype to the colexec package. It's
implemented using the Bytes data structure, and lazily deserializes JSON
objects for processing.

There's an inefficiency here, which is that forming a JSON object costs
an allocation. A future commit can make a cheaper "lazy JSON" object
that doesn't cache or require up-front allocations.

Addresses: #42043.
Fixes: #49470.
Fixes: #49472.

Release note (performance improvement): improve the speed of JSON in the
vectorized execution engine

Co-authored-by: Jordan Lewis <[email protected]>
@jlinder jlinder added the T-sql-queries SQL Queries Team label Jun 16, 2021
craig bot pushed a commit that referenced this issue Dec 14, 2022
93400: coldata: add native support of enums r=yuzefovich a=yuzefovich

This commit adds the native support of enum types to the vectorized
engine. We store them via their physical representation, so we can
easily reuse `Bytes` vector for almost all operations, and, thus, we
just mark the enum family as having the bytes family as its canonical
representation. There are only a handful of places where we need to go
from the physical representation to either the logical one or to the
`DEnum`:
- when constructing the pgwire message to the client (in both text and
binary format the logical representation is used)
- when converting from columnar to row-by-row format (fully-fledged
`DEnum` is constructed)
- casts.

In all of these places we already have access to the precise typing
information (similar to what we have for UUIDs which are supported via
the bytes canonical type family already).

I can really see only one downside to such implementation - in some
places the resolution based on the canonical (rather than actual) type
family might be too coarse. For example, we have `<bytes> || <bytes>`
binary operator (`concat`). As it currently stands the execution will
proceed to perform the concatenation between two UUIDs or between a
BYTES value and a UUID, and now we'll be adding enums into the mix.
However, the type checking is performed earlier on the query execution
path, so I think it is acceptable since the execution should never
reach such a setup.

An additional benefit of this work is that we'll be able to support the
KV projection pushdown in presence of enums - on the KV server side
we'll just operate with the physical representations and won't need to
have access to the hydrated type whereas on the client side we'll have
the hydrated type, so we'll be able to do all operations.

Addresses: #42043.
Informs: #92954.

Epic: CRDB-14837

Release note: None

Co-authored-by: Yahor Yuzefovich <[email protected]>
@yuzefovich yuzefovich moved this to Backlog in SQL Queries May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-vec SQL vectorized engine C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) meta-issue Contains a list of several other issues. T-sql-queries SQL Queries Team
Projects
Status: Backlog
Development

No branches or pull requests

6 participants