-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New gcp_big_query
sink
#1536
Comments
I have this done and minimally tested against GCP, but I have a few questions. We now have transforms that can alter the schema of the log records. Should this sink be writing the logs into the table as-is, or still follow the rewriting described in the initial issue (using a JSON encoded column)? How important is being able to configure the column names within the BigQuery sink itself? |
Good questions, I have a lot of thoughts around this, but I didn't want to distract from a simple first version. Here are a couple of ways this can work today:
Long term I would like to improve the UX by:
|
Just noting that a user asked for this in gitter today: https://gitter.im/timberio-vector/community?at=5f3d3d31582470633b670c43 |
Any update on this? I prototyped a simple implementation here https://github.com/seeyarh/tobq/blob/main/src/lib.rs I'd like to work on this sink, if possible. |
@seeyarh nice! That seems like a good start. This issue hasn't been scheduled yet, but we are happy to review a PR to Vector if you wanted to try implementing it. |
@jszwedko @seeyarh @binarylogic , any news on this? We are actively migrating our DW to Google Bigquery and would like to have this feature. |
Hi @gartemiev ! Nothing yet unfortunately. |
https://cloud.google.com/blog/products/data-analytics/bigquery-now-natively-supports-semi-structured-data |
@jszwedko @seeyarh @binarylogic are there any news on this when it will be approximately available? |
It's currently not on our roadmap, but we'd be open to discussing a community contribution if anyone was interested in working on that. |
@jszwedko @spencergilbert I would like to pick this up, as we have also stumbled into a need to write directly to BigQuery. I get the sense that this would be similar to other |
Hi @goakley , thanks for letting us know you're interested in working on this. I think right now its OK to chat about the approach first, before implementing. We also have a formal process for evaluating new proposed components that we may draw some questions from and ask you. There has been good demand for this component already, which is good. Last December I took a look at this, and the main concern is that google seems to really be advocating new development towards the Storage Write API (gRPC based). It definitely would be a heavier lift that is for sure. I think the trade off worth evaluating is if we get some nice performance improvements with the gRPC based API, and if support is dropped for the legacy API, that could imply a rewrite to the gRPC based API down the line. |
Thank you for the feedback @neuronull. Using the gRPC interface requires a more complicated setup, but I believe it can be done as follow: In addition to the "usual" sink settings (batch size, credentials, etc), users will be required to specify the project, dataset, and table name for the table in which they want to store their data, plus a representation of the table schema, which is necessary for calling the By default, Vector will write to the table's default stream. In the future, we could extend the sink, adding a new config setting that allows writing to an application-created stream. I would prefer to keep the initial implementation simple and not add application-created streams at the start. Google themselves recommend using the default stream when possible, however they do not support exactly-once write semantics which some use cases might require. In any case, Vector will not manage streams - streams should be created independently and then provided to Vector via the config file. Otherwise, the flow here is predictable. Data reaches the sink, gets transformed into the correct structure as specified by the config's protobuf, and is streamed to bigquery with the usual backoff/retry mechanics that vector uses in other sinks. |
Hey @goakley , thanks for the follow up. Ah yes, needing to supply the schema. I remember that now. What you outlined makes sense to me, and I agree to keep it simple to the default stream to begin (though perhaps wiring it up with in mind that the default stream may not always be the one that is used).
I would opt for the file path approach to keep the vector config cleaner. We recently had a contribution of a protobuf codec, that does this for |
Thanks for the nudge in the right direction @neuronull. I've split out the protobuf-specific serialization logic into its own PR: #18598 Once that's merged, I will follow up with a proposed BigQuery sink implementation. |
Hey @neuronull, I've been reading through the vector code and trying some stuff out. Of particular interest in the fact that BigQuery's (1) By calling (2) (3) The |
👋 Hey @goakley , those are good questions to be asking. Sorry I didn't get to respond earlier. I will get back to you on this on Monday. |
Indeed most of the sinks we have are on that request-response model. An example of one that differs from that is the
We do have some infrastructure here for retrying requests. A See vector/src/sinks/util/retries.rs Lines 25 to 35 in 0d09898
vector/src/sinks/aws_s3/config.rs Lines 197 to 199 in 0d09898
Yes the ICYMI there is a detailed tutorial for new style sinks here: https://github.com/vectordotdev/vector/tree/master/docs/tutorials/sinks. It does center on the request-response based model but I think it does a good job at explaining the infrastructure that is used.
😅 There are definitely some obscure compilation errors that can crop up, and they can be tricky sometimes to track down. If you get stuck, you're welcome to share your branch and I can take a look at the error(s). I have done that in the past for others. |
Thank you @neuronull, that is helpful! I've tried to keep things relatively simple in this initial PR, which does function as expected: #18886 Adding (Oh, and the higher-ranked lifetime error was because |
Awesome! I'll queue that PR up for review~ |
👋 I haven't given that a deep review yet but I looked deep enough to see the manifestation of this
, and that is the main thing I think needs deciding on before moving forward with diving deeper into the review. To fully utilize the gRPC Stream service model that the Storage Write API has, would mean going against the unary request/response mode the Vector stream sink driver framework relies on a bit (which we've touched on a bit already). I think it's still possible to do it though. I do see your point about it being a tradeoff with complexity. My bad for pointing to the Vector sink which doesn't utilize the Stream service model. A better reference point would probably actually be the I am curious about the performance of the current design is. It's one thing to have perhaps a more robust design with leveraging the Stream service model, but it would add to the argument for it if the performance of the current model was noticeably poor. |
@neuronull I'm not sure what performance we're looking for, but my team is currently using this branch in production (don't tell on us) to push an average of 3.3k events per second to BigQuery from a single |
I'm glad to hear it is working well for your use case, thank you for sharing that (🙈) Raised this with the team and we have a proposal for you @goakley: We are working on a formalized process for community-driven components in Vector. But in the interim, would you / your company be willing to informally volunteer ownership of the component in it's current state? That essentially implies that aside from routine maintenance required, the Vector team would be "hands-off" on this component, relying on you and your team for bug fixes reported by the community, etc. For an example of that you can see the history of the In the future, we may want to adopt this into a "core" vector component that we would re-assume ownership of, at which time we could further investigate the stream service approach that we've been discussing in this thread. If this is agreeable to you/your company, we will dive into your PR for in-depth code review. How does that sound? |
GCP Big Query is a powerful service for analyzing large amounts of structured data. If used correctly, it can be a cost-effective storage for log data. I would like to see Vector support this service as a sink, but it'll require careful planning due to the many different ways Big Query can be used.
Context
Big Query is flexible and we should consider the following features for our Big Query sink:
This, of course, is not inclusive of all factors we should consider for Big Query, but it helps to demonstrate the variety of options.
Starting simple
v1 of this sink should solve the simplest implementation:
timestamp
andjson_event
.timestamp
column should map to our own internaltimestamp
column.json_event
column should contain a JSON encoded representation of our event.log_schema.timestamp_key
). It is worth thinking about a generic column mapping configuration scheme so that users could map other custom fields to Big Query columns.timestamp
day.Long Term
We might consider the following features for long-term development:
The text was updated successfully, but these errors were encountered: