New `merge` transform #1488

binarylogic · 2020-01-07T16:49:14Z

I'm opening this issue to explicitly represent the work of a merge transform (I'm open to better names). The purpose of this transform is to merge log lines that were split upstream. For example, the docker source caps logs at 16kb and therefore splits logs that exceed this size. We need a transform that allows users to merge these lines together to form the single log line they are meant to represent.

Use cases

Ability to merge docker log events that were split due to size
Ability to merge journald log events that were split due to size
Ability to merge file log events based on a start and stop pattern (the current message_start_indicator option)

Spec

This spec is largely based on the comments from @MOZGIII. See #1436 (comment), #1436 (comment), and #1436 (comment)).

This should be built as a separate configurable transform, enabling users merge logs from any source and cover a wide variety of use cases that we are not thinking about.
The docker source should come pre-configured with this transform since we know Docker will split lines.
The journald source should come pre-configured with this transform since we know Journald splits lines.
message_start_indicator should be removed from the file source in favor of using this new transform. We can provide guides that educate the users on this.

Let me know if these requirements are not accurate.

The text was updated successfully, but these errors were encountered:

lukesteensen · 2020-01-07T17:43:05Z

message_start_indicator should be removed from the file source in favor of using this new transform. We can provide guides that educate the users on this.

I'm not sure about this part. Would we roll that functionality into the new transform? There are likely quite a few situations where it would be a better fit than the new strategy of the source detecting partial messages. Or we could keep it as a way to have sources identify which messages are partial?

More generally, while I like the idea of implementing this in a way that's not tied to a specific source, I am slightly skeptical of requiring the user to configure a separate transform with this. Could we not just expose it through options on the sources like message_start_indicator that use a common implementation? You've already hinted at this by saying some sources will be preconfigured with it.

MOZGIII · 2020-01-09T14:08:15Z

Before I start with the actual implementation, I'd like to prepare some test cases. Is there any place where I could put this setup? Not sure it should belong to the main vector repo.
I plan to replicate the docker line splitting behavior - so I'd need a specially crafted Dockerfile with long echos and a vector config.

lukesteensen · 2020-01-09T17:11:30Z

Before I start with the actual implementation, I'd like to prepare some test cases. Is there any place where I could put this setup?

This could fit as a correctness test in our test harness /cc @binarylogic

MOZGIII · 2020-01-09T17:13:30Z

Great. I created a repo in the meantime: https://github.com/MOZGIII/vector-merge-test-setup
I'll see if I can migrate it to the test harness repo.

binarylogic · 2020-01-09T19:21:03Z

Yeah, that would be ideal. There's a little more involved there since the test harness involves terraform and ansible scripts. I'd say keep moving forward with you repo and then we can pair on transitioning it to our test hardness when you're ready.

MOZGIII · 2020-01-09T19:31:07Z

Sounds good! I've used Ansible and Terraform before, but still, I'd appreciate guidance! Before we can move it to test harness, I need to figure out a way to programmatically assert the behavior. With my current setup that'd be problematic, however, I imagine it can be done better if we can spare a full VM for it.

binarylogic · 2020-01-09T19:41:22Z

Yep. For example, here's a very simple nested JSON test:

https://github.com/timberio/vector-test-harness/blob/master/cases/wrapped_json_correctness/ansible/run/vector.yml

You can read about the test here:

https://github.com/timberio/vector-test-harness/tree/master/cases/wrapped_json_correctness

I don't want to distract you with this stuff right now, but I think it's good for you to see how we're using Ansible to run test cases. When you're ready we can pair and get it set up.

MOZGIII · 2020-01-09T22:23:46Z

message_start_indicator should be removed from the file source in favor of using this new transform. We can provide guides that educate the users on this.

I'm not sure about this part. Would we roll that functionality into the new transform? There are likely quite a few situations where it would be a better fit than the new strategy of the source detecting partial messages. Or we could keep it as a way to have sources identify which messages are partial?

I'm also not sure about the deprecation since I don't fully understand yet if we're covering all the use cases of the message_start_indicator with the merge functionality. It might be that we don't. I'll think about it tomorrow.

More generally, while I like the idea of implementing this in a way that's not tied to a specific source, I am slightly skeptical of requiring the user to configure a separate transform with this. Could we not just expose it through options on the sources like message_start_indicator that use a common implementation? You've already hinted at this by saying some sources will be preconfigured with it.

The main beauty of having this a separate transform is it won't be limited to just merging the raw input from the sources. For instance in the JSON-in-JSON case, one would be able to unwrap the "JSON envelope" and merge over a field from the wrapped document. I think that'd be cool.

MOZGIII · 2020-01-10T17:02:56Z

If anyone happen to have a VM dump with journald configured for testing with vector - please share it with me 😄 I need to replicate the message splitting behavior for journald.
UPD: I'll be working on my own setup tomorrow if nothing comes up by then.

MOZGIII · 2020-01-10T17:32:19Z

Also, regarding journald setup, as was discussed here, the partial messages are specific to the docker use-case - when docker outputs the container logs to the journald, it adds it's own partial marker - CONTAINER_PARTIAL_MESSAGE.
This is covered by adding a merge transform manually. I doubt we want this docker-specific logic to be built-in to a journald source (which would be an odd coupling, since journald source is currently generic, and not specific to docker). I'd consider this use-case covered by just the addition of the transform, and not provide any specific details.
We can hint to this issue with docker in the documentation for the journald source and suggest adding a merge transform there.

MOZGIII · 2020-01-10T17:41:29Z

I also realized that we should probably take a look at k8s source as well. It's implemented separately, but the design goal there, as noted in the comments, is for it to be interchangeable with docker source. So, whatever defaults we'll apply to docker we have to make sure the same are applied to kubernetes.

binarylogic · 2020-01-10T17:44:53Z

Also, regarding journald setup, as was discussed here, the partial messages are specific to the docker use-case - when docker outputs the container logs to the journald, it adds it's own partial marker - CONTAINER_PARTIAL_MESSAGE. This is covered by adding a merge transform manually.

Sure, but the CONTAINER_PARTIAL_MESSAGE is an explicit flag that is set, right? I know it's set by Docker, but I'm wondering if we can treat it like a generic flag? Then it's not Docker specific (if that makes sense).

So, whatever defaults we'll apply to docker we have to make sure the same are applied to kubernetes.

Make sense to me.

MOZGIII · 2020-01-10T17:55:14Z

Also, regarding journald setup, as was discussed here, the partial messages are specific to the docker use-case - when docker outputs the container logs to the journald, it adds it's own partial marker - CONTAINER_PARTIAL_MESSAGE. This is covered by adding a merge transform manually.

Sure, but the CONTAINER_PARTIAL_MESSAGE is an explicit flag that is set, right? I know it's set by Docker, but I'm wondering if we can treat it like a generic flag? Then it's not Docker specific (if that makes sense).

Hmm, that's interesting. To my knowledge, there's no spec that rules container runtime to add CONTAINER_PARTIAL_MESSAGE for the partial messages. In practice, rkt doesn't do it, but cri-o added it recently: cri-o/cri-o@17cbc3a. So, we might as well consider it the emerging standard. But still, this would be a container-runtime specific thing. Do you think it's a good idea to make it enabled by default? Though, most likely it won't hurt, as long as we provide means to opt-out...

MOZGIII · 2020-01-11T18:21:53Z

I just realized the docker approach to partial lines (partial message marker by the presence or absence of the \n at the end) is unsound. Consider the following example:

{"log":"{ \"long_value\": \"long value ... \\n", ...}
{"log":" ... \" }", ...}

It might so happen that a long log message is split such that the end of the first chunk is \\n. This would break the invariant docker implies, and in turn, cause a false-negative partial message detection. This is still better than nothing, but it's more of a heuristic than a correct solution. And this one is on docker, we don't seem to have any way of fixing this for general case.

MOZGIII · 2020-01-15T21:51:36Z

I just realized the docker approach to partial lines (partial message marker by the presence or absence of the \n at the end) is unsound. Consider the following example:
{"log":"{ \"long_value\": \"long value ... \\n", ...}
{"log":" ... \" }", ...}
It might so happen that a long log message is split such that the end of the first chunk is \\n. This would break the invariant docker implies, and in turn, cause a false-negative partial message detection. This is still better than nothing, but it's more of a heuristic than a correct solution. And this one is on docker, we don't seem to have any way of fixing this for general case.

Actually, we can special casing for \\n and make it sound this on our end. I'll create an issue for this.

MOZGIII · 2020-01-16T02:21:46Z

So, the summary of the progress on this issue:

the merge transform is implemented at feat(new transform): New merge transform #1504
automatic merging of partial events for docker source is also implemented at feat(new transform): New merge transform #1504
defaults for journald and kubernetes sources are not yet covered; journald worth a separate issue and discussion, for kubernetes - I just did the research - and the same logic we use for docker is applicable
a message_start_indicator for file source is still there, let's create a separate issue for it too; it might be that lua script is enough, otherwise we'll have to add a dedicated partial message marker transform to get rid of message_start_indicator
test harness integration - we can add a dedicated issue for this too to track the progress and have a place to discuss; maybe at the test harness repo even
as mentioned at Batch multiple events as a single record in aws_kinesis_firehose sink #1407 (comment), it'd be great to conduct a research to test the limits of the current design; the end goal is to provide a set of vector configs (that we could add to the documentation as examples) to cover various use cases - with journald, other sources, sinks that can benefit from merging the events together and so on; this will also help with the issues mentioned above to enable partial message merging for journald - it'd be easier to implement with if we've known what to aim for

binarylogic added the type: new feature label Jan 7, 2020

binarylogic added this to the Improve log continuation strategy milestone Jan 7, 2020

binarylogic assigned MOZGIII Jan 7, 2020

MOZGIII mentioned this issue Jan 9, 2020

feat(new transform): New merge transform #1504

Merged

MOZGIII mentioned this issue Jan 15, 2020

Batch multiple events as a single record in aws_kinesis_firehose sink #1407

Closed

lukesteensen mentioned this issue Jan 24, 2020

Transforms as tasks #1598

Closed

binarylogic closed this as completed in #1504 Feb 5, 2020

This was referenced Feb 5, 2020

Merge partial events at kubernetes source #1718

Closed

Merge partial events at journald source #1719

Open

binarylogic added type: feature A value-adding code addition that introduce new functionality. and removed type: new feature labels Jun 16, 2020

nivekuil mentioned this issue Dec 20, 2020

Reconstruct split docker log lines in Promtail grafana/loki#2281

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New `merge` transform #1488

New `merge` transform #1488

binarylogic commented Jan 7, 2020 •

edited by MOZGIII

Loading

lukesteensen commented Jan 7, 2020

MOZGIII commented Jan 9, 2020

lukesteensen commented Jan 9, 2020

MOZGIII commented Jan 9, 2020

binarylogic commented Jan 9, 2020

MOZGIII commented Jan 9, 2020

binarylogic commented Jan 9, 2020

MOZGIII commented Jan 9, 2020

MOZGIII commented Jan 10, 2020 •

edited

Loading

MOZGIII commented Jan 10, 2020 •

edited

Loading

MOZGIII commented Jan 10, 2020

binarylogic commented Jan 10, 2020

MOZGIII commented Jan 10, 2020

MOZGIII commented Jan 11, 2020 •

edited

Loading

MOZGIII commented Jan 15, 2020

MOZGIII commented Jan 16, 2020 •

edited

Loading

New merge transform #1488

New merge transform #1488

Comments

binarylogic commented Jan 7, 2020 • edited by MOZGIII Loading

Use cases

Spec

lukesteensen commented Jan 7, 2020

MOZGIII commented Jan 9, 2020

lukesteensen commented Jan 9, 2020

MOZGIII commented Jan 9, 2020

binarylogic commented Jan 9, 2020

MOZGIII commented Jan 9, 2020

binarylogic commented Jan 9, 2020

MOZGIII commented Jan 9, 2020

MOZGIII commented Jan 10, 2020 • edited Loading

MOZGIII commented Jan 10, 2020 • edited Loading

MOZGIII commented Jan 10, 2020

binarylogic commented Jan 10, 2020

MOZGIII commented Jan 10, 2020

MOZGIII commented Jan 11, 2020 • edited Loading

MOZGIII commented Jan 15, 2020

MOZGIII commented Jan 16, 2020 • edited Loading

New `merge` transform #1488

New `merge` transform #1488

binarylogic commented Jan 7, 2020 •

edited by MOZGIII

Loading

MOZGIII commented Jan 10, 2020 •

edited

Loading

MOZGIII commented Jan 10, 2020 •

edited

Loading

MOZGIII commented Jan 11, 2020 •

edited

Loading

MOZGIII commented Jan 16, 2020 •

edited

Loading