Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New merge transform #1488

Closed
4 of 7 tasks
binarylogic opened this issue Jan 7, 2020 · 16 comments · Fixed by #1504
Closed
4 of 7 tasks

New merge transform #1488

binarylogic opened this issue Jan 7, 2020 · 16 comments · Fixed by #1504
Assignees
Labels
type: feature A value-adding code addition that introduce new functionality.

Comments

@binarylogic
Copy link
Contributor

binarylogic commented Jan 7, 2020

I'm opening this issue to explicitly represent the work of a merge transform (I'm open to better names). The purpose of this transform is to merge log lines that were split upstream. For example, the docker source caps logs at 16kb and therefore splits logs that exceed this size. We need a transform that allows users to merge these lines together to form the single log line they are meant to represent.

Use cases

  • Ability to merge docker log events that were split due to size
  • Ability to merge journald log events that were split due to size
  • Ability to merge file log events based on a start and stop pattern (the current message_start_indicator option)

Spec

This spec is largely based on the comments from @MOZGIII. See #1436 (comment), #1436 (comment), and #1436 (comment)).

  • This should be built as a separate configurable transform, enabling users merge logs from any source and cover a wide variety of use cases that we are not thinking about.
  • The docker source should come pre-configured with this transform since we know Docker will split lines.
  • The journald source should come pre-configured with this transform since we know Journald splits lines.
  • message_start_indicator should be removed from the file source in favor of using this new transform. We can provide guides that educate the users on this.

Let me know if these requirements are not accurate.

@binarylogic binarylogic added this to the Improve log continuation strategy milestone Jan 7, 2020
@lukesteensen
Copy link
Member

message_start_indicator should be removed from the file source in favor of using this new transform. We can provide guides that educate the users on this.

I'm not sure about this part. Would we roll that functionality into the new transform? There are likely quite a few situations where it would be a better fit than the new strategy of the source detecting partial messages. Or we could keep it as a way to have sources identify which messages are partial?

More generally, while I like the idea of implementing this in a way that's not tied to a specific source, I am slightly skeptical of requiring the user to configure a separate transform with this. Could we not just expose it through options on the sources like message_start_indicator that use a common implementation? You've already hinted at this by saying some sources will be preconfigured with it.

@MOZGIII
Copy link
Contributor

MOZGIII commented Jan 9, 2020

Before I start with the actual implementation, I'd like to prepare some test cases. Is there any place where I could put this setup? Not sure it should belong to the main vector repo.
I plan to replicate the docker line splitting behavior - so I'd need a specially crafted Dockerfile with long echos and a vector config.

@lukesteensen
Copy link
Member

Before I start with the actual implementation, I'd like to prepare some test cases. Is there any place where I could put this setup?

This could fit as a correctness test in our test harness /cc @binarylogic

@MOZGIII
Copy link
Contributor

MOZGIII commented Jan 9, 2020

Great. I created a repo in the meantime: https://github.com/MOZGIII/vector-merge-test-setup
I'll see if I can migrate it to the test harness repo.

@binarylogic
Copy link
Contributor Author

Yeah, that would be ideal. There's a little more involved there since the test harness involves terraform and ansible scripts. I'd say keep moving forward with you repo and then we can pair on transitioning it to our test hardness when you're ready.

@MOZGIII
Copy link
Contributor

MOZGIII commented Jan 9, 2020

Sounds good! I've used Ansible and Terraform before, but still, I'd appreciate guidance! Before we can move it to test harness, I need to figure out a way to programmatically assert the behavior. With my current setup that'd be problematic, however, I imagine it can be done better if we can spare a full VM for it.

@binarylogic
Copy link
Contributor Author

Yep. For example, here's a very simple nested JSON test:

https://github.com/timberio/vector-test-harness/blob/master/cases/wrapped_json_correctness/ansible/run/vector.yml

You can read about the test here:

https://github.com/timberio/vector-test-harness/tree/master/cases/wrapped_json_correctness

I don't want to distract you with this stuff right now, but I think it's good for you to see how we're using Ansible to run test cases. When you're ready we can pair and get it set up.

@MOZGIII
Copy link
Contributor

MOZGIII commented Jan 9, 2020

message_start_indicator should be removed from the file source in favor of using this new transform. We can provide guides that educate the users on this.

I'm not sure about this part. Would we roll that functionality into the new transform? There are likely quite a few situations where it would be a better fit than the new strategy of the source detecting partial messages. Or we could keep it as a way to have sources identify which messages are partial?

I'm also not sure about the deprecation since I don't fully understand yet if we're covering all the use cases of the message_start_indicator with the merge functionality. It might be that we don't. I'll think about it tomorrow.

More generally, while I like the idea of implementing this in a way that's not tied to a specific source, I am slightly skeptical of requiring the user to configure a separate transform with this. Could we not just expose it through options on the sources like message_start_indicator that use a common implementation? You've already hinted at this by saying some sources will be preconfigured with it.

The main beauty of having this a separate transform is it won't be limited to just merging the raw input from the sources. For instance in the JSON-in-JSON case, one would be able to unwrap the "JSON envelope" and merge over a field from the wrapped document. I think that'd be cool.

@MOZGIII
Copy link
Contributor

MOZGIII commented Jan 10, 2020

If anyone happen to have a VM dump with journald configured for testing with vector - please share it with me 😄 I need to replicate the message splitting behavior for journald.
UPD: I'll be working on my own setup tomorrow if nothing comes up by then.

@MOZGIII
Copy link
Contributor

MOZGIII commented Jan 10, 2020

Also, regarding journald setup, as was discussed here, the partial messages are specific to the docker use-case - when docker outputs the container logs to the journald, it adds it's own partial marker - CONTAINER_PARTIAL_MESSAGE.
This is covered by adding a merge transform manually. I doubt we want this docker-specific logic to be built-in to a journald source (which would be an odd coupling, since journald source is currently generic, and not specific to docker). I'd consider this use-case covered by just the addition of the transform, and not provide any specific details.
We can hint to this issue with docker in the documentation for the journald source and suggest adding a merge transform there.

@MOZGIII
Copy link
Contributor

MOZGIII commented Jan 10, 2020

I also realized that we should probably take a look at k8s source as well. It's implemented separately, but the design goal there, as noted in the comments, is for it to be interchangeable with docker source. So, whatever defaults we'll apply to docker we have to make sure the same are applied to kubernetes.

@binarylogic
Copy link
Contributor Author

Also, regarding journald setup, as was discussed here, the partial messages are specific to the docker use-case - when docker outputs the container logs to the journald, it adds it's own partial marker - CONTAINER_PARTIAL_MESSAGE. This is covered by adding a merge transform manually.

Sure, but the CONTAINER_PARTIAL_MESSAGE is an explicit flag that is set, right? I know it's set by Docker, but I'm wondering if we can treat it like a generic flag? Then it's not Docker specific (if that makes sense).

So, whatever defaults we'll apply to docker we have to make sure the same are applied to kubernetes.

Make sense to me.

@MOZGIII
Copy link
Contributor

MOZGIII commented Jan 10, 2020

Also, regarding journald setup, as was discussed here, the partial messages are specific to the docker use-case - when docker outputs the container logs to the journald, it adds it's own partial marker - CONTAINER_PARTIAL_MESSAGE. This is covered by adding a merge transform manually.

Sure, but the CONTAINER_PARTIAL_MESSAGE is an explicit flag that is set, right? I know it's set by Docker, but I'm wondering if we can treat it like a generic flag? Then it's not Docker specific (if that makes sense).

Hmm, that's interesting. To my knowledge, there's no spec that rules container runtime to add CONTAINER_PARTIAL_MESSAGE for the partial messages. In practice, rkt doesn't do it, but cri-o added it recently: cri-o/cri-o@17cbc3a. So, we might as well consider it the emerging standard. But still, this would be a container-runtime specific thing. Do you think it's a good idea to make it enabled by default? Though, most likely it won't hurt, as long as we provide means to opt-out...

@MOZGIII
Copy link
Contributor

MOZGIII commented Jan 11, 2020

I just realized the docker approach to partial lines (partial message marker by the presence or absence of the \n at the end) is unsound. Consider the following example:

{"log":"{ \"long_value\": \"long value ... \\n", ...}
{"log":" ... \" }", ...}

It might so happen that a long log message is split such that the end of the first chunk is \\n. This would break the invariant docker implies, and in turn, cause a false-negative partial message detection. This is still better than nothing, but it's more of a heuristic than a correct solution. And this one is on docker, we don't seem to have any way of fixing this for general case.

@MOZGIII
Copy link
Contributor

MOZGIII commented Jan 15, 2020

I just realized the docker approach to partial lines (partial message marker by the presence or absence of the \n at the end) is unsound. Consider the following example:

{"log":"{ \"long_value\": \"long value ... \\n", ...}
{"log":" ... \" }", ...}

It might so happen that a long log message is split such that the end of the first chunk is \\n. This would break the invariant docker implies, and in turn, cause a false-negative partial message detection. This is still better than nothing, but it's more of a heuristic than a correct solution. And this one is on docker, we don't seem to have any way of fixing this for general case.

Actually, we can special casing for \\n and make it sound this on our end. I'll create an issue for this.

@MOZGIII
Copy link
Contributor

MOZGIII commented Jan 16, 2020

So, the summary of the progress on this issue:

  • the merge transform is implemented at feat(new transform): New merge transform #1504
  • automatic merging of partial events for docker source is also implemented at feat(new transform): New merge transform #1504
  • defaults for journald and kubernetes sources are not yet covered; journald worth a separate issue and discussion, for kubernetes - I just did the research - and the same logic we use for docker is applicable
  • a message_start_indicator for file source is still there, let's create a separate issue for it too; it might be that lua script is enough, otherwise we'll have to add a dedicated partial message marker transform to get rid of message_start_indicator
  • test harness integration - we can add a dedicated issue for this too to track the progress and have a place to discuss; maybe at the test harness repo even
  • as mentioned at Batch multiple events as a single record in aws_kinesis_firehose sink #1407 (comment), it'd be great to conduct a research to test the limits of the current design; the end goal is to provide a set of vector configs (that we could add to the documentation as examples) to cover various use cases - with journald, other sources, sinks that can benefit from merging the events together and so on; this will also help with the issues mentioned above to enable partial message merging for journald - it'd be easier to implement with if we've known what to aim for

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: feature A value-adding code addition that introduce new functionality.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants