Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds a new parser for XML data #7460

Closed
wants to merge 37 commits into from
Closed

Conversation

M0rdecay
Copy link
Contributor

@M0rdecay M0rdecay commented May 5, 2020

Required for all PRs:

  • Signed CLA.
  • Associated README.md updated.
  • Has appropriate unit tests.

This is a draft of XML parser (related issues #6968, #1758). Please review the proposed solution and let us know if improvements are required.

@M0rdecay
Copy link
Contributor Author

@danielnelson, hello!
I am sorry for disturbing, is there a chance this PR will end up in 1.16? Should i fix merge conflicts?

@danielnelson
Copy link
Contributor

I think what I would like to see on this pull request is more evidence that it will be useful across different documents. It would be useful if we could get more community feedback on if this plugin was or was not sufficient for XML processing.

It is always helpful to resolve any merge conflicts, this way we know the build works and is able to be merged. If we see conflicts or test failures, we sometimes assume it is a work in progress.

I'm going to be stepping away from being the Telegraf maintainer; @reimda and @ssoroka will be taking over and will help you further.

@M0rdecay
Copy link
Contributor Author

@ssoroka, hello!
Merge conflicts resolved. What should I do next?

@ssoroka
Copy link
Contributor

ssoroka commented Aug 11, 2020

I think this runs into the same problems as the json parser. It's fine for pulling out a specific value, but it's difficult to iterate over parts of the document to build one or more metrics for any document that's non-trivial.

I'm not really sure what the answer to this is. I don't think xpath really helps here. I'd like to do a bit more research to find a flexbile approach for both.

@M0rdecay
Copy link
Contributor Author

M0rdecay commented Aug 17, 2020

@ssoroka you are right, I thought about it too.

Maybe we should try something like this:

[[inputs.file]]
...
  data_format = "xml"

  [[inputs.file.xml]]
    query = ...

  [[inputs.file.xml]]
    query = ...

This way we could analyze the same document several times.
At first glance, this shouldn't be difficult to implement.

@ssoroka
Copy link
Contributor

ssoroka commented Aug 19, 2020

that definitely helps, but it's still not clear how to go to a specific path and output one metric per element in the array, for example, or what to do if the fields are spread across the document, which they often are, or if part of the xpath needs to be extracted as a field/tag name or value.

I wonder if there's some kind of object/structure mappers that do this well already?

@M0rdecay
Copy link
Contributor Author

M0rdecay commented Aug 25, 2020

but it's still not clear how to go to a specific path and output one metric per element in the array

This task can be solved by explicitly specifying the name of the node, which will be the key to the element in the array.
For example:

<WowzaStreamingEngine>
    <ConnectionsCurrent>1</ConnectionsCurrent>
    <ConnectionsTotal>1</ConnectionsTotal>
    <ConnectionsTotalAccepted>1</ConnectionsTotalAccepted>
    <ConnectionsTotalRejected>1</ConnectionsTotalRejected>
    <MessagesInBytesRate>1.0</MessagesInBytesRate>
    <MessagesOutBytesRate>1.0</MessagesOutBytesRate>
    <VHost>
        <Name>host_1</Name>
        <TimeRunning>64679.885</TimeRunning>
        <ConnectionsLimit>0</ConnectionsLimit>
        <ConnectionsCurrent>0</ConnectionsCurrent>
        <ConnectionsTotal>0</ConnectionsTotal>
        <ConnectionsTotalAccepted>0</ConnectionsTotalAccepted>
        <ConnectionsTotalRejected>0</ConnectionsTotalRejected>
        <MessagesInBytesRate>0.0</MessagesInBytesRate>
        <MessagesOutBytesRate>0.0</MessagesOutBytesRate>
    </VHost>
    <VHost>
        <Name>host_2</Name>
        <TimeRunning>67777.885</TimeRunning>
        <ConnectionsLimit>2</ConnectionsLimit>
        <ConnectionsCurrent>2</ConnectionsCurrent>
        <ConnectionsTotal>2</ConnectionsTotal>
        <ConnectionsTotalAccepted>2</ConnectionsTotalAccepted>
        <ConnectionsTotalRejected>2</ConnectionsTotalRejected>
        <MessagesInBytesRate>2.2</MessagesInBytesRate>
        <MessagesOutBytesRate>2.2</MessagesOutBytesRate>
    </VHost>
</WowzaStreamingEngine>
[[inputs.file]]
  files = [ "data.xml" ]
  data_format = "xml"
  xml_query = "//VHost/"
  xml_merge_nodes = true
  tag_keys = [ "Name" ]
  xml_array_key = "VHost" # <- This functionality needs to be added
  xml_array = true # <- Or this. In this case, it is worth going through all the nested keys of the first level and creating a separate metric from each.

After minor improvements, it will be possible to achieve the following result:

file,host=host.local,Name=host_1 TimeRunning=64679.885,ConnectionsTotalAccepted=0i,MessagesInBytesRate=0,MessagesOutBytesRate=0,ConnectionsCurrent=0i,ConnectionsTotal=0i,ConnectionsTotalRejected=0i,ConnectionsLimit=0i 1598357710000000000
file,host=host.local,Name=host_2 ConnectionsTotal=2i,MessagesInBytesRate=2.2,ConnectionsTotalAccepted=2i,ConnectionsTotalRejected=2i,MessagesOutBytesRate=2.2,TimeRunning=67777.885,ConnectionsLimit=2i,ConnectionsCurrent=2i 1598357710000000000

For the rest - unfortunately, it seems to me that it will not work to achieve such flexibility by providing only a declarative method for solving the problem.
I would also like to avoid using xslt to keep the configuration simple.

@M0rdecay
Copy link
Contributor Author

@ssoroka hello!
What do you think of the information above?
If this sounds good to you, I would gladly take on the implementation.

@ssoroka
Copy link
Contributor

ssoroka commented Aug 27, 2020

So we're going to go ahead with your suggestion, and this will be about equal to the approach with the json parsing plugin, so seems reasonable. Both approaches don't solve the problem well for non-trivial documents. If possible, could you add more usage examples to the readme, and add a note that if your XML document is complex you may need to move over to an execd plugin? One way to do this is to use the value parser, then use a processors.execd script to parse that into the proper metric.

@M0rdecay
Copy link
Contributor Author

Thanks for the answer!
I will be updating the branch shortly and will write when everything is ready.

@M0rdecay
Copy link
Contributor Author

M0rdecay commented Aug 31, 2020

@srebhan (I apologize, on the first time I invited the wrong account to the discussion)
I also looked at the solution suggested in the #8047.

I like the functionality to get a custom time stamp from the document and the ability to set the measurement name based on the value from the document.

But the need to enumerate fields and specify their types manually seems inconvenient, it also makes the configuration heavy (in our cases there are situations when the analyzed node can have 30+ child nodes and attributes).

If needed, I can add measurement name and timestamp extraction to this implementation.

Your opinion?

Upd. For your example, I get the following configuration:

[[inputs.file]]
  files = [ "sensors.xml" ]
  data_format = "xml"
  xml_query = "/Bus/Sensor"
  xml_array = true
  tag_keys = [ "name" ]

[[processors.enum]]
  [[processors.enum.mapping]]
    field = "Mode"
    default = 1
    [processors.enum.mapping.value_mappings]
      error = 0

With result:

file,name=Sensor\ Facility\ A Mode=1i,consumers=3i,frequency=49.78,power=123.4,temperature=20 1598891953000000000
file,name=Sensor\ Facility\ B Mode=1i,consumers=1i,frequency=49.78,power=14.3,temperature=23.1 1598891953000000000
file,name=Sensor\ Facility\ C Mode=0i,consumers=0i,frequency=49.78,power=0.02,temperature=19.7 1598891953000000000

@M0rdecay M0rdecay mentioned this pull request Aug 31, 2020
3 tasks
@srebhan
Copy link
Member

srebhan commented Sep 1, 2020

Please rebase on master to fix the AppVeyor problem.

@M0rdecay
Copy link
Contributor Author

M0rdecay commented Sep 1, 2020

@srebhan great, thanks!

@M0rdecay
Copy link
Contributor Author

M0rdecay commented Sep 4, 2020

@reimda, hello!

Unfortunately, Steven hasn't been in touch yet.
Could you please consider the offered parsers and suggest which one you think is more suitable?
We would like to see any of them in upstream - the main thing is to get a tool for processing xml.

@M0rdecay
Copy link
Contributor Author

M0rdecay commented Sep 8, 2020

Okay, I guess I'm missing the ability to get tags or fields from arbitrary place in the document and add them to each generated metric.
Something like this should work:

xml_tags = [ "../node", "//data" ]

This feature should replace the current functionality of getting the name of the measurement.
To turn a tag or field into a metric name, we can use converter processor

… ability to get fields and tags from an arbitrary place in the document
Removed commented code snippet
@M0rdecay
Copy link
Contributor Author

M0rdecay commented Sep 9, 2020

I think it's done

@M0rdecay
Copy link
Contributor Author

@ssoroka, hello!

I apologize for the frequent mention, we are really looking forward to the XML parser in telegraf.

Can we wait for the soon inclusion of this or an alternative solution in the master?
Maybe something needs to be improved?

Right now we use the parser through a wrapper as an external processor, but it's not very convenient ...

@srebhan
Copy link
Member

srebhan commented Sep 11, 2020

Just two comments from my side where I see problems with the approach you are taking @M0rdecay:

  1. Implicit type conversions:
    You are trying to "automagically" figure out the type of a value and there you have int (taking precedence) and float. Imagine a weird service that outputs floating-point values, but whenever the float ends with .0 (e.g. 42.0) it will give you 42. In the next iteration you get 23.5. Now you parse the first value as int, the later one as float...
  2. "Reimplementation" of XPath functionality:
    While your code reads nice and clean I've the feeling that you partially reimplement what XPath was meant for.

However, I see the advantage of your approach for some use-cases and really would also like to merge this into my PR #8047. But we have to agree on the underlying XML-library first I guess...

@M0rdecay
Copy link
Contributor Author

M0rdecay commented Sep 11, 2020

@srebhan good morning!

On the first point - of course, such a situation is possible, but it would violate the XML schema. I consider the probability of such a situation to be low, but, of course, not zero. This is a rather rare case, such situations can be handled through a converter processor.
Here I will also add that if 42.0 is in XML, the conversion will be correct - https://play.golang.org/p/IdI6XxI7Aqn

About function reimplementation- indeed, the library I am using is not as functional in the XPath part as the one you are using, so I was forced to add attribute extraction. In general, the libraries you are using seem to be more suitable, so we can use them.

Upd. We can continue the discussion in the Slack InfluxDB Community, but I would prefer to do it here and further so that others can see the correspondence.

@srebhan
Copy link
Member

srebhan commented Sep 11, 2020

On the first point - of course, such a situation is possible, but it would violate the XML schema. I consider the probability of such a situation to be low, but, of course, not zero.

And I have exactly that kind of service running here... :-) Could you explain how it violates the XML schema? Do you have any reference? To my knowledge, XML does not make any assumptions on the node content, so in general it is text. Otherwise you need a DTD i think.

This is a rather rare case, such situations can be handled through a converter processor.

Wouldn't it be better to leave out the automatic conversion and use the converter processor by default? Alternatively you should be able to explicitly specify the output type.

Here I will also add that if 42.0 is in XML, the conversion will be correct - https://play.golang.org/p/IdI6XxI7Aqn

Sure, but my point was the possible confusion with an int.

Upd. We can continue the discussion in the Slack InfluxDB Community, but I would prefer to do it here and further so that others can see the correspondence.

As you wish. :-)

@M0rdecay
Copy link
Contributor Author

M0rdecay commented Sep 11, 2020

XSD unambiguously declares the type xsd:double - are you using something different for pre-validation before XML posting?

Upd. It looks like I'm wrong here. According to xsd version 1, a number without a dot can also satisfy this type.
In this case, indeed, we must either specify the type in advance, or use a converter.

@srebhan
Copy link
Member

srebhan commented Sep 11, 2020

True XSD can strictly define the type! The important thing is that this is completely optional. It is ok to send out an XML without any validation. The service I'm forced to use obviously does that... :-(

@M0rdecay
Copy link
Contributor Author

Well, I see you are right. In this case, your solution becomes more versatile at the expense of the desired ease of configuration.

I would like to know what the upstream owners think about this.
For me, the main goal here is to see any of the solutions in the master branch.

@M0rdecay
Copy link
Contributor Author

We discussed about data typing with @srebhan and came to the following conclusion - the flag that indicates whether the parser will dynamically type the values ​​of the nodes may be a suitable solution.

The last commit adds this flag, so far without changing the library for working with XML.
I would like to change the library only if using the parser shows that it is necessary - this is mentioned in the README.

@M0rdecay
Copy link
Contributor Author

Let's continue at #8121

It seems that the second attempt came out more successful.

@M0rdecay M0rdecay closed this Sep 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants