Adds a new parser for XML data #7460

M0rdecay · 2020-05-05T15:00:43Z

Required for all PRs:

Signed CLA.
Associated README.md updated.
Has appropriate unit tests.

This is a draft of XML parser (related issues #6968, #1758). Please review the proposed solution and let us know if improvements are required.

M0rdecay · 2020-07-22T18:11:50Z

@danielnelson, hello!
I am sorry for disturbing, is there a chance this PR will end up in 1.16? Should i fix merge conflicts?

danielnelson · 2020-07-23T20:30:52Z

I think what I would like to see on this pull request is more evidence that it will be useful across different documents. It would be useful if we could get more community feedback on if this plugin was or was not sufficient for XML processing.

It is always helpful to resolve any merge conflicts, this way we know the build works and is able to be merged. If we see conflicts or test failures, we sometimes assume it is a work in progress.

I'm going to be stepping away from being the Telegraf maintainer; @reimda and @ssoroka will be taking over and will help you further.

Fix license check

Move licence info to preserve lexographic order

M0rdecay · 2020-07-30T19:50:56Z

@ssoroka, hello!
Merge conflicts resolved. What should I do next?

ssoroka · 2020-08-11T21:48:12Z

I think this runs into the same problems as the json parser. It's fine for pulling out a specific value, but it's difficult to iterate over parts of the document to build one or more metrics for any document that's non-trivial.

I'm not really sure what the answer to this is. I don't think xpath really helps here. I'd like to do a bit more research to find a flexbile approach for both.

M0rdecay · 2020-08-17T21:15:48Z

@ssoroka you are right, I thought about it too.

Maybe we should try something like this:

[[inputs.file]]
...
  data_format = "xml"

  [[inputs.file.xml]]
    query = ...

  [[inputs.file.xml]]
    query = ...

This way we could analyze the same document several times.
At first glance, this shouldn't be difficult to implement.

ssoroka · 2020-08-19T21:34:27Z

that definitely helps, but it's still not clear how to go to a specific path and output one metric per element in the array, for example, or what to do if the fields are spread across the document, which they often are, or if part of the xpath needs to be extracted as a field/tag name or value.

I wonder if there's some kind of object/structure mappers that do this well already?

M0rdecay · 2020-08-25T12:24:16Z

but it's still not clear how to go to a specific path and output one metric per element in the array

This task can be solved by explicitly specifying the name of the node, which will be the key to the element in the array.
For example:

<WowzaStreamingEngine>
    <ConnectionsCurrent>1</ConnectionsCurrent>
    <ConnectionsTotal>1</ConnectionsTotal>
    <ConnectionsTotalAccepted>1</ConnectionsTotalAccepted>
    <ConnectionsTotalRejected>1</ConnectionsTotalRejected>
    <MessagesInBytesRate>1.0</MessagesInBytesRate>
    <MessagesOutBytesRate>1.0</MessagesOutBytesRate>
    <VHost>
        <Name>host_1</Name>
        <TimeRunning>64679.885</TimeRunning>
        <ConnectionsLimit>0</ConnectionsLimit>
        <ConnectionsCurrent>0</ConnectionsCurrent>
        <ConnectionsTotal>0</ConnectionsTotal>
        <ConnectionsTotalAccepted>0</ConnectionsTotalAccepted>
        <ConnectionsTotalRejected>0</ConnectionsTotalRejected>
        <MessagesInBytesRate>0.0</MessagesInBytesRate>
        <MessagesOutBytesRate>0.0</MessagesOutBytesRate>
    </VHost>
    <VHost>
        <Name>host_2</Name>
        <TimeRunning>67777.885</TimeRunning>
        <ConnectionsLimit>2</ConnectionsLimit>
        <ConnectionsCurrent>2</ConnectionsCurrent>
        <ConnectionsTotal>2</ConnectionsTotal>
        <ConnectionsTotalAccepted>2</ConnectionsTotalAccepted>
        <ConnectionsTotalRejected>2</ConnectionsTotalRejected>
        <MessagesInBytesRate>2.2</MessagesInBytesRate>
        <MessagesOutBytesRate>2.2</MessagesOutBytesRate>
    </VHost>
</WowzaStreamingEngine>

[[inputs.file]]
  files = [ "data.xml" ]
  data_format = "xml"
  xml_query = "//VHost/"
  xml_merge_nodes = true
  tag_keys = [ "Name" ]
  xml_array_key = "VHost" # <- This functionality needs to be added
  xml_array = true # <- Or this. In this case, it is worth going through all the nested keys of the first level and creating a separate metric from each.

After minor improvements, it will be possible to achieve the following result:

file,host=host.local,Name=host_1 TimeRunning=64679.885,ConnectionsTotalAccepted=0i,MessagesInBytesRate=0,MessagesOutBytesRate=0,ConnectionsCurrent=0i,ConnectionsTotal=0i,ConnectionsTotalRejected=0i,ConnectionsLimit=0i 1598357710000000000
file,host=host.local,Name=host_2 ConnectionsTotal=2i,MessagesInBytesRate=2.2,ConnectionsTotalAccepted=2i,ConnectionsTotalRejected=2i,MessagesOutBytesRate=2.2,TimeRunning=67777.885,ConnectionsLimit=2i,ConnectionsCurrent=2i 1598357710000000000

For the rest - unfortunately, it seems to me that it will not work to achieve such flexibility by providing only a declarative method for solving the problem.
I would also like to avoid using xslt to keep the configuration simple.

M0rdecay · 2020-08-25T13:36:26Z

@ssoroka hello!
What do you think of the information above?
If this sounds good to you, I would gladly take on the implementation.

ssoroka · 2020-08-27T15:49:56Z

So we're going to go ahead with your suggestion, and this will be about equal to the approach with the json parsing plugin, so seems reasonable. Both approaches don't solve the problem well for non-trivial documents. If possible, could you add more usage examples to the readme, and add a note that if your XML document is complex you may need to move over to an execd plugin? One way to do this is to use the value parser, then use a processors.execd script to parse that into the proper metric.

M0rdecay · 2020-08-27T16:00:58Z

Thanks for the answer!
I will be updating the branch shortly and will write when everything is ready.

Fix expected tag key name

M0rdecay · 2020-08-31T16:15:54Z

@srebhan (I apologize, on the first time I invited the wrong account to the discussion)
I also looked at the solution suggested in the #8047.

I like the functionality to get a custom time stamp from the document and the ability to set the measurement name based on the value from the document.

But the need to enumerate fields and specify their types manually seems inconvenient, it also makes the configuration heavy (in our cases there are situations when the analyzed node can have 30+ child nodes and attributes).

If needed, I can add measurement name and timestamp extraction to this implementation.

Your opinion?

Upd. For your example, I get the following configuration:

[[inputs.file]]
  files = [ "sensors.xml" ]
  data_format = "xml"
  xml_query = "/Bus/Sensor"
  xml_array = true
  tag_keys = [ "name" ]

[[processors.enum]]
  [[processors.enum.mapping]]
    field = "Mode"
    default = 1
    [processors.enum.mapping.value_mappings]
      error = 0

With result:

file,name=Sensor\ Facility\ A Mode=1i,consumers=3i,frequency=49.78,power=123.4,temperature=20 1598891953000000000
file,name=Sensor\ Facility\ B Mode=1i,consumers=1i,frequency=49.78,power=14.3,temperature=23.1 1598891953000000000
file,name=Sensor\ Facility\ C Mode=0i,consumers=0i,frequency=49.78,power=0.02,temperature=19.7 1598891953000000000

srebhan · 2020-09-01T18:56:34Z

Please rebase on master to fix the AppVeyor problem.

M0rdecay · 2020-09-01T19:20:29Z

@srebhan great, thanks!

M0rdecay · 2020-09-04T07:31:26Z

@reimda, hello!

Unfortunately, Steven hasn't been in touch yet.
Could you please consider the offered parsers and suggest which one you think is more suitable?
We would like to see any of them in upstream - the main thing is to get a tool for processing xml.

…o xml_parser

M0rdecay · 2020-09-08T20:50:13Z

Okay, I guess I'm missing the ability to get tags or fields from arbitrary place in the document and add them to each generated metric.
Something like this should work:

xml_tags = [ "../node", "//data" ]

This feature should replace the current functionality of getting the name of the measurement.
To turn a tag or field into a metric name, we can use converter processor

… ability to get fields and tags from an arbitrary place in the document

Removed commented code snippet

M0rdecay · 2020-09-09T16:06:16Z

I think it's done

M0rdecay · 2020-09-10T16:33:36Z

@ssoroka, hello!

I apologize for the frequent mention, we are really looking forward to the XML parser in telegraf.

Can we wait for the soon inclusion of this or an alternative solution in the master?
Maybe something needs to be improved?

Right now we use the parser through a wrapper as an external processor, but it's not very convenient ...

srebhan · 2020-09-11T07:35:33Z

Just two comments from my side where I see problems with the approach you are taking @M0rdecay:

Implicit type conversions:
You are trying to "automagically" figure out the type of a value and there you have int (taking precedence) and float. Imagine a weird service that outputs floating-point values, but whenever the float ends with .0 (e.g. 42.0) it will give you 42. In the next iteration you get 23.5. Now you parse the first value as int, the later one as float...
"Reimplementation" of XPath functionality:
While your code reads nice and clean I've the feeling that you partially reimplement what XPath was meant for.

However, I see the advantage of your approach for some use-cases and really would also like to merge this into my PR #8047. But we have to agree on the underlying XML-library first I guess...

M0rdecay · 2020-09-11T08:02:35Z

@srebhan good morning!

On the first point - of course, such a situation is possible, but it would violate the XML schema. I consider the probability of such a situation to be low, but, of course, not zero. This is a rather rare case, such situations can be handled through a converter processor.
Here I will also add that if 42.0 is in XML, the conversion will be correct - https://play.golang.org/p/IdI6XxI7Aqn

About function reimplementation- indeed, the library I am using is not as functional in the XPath part as the one you are using, so I was forced to add attribute extraction. In general, the libraries you are using seem to be more suitable, so we can use them.

Upd. We can continue the discussion in the Slack InfluxDB Community, but I would prefer to do it here and further so that others can see the correspondence.

srebhan · 2020-09-11T08:16:28Z

On the first point - of course, such a situation is possible, but it would violate the XML schema. I consider the probability of such a situation to be low, but, of course, not zero.

And I have exactly that kind of service running here... :-) Could you explain how it violates the XML schema? Do you have any reference? To my knowledge, XML does not make any assumptions on the node content, so in general it is text. Otherwise you need a DTD i think.

This is a rather rare case, such situations can be handled through a converter processor.

Wouldn't it be better to leave out the automatic conversion and use the converter processor by default? Alternatively you should be able to explicitly specify the output type.

Here I will also add that if 42.0 is in XML, the conversion will be correct - https://play.golang.org/p/IdI6XxI7Aqn

Sure, but my point was the possible confusion with an int.

Upd. We can continue the discussion in the Slack InfluxDB Community, but I would prefer to do it here and further so that others can see the correspondence.

As you wish. :-)

M0rdecay · 2020-09-11T08:27:03Z

XSD unambiguously declares the type xsd:double - are you using something different for pre-validation before XML posting?

Upd. It looks like I'm wrong here. According to xsd version 1, a number without a dot can also satisfy this type.
In this case, indeed, we must either specify the type in advance, or use a converter.

srebhan · 2020-09-11T08:38:28Z

True XSD can strictly define the type! The important thing is that this is completely optional. It is ok to send out an XML without any validation. The service I'm forced to use obviously does that... :-(

M0rdecay · 2020-09-11T08:41:41Z

Well, I see you are right. In this case, your solution becomes more versatile at the expense of the desired ease of configuration.

I would like to know what the upstream owners think about this.
For me, the main goal here is to see any of the solutions in the master branch.

M0rdecay · 2020-09-11T15:40:48Z

We discussed about data typing with @srebhan and came to the following conclusion - the flag that indicates whether the parser will dynamically type the values of the nodes may be a suitable solution.

The last commit adds this flag, so far without changing the library for working with XML.
I would like to change the library only if using the parser shows that it is necessary - this is mentioned in the README.

M0rdecay · 2020-09-14T16:39:23Z

Let's continue at #8121

It seems that the second attempt came out more successful.

M0rdecay and others added 10 commits May 5, 2020 17:34

Add reading xml parser parameters

14843c8

Update config.go

5c61eb6

Update registry.go

e40b1bd

Create xml parser

8072104

Update parser.go

b7b0dfb

Add tests

dc059d9

Create README.md

e4f1a2c

Run gofmt

30d487a

Merge remote-tracking branch 'origin/master' into xml_parser

beb2ca0

Run go mod tidy

416c579

danielnelson added the new plugin label May 5, 2020

root and others added 3 commits May 6, 2020 22:26

Resolve dependencies conflict

cb6b606

Update go.mod

6e3b562

Update go.sum

97f9a86

M0rdecay added 3 commits July 30, 2020 22:06

Merge branch 'master' into xml_parser

28e3411

Update LICENSE_OF_DEPENDENCIES.md

24e41dd

Fix license check

Update LICENSE_OF_DEPENDENCIES.md

8a7c275

Move licence info to preserve lexographic order

CentOS and others added 3 commits August 28, 2020 11:41

Added the ability to work with xml arrays

ce581af

Run gofmt

0cb68b3

Update parser_test.go

3ad53aa

Fix expected tag key name

For parsing an array added reading the attributes of the current node

fc9487b

M0rdecay mentioned this pull request Aug 31, 2020

Add XML parser using XPath queries #8047

Closed

3 tasks

Merge remote-tracking branch 'upstream/master' into xml_parser

acd4167

M0rdecay and others added 2 commits September 1, 2020 16:30

Added the ability to get the name of the measurement by query

42fb24d

Update README.md

649a0f0

M0rdecay added 3 commits September 7, 2020 09:37

DefaultTags processing added

5c670ea

Merge branch 'xml_parser' of https://github.com/M0rdecay/telegraf int…

ce5e474

…o xml_parser

Slightly simplified getting a single node

74f26ca

The ability to get the name of the metric has been removed. Added the…

4ea7d6d

… ability to get fields and tags from an arbitrary place in the document

ssoroka added the area/xml label Sep 9, 2020

M0rdecay added 2 commits September 9, 2020 18:09

Update README.md

1d29c91

Update parser.go

ed39a91

Removed commented code snippet

Added a flag that controls dynamic typing

29341d9

M0rdecay mentioned this pull request Sep 13, 2020

Alternative implementation of XML parser #8121

Closed

3 tasks

M0rdecay closed this Sep 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds a new parser for XML data #7460

Adds a new parser for XML data #7460

M0rdecay commented May 5, 2020 •

edited

Loading

M0rdecay commented Jul 22, 2020

danielnelson commented Jul 23, 2020

M0rdecay commented Jul 30, 2020

ssoroka commented Aug 11, 2020

M0rdecay commented Aug 17, 2020 •

edited

Loading

ssoroka commented Aug 19, 2020

M0rdecay commented Aug 25, 2020 •

edited

Loading

M0rdecay commented Aug 25, 2020

ssoroka commented Aug 27, 2020

M0rdecay commented Aug 27, 2020

M0rdecay commented Aug 31, 2020 •

edited

Loading

srebhan commented Sep 1, 2020

M0rdecay commented Sep 1, 2020

M0rdecay commented Sep 4, 2020

M0rdecay commented Sep 8, 2020

M0rdecay commented Sep 9, 2020

M0rdecay commented Sep 10, 2020

srebhan commented Sep 11, 2020

M0rdecay commented Sep 11, 2020 •

edited

Loading

srebhan commented Sep 11, 2020

M0rdecay commented Sep 11, 2020 •

edited

Loading

srebhan commented Sep 11, 2020

M0rdecay commented Sep 11, 2020

M0rdecay commented Sep 11, 2020

M0rdecay commented Sep 14, 2020

Adds a new parser for XML data #7460

Adds a new parser for XML data #7460

Conversation

M0rdecay commented May 5, 2020 • edited Loading

Required for all PRs:

M0rdecay commented Jul 22, 2020

danielnelson commented Jul 23, 2020

M0rdecay commented Jul 30, 2020

ssoroka commented Aug 11, 2020

M0rdecay commented Aug 17, 2020 • edited Loading

ssoroka commented Aug 19, 2020

M0rdecay commented Aug 25, 2020 • edited Loading

M0rdecay commented Aug 25, 2020

ssoroka commented Aug 27, 2020

M0rdecay commented Aug 27, 2020

M0rdecay commented Aug 31, 2020 • edited Loading

srebhan commented Sep 1, 2020

M0rdecay commented Sep 1, 2020

M0rdecay commented Sep 4, 2020

M0rdecay commented Sep 8, 2020

M0rdecay commented Sep 9, 2020

M0rdecay commented Sep 10, 2020

srebhan commented Sep 11, 2020

M0rdecay commented Sep 11, 2020 • edited Loading

srebhan commented Sep 11, 2020

M0rdecay commented Sep 11, 2020 • edited Loading

srebhan commented Sep 11, 2020

M0rdecay commented Sep 11, 2020

M0rdecay commented Sep 11, 2020

M0rdecay commented Sep 14, 2020

M0rdecay commented May 5, 2020 •

edited

Loading

M0rdecay commented Aug 17, 2020 •

edited

Loading

M0rdecay commented Aug 25, 2020 •

edited

Loading

M0rdecay commented Aug 31, 2020 •

edited

Loading

M0rdecay commented Sep 11, 2020 •

edited

Loading

M0rdecay commented Sep 11, 2020 •

edited

Loading