-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds a new parser for XML data #7460
Conversation
@danielnelson, hello! |
I think what I would like to see on this pull request is more evidence that it will be useful across different documents. It would be useful if we could get more community feedback on if this plugin was or was not sufficient for XML processing. It is always helpful to resolve any merge conflicts, this way we know the build works and is able to be merged. If we see conflicts or test failures, we sometimes assume it is a work in progress. I'm going to be stepping away from being the Telegraf maintainer; @reimda and @ssoroka will be taking over and will help you further. |
Fix license check
Move licence info to preserve lexographic order
@ssoroka, hello! |
I think this runs into the same problems as the json parser. It's fine for pulling out a specific value, but it's difficult to iterate over parts of the document to build one or more metrics for any document that's non-trivial. I'm not really sure what the answer to this is. I don't think xpath really helps here. I'd like to do a bit more research to find a flexbile approach for both. |
@ssoroka you are right, I thought about it too. Maybe we should try something like this: [[inputs.file]]
...
data_format = "xml"
[[inputs.file.xml]]
query = ...
[[inputs.file.xml]]
query = ... This way we could analyze the same document several times. |
that definitely helps, but it's still not clear how to go to a specific path and output one metric per element in the array, for example, or what to do if the fields are spread across the document, which they often are, or if part of the xpath needs to be extracted as a field/tag name or value. I wonder if there's some kind of object/structure mappers that do this well already? |
This task can be solved by explicitly specifying the name of the node, which will be the key to the element in the array. <WowzaStreamingEngine>
<ConnectionsCurrent>1</ConnectionsCurrent>
<ConnectionsTotal>1</ConnectionsTotal>
<ConnectionsTotalAccepted>1</ConnectionsTotalAccepted>
<ConnectionsTotalRejected>1</ConnectionsTotalRejected>
<MessagesInBytesRate>1.0</MessagesInBytesRate>
<MessagesOutBytesRate>1.0</MessagesOutBytesRate>
<VHost>
<Name>host_1</Name>
<TimeRunning>64679.885</TimeRunning>
<ConnectionsLimit>0</ConnectionsLimit>
<ConnectionsCurrent>0</ConnectionsCurrent>
<ConnectionsTotal>0</ConnectionsTotal>
<ConnectionsTotalAccepted>0</ConnectionsTotalAccepted>
<ConnectionsTotalRejected>0</ConnectionsTotalRejected>
<MessagesInBytesRate>0.0</MessagesInBytesRate>
<MessagesOutBytesRate>0.0</MessagesOutBytesRate>
</VHost>
<VHost>
<Name>host_2</Name>
<TimeRunning>67777.885</TimeRunning>
<ConnectionsLimit>2</ConnectionsLimit>
<ConnectionsCurrent>2</ConnectionsCurrent>
<ConnectionsTotal>2</ConnectionsTotal>
<ConnectionsTotalAccepted>2</ConnectionsTotalAccepted>
<ConnectionsTotalRejected>2</ConnectionsTotalRejected>
<MessagesInBytesRate>2.2</MessagesInBytesRate>
<MessagesOutBytesRate>2.2</MessagesOutBytesRate>
</VHost>
</WowzaStreamingEngine> [[inputs.file]]
files = [ "data.xml" ]
data_format = "xml"
xml_query = "//VHost/"
xml_merge_nodes = true
tag_keys = [ "Name" ]
xml_array_key = "VHost" # <- This functionality needs to be added
xml_array = true # <- Or this. In this case, it is worth going through all the nested keys of the first level and creating a separate metric from each. After minor improvements, it will be possible to achieve the following result:
For the rest - unfortunately, it seems to me that it will not work to achieve such flexibility by providing only a declarative method for solving the problem. |
@ssoroka hello! |
So we're going to go ahead with your suggestion, and this will be about equal to the approach with the json parsing plugin, so seems reasonable. Both approaches don't solve the problem well for non-trivial documents. If possible, could you add more usage examples to the readme, and add a note that if your XML document is complex you may need to move over to an execd plugin? One way to do this is to use the value parser, then use a processors.execd script to parse that into the proper metric. |
Thanks for the answer! |
Fix expected tag key name
@srebhan (I apologize, on the first time I invited the wrong account to the discussion) I like the functionality to get a custom time stamp from the document and the ability to set the measurement name based on the value from the document. But the need to enumerate fields and specify their types manually seems inconvenient, it also makes the configuration heavy (in our cases there are situations when the analyzed node can have 30+ child nodes and attributes). If needed, I can add measurement name and timestamp extraction to this implementation. Your opinion? Upd. For your example, I get the following configuration: [[inputs.file]]
files = [ "sensors.xml" ]
data_format = "xml"
xml_query = "/Bus/Sensor"
xml_array = true
tag_keys = [ "name" ]
[[processors.enum]]
[[processors.enum.mapping]]
field = "Mode"
default = 1
[processors.enum.mapping.value_mappings]
error = 0 With result:
|
Please rebase on master to fix the AppVeyor problem. |
@srebhan great, thanks! |
@reimda, hello! Unfortunately, Steven hasn't been in touch yet. |
Okay, I guess I'm missing the ability to get tags or fields from arbitrary place in the document and add them to each generated metric. xml_tags = [ "../node", "//data" ] This feature should replace the current functionality of getting the name of the measurement. |
… ability to get fields and tags from an arbitrary place in the document
Removed commented code snippet
I think it's done |
@ssoroka, hello! I apologize for the frequent mention, we are really looking forward to the XML parser in telegraf. Can we wait for the soon inclusion of this or an alternative solution in the master? Right now we use the parser through a wrapper as an external processor, but it's not very convenient ... |
Just two comments from my side where I see problems with the approach you are taking @M0rdecay:
However, I see the advantage of your approach for some use-cases and really would also like to merge this into my PR #8047. But we have to agree on the underlying XML-library first I guess... |
@srebhan good morning! On the first point - of course, such a situation is possible, but it would violate the XML schema. I consider the probability of such a situation to be low, but, of course, not zero. This is a rather rare case, such situations can be handled through a About function reimplementation- indeed, the library I am using is not as functional in the Upd. We can continue the discussion in the Slack InfluxDB Community, but I would prefer to do it here and further so that others can see the correspondence. |
And I have exactly that kind of service running here... :-) Could you explain how it violates the XML schema? Do you have any reference? To my knowledge, XML does not make any assumptions on the node content, so in general it is text. Otherwise you need a DTD i think.
Wouldn't it be better to leave out the automatic conversion and use the
Sure, but my point was the possible confusion with an int.
As you wish. :-) |
XSD unambiguously declares the type xsd:double - are you using something different for pre-validation before XML posting? Upd. It looks like I'm wrong here. According to xsd version 1, a number without a dot can also satisfy this type. |
True XSD can strictly define the type! The important thing is that this is completely optional. It is ok to send out an XML without any validation. The service I'm forced to use obviously does that... :-( |
Well, I see you are right. In this case, your solution becomes more versatile at the expense of the desired ease of configuration. I would like to know what the upstream owners think about this. |
We discussed about data typing with @srebhan and came to the following conclusion - the flag that indicates whether the parser will dynamically type the values of the nodes may be a suitable solution. The last commit adds this flag, so far without changing the library for working with XML. |
Let's continue at #8121 It seems that the second attempt came out more successful. |
Required for all PRs:
This is a draft of XML parser (related issues #6968, #1758). Please review the proposed solution and let us know if improvements are required.