-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove support for Visio and potm files #22079
Remove support for Visio and potm files #22079
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @dadoonet.
@@ -127,7 +127,7 @@ public void execute(IngestDocument ingestDocument) { | |||
} | |||
additionalFields.put(Property.CONTENT_LENGTH.toLowerCase(), length); | |||
} | |||
} catch (Exception e) { | |||
} catch (Throwable e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not do this. We worked incredibly hard to remove it from the codebase, see #19231. It just hides bugs, and swallows truly fatal errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argh... Indeed...
What is the other solution then in our case? Catch the Throwable, look if it's caused by a ClassNotFound exception, then wrap it and if not a ClassNotFound, just throw the Throwable?
@jasontedor Can you tell me what you think of the latest changes? Thanks! |
Why are we doing this instead of adding the missing dependency? |
I totally agree on not catching Throwable and the effort made to change that in our codebase. That said in this specific case it seems pretty awkward that the node crashes. Is it right to let it go down for a missing dependency problem triggered by some specific call? I guess there is no way to check for this earlier when the node is started? |
Well. We reduced the number of dependencies with this PR: elastic/elasticsearch-mapper-attachments#186. We ended up saying that we will only support a subset of files. Here the user is sending a Word document which is supported but within the Word document he is having an embedded Visio Chart, which we don't support. I think that killing a node because a end user is sending a non supported format should be prevented. @javanna No we can't check that on startup because it depends on the end user's request. |
Yes, it is. We (developers) did not setup the system correctly. We didn't find the problem because of inadequate tests. Let's fix the tests, not try to mask the problems. The underlying issue is the OOXMLParser in tika is poorly designed (all static functions) and does not allow overriding which embedded formats are supported. If we are going to handle this by catching (instead of adding the dependency), then it should be completely local and specific to this case, not general purpose for all of tika (so we don't mask other issues). OOXMLParser is subclassable, where we can catch the NCDFE, not throwable. But such a hack should only be done after opening an issue in tika (and referencing in a comment with the hacky catch). We also desperately need more tests, ideally each type of embedded file (and there are a limited number of these that are possible, so we should try to cover each one). I believe tika has an extensive collection of files for testing, perhaps we can reuse some of those? |
We don't mask it here. We fail the ingestion with an error message which tells what is happening. throw new ElasticsearchParseException("Missing classes to parse document in field [{}]", e, field); We just don't kill anymore the node. Definitely I don't want to hide any error. |
I think everybody agrees that we should not blindly catch Throwable in our code. We made efforts to not hide OOM errors and such, and we don't want to go back there for sure. But, maybe I didn't state it clearly, I don't think it is acceptable that a node crashes because of a missing dependency. We are currently exposing an api that can kill a node, that is a big concerning problem to me. I am not confident that this category of errors with the attachment processor can be solved by testing, cause so many things can go wrong with tika and the many formats it supports. For sure catching |
+1 |
I think if we should add the dependency we should add the dependency and consider that the bug that caused the thing to go down, not the lack of catch. If we don't want the dependency we should figure out some way to work around not having it. I don't think we should have a global "catch NoClassDefError" across all of our code because we don't want to hide other problems. If the way work around this is by catching the Error then I'm all for adding it, but I agree with @rjernst that we should file an issue with tika. If they intend for us to be able to run without the dependency then they should throw a friendlier exception. If they don't intend for us to be able to run without the dependency then we have more thinking to do.... |
David is introducing a catch in the |
Right. I'm cool with it.
That argument makes sense to me. Something along the lines of "Tika throws these instead of UnsupportedOperationException when you are missing an optional dependency." is fine reasoning for adding it. I think we should file an issue with Tika about it. I'd like to be able to remove it one day and catch something that is less likely to mask other weird stuff. |
So what is the outcome on this? Do we want to merge my PR? |
I agree that taking a node down in this situation is bad, but I disagree with the conclusion that we should catch
Like @rjernst said, we need to work this one on the Tika side. |
I'm going to test the workaround proposed on the JIRA and will update the PR if successful. |
Thanks @dadoonet. |
For information, the proposed workaround does not work in our case as I explained at https://issues.apache.org/jira/browse/TIKA-2208?focusedCommentId=15753790&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15753790 Basically, the Tika checks are made when Tika instance is created not when it is used. I can obviously fix the current problem by adding the missing library but I think that it will hide potential other issues with missing JARs. |
235a634
to
190834d
Compare
Actually I was wrong. This PR removes support for Visio files even embedded. I'm going to update the PR soonish. |
Actually we never supported Visio files but we are failing hard (kill a node) when that kind of file is provided. See elastic#22079 (comment) This commits excludes Visio parsing from Tika so it does not fail anymore but returns empty content instead. As a side effect, it also removes support for POTM files. Closes elastic#22077.
929e428
to
2b15d20
Compare
@jasontedor So I removed support for POTM files. I rebased on master and squashed everything. Could you check again this PR? |
I am missing in this pull request where support for POTM files was removed? Also, please do not squash commits in the middle of review cycles. You can merge master into your branch instead of rebasing, too. |
@jasontedor |
I'm going to push a new commit to my branch because I just merged #22959 so you will see in a more obvious way the POTM file removal. |
# Conflicts: # plugins/ingest-attachment/src/test/resources/org/elasticsearch/ingest/attachment/test/tika-files.zip
@jasontedor I merged with master so it's ready. Can you review it please? |
@dadoonet thank you repaired, and now we do not have such an error! |
@sergeytkachenko Do you mean that you tried out that branch? |
@dadoonet Yes! |
Thanks for testing it @sergeytkachenko. You can apply the same fix on 5.2.0 tag and build another version of the plugin if it's a blocker for your production. |
If no one objects in the meantime, I'll merge that change tomorrow morning CET. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Thanks @jasontedor ! |
* Send a non supported document to an ingest pipeline using `ingest-attachment` * If Tika is not able to parse the document because of a missing class (we are not importing all jars needed by Tika), Tika throws a Throwable which is not catch. This commit removes support for Visio and POTM office files. So elasticsearch is not killed anymore when you run a command like: ``` GET _ingest/pipeline/_simulate { "pipeline" : { "processors" : [ { "attachment" : { "field" : "file" } } ] }, "docs" : [ { "_source" : { "file" : "BASE64CONTENT" } } ] } ``` The good news is that it does not kill the node anymore and allows to extract the text which is in the Office document even if we have a Visio content (which is not extracted anymore). Related to elastic#22077 Backport of elastic#22079 in 5.x branch (5.3)
* Parse a non supported document using `mapper-attachments` * If Tika is not able to parse the document because of a missing class (we are not importing all jars needed by Tika), Tika throws a Throwable which is not catch. This commit removes support for Visio and POTM office files. The good news is that it does not kill the node anymore and allows to extract the text which is in the Office document even if we have a Visio content (which is not extracted anymore). Related to elastic#22077 and elastic#22079 for mapper-attachments plugin
* Parse a non supported document using `mapper-attachments` * If Tika is not able to parse the document because of a missing class (we are not importing all jars needed by Tika), Tika throws a Throwable which is not catch. This commit removes support for Visio and POTM office files. The good news is that it does not kill the node anymore and allows to extract the text which is in the Office document even if we have a Visio content (which is not extracted anymore). Related to #22077 and #22079 for mapper-attachments plugin Backport of #23214 in 5.2 branch
ingest-attachment
This commit removes extracting embedded content from Office XML docs.
So elasticsearch is not killed anymore when you run a command like:
The good news is that it does not kill the node anymore and allows to extract the text which is in the Office document.
Closes #22077.