-
Notifications
You must be signed in to change notification settings - Fork 73
WIP: Draft - Config feature to allow XMLConversionControl to preserve CRLF and CR in data #912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
14878d8
to
01f2964
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two main things that I'm unsure about in his change:
-
Lots of backwards incompatible changes with he infoset inputters/outputters now requiring XCC. I don't think this should be mandatory, most people don't deal with CR/LF acually in their data the default should be fine.
-
I'm no sure about storing the config in the dataprocessor. Feel like XCC should just be another paramter that the CLI/TDMLRunner etc extracts from the config (or not) just like variable/tunables/ec and passes it to the infoset inputter/ouputter.
I also sort of wish this was split up into separae PRs for refactoring of the remapping stuff and changing how the config is handled and API changes. The refactoring is uncontroversial, but the config/API changes feel more controversial and maybe there's a better approach to consider.
@@ -0,0 +1,3 @@ | |||
abc | |||
def | |||
ghi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you can rely on git not munging these CRLFs on windows. We need an alternative method to ensure that we actually send CRLF as test data. One option is to use withTempFile, write the bytes directly to that and then parse with the input file being the temp file.
// rename the src/test dir to src/test1. Rename the src/it dir to src/test. | ||
// Then modify this val to be "test". | ||
// Then you can run these as ordinary junit-style tests under the IDE. | ||
val test = "it" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer we solve this issue and update our entire integration suite in a separate PR. Making one-off change to a subset of our integration tests made it really difficult to update everything to be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm of the opinion that the 267 integration tests run fast enough now that we should just lump them in under regular src/test now, and not bother with the separate src/it.
However, it would be good to move the stage build tasks so that you don't have to wait for the javadoc/scaladoc to build every time you want to make a small changes and then rerun the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created https://issues.apache.org/jira/browse/DAFFODIL-2772 about fixing the src/it tests to be ordinary src/test.
withTempFile { output => | ||
val schema = path(s"daffodil-cli/src/$test/resources/org/apache/daffodil/CLI/aString.dfdl.xsd") | ||
val config = path(s"daffodil-cli/src/$test/resources/org/apache/daffodil/CLI/config-convertCR.cfg.xml") | ||
val input = path(s"daffodil-cli/src/$test/resources/org/apache/daffodil/CLI/input/inputWithCRLFs.txt.dat") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest moving these outside the withTempFile block since they don't need it. Makes it more consistent with the rest of our test suite.
|
||
|
||
val xml = FileUtils.readFileToString(xmlOut.toFile, StandardCharsets.UTF_8) | ||
println(xml) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove prinln
|
||
var cmd = args"parse -s $schema -c $config --root a -o $xmlOut $input " | ||
runCLI(cmd) { cli => | ||
cli.expect(s"") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The body of he runCLI function can be empty. You don't need to expect the empty string.
</xs:all> | ||
</xs:complexType> | ||
</xs:element> | ||
<xs:element name="tunables" type="tns:tunablesType"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change? It's hard to tell if there are actual changes to the tunables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm inclined to remove all changes to the config file stuff, switch to a simple properties file for the xml controls, and treat the existing config file stuff as legacy.
However, it has become clear to me that some schemas just require a config file in order to be used, and the same will hold for these xml conversion control parameters. Without them the CLI parse/unparse, the TDML runner, or using them via API are going to always require a config along with certain XML conversion controls.
<!-- | ||
TODO: really these all should have been XML attributes | ||
--> | ||
<xs:element name="allowExpressionResultCoercion" type="xs:boolean" default="true" minOccurs="0" form="unqualified"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this all copy paste from above, just with a different type? Is there a way around this? This is going to be a pain to keep these in sync.
<xs:choice> | ||
<!-- allow qualified or unqualified form --> | ||
<xs:element ref="tns:tunables" minOccurs="0"/> | ||
<xs:element name="tunables" minOccurs="0" type="tns:tunablesTypeUnqualified" form="unqualified"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we just pick one and go with it (or switch to a easier config format like .properties) and not worry about backwards compa for config files. I don't think they are used that often, and it's easy to change them.
val bos = new java.io.ByteArrayOutputStream() | ||
val xml = new XMLTextInfosetOutputter(bos, true) | ||
val xml = new XMLTextInfosetOutputter(bos, true, daffodilConfig.xmlConversionControl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest the debugger is just opinionated about this XML conversions, or if we do want it to be modifiable then it be a debugger setting with the set
command. Having to pass around this config for one option feels unnecessary.
override val xmlConversionControl: XMLConversionControl) extends InfosetInputter | ||
with XMLInfosetInputterMixin { | ||
|
||
lazy val events: Array[NullInfosetInputter.Event] = toEvents(is) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is incorrect. Converting to an array of events must happen outside of the creation of the NullInfosetInputer to avoid including converting the stream to events in performance metrics. Which means converting it to events must know about the XCC and not the infoset inputter.
I think changes to this file need to be reverted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switching this back to a WIP/Draft PR.
There is much work to do on this before it is sensible to review it again.
@@ -1 +1 @@ | |||
Hello World | |||
Hello World |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why this file showed up here. It is identical to prior version.
@@ -1,56 +1,56 @@ | |||
<?xml version="1.0" encoding="UTF-8"?> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file also identical. I don't know why it is here in this changeset.
// rename the src/test dir to src/test1. Rename the src/it dir to src/test. | ||
// Then modify this val to be "test". | ||
// Then you can run these as ordinary junit-style tests under the IDE. | ||
val test = "it" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm of the opinion that the 267 integration tests run fast enough now that we should just lump them in under regular src/test now, and not bother with the separate src/it.
However, it would be good to move the stage build tasks so that you don't have to wait for the javadoc/scaladoc to build every time you want to make a small changes and then rerun the tests.
} | ||
|
||
// | ||
// Illustrates a problem with the expect library (perhaps?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working as designed then. I will just remove this test.
@@ -332,7 +332,7 @@ class TestExternalVariables { | |||
val dp1 = pf.onPath("/") | |||
val dp2 = pf.onPath("/").withExternalVariables(variables) | |||
|
|||
val outputter = new ScalaXMLInfosetOutputter() | |||
val outputter = new ScalaXMLInfosetOutputter(dp2.daffodilConfig.xmlConversionControl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point.
I only attached the config to the dp because i learned by trying it, that the xmlConversionControl object needed to get passed to exactly everywhere the dp was already being passed. Because to run something, in the CLI, or in the TDML runner, you have to configure both the daffodil DP, AND at least one infoset inputter/outputter, which usually means one of the XML ones. So the two objects (DP and config info) just go together everywhere.
But I will look into refactoring this so that the DP object doesn't carry the config at all.
That will straighten out the NullInfosetInputter as well.
if (len != 0) | ||
while (pos < len) { | ||
c = s(pos) | ||
if (remapChar(c) != c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug. Yes, a lone surrogate (which you can only tell from context) must be remapped.
*/ | ||
protected def remap (prev: Int, curr: Int, next: Int): Int | ||
|
||
private def needsRemapping(s: String): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good fix. Avoids on average 1/2 a pass over the data. Fixes surrogate bug with needsRemapping.
</xs:all> | ||
</xs:complexType> | ||
</xs:element> | ||
<xs:element name="tunables" type="tns:tunablesType"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm inclined to remove all changes to the config file stuff, switch to a simple properties file for the xml controls, and treat the existing config file stuff as legacy.
However, it has become clear to me that some schemas just require a config file in order to be used, and the same will hold for these xml conversion control parameters. Without them the CLI parse/unparse, the TDML runner, or using them via API are going to always require a config along with certain XML conversion controls.
bd3bcba
to
d1a9e7f
Compare
DAFFODIL-1559
d1a9e7f
to
b106d7b
Compare
Scrapping this PR. |
Review ideas.
Ignore JAPI/SAPI changes for now.
Look only at XMLTextInfosetInputter and XMLTextInfosetOutputter, and ignore all the other variants thereof.
Because the changes across JAPI/SAPI and these inputters/outputters are very redundant. All same everywhere.
The real guts of this thing are:
changes to config file so that we can add
<xmlConversionControl carriageReturnMapping="PreserveCR"/>
Which causes CRs to be presreved by remapping to the PUA.
This can be extended with other attributes to control other aspects of XML Conversion as is needed for other open tickets (DAFFODIL-2346).
The actual remapping code has been consolidated and revised, but is mostly the same.
DAFFODIL-1559