Skip to content

WIP: Draft - Config feature to allow XMLConversionControl to preserve CRLF and CR in data #912

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

mbeckerle
Copy link
Contributor

@mbeckerle mbeckerle commented Jan 14, 2023

Review ideas.

Ignore JAPI/SAPI changes for now.

Look only at XMLTextInfosetInputter and XMLTextInfosetOutputter, and ignore all the other variants thereof.

Because the changes across JAPI/SAPI and these inputters/outputters are very redundant. All same everywhere.

The real guts of this thing are:

changes to config file so that we can add

<xmlConversionControl carriageReturnMapping="PreserveCR"/>

Which causes CRs to be presreved by remapping to the PUA.

This can be extended with other attributes to control other aspects of XML Conversion as is needed for other open tickets (DAFFODIL-2346).

The actual remapping code has been consolidated and revised, but is mostly the same.

DAFFODIL-1559

@mbeckerle mbeckerle force-pushed the daf-1559-crlf branch 4 times, most recently from 14878d8 to 01f2964 Compare January 14, 2023 10:03
@mbeckerle mbeckerle changed the title DRAFT: Checkpoint on work to disable CRLF->LF Config feature to allow XMLConversionControl to preserve CRLF and CR in data Jan 17, 2023
Copy link
Member

@stevedlawrence stevedlawrence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two main things that I'm unsure about in his change:

  1. Lots of backwards incompatible changes with he infoset inputters/outputters now requiring XCC. I don't think this should be mandatory, most people don't deal with CR/LF acually in their data the default should be fine.

  2. I'm no sure about storing the config in the dataprocessor. Feel like XCC should just be another paramter that the CLI/TDMLRunner etc extracts from the config (or not) just like variable/tunables/ec and passes it to the infoset inputter/ouputter.

I also sort of wish this was split up into separae PRs for refactoring of the remapping stuff and changing how the config is handled and API changes. The refactoring is uncontroversial, but the config/API changes feel more controversial and maybe there's a better approach to consider.

@@ -0,0 +1,3 @@
abc
def
ghi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you can rely on git not munging these CRLFs on windows. We need an alternative method to ensure that we actually send CRLF as test data. One option is to use withTempFile, write the bytes directly to that and then parse with the input file being the temp file.

// rename the src/test dir to src/test1. Rename the src/it dir to src/test.
// Then modify this val to be "test".
// Then you can run these as ordinary junit-style tests under the IDE.
val test = "it"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer we solve this issue and update our entire integration suite in a separate PR. Making one-off change to a subset of our integration tests made it really difficult to update everything to be consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm of the opinion that the 267 integration tests run fast enough now that we should just lump them in under regular src/test now, and not bother with the separate src/it.

However, it would be good to move the stage build tasks so that you don't have to wait for the javadoc/scaladoc to build every time you want to make a small changes and then rerun the tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created https://issues.apache.org/jira/browse/DAFFODIL-2772 about fixing the src/it tests to be ordinary src/test.

withTempFile { output =>
val schema = path(s"daffodil-cli/src/$test/resources/org/apache/daffodil/CLI/aString.dfdl.xsd")
val config = path(s"daffodil-cli/src/$test/resources/org/apache/daffodil/CLI/config-convertCR.cfg.xml")
val input = path(s"daffodil-cli/src/$test/resources/org/apache/daffodil/CLI/input/inputWithCRLFs.txt.dat")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest moving these outside the withTempFile block since they don't need it. Makes it more consistent with the rest of our test suite.



val xml = FileUtils.readFileToString(xmlOut.toFile, StandardCharsets.UTF_8)
println(xml)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove prinln


var cmd = args"parse -s $schema -c $config --root a -o $xmlOut $input "
runCLI(cmd) { cli =>
cli.expect(s"")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The body of he runCLI function can be empty. You don't need to expect the empty string.

</xs:all>
</xs:complexType>
</xs:element>
<xs:element name="tunables" type="tns:tunablesType"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? It's hard to tell if there are actual changes to the tunables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm inclined to remove all changes to the config file stuff, switch to a simple properties file for the xml controls, and treat the existing config file stuff as legacy.

However, it has become clear to me that some schemas just require a config file in order to be used, and the same will hold for these xml conversion control parameters. Without them the CLI parse/unparse, the TDML runner, or using them via API are going to always require a config along with certain XML conversion controls.

<!--
TODO: really these all should have been XML attributes
-->
<xs:element name="allowExpressionResultCoercion" type="xs:boolean" default="true" minOccurs="0" form="unqualified">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this all copy paste from above, just with a different type? Is there a way around this? This is going to be a pain to keep these in sync.

<xs:choice>
<!-- allow qualified or unqualified form -->
<xs:element ref="tns:tunables" minOccurs="0"/>
<xs:element name="tunables" minOccurs="0" type="tns:tunablesTypeUnqualified" form="unqualified"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we just pick one and go with it (or switch to a easier config format like .properties) and not worry about backwards compa for config files. I don't think they are used that often, and it's easy to change them.

val bos = new java.io.ByteArrayOutputStream()
val xml = new XMLTextInfosetOutputter(bos, true)
val xml = new XMLTextInfosetOutputter(bos, true, daffodilConfig.xmlConversionControl)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest the debugger is just opinionated about this XML conversions, or if we do want it to be modifiable then it be a debugger setting with the set command. Having to pass around this config for one option feels unnecessary.

override val xmlConversionControl: XMLConversionControl) extends InfosetInputter
with XMLInfosetInputterMixin {

lazy val events: Array[NullInfosetInputter.Event] = toEvents(is)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is incorrect. Converting to an array of events must happen outside of the creation of the NullInfosetInputer to avoid including converting the stream to events in performance metrics. Which means converting it to events must know about the XCC and not the infoset inputter.

I think changes to this file need to be reverted.

@mbeckerle mbeckerle changed the title Config feature to allow XMLConversionControl to preserve CRLF and CR in data WIP: Draft - Config feature to allow XMLConversionControl to preserve CRLF and CR in data Jan 17, 2023
Copy link
Contributor Author

@mbeckerle mbeckerle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching this back to a WIP/Draft PR.

There is much work to do on this before it is sensible to review it again.

@@ -1 +1 @@
Hello World
Hello World
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why this file showed up here. It is identical to prior version.

@@ -1,56 +1,56 @@
<?xml version="1.0" encoding="UTF-8"?>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file also identical. I don't know why it is here in this changeset.

// rename the src/test dir to src/test1. Rename the src/it dir to src/test.
// Then modify this val to be "test".
// Then you can run these as ordinary junit-style tests under the IDE.
val test = "it"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm of the opinion that the 267 integration tests run fast enough now that we should just lump them in under regular src/test now, and not bother with the separate src/it.

However, it would be good to move the stage build tasks so that you don't have to wait for the javadoc/scaladoc to build every time you want to make a small changes and then rerun the tests.

}

//
// Illustrates a problem with the expect library (perhaps?)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working as designed then. I will just remove this test.

@@ -332,7 +332,7 @@ class TestExternalVariables {
val dp1 = pf.onPath("/")
val dp2 = pf.onPath("/").withExternalVariables(variables)

val outputter = new ScalaXMLInfosetOutputter()
val outputter = new ScalaXMLInfosetOutputter(dp2.daffodilConfig.xmlConversionControl)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point.

I only attached the config to the dp because i learned by trying it, that the xmlConversionControl object needed to get passed to exactly everywhere the dp was already being passed. Because to run something, in the CLI, or in the TDML runner, you have to configure both the daffodil DP, AND at least one infoset inputter/outputter, which usually means one of the XML ones. So the two objects (DP and config info) just go together everywhere.

But I will look into refactoring this so that the DP object doesn't carry the config at all.

That will straighten out the NullInfosetInputter as well.

if (len != 0)
while (pos < len) {
c = s(pos)
if (remapChar(c) != c)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug. Yes, a lone surrogate (which you can only tell from context) must be remapped.

*/
protected def remap (prev: Int, curr: Int, next: Int): Int

private def needsRemapping(s: String): Boolean = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good fix. Avoids on average 1/2 a pass over the data. Fixes surrogate bug with needsRemapping.

</xs:all>
</xs:complexType>
</xs:element>
<xs:element name="tunables" type="tns:tunablesType"/>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm inclined to remove all changes to the config file stuff, switch to a simple properties file for the xml controls, and treat the existing config file stuff as legacy.

However, it has become clear to me that some schemas just require a config file in order to be used, and the same will hold for these xml conversion control parameters. Without them the CLI parse/unparse, the TDML runner, or using them via API are going to always require a config along with certain XML conversion controls.

@mbeckerle mbeckerle removed the request for review from jadams-tresys January 17, 2023 21:51
@mbeckerle mbeckerle marked this pull request as draft January 17, 2023 21:51
@mbeckerle
Copy link
Contributor Author

Scrapping this PR.

@mbeckerle mbeckerle closed this Aug 16, 2023
@mbeckerle mbeckerle deleted the daf-1559-crlf branch August 17, 2023 11:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants