Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RabbitInAHat fails to load a custom model if BOM is set #411

Closed
BillCM opened this issue Apr 24, 2024 · 7 comments
Closed

RabbitInAHat fails to load a custom model if BOM is set #411

BillCM opened this issue Apr 24, 2024 · 7 comments
Assignees
Labels
bug Rabbit-In-A-Hat Issue concerning the specification of mapping rules

Comments

@BillCM
Copy link

BillCM commented Apr 24, 2024

Describe the bug
I created a custom model in Excel (XLSX) and exported to CSV. This file failed to load and resulted in this error

java.lang.IllegalArgumentException: Mapping for table not found, expected one of [table, field, required, type, schema, description] at org.apache.commons.csv.CSVRecord.get(CSVRecord.java:121) at org.ohdsi.rabbitInAHat.dataModel.Database.generateModelFromCSV(Database.java:117) at org.ohdsi.rabbitInAHat.RabbitInAHatMain.doSetTargetCustom(RabbitInAHatMain.java:465) at org.ohdsi.rabbitInAHat.RabbitInAHatMain.lambda$createMenuBar$9(RabbitInAHatMain.java:268)

The problem is that CSV parsing does not account for the Byte Order Mark (BOM).

To Reproduce
Steps to reproduce the behavior:

  1. Create a custom model in Excel (XLSX)
  2. Export the model to CSV
  3. Use Edit -> Set Target Database -> Load Custom
  4. Select the exported CSV, receive a popup error that the column headers were not found.

Expected behavior
CSV opens correctly.

Workaround
Open the exported CSV in a text editor and change the encoding from "UTF-8 with BOM" to "UTF-8"

Desktop (please complete the following information):

  • OS: OSX 14.4.1
  • Java: 11.0.21
  • Rabbit Version : 10.8a

Additional context
The issue

@BillCM BillCM added the bug label Apr 24, 2024
@janblom janblom self-assigned this Apr 24, 2024
@janblom janblom added the Rabbit-In-A-Hat Issue concerning the specification of mapping rules label Apr 24, 2024
@janblom
Copy link
Collaborator

janblom commented Apr 24, 2024

@BillCM it is possible to attach the CSV file that causes the problem to this issue (or a stripped down one, as long as it causes the same problem). This can can save me some time when building a test case.

Thanks,
Jan

janblom added a commit to thehyve/OHDSI-WhiteRabbit that referenced this issue Apr 25, 2024
@BillCM
Copy link
Author

BillCM commented Apr 29, 2024

@janblom Correct. CSV exported from Excel have the Byte Order Mark set. The only way to make RabbitInAHat to read the file is to remove the BOM by changing the encoding. Perhaps this it worth a note in the docs?

@janblom
Copy link
Collaborator

janblom commented Apr 29, 2024

Hi @BillCM , thank you for reporting this issue.

I have prepared a fix already which adds flexibility, so that RabbitInAHat can read CSV's with and without a BOM. This will be part of the upcoming 1.0 release. (Unfortunately testing another aspect of that release is taking some time. )

Since this issue will be fixed, it is not necessary to update the docs. This issue will serve as the (temporary) documentation until the fix is released, and the issue closed. (the fix is in my employers public repo until I have it approved and merged into the OHDSI repo).

If possible, could you attach a CSV to this issue that I can use to reproduce the bug? While I am fairly confident that the upcoming fix will cover your case, there is nothing better than having the certainty :-)

Thanks,
Jan

@BillCM
Copy link
Author

BillCM commented Apr 30, 2024

@janblom I think this very issue is causing the build to break. It appears that the embedded CSVs for CDM5.0 and CDM5.1 and their stem models are all being identified as Excel encoded with BOM. This is causing the mvn build to fail for main branch on my machine.

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.3.1:resources (default-resources) on project rabbitinahat: filtering /Users/bill/ext/WhiteRabbit/rabbitinahat/src/main/resources/org/ohdsi/rabbitInAHat/dataModel/StemTableV5.0.csv to /Users/bill/ext/WhiteRabbit/rabbitinahat/target/classes/org/ohdsi/rabbitInAHat/dataModel/StemTableV5.0.csv failed with MalformedInputException: Input length = 1 -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.3.1:resources (default-resources) on project rabbitinahat: filtering /Users/bill/ext/WhiteRabbit/rabbitinahat/src/main/resources/org/ohdsi/rabbitInAHat/dataModel/StemTableV5.0.csv to /Users/bill/ext/WhiteRabbit/rabbitinahat/target/classes/org/ohdsi/rabbitInAHat/dataModel/StemTableV5.0.csv failed with MalformedInputException

Upon opening the code in IntelliJ, the 4 files in question are linked to the Excel icon and will not open for editing.
Screenshot 2024-04-30 at 2 00 50 PM

After converting the files to UTF-8, the build works.

@janblom
Copy link
Collaborator

janblom commented May 1, 2024

I am unable to reproduce the last report, both on Linux and MacOS. Could it be that the csv files related to this were inadvertedly changed? I suspect an encoding problem (setting in your machine, such as locale) but I am unable to verify that. Since this is very likely not related to the issue reported here first, I will not investigate this further in this context. If you do think this is a problem of the WhiteRabbit project, please report this in a separate issue.

It is in any case not related to the first problem reported in this issue (I was able to confirm that). The original problem is now fixed in the release-1.0.0 branch, including a second BOM related issue in WhiteRabbit. It will be in the planned 1.0.0 release, hopefully soon.

@janblom
Copy link
Collaborator

janblom commented May 3, 2024

A fix for the first issue reported in this thread is included with the second release candidate of version 1.0.0

janblom added a commit that referenced this issue Aug 5, 2024
* Create release 1.0.0

* Enforce consistent ordering of the tables in the scan report (solves issue #236)

* Snowflake: always use database and schema when accessing table (meta)data. Fixes issue #409

* Update Snowflake JDBC version and activate+fix Snowflake integration tests

* Upgrade dependency, testcontainer version and fix MSSqlServer integration test.

* Only run Snowflake integration tests when a Snowflake access configuartion is available

* Switch to SQL for obtaining field metadata for Snowflake (default, JDBC can still be used through a system property or env.var)

* Fix for #411 (can't process custom models with UTF8 BOM in csv file)

* Better method naming and clearer logging for SnowflakeHandler

* Add UTF BOM handling code reading of csv's

* Change to ojdbc8 version 19.23.0.0 (for Oracle). Different (sub)repo, more recently published, solves issue #415

* Avoid testing results for integration test with externally loaded BigQuery JDBC jar: makes setup more simple
@janblom
Copy link
Collaborator

janblom commented Aug 5, 2024

Seolved in WhiteRabbit version 1.0.0

@janblom janblom closed this as completed Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Rabbit-In-A-Hat Issue concerning the specification of mapping rules
Projects
None yet
Development

No branches or pull requests

2 participants