RabbitInAHat fails to load a custom model if BOM is set #411

BillCM · 2024-04-24T16:43:35Z

Describe the bug
I created a custom model in Excel (XLSX) and exported to CSV. This file failed to load and resulted in this error

java.lang.IllegalArgumentException: Mapping for table not found, expected one of [table, field, required, type, schema, description] at org.apache.commons.csv.CSVRecord.get(CSVRecord.java:121) at org.ohdsi.rabbitInAHat.dataModel.Database.generateModelFromCSV(Database.java:117) at org.ohdsi.rabbitInAHat.RabbitInAHatMain.doSetTargetCustom(RabbitInAHatMain.java:465) at org.ohdsi.rabbitInAHat.RabbitInAHatMain.lambda$createMenuBar$9(RabbitInAHatMain.java:268)

The problem is that CSV parsing does not account for the Byte Order Mark (BOM).

To Reproduce
Steps to reproduce the behavior:

Create a custom model in Excel (XLSX)
Export the model to CSV
Use Edit -> Set Target Database -> Load Custom
Select the exported CSV, receive a popup error that the column headers were not found.

Expected behavior
CSV opens correctly.

Workaround
Open the exported CSV in a text editor and change the encoding from "UTF-8 with BOM" to "UTF-8"

Desktop (please complete the following information):

OS: OSX 14.4.1
Java: 11.0.21
Rabbit Version : 10.8a

Additional context
The issue

The text was updated successfully, but these errors were encountered:

janblom · 2024-04-24T16:51:00Z

@BillCM it is possible to attach the CSV file that causes the problem to this issue (or a stripped down one, as long as it causes the same problem). This can can save me some time when building a test case.

Thanks,
Jan

…prefix)

BillCM · 2024-04-29T13:32:37Z

@janblom Correct. CSV exported from Excel have the Byte Order Mark set. The only way to make RabbitInAHat to read the file is to remove the BOM by changing the encoding. Perhaps this it worth a note in the docs?

janblom · 2024-04-29T13:44:53Z

Hi @BillCM , thank you for reporting this issue.

I have prepared a fix already which adds flexibility, so that RabbitInAHat can read CSV's with and without a BOM. This will be part of the upcoming 1.0 release. (Unfortunately testing another aspect of that release is taking some time. )

Since this issue will be fixed, it is not necessary to update the docs. This issue will serve as the (temporary) documentation until the fix is released, and the issue closed. (the fix is in my employers public repo until I have it approved and merged into the OHDSI repo).

If possible, could you attach a CSV to this issue that I can use to reproduce the bug? While I am fairly confident that the upcoming fix will cover your case, there is nothing better than having the certainty :-)

Thanks,
Jan

BillCM · 2024-04-30T18:11:54Z

@janblom I think this very issue is causing the build to break. It appears that the embedded CSVs for CDM5.0 and CDM5.1 and their stem models are all being identified as Excel encoded with BOM. This is causing the mvn build to fail for main branch on my machine.

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.3.1:resources (default-resources) on project rabbitinahat: filtering /Users/bill/ext/WhiteRabbit/rabbitinahat/src/main/resources/org/ohdsi/rabbitInAHat/dataModel/StemTableV5.0.csv to /Users/bill/ext/WhiteRabbit/rabbitinahat/target/classes/org/ohdsi/rabbitInAHat/dataModel/StemTableV5.0.csv failed with MalformedInputException: Input length = 1 -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.3.1:resources (default-resources) on project rabbitinahat: filtering /Users/bill/ext/WhiteRabbit/rabbitinahat/src/main/resources/org/ohdsi/rabbitInAHat/dataModel/StemTableV5.0.csv to /Users/bill/ext/WhiteRabbit/rabbitinahat/target/classes/org/ohdsi/rabbitInAHat/dataModel/StemTableV5.0.csv failed with MalformedInputException

Upon opening the code in IntelliJ, the 4 files in question are linked to the Excel icon and will not open for editing.

After converting the files to UTF-8, the build works.

janblom · 2024-05-01T18:26:39Z

I am unable to reproduce the last report, both on Linux and MacOS. Could it be that the csv files related to this were inadvertedly changed? I suspect an encoding problem (setting in your machine, such as locale) but I am unable to verify that. Since this is very likely not related to the issue reported here first, I will not investigate this further in this context. If you do think this is a problem of the WhiteRabbit project, please report this in a separate issue.

It is in any case not related to the first problem reported in this issue (I was able to confirm that). The original problem is now fixed in the release-1.0.0 branch, including a second BOM related issue in WhiteRabbit. It will be in the planned 1.0.0 release, hopefully soon.

janblom · 2024-05-03T12:43:21Z

A fix for the first issue reported in this thread is included with the second release candidate of version 1.0.0

* Create release 1.0.0 * Enforce consistent ordering of the tables in the scan report (solves issue #236) * Snowflake: always use database and schema when accessing table (meta)data. Fixes issue #409 * Update Snowflake JDBC version and activate+fix Snowflake integration tests * Upgrade dependency, testcontainer version and fix MSSqlServer integration test. * Only run Snowflake integration tests when a Snowflake access configuartion is available * Switch to SQL for obtaining field metadata for Snowflake (default, JDBC can still be used through a system property or env.var) * Fix for #411 (can't process custom models with UTF8 BOM in csv file) * Better method naming and clearer logging for SnowflakeHandler * Add UTF BOM handling code reading of csv's * Change to ojdbc8 version 19.23.0.0 (for Oracle). Different (sub)repo, more recently published, solves issue #415 * Avoid testing results for integration test with externally loaded BigQuery JDBC jar: makes setup more simple

janblom · 2024-08-05T13:38:13Z

Seolved in WhiteRabbit version 1.0.0

BillCM added the bug label Apr 24, 2024

janblom self-assigned this Apr 24, 2024

janblom added the Rabbit-In-A-Hat Issue concerning the specification of mapping rules label Apr 24, 2024

janblom added a commit to thehyve/OHDSI-WhiteRabbit that referenced this issue Apr 25, 2024

Fix for issue OHDSI#411 (crash on custom models in CSV with UTF8 BOM …

df1ff30

…prefix)

janblom added a commit that referenced this issue May 1, 2024

Fix for #411 (can't process custom models with UTF8 BOM in csv file)

3df5cc2

janblom closed this as completed Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RabbitInAHat fails to load a custom model if BOM is set #411

RabbitInAHat fails to load a custom model if BOM is set #411

BillCM commented Apr 24, 2024

janblom commented Apr 24, 2024

BillCM commented Apr 29, 2024

janblom commented Apr 29, 2024 •

edited

Loading

BillCM commented Apr 30, 2024 •

edited

Loading

janblom commented May 1, 2024

janblom commented May 3, 2024

janblom commented Aug 5, 2024

RabbitInAHat fails to load a custom model if BOM is set #411

RabbitInAHat fails to load a custom model if BOM is set #411

Comments

BillCM commented Apr 24, 2024

janblom commented Apr 24, 2024

BillCM commented Apr 29, 2024

janblom commented Apr 29, 2024 • edited Loading

BillCM commented Apr 30, 2024 • edited Loading

janblom commented May 1, 2024

janblom commented May 3, 2024

janblom commented Aug 5, 2024

janblom commented Apr 29, 2024 •

edited

Loading

BillCM commented Apr 30, 2024 •

edited

Loading