Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading a Delimited Files with extra columns #12186

Closed
jdunkerley opened this issue Jan 29, 2025 · 6 comments · Fixed by #12231
Closed

Reading a Delimited Files with extra columns #12186

jdunkerley opened this issue Jan 29, 2025 · 6 comments · Fixed by #12231
Assignees
Labels
-libs Libraries: New libraries to be implemented

Comments

@jdunkerley
Copy link
Member

Currently, if reading a file with extra columns after the first row it is only possible to get a warning or drop the rows.

Want to change the Delimited argument for keep_invalid_rows to an atom type:

  • Invalid_Rows.Drop_Invalid_Rows
  • Invalid_Rows.Keep_Invalid_Rows
  • Invalid_Rows.Add_Extra_Column (adds new columns with default names and nulls above and below).

In all cases we should still warn about it.

@jdunkerley jdunkerley added the -libs Libraries: New libraries to be implemented label Jan 29, 2025
@jdunkerley
Copy link
Member Author

A,B,C.csv

@radeusgd radeusgd moved this from ❓New to 📤 Backlog in Issues Board Jan 29, 2025
@radeusgd radeusgd moved this from 📤 Backlog to 🔧 Implementation in Issues Board Feb 3, 2025
@enso-bot
Copy link

enso-bot bot commented Feb 4, 2025

Radosław Waśko reports a new STANDUP for yesterday (2025-02-03):

Progress: Catching up and updating single-column Table PR after reviews. Experiment with PoC of JUnit tests for std-bits. Start on extra columns in Delimited - added new types and updated API. It should be finished by 2025-02-05.

Next Day: Next day I will be working on the same task. Update tests, finish updating the logic.

@radeusgd
Copy link
Member

radeusgd commented Feb 4, 2025

On the backlog meeting we have discussed that it will probably make more sense to rename the keep_invalid_rows argument into on_invalid_rows.

I'd also suggest to consider on_column_count_mismatch

The problem with the rename is that it would be a breaking change.

We considered if we can keep the old name as an optional argument, but because Delimited is a constructor, that means that the Delimited_Format datatype will have two conflicting fields - new on_invalid_rows and the old one (keep_invalid_rows) that drive the same behaviour, 'denormalizing' the data type by keeping redundant information. That seemed not ideal.

(If it was not a constructor but a method we could check if only one argument is set and then select which one to use, but in a constructor we cannot run custom logic).

Because keeping the old field as a 'fallback' seems to cause more harm, we considered if we should move forward with the breaking change - renaming the field in a breaking way. @jdunkerley we wanted to hear your opinion on this, what do you think?

The biggest trouble with this is that if there are any workflows that used the keep_invalid_rows argument by name (e.g. Delimited ',' keep_invalid_rows=False) which happens in the IDE if only some arguments of the atom are customized, then the error that they will be getting will be:

Type error: expected a function, but got Delimited.

which unfortunately is very confusing. This was already discussed in issue #7359 a long time ago, but due to the difficulty we reached only a partial solution. Perhaps this needs revisiting?

@jdunkerley
Copy link
Member Author

As discussed, lets call it on_invalid_rows and look to catch the "expected a function" in Delimited_Format.resolve.

@enso-bot
Copy link

enso-bot bot commented Feb 4, 2025

Radosław Waśko reports a new STANDUP for today (2025-02-04):

Progress: Finished the implementation and updated tests. Discussing argument naming and breaking changes. Put up a Draft PR. It should be finished by 2025-02-05.

Next Day: Next day I will be working on the same task. Rename the argument. Amend tests. Try issuing a more friendly error when old name is used.

@mergify mergify bot closed this as completed in #12231 Feb 5, 2025
mergify bot pushed a commit that referenced this issue Feb 5, 2025
@github-project-automation github-project-automation bot moved this from 🔧 Implementation to 🟢 Accepted in Issues Board Feb 5, 2025
@enso-bot
Copy link

enso-bot bot commented Feb 6, 2025

Radosław Waśko reports a new STANDUP for yesterday (2025-02-05):

Progress: Renamed the argument and described in changelog. Got the Delimited extra columns PR merged. Created a prototype that can improve the error messages on such argument name changes or typos. It should be finished by 2025-02-05.

Next Day: Next day I will be working on the #12136 task. Finish the prototype PR after review. Start work on key-pair auth for Snowflake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
-libs Libraries: New libraries to be implemented
Projects
Status: 🟢 Accepted
Development

Successfully merging a pull request may close this issue.

2 participants