Question related to unsparsify #1495

johnkerl · 2024-02-13T14:38:54Z

Originally posted by @aborruso in #1418 (comment):

@johnkerl I have a unsparsify related question.

I have this input csv:

id,Year,Neighbourhood_name,Category,Gender,Amount
1,2019,Emilstorp,0-5 years,Male,15
2,2019,Emilstorp,6-15 years,Female,25
3,2021,Emilstorp,0-5 years,Male,20

If I run

mlr --csv cut -x -f Gender then reshape -s Category,Amount input.csv

I have this error:

mlr: CSV schema change: first keys "id,Year,Neighbourhood_name,0-5 years"; current keys "id,Year,Neighbourhood_name,6-15 years"
mlr: exiting due to data error.

It's a wrong reshape, because I must cut Gender and id, but If I change format --c2m, I have no error, probably because the unsparsify command is not forced.

So it is probably okay to have that error, but for the non-expert user it is a message that does not help to find the solution. What do you think about? I'm not making any suggestions for improvement, because I don't have any at the moment.

Thank you

The text was updated successfully, but these errors were encountered:

johnkerl · 2024-02-18T17:29:25Z

@aborruso most definitely this cannot be accommodated within CSV output format:

$ mlr --c2j cut -x -f Gender then reshape -s Category,Amount input.csv
[
{
  "id": 1,
  "Year": 2019,
  "Neighbourhood_name": "Emilstorp",
  "0-5 years": 15
},
{
  "id": 2,
  "Year": 2019,
  "Neighbourhood_name": "Emilstorp",
  "6-15 years": 25
},
{
  "id": 3,
  "Year": 2021,
  "Neighbourhood_name": "Emilstorp",
  "0-5 years": 20
}
]

This is not at all related to unsparsify. It's because CSV output must have the same keys for all rows. I had really hoped the error message we're seeing now would be clear ... and to me it is ... but it is not clear to everyone ... 🤔

aborruso · 2024-02-18T17:44:13Z

This is not at all related to unsparsify. It's because CSV output must have the same keys for all rows. I had really hoped the error message we're seeing now would be clear ... and to me it is ... but it is not clear to everyone ... 🤔

Please don't hate me if I write stupid things now. But wasn't automatic unsparsify introduced if the output is csv?

I'll try to explain with examples. If I use Miller 5 I have this output, I have a sparsified output:

id,Year,Neighbourhood_name,0-5 years
1,2019,Emilstorp,15

id,Year,Neighbourhood_name,6-15 years
2,2019,Emilstorp,25

id,Year,Neighbourhood_name,0-5 years
3,2021,Emilstorp,20

If I use Miller 6 I have the error, because CSV output must have the same keys for all rows.
Why I do not have the same output of 5?

If I in fact apply unsparsify in 6 I have no error

mlrgo --csv cut -x -f Gender then reshape -s Category,Amount then unsparsify  input.csv

So I thought that without unsparsify I could have in Miller 6 one of these two outputs:

either the one equal to Miller 5;
or the one with automatically applied unsparsify

For the error message, you are right, it is understandable.

aborruso · 2024-02-25T12:05:04Z

Dear @johnkerl I probably wasn't very clear and I'll try to explain again.

If I have this input

{"a":3,"b":"hello"}
{"a":2}

I can write mlr --ijsonl --ocsv cat input.jsonl, without the need to add the unsparsify command. It is applied by default, since the output is CSV.

Instead if I have this input

id,Year,Neighbourhood_name,Category,Gender,Amount
1,2019,Emilstorp,0-5 years,Male,15
2,2019,Emilstorp,6-15 years,Female,25
3,2021,Emilstorp,0-5 years,Male,20

and I run mlrgo --csv cut -x -f Gender then reshape -s Category,Amount input.csv, I must add the verb unsparsify, although the output here is also CSV. Otherwise I have an error.

Couldn't it always be put as a final verb, implied, whenever the output is a rectangular format?

Thank you

johnkerl · 2024-02-26T05:11:02Z

@aborruso this is one of those cases where we would need to read all output before procuding any output, and I'm not comfortable doing that as a default behavior. That would break Miller's streaming-when-it-can feature, which is one of its great strengths, only to accommodate some corner-case data. Since the data being produced here are irregular, manually specified unsparsify is the correct approach.

aborruso · 2024-02-26T06:19:32Z

Thank you very much

aborruso · 2024-02-26T07:29:52Z

@johnkerl for me you can close this.
I get confused, because I can never read in the documentation that this automatic behavior only occurs when "streaming-when-it-can".
It's me who doesn't see it, I'm sure it will be explained very well.

Thank you and sorry for this somewhat off-topic and erroneous issue

johnkerl mentioned this issue Feb 13, 2024

CSV to JSONL: wrong conversion? #1418

Closed

johnkerl closed this as completed Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question related to unsparsify #1495

Question related to unsparsify #1495

johnkerl commented Feb 13, 2024 •

edited

Loading

johnkerl commented Feb 18, 2024

aborruso commented Feb 18, 2024

aborruso commented Feb 25, 2024

johnkerl commented Feb 26, 2024 •

edited

Loading

aborruso commented Feb 26, 2024

aborruso commented Feb 26, 2024

Question related to unsparsify #1495

Question related to unsparsify #1495

Comments

johnkerl commented Feb 13, 2024 • edited Loading

johnkerl commented Feb 18, 2024

aborruso commented Feb 18, 2024

aborruso commented Feb 25, 2024

johnkerl commented Feb 26, 2024 • edited Loading

aborruso commented Feb 26, 2024

aborruso commented Feb 26, 2024

johnkerl commented Feb 13, 2024 •

edited

Loading

johnkerl commented Feb 26, 2024 •

edited

Loading