Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question related to unsparsify #1495

Closed
johnkerl opened this issue Feb 13, 2024 · 6 comments
Closed

Question related to unsparsify #1495

johnkerl opened this issue Feb 13, 2024 · 6 comments

Comments

@johnkerl
Copy link
Owner

johnkerl commented Feb 13, 2024

Originally posted by @aborruso in #1418 (comment):

@johnkerl I have a unsparsify related question.

I have this input csv:

id,Year,Neighbourhood_name,Category,Gender,Amount
1,2019,Emilstorp,0-5 years,Male,15
2,2019,Emilstorp,6-15 years,Female,25
3,2021,Emilstorp,0-5 years,Male,20

If I run

mlr --csv cut -x -f Gender then reshape -s Category,Amount input.csv

I have this error:

mlr: CSV schema change: first keys "id,Year,Neighbourhood_name,0-5 years"; current keys "id,Year,Neighbourhood_name,6-15 years"
mlr: exiting due to data error.

It's a wrong reshape, because I must cut Gender and id, but If I change format --c2m, I have no error, probably because the unsparsify command is not forced.

So it is probably okay to have that error, but for the non-expert user it is a message that does not help to find the solution. What do you think about? I'm not making any suggestions for improvement, because I don't have any at the moment.

Thank you

@johnkerl
Copy link
Owner Author

@aborruso most definitely this cannot be accommodated within CSV output format:

$ mlr --c2j cut -x -f Gender then reshape -s Category,Amount input.csv
[
{
  "id": 1,
  "Year": 2019,
  "Neighbourhood_name": "Emilstorp",
  "0-5 years": 15
},
{
  "id": 2,
  "Year": 2019,
  "Neighbourhood_name": "Emilstorp",
  "6-15 years": 25
},
{
  "id": 3,
  "Year": 2021,
  "Neighbourhood_name": "Emilstorp",
  "0-5 years": 20
}
]

This is not at all related to unsparsify. It's because CSV output must have the same keys for all rows. I had really hoped the error message we're seeing now would be clear ... and to me it is ... but it is not clear to everyone ... 🤔

@aborruso
Copy link
Contributor

This is not at all related to unsparsify. It's because CSV output must have the same keys for all rows. I had really hoped the error message we're seeing now would be clear ... and to me it is ... but it is not clear to everyone ... 🤔

Please don't hate me if I write stupid things now. But wasn't automatic unsparsify introduced if the output is csv?

I'll try to explain with examples. If I use Miller 5 I have this output, I have a sparsified output:

id,Year,Neighbourhood_name,0-5 years
1,2019,Emilstorp,15

id,Year,Neighbourhood_name,6-15 years
2,2019,Emilstorp,25

id,Year,Neighbourhood_name,0-5 years
3,2021,Emilstorp,20

If I use Miller 6 I have the error, because CSV output must have the same keys for all rows.
Why I do not have the same output of 5?

If I in fact apply unsparsify in 6 I have no error

mlrgo --csv cut -x -f Gender then reshape -s Category,Amount then unsparsify  input.csv

So I thought that without unsparsify I could have in Miller 6 one of these two outputs:

  • either the one equal to Miller 5;
  • or the one with automatically applied unsparsify

For the error message, you are right, it is understandable.

@aborruso
Copy link
Contributor

Dear @johnkerl I probably wasn't very clear and I'll try to explain again.

If I have this input

{"a":3,"b":"hello"}
{"a":2}

I can write mlr --ijsonl --ocsv cat input.jsonl, without the need to add the unsparsify command. It is applied by default, since the output is CSV.

Instead if I have this input

id,Year,Neighbourhood_name,Category,Gender,Amount
1,2019,Emilstorp,0-5 years,Male,15
2,2019,Emilstorp,6-15 years,Female,25
3,2021,Emilstorp,0-5 years,Male,20

and I run mlrgo --csv cut -x -f Gender then reshape -s Category,Amount input.csv, I must add the verb unsparsify, although the output here is also CSV. Otherwise I have an error.

Couldn't it always be put as a final verb, implied, whenever the output is a rectangular format?

Thank you

@johnkerl
Copy link
Owner Author

johnkerl commented Feb 26, 2024

@aborruso this is one of those cases where we would need to read all output before procuding any output, and I'm not comfortable doing that as a default behavior. That would break Miller's streaming-when-it-can feature, which is one of its great strengths, only to accommodate some corner-case data. Since the data being produced here are irregular, manually specified unsparsify is the correct approach.

@aborruso
Copy link
Contributor

Thank you very much

@aborruso
Copy link
Contributor

@johnkerl for me you can close this.
I get confused, because I can never read in the documentation that this automatic behavior only occurs when "streaming-when-it-can".
It's me who doesn't see it, I'm sure it will be explained very well.

Thank you and sorry for this somewhat off-topic and erroneous issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants