Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad y axis scaling with color aesthetic #560

Closed
durcan opened this issue Mar 12, 2015 · 8 comments
Closed

Bad y axis scaling with color aesthetic #560

durcan opened this issue Mar 12, 2015 · 8 comments

Comments

@durcan
Copy link

durcan commented Mar 12, 2015

This is all in current master of Gadfly and Julia 0.3.7.

Currently using Scale.y_log10 with the color aesthetic bound produces bad scaling. The following code:

using Gadfly, RDatasets, Lazy

d = dataset("ggplot2", "diamonds")
@>> begin 
    plot(d,
    x=:Price,
    Scale.y_log10,
    Geom.histogram)
    draw(PNG(12cm, 12cm))
end

produces
aa_plt1

but if we try to bind color:

@>> begin 
    plot(d,
    x=:Price,
    color=:Color,
    Scale.y_log10,
    Geom.histogram(position=:stack))
    draw(PNG(12cm, 12cm))
end

this is the result:
aa_plt2

Clearly the y scale is way off. A similar thing happens with a linear y axis and position=:dodge. The default stacked position scales correctly:

@>> begin 
    plot(d,
    x=:Price,
    color=:Color,
    Geom.histogram(position=:stack))
    draw(PNG(12cm, 12cm))
end

aa_plt3

But the dodge position does not:

@>> begin 
    plot(d,
    x=:Price,
    color=:Color,
    Geom.histogram(position=:dodge))
    draw(PNG(12cm, 12cm))
end

aa_plt4

The correct behavior can be manually simulated with layers (Alpha channel to the rescue!) but it is slightly awkward (and this is only for three of the seven colors):

@>> begin 
    plot(
    Scale.y_log10,
    layer(d[d[:Color].=="J",:],
    x=:Price,
    Theme(default_color=Color.AlphaColorValue(Color.color("Turquoise "), 0.5)),
    Geom.histogram),
    layer(d[d[:Color].=="I",:],
    x=:Price,
    Theme(default_color=Color.AlphaColorValue(Color.color("Coral"), 0.5)),
    Geom.histogram),
    layer(d[d[:Color].=="E",:],
    x=:Price,
    Theme(default_color=Color.AlphaColorValue(Color.color("DarkViolet"), 0.5)),
    Geom.histogram))
    draw(PNG(12cm, 12cm))
end

aa_plt5

As an aside (or maybe separate issue) this sort of plot can work very well as an step plot (where only the outline of this histogram is shown) instead of a bar style histogram. Sadly Stat.histogram seems to clobber Stat.step so Geom.step with Stat.histogram is the same as Geom.line with Stat.histogram:

@>> begin 
    plot(d,
    x=:Price,
    Scale.y_log10,
    Stat.histogram,
    Geom.step)
    draw(PNG(12cm, 12cm))
end

aa_plt6

@dcjones
Copy link
Collaborator

dcjones commented Mar 13, 2015

Ok, I see what the problem is.

Honestly, apart from the broken y-axis scale extents, I'm a little unsure of what the correct behavior for stacked bar charts on a log y-axis really is.

Suppose you make a stacked bar chart with two values: a=5.0, b=5.0, and you plot it on the log10 scale. Depending on the order of a and b, you could get either:
option1
or
option2
The sensitivity to ordering, and the fact the heights are no longer really additive, makes this feel misleading.

So, I wonder, is it more or less misleading to have the height of the entire stack on a log scale, then divide the stack up linearly, so you get.
option3
I can't find many examples of plots like this, so I'm sure which is expected. Any thoughts?

@dcjones
Copy link
Collaborator

dcjones commented Mar 13, 2015

What ggplot2 does is separately compute log10(a) and log10(b) and then stack them.
try

This also seems confusing to me, because it makes it look like 5 + 5 > 10. I don't know what to believe anymore... 😕

@dcjones
Copy link
Collaborator

dcjones commented Mar 13, 2015

Fixed for dodge now. Still need to decide what to do with stack.

Your second problem statistics clobbering is because, currently there in only allowed one statistic per layer, so they do get clobbered. I think I avoided allowing multiple stats because it would make arguments to plot sensitive to ordering (i.e. in deciding what statistic is applied first), whereas order doesn't matter now. But that's not a very compelling reason to restrict this.

@durcan
Copy link
Author

durcan commented Mar 13, 2015

I had not really considered stacked histograms with a log scale. I agree that there is no optimal solution, but I feel that ggplot2's is more misleading than your naive approach (although in your examples I think the larger block of color goes on bottom, not on top. Although it is possible that I am thinking about this wrong). In general I am not a huge fan of stacking. Even on a linear scale they are pretty hard to look at. I mean, if you look at my stacked plot above, do you really feel like you have any idea what the distribution of I colored diamonds is? I certainly do not. I guess what I am saying is it, don't sweat it too much. It will be confusing no matter what you do.

With "dodge" (or my hand rolled overlap) the only problem is what to do if you want to somehow normalize the different distributions. This can be useful if you want to make a comparison of the tails or something, but you happen to have a different number of samples of each population. density=true normalizes to unit area (I believe) which makes for some very crazy viewing on a log scale. Ideally each would be rescaled to a common area such that each bin had at least a height of one. Anyway, not sure what something like this would look like ideally.

As for the clobbering, I thought something like that might be happening. I am not sure I have a brilliant idea about what general combinations of stats should mean. I do think that using Geom.step for histograms (especially when you want to plot a bunch of overlapping ones) is fairly desirable though.

Again, thanks for the speedy response.

@binarybana
Copy link
Contributor

👍 to your proposal for proportional stacking with the correct summed heights for log histograms. Usually, the only useful information I can glean out of stacked plots is a rough sense of proportion for any given region of the histogram so that makes a lot of sense to me. And bar heights being order sensitive just gives me the willies.

@dcjones
Copy link
Collaborator

dcjones commented Mar 27, 2015

👍 to your proposal for proportional stacking with the correct summed heights for log histograms.

Yeah, this seems like the best option (or maybe the worst option except for all the others). There's still the potential to mislead, but not as badly as the alternatives, and the resulting plot is pretty useful.

try1

@CorySimon
Copy link

When I try the opaque bars in different layers, I get an error:

`convert` has no method matching convert(::Type{Union(Nothing,ColorValue{T})},     ::AlphaColorValue{RGB{Float64},Float64})
while loading In[149], in expression starting on line 1

  in Theme at /home/corymsimon/.julia/v0.3/Gadfly/src/varset.jl:53

@dcjones
Copy link
Collaborator

dcjones commented Mar 28, 2015

You may have to be on Gadfly master to do that. I haven't tagged a new version in a while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants