Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Histogram support addition to distplot.py #47

Merged
merged 17 commits into from
Aug 13, 2024

Conversation

imperorrp
Copy link
Collaborator

@imperorrp imperorrp commented Apr 15, 2024

This PR aims to add support for histograms in distplot.py (Issue #24 ), allowing marginal densities to be visualized in this new form, apart from preexisting KDE and ECDF forms


📚 Documentation preview 📚: https://arviz-plots--47.org.readthedocs.build/en/47/

@imperorrp
Copy link
Collaborator Author

@OriolAbril I had to dig a bit deeper into understanding how the plotting actually works as coded, but I think I've managed to get step 1 (computing the histogram) roughly done- I'll probably have to combine these DataArrays into a Dataset and then create the visual element for mapping. Also, I'll remove the print statements I've added in later commits- just added them to help with seeing the outputs. Is something like this alright?- Looping through the data variables in distribution to get DataArrays and then passing that plus the dimensions to be reduced to the xarray_einstats.histogram() function.

Also, I was going through the flow of how density was computed for KDE as an example (in Arviz-Stats). I saw bandwidth computing was done for KDE- but that won't be necessary for plotting histograms, right?

@imperorrp
Copy link
Collaborator Author

Combined the xarray_einstats.numba.histogram produced DataArrays into a Dataset:

Dataset-image

(This is with the Schools dataset)

It's in the style of the KDE dataset returned by distribution.azstats.kde to density

@imperorrp
Copy link
Collaborator Author

imperorrp commented Apr 22, 2024

Changed the plot_axis coords to 'x' and 'y' in place of 'bin_midpoint' and 'bin_height':

image

Also added a visual element function, backend interface and matplotlib implementation with matplotlib.pyplot.bar:

image

The bar widths are different due to the different scales of the x-axis. I'm not sure how this can be fixed yet if keeping the width a consistent value is desired though. Also, the number of bins is set to 20 by default. Some bins have too few values to be visible as their respective bars are too low in height compared to the taller ones.

Copy link
Member

@OriolAbril OriolAbril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is looking very good already. The computation of the histogram is a bit trickier and would not be part of the gsoc project, but I think it will be helpful to get you more familiar with xarray.

I have tried to focus on the plotting and contributing workflow with the comments. Let me know if you have any doubts related to testing.

@@ -99,6 +99,29 @@ def create_plotting_grid(


# "geoms"
def hist(x, height, target, *, width=1.6, bottom=None, align="center", color=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given this is a histogram instead of a bar plot, I would require both left and right bin edges and use those for the bars, therefore removing both width and align parameter. We might consider something like rwidth like matplotlib (relative width) with a default to 1 in case we didn't want bins to be draws with their actual width, but I would leave that for later (if at all).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will require some changes to the data restructure utility function but I am not too worried about this, at some point we'll have a function in arviz-stats to take care of all this.


# number of bins is set to 20 by default
hist = histogram(
da=var_data, dims=density_dims, bins=20, **stats_kwargs.get("density", {})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default number of bins should be taken care of by histogram function. Moreover, if we believed a different default were needed, it would have to happen via stats_kwargs and setdefault so a user could modify that value.

Copy link
Collaborator Author

@imperorrp imperorrp May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just modified this in the latest commit to do so

Now by default, with no bins provided (azp.plot_dist(schools, kind='hist')) and the same example dataset, this is produced:
distplot-hist-matplotlib

And with bins=40 provided by a user (azp.plot_dist(schools, kind='hist', stats_kwargs={"density": {"bins":40}})), this is produced:
distplot-hist-matplotlib

Will look at your other comments and make changes for those too now

def hist(x, height, target, *, width=1.6, bottom=None, align="center", color=None):
"""Interface for a hist plot.

Parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't been completely consistent adding the docstrings to this module, but the idea is for the documentation here so define the behaviour all backends have to comply with. It also isn't a general plotting backend interface but a common interface for our own plots to use which means we can be quite restrictive about the features.

For example, I would only accept array_like for x, height, width... I don't think the extra edge case of floats would be used anywhere. It also means the descriptions should not depend on any backend.

Comment on lines 119 to 120
color : color or array-like, optional
The colors of the bar faces.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might start with the same color for all rectangles, mostly because I am not sure how easy it will be to support the different color rectangles in all backends. If we then see it is feasible and of interest we can add that extra case.

There should also be a facecolor and an edgecolor arguments given there is a color argument and the generated form is "fillable" to match the expectations from https://arviz-plots.readthedocs.io/en/latest/api/backend/index.html#backend-interface-arguments

@imperorrp
Copy link
Collaborator Author

It is looking very good already. The computation of the histogram is a bit trickier and would not be part of the gsoc project, but I think it will be helpful to get you more familiar with xarray.

I have tried to focus on the plotting and contributing workflow with the comments. Let me know if you have any doubts related to testing.

Thank you, I will take a look at your reviews and advice!

@imperorrp
Copy link
Collaborator Author

@OriolAbril Latest commit has changes incorporating advice from your comments. Modified data restructure utility function to keep left and right edges info now until the matplotlib backend calculates midpoints from these at the end to create the bars, and I've updated the docstring and color/facecolor/edgecolor interface

Copy link
Member

@OriolAbril OriolAbril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment on how to use matplotlib bar function to generate histogram looking plot. Let me know if you need any help to generate the bokeh equivalent

@OriolAbril
Copy link
Member

By the way, I commented on the ess and mcse issues but forgot to comment here. There is a histogram function available in arviz-stats now. I think the output will have a similar (if not the same) shape to what you are using

@imperorrp
Copy link
Collaborator Author

By the way, I commented on the ess and mcse issues but forgot to comment here. There is a histogram function available in arviz-stats now. I think the output will have a similar (if not the same) shape to what you are using

Altered the code to make use of this function. Also, it seems like point estimate and credible intervals are being plotted so close to the x axis for histogram plots because the default 'y' aesthetic value/s returned by plot collection are pretty low in relation. The maximum y-axis values for the KDE/ECDF plots are way lower so it's not a problem for them. I initially thought there was some issue with this aesthetic mapping when type hist is picked, but it works fine- it's just really small values compared to the histogram heights. Should we mention this in the documentation, asking users to define custom 'y' aesthetic values (in pc_kwargs)when using kind=histogram for better visibility?

For the plot_dist example where one subplot is created for each variable:

azp.plot_dist(
        data,
        kind="kde",
        pc_kwargs={
            "cols": ["__variable__"],
            "aes": {"color": ["school"], "y": ["school"]},
            "y": np.linspace(0, 0.06, 9),
        },
        aes_map={
            "kde": ["color"],
            "point_estimate": ["color", "y"],
            "credible_interval": ["y"],
        },
    )
    plt.savefig("distplot-kde-cols-matplotlib.png")

image

Point estimates when switching from kind="kde" to kind="hist":

azp.plot_dist(
        data,
        kind="hist",
        pc_kwargs={
            "cols": ["__variable__"],
            "aes": {"color": ["school"], "y": ["school"]},
            "y": np.linspace(0, 0.06, 9),
        },
        aes_map={
            "kde": ["color"],
            "point_estimate": ["color", "y"],
            "credible_interval": ["y"],
        },
    )
    plt.savefig("distplot-hist-cols-matplotlib.png")

image

This is easily fixable with editing the np.linspace line though-

"y": np.linspace(0, 600, 9),

image

We could maybe add this in the documentation so users know to keep this in mind.

Comment on lines +257 to +281
density = distribution.azstats.histogram(
dims=density_dims, **stats_kwargs.get("density", {})
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should pass density=True as default here. This will normalize the histogram so its area is 1, and will therefore generate histograms with heights similar to those of the kde. That will should take care of the non-reusability of the overlayed example when changing hist/kde. For ecdf/dotplots it will probably break too, but I don't think it matters, ecdf always goes 0-1 whereas dotplots are particularly strange, they should not be expected to act as drop-in replacements of either kde nor hist.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah adding that seems to work for the overlay example, without having to alter the np.linspace parameters.

@OriolAbril OriolAbril force-pushed the plot_dist-histogram branch 2 times, most recently from 98e5d1f to 8f43767 Compare July 18, 2024 13:21
Copy link
Member

@OriolAbril OriolAbril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As always, go over all comments before starting to address any single one of them

@@ -97,6 +99,7 @@ def plot_dist(

* "kde" -> passed to :func:`~arviz_plots.visuals.line_xy`
* "ecdf" -> passed to :func:`~arviz_plots.visuals.ecdf_line`
* "hist" -> passed to :func: `~WIP`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -250,6 +253,16 @@ def plot_dist(
**density_kwargs,
)

elif kind == "hist":
density = distribution.azstats.histogram(
dims=density_dims, density=True, **stats_kwargs.get("density", {})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

density=True should be added to stats_kwargs with setdefault. Then, if a user wanted to they can set it to False, keep the axis and use the y axis to look at counts for example. The task itself isn't really important, what is is that it should be possible

@@ -352,7 +374,7 @@ def plot_dist(
labeller=labeller,
**title_kwargs,
)
if (kind == "kde") and (plot_kwargs.get("remove_axis", True) is not False):
if (kind in ("kde", "hist")) and (plot_kwargs.get("remove_axis", True) is not False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would modify this logic. Instead, I would set the value of the remove_axis depending on kind at the top, right after copying plot_kwargs.

if kind in ("hist", "ecdf"):
    plot_kwargs.setdefault("remove_axis", False)

I think as a default we'll want to remove the axis for kde and quantile dotplot but keep it for the others.

@@ -13,6 +13,21 @@
from arviz_base.labels import BaseLabeller


def plot_hist(da, target, backend, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use hist only to "reserve" the plot_xyz for batteries included functions, otherwise it might get confusing.

@@ -99,6 +99,33 @@ def create_plotting_grid(


# "geoms"
def hist(y, l_e, r_e, target, *, bottom=None, color=None, facecolor=None, edgecolor=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went over all potential keywords in https://arviz-plots.readthedocs.io/en/latest/api/backend/index.html#backend-interface-arguments and alpha should be added here too.

@imperorrp
Copy link
Collaborator Author

Just pushed changes per last review

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 57.81250% with 27 lines in your changes missing coverage. Please review.

Project coverage is 84.84%. Comparing base (1298e42) to head (2304364).
Report is 1 commits behind head on main.

Files Patch % Lines
src/arviz_plots/plots/utils.py 13.33% 13 Missing ⚠️
src/arviz_plots/backend/none/__init__.py 58.33% 5 Missing ⚠️
src/arviz_plots/backend/bokeh/__init__.py 55.55% 4 Missing ⚠️
src/arviz_plots/backend/matplotlib/__init__.py 63.63% 4 Missing ⚠️
src/arviz_plots/backend/__init__.py 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #47      +/-   ##
==========================================
- Coverage   85.85%   84.84%   -1.02%     
==========================================
  Files          17       17              
  Lines        1951     2012      +61     
==========================================
+ Hits         1675     1707      +32     
- Misses        276      305      +29     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

**artist_kws,
):
"""Interface to Bokeh for a histogram bar plot."""
artist_kws.setdefault("level", "glyph")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary?



# WIP: dropping bin_midpoints calculation and keeping bin_edges
def restructure_hist_data(hist_dict):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be removed

@OriolAbril
Copy link
Member

This will need to be rebased and double checked to make it compatible with plotly. Take care of these last two comments left and after that I'll rebase and add more commits to the PR so it works with plotly

@OriolAbril OriolAbril force-pushed the plot_dist-histogram branch from 2bd3519 to f7a56c6 Compare August 13, 2024 23:12
@OriolAbril OriolAbril merged commit a1e32a5 into arviz-devs:main Aug 13, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants