Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding data guide. #268

Closed
6 tasks
seabbs opened this issue Apr 13, 2021 · 11 comments
Closed
6 tasks

Adding data guide. #268

seabbs opened this issue Apr 13, 2021 · 11 comments
Labels
documentation Improvements or additions to documentation

Comments

@seabbs
Copy link
Contributor

seabbs commented Apr 13, 2021

The adding data guide needs to be expanded. Current version is here: https://github.com/epiforecasts/covidregionaldata/blob/master/.github/ADDING-DATA.md

It should include:

  • How to download data
  • How to clean data when there is only one level
  • How to clean data with multiple levels
  • How to incrementally implement the prototype class
  • How to test the prototype class
  • A style guide on documentation and method use (i.e only use self in top level method calls where possible).
@RichardMN
Copy link
Collaborator

I'll put my hand up for some of this. I think I can do:

  • How to download data
  • How to clean data when there is only one level
  • How to clean data with multiple levels
  • How to incrementally implement the prototype class

Thinking about this, it almost feels something which might draft better with a wiki approach rather than concurrent PRs on a branch.

One suggestion which I think I would make would be to work through a flowchart/decision tree of the data available and then model the new code not on R/Country Template.R but on one of the existing classes. (This is essentially what I did for most of them.)

  • Q1: Do you have data only for level 1?
    • Yes
      • Model based on Cuba or Canada or SouthAfrica or Italy or India [or ...]
    • No
      • Q2: Is your data available is only for level 2 but then aggregated to get level 1 data?
        • Model on Lithuania or Brazil or Germany
      • Q3: Is your data available as separate sources for levels 1 and 2?
        • Model on USA, look at France
      • Q4: Is some data available only at certain levels?
        • Model on Belgium (see also France)
      • Q5: Are you doing something different from all the others?
        • Look at UK, but this probably isn't what you want to do.

This approach is more likely to give people a framework with the parts they need already in than providing a very generic framework which has many methods in place which may not be necessary for most use cases.

It may be useful to fill out the taxonomy above for general reference too.

What I don't think I can do apart from asking more questions is advise on testing or on debugging (two different things). Debugging the R6 code has been a learning experience, I'm not sure I'm doing it the best way possible but perhaps someone else may have some tips we could share.

@seabbs
Copy link
Contributor Author

seabbs commented Apr 14, 2021

That would be amazing Richard.

I really like the flow chart idea that seems like a much nicer way of handling it.

We could use GitHub wikis for both this and contributing guide (as suggested by @kathsherratt)? That might make changing them easier and more natural.

Totally agree on the testing/debugging issues. I also don't think I have found the optimal work flow. I think perhaps if we all chip on with suggestions we can agree some good working practices?

@seabbs
Copy link
Contributor Author

seabbs commented Apr 14, 2021

My personal workflow has been to implement each method in turn, load the class and then use n <- Class$new(); debug(n$download) to drop into debug in the method I am working on.

@RichardMN
Copy link
Collaborator

RichardMN commented Apr 14, 2021

Good - I am starting with a wiki formulation in https://github.com/epiforecasts/covidregionaldata/wiki/Adding-Data-(wiki-draft)

I've drawn in from @seabbs blob and will also pull in some of the text I'd written for a vignette on this topic for 0.8.3

Currently I'm littering it with headers as a way of outlining - please feel free to also edit on this text, including adding in comments/questions/suggestions. I would propose to draft on the wiki for a few days and then transfer into a PR (I think it's useful to have this documentation in the package) but we may choose to leave it on the wiki.

@seabbs
Copy link
Contributor Author

seabbs commented Apr 15, 2021

Just skimmed through where this is at. Looks really great currently. Really like the Q and A design. I think I am quite in favour in leaving it in the wiki (so that is a 100% course change from me) but very open to alternative views on that.

@RichardMN
Copy link
Collaborator

Thanks for the kind comments.

Can someone who is familiar with the ECDC / WHO / JRC code sketch out or write a version of what could be said about doing this sort of a system? I've looked a bit at these Classes but really am not familiar and don't know what counts as a good example and what counts as an edge case which is functional but to be avoided.

I think I know what I want to write more in the cleaning single levels and cleaning multiple levels, I think I've got a sense of what needs to go in the incremental and testing sections.

I have not written about only using self in top-level functions - I think I would include the clean_level_? functions in that category. At least some of my code uses self$clean to pass intermediate forms from one of these to the other. I don't know what arguments there are for passing things on the stack as opposed to them being referred to in the object they're on.

One other thing I might suggest is that if people know of open data sources which we don't yet include, we could start a "wish list" of these which could be added. It could be faster than me going looking for new data sources, but would have to come with the clear statement that listing something doesn't mean it's automatically, or even ever, going to be added. (There may be complications or limitations which aren't apparent, or which the nominator doesn't consider important.)

@seabbs
Copy link
Contributor Author

seabbs commented Apr 19, 2021

@joseph-palmer as you are handling adding the JHU and google data sketching out some docs for these kind of classes might be a good intro?

Maybe we should leave documenting the self usage for a bit as I agree I am not totally clear what the best approach is.

I thought we might wishlist datasets as a discussion section? Agree we should post to this and definitely stress that it is very much not a given anything will be implemented based on it being listed..

https://github.com/epiforecasts/covidregionaldata/discussions/categories/new-datasets

@joseph-palmer
Copy link
Collaborator

Iv'e put a page up https://github.com/epiforecasts/covidregionaldata/wiki/Adding-a-national-data-source and will keep going over it to make sure its clear. I will update as I add in JHU and Google data as I'm sure that will raise steps I've missed. Could even do a walk though of creating it which might be nice.

@seabbs
Copy link
Contributor Author

seabbs commented Apr 22, 2021

I'll look at this again at more detail today/tomorrow but are we happy with how the package links to these docs at the moment?

@RichardMN
Copy link
Collaborator

RichardMN commented Apr 22, 2021

I think it’s fine.

I just looked and I cannot find how the package links to these but I think that’s probably ok. I have to imagine that anyone who wants to add a data set will at least visit the GitHub page and may find their way to the wiki.

i note (and this is something which I guess may go into the pkgdown.yml) a typo in the following line:

Clases that define individual datasets and contain all the methods needed to download, process, and clean them.

classes.

Let’s also put in a sentence there saying that individual data classes may have params which are documented in their data class but can be passed to get_regional_data et al.

@joseph-palmer
Copy link
Collaborator

This looks like its finished so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants