Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed future workflow #161

Open
sbfnk opened this issue Nov 20, 2024 · 10 comments
Open

Proposed future workflow #161

sbfnk opened this issue Nov 20, 2024 · 10 comments
Labels

Comments

@sbfnk
Copy link
Collaborator

sbfnk commented Nov 20, 2024

In the process of addressing #131 workflows for estimating contact matrices will change. The proposal is to move some of the functionality hidden inside the contact_matrix() function into separate functions. This means that instead of (currently)

contact_matrix(polymod, countries = "United Kingdom", age.limits = c(0, 1, 5, 15), symmetric = TRUE)

One would do something like

uk_contacts <- polymod |>
  filter(country == "United Kingdom") |>
  process_age(age_limits = c(0, 1, 5, 15))

uk_pop <- uk_contacts |>
  country_population()

cm <- uk_contacts |>
  contact_matrix(by = "age", symmetric = TRUE, survey.pop = uk_pop)

It's quite a lot more typing but hopefully more explicit and less black box processing thus less likely to lead to misinterpretation / misunderstanding.

@sbfnk sbfnk added the question label Nov 20, 2024
@joshwlambert
Copy link

joshwlambert commented Nov 20, 2024

I think the current functionality is a lot cleaner and will be easier for new users coming to the package. It looks like at least one of the functions in the second code chunk are already exported from {socialmixr}, therefore I have two questions:

  1. Could you not offer both sets of functionality, i.e. retain the original function and allow users to do the second if they wish (the contact_matrix() function in the second code chunk would need a different name)?
  2. What are the misinterpretations/misunderstandings that you see users having with the current approach? As I'm not a regular user of this package it might help clarify how the proposed approach addresses these.

@sbfnk
Copy link
Collaborator Author

sbfnk commented Nov 20, 2024

Thanks, looks like some motivation/context was missing from the original post. The contact_matrix() function currently has 22 arguments and passes arguments in ellipses ... to three other functions, and 800 lines of code, making it unwieldy both from a user and developer perspective. The rationale for removing the processing of demographic data, filtering of surveys etc. out of the function is to expose more of the internals such as how age is processed, what demographic data is used etc. whilst having smaller functions with fewer arguments. The downside is more function calls for the user, especially when previously using arguments defaults.

One related question is whether we want to be able to address #143 - at the moment the contact_matrix() function does a lot of manipulation of age but if it is to allow generation of general contact matrices (e.g. by sex) then it really makes sense to move all the functionality for estimating/grouping ages into other functions because users might not be interested in age, but also because if not contact_matrix() will have to grow even bigger / need more arguments.

Re (1) yes we could and perhaps it's the way forward, but from a maintenance point of view it would help to be a bit more prescriptive on the workflow.

@bahadzie
Copy link

 uk_contacts <- polymod |>
  as_contact_survey() |>
  filter(country == "United Kingdom") |>
  process_age(age_limits = c(0, 1, 5, 15))

uk_pop <- uk_contacts |>
  country_population()

cm <- uk_contacts |>
  contact_matrix(by = "age", symmetric = TRUE, survey.pop = uk_pop)

I would change

contact_matrix(by = "age", symmetric = TRUE, survey.pop = uk_pop) can become

to

`cm(by = "age", symmetric = TRUE, survey.pop = uk_pop)

On CRAN {o2geosocial} and {finalsize} are reverse (Suggests:) dependencies. Not sure how we feel about potential breaking changes. If long term the idea is to move to {contactmatrix} then I think we should maintain contact_matrix()'s current function signature but still refactor the code to split it out into smaller functions that will reduce its cyclomatic complexity which at 143 is almost 10X the recommended {lintr} limit of 15.

IMO treating age as a variable useful for stratification simplifies implementing #143 . Of course we still need the existing age estimation functionality but this can be a separate function that is called by contact_matrix()

Section comments (below) from contact_matrix() are a useful blueprint for potential functions.

  ## === check arguments and define variables
  ## === check and clean survey
  ## === check if specific countries are requested (if a survey contains data from multiple countries)
  ## === age processing: deal with ranges and missing data
  ## === check if any filters have been requested
  ## === adjust age.group.brakes to the lower and upper ages in the survey
  ## === if split, symmetric or age weights are requested, get demographic data (survey population)
  ## === process weights
  ## === merge participants and contacts into a single data table
  ## === process contact age ranges / missing ages
  ## === sample participants randomly (if requested)
  ## === calculate weighted contact matrix
  ## === split contact matrix
  ## === option to add matrix per capita, i.e. the contact rate of age i with one individual of age j in the population.
  ## === option to return participant weights

Potential mappings are...

as_contact_survey(){
  ## === check and clean survey
  ## === check if specific countries are requested (if a survey contains data from multiple countries)
  ## === merge participants and contacts into a single data table
}
filter() {
  ## === check if any filters have been requested
}
process_age(){
  ## === age processing: deal with ranges and missing data
  ## === adjust age.group.brakes to the lower and upper ages in the survey
  ## === process contact age ranges / missing ages
}
as_contact_survey(){
  ## === check if specific countries are requested (if a survey contains data from multiple countries)
}

@joshwlambert
Copy link

Thanks for the additional context. The misunderstanding/misinterpretation comes from the number of arguments in contact_data() and the misuse of argument defaults?

One related question is whether we want to be able to address #143 - at the moment the contact_matrix() function does a lot of manipulation of age but if it is to allow generation of general contact matrices (e.g. by sex) then it really makes sense to move all the functionality for estimating/grouping ages into other functions because users might not be interested in age, but also because if not contact_matrix() will have to grow even bigger / need more arguments.

I agree that the proposed refactor is the right development direction, as the single point of user interaction (contact_matrix()) does not scale well as you generalise.

If long term the idea is to move to {contactmatrix} then I think we should maintain contact_matrix()'s current function signature

The plan is to use the <contactmatrix> class from {contactmatrix} in {socialmixr} but not move functionality currently in {socialmixr} into {contactmatrix}? Is my understanding of this correct @Bisaloo & @sbfnk?

@sbfnk
Copy link
Collaborator Author

sbfnk commented Nov 20, 2024

Thanks for the additional context. The misunderstanding/misinterpretation comes from the number of arguments in contact_data() and the misuse of argument defaults?

What I meant was the risk is from doing lots of processing inside the function that isn't really exposed to the user (e.g. there is no way for the user currently to see the result of interpolating within age ranges, or of puling in and manipulating demographic data - they only see the end result in the matrix). In that sense I see a risk of misinterpretation due to lack of visibility of what happens inside the contact_matrix() function. That said I have no evidence / examples of this actually having ever been an issue.

@adamkucharski
Copy link

Agree it can potentially be a bit confusing to have the demography generated by a contact_matrix function – reviewing the nice tutorial by @amanda-minter got me thinking more about how we present contact data vs contact matrix vs NGM etc. (especially as we previously had a few transformations that were given in a bit too much of a 'tell rather than show' way to the user)

Am I correct that the below line basically defines the raw POLYMOD data object with some aggregation?

uk_contacts <- polymod |>
  filter(country == "United Kingdom") |>
  process_age(age_limits = c(0, 1, 5, 15))

So perhaps clearer to have

uk_data <- polymod |>
  filter(country == "United Kingdom") |>
  process_age(age_limits = c(0, 1, 5, 15))

Then can extract contacts and/or demography as user needs, like you suggest?

@sbfnk
Copy link
Collaborator Author

sbfnk commented Nov 25, 2024

The demographic data isn't actually stored inside the polymod object. When calling

uk_pop <- uk_contacts |>
  country_population()

the country_population() function would look up country and year inside uk_contacts and try to find the corresponding demographics in the world population prospects data using wpp_age(). Perhaps that could be made clearer somehow, e.g. with a better function name.

@Bisaloo
Copy link
Contributor

Bisaloo commented Nov 26, 2024

The proposed workflow feels less "magic" but I think it's a good thing. There is a lot going on in socialmixr's code and in particular in contact_matrix() and decoupling the different conceptual steps into different functions makes sense to me.

The tradeoff, which is still definitely worth it in my opinion, is that users will have to write a couple extra more lines of code and spend more time to understand what is happening behind the scenes.

A nice other added benefit that was not yet mentioned in this thread is that this would resolve the case of arguments accepting multiple types (e.g., survey.pop accepting a string or a data.frame), which I always found hard to use and to maintain.

Beyond the general agreement in terms of direction, I think this also raises a lot of practical questions:

  • Implementing major breaking changes vs starting a new package (maybe similar to the reflection you had on bpmodels & epichains?)
  • This workflow feels more "tidy" (in the sense defined by the tidyverse). It may make sense to go all the way and try to follow their recommendation to use verbs as functions names. In particular, there is no specific reason to keep the contact_matrix() name since the function signature is likely to be completely different
  • Generally, I would keep an eye on NAMESPACE growth and try to handle the potential added complexity by being very careful in how functions are named
  • I would probably use this opportunity to tighten the scope of the package or split it (as proposed in more focused dependency options? #124). In particular, I am not sure if I would include getting the age data from wpp. I'm on the fence...

@lwillem
Copy link
Contributor

lwillem commented Dec 13, 2024

Great discussion! While breaking the functionality into steps could improve user understanding, it is currently very convenient for users to obtain a contact matrix from a survey dataset without having to deal with normalization, demographic adjustments, filtering, and other complexities. From an end-user perspective, I would therefore advocate for retaining the current contact_matrix(...) interface. However, the current implementation, with most functionality directly coded inside a single function, presents challenges for further development. For users looking to extend the code, this design results in a steep learning curve.

I would recommend reorganizing the contact_matrix() function into an "interface function" that delegates tasks to multiple sub-functions (potentially without including all features). Additionally, it would be beneficial to provide documentation guiding users through the stepwise process for more in-depth analyses or for utilizing all features when needed. This approach would enhance usability for both end-users and developers looking to extend or adapt the functionality.

@sbfnk
Copy link
Collaborator Author

sbfnk commented Dec 16, 2024

Thanks for all the comments. Synthesising the comments I would suggest to:

  • create functions for the individual parts of contact_matrix() so that the full workflow can be done in step-wise fashion
  • retain contact_matrix() in its current form at least for now, but not develop it further; the new function that creates the contact matrix from pre-processed contact data will be called something else, in line with tidyverse recommendations
  • split out the code to interface with the social contact data repository into a new package
  • defer the decision on whether to eventually deprecate contact_matrix()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants