Proposed future workflow #161

sbfnk · 2024-11-20T09:10:25Z

In the process of addressing #131 workflows for estimating contact matrices will change. The proposal is to move some of the functionality hidden inside the contact_matrix() function into separate functions. This means that instead of (currently)

contact_matrix(polymod, countries = "United Kingdom", age.limits = c(0, 1, 5, 15), symmetric = TRUE)

One would do something like

uk_contacts <- polymod |>
  filter(country == "United Kingdom") |>
  process_age(age_limits = c(0, 1, 5, 15))

uk_pop <- uk_contacts |>
  country_population()

cm <- uk_contacts |>
  contact_matrix(by = "age", symmetric = TRUE, survey.pop = uk_pop)

It's quite a lot more typing but hopefully more explicit and less black box processing thus less likely to lead to misinterpretation / misunderstanding.

The text was updated successfully, but these errors were encountered:

joshwlambert · 2024-11-20T10:41:12Z

I think the current functionality is a lot cleaner and will be easier for new users coming to the package. It looks like at least one of the functions in the second code chunk are already exported from {socialmixr}, therefore I have two questions:

Could you not offer both sets of functionality, i.e. retain the original function and allow users to do the second if they wish (the contact_matrix() function in the second code chunk would need a different name)?
What are the misinterpretations/misunderstandings that you see users having with the current approach? As I'm not a regular user of this package it might help clarify how the proposed approach addresses these.

sbfnk · 2024-11-20T11:07:28Z

Thanks, looks like some motivation/context was missing from the original post. The contact_matrix() function currently has 22 arguments and passes arguments in ellipses ... to three other functions, and 800 lines of code, making it unwieldy both from a user and developer perspective. The rationale for removing the processing of demographic data, filtering of surveys etc. out of the function is to expose more of the internals such as how age is processed, what demographic data is used etc. whilst having smaller functions with fewer arguments. The downside is more function calls for the user, especially when previously using arguments defaults.

One related question is whether we want to be able to address #143 - at the moment the contact_matrix() function does a lot of manipulation of age but if it is to allow generation of general contact matrices (e.g. by sex) then it really makes sense to move all the functionality for estimating/grouping ages into other functions because users might not be interested in age, but also because if not contact_matrix() will have to grow even bigger / need more arguments.

Re (1) yes we could and perhaps it's the way forward, but from a maintenance point of view it would help to be a bit more prescriptive on the workflow.

bahadzie · 2024-11-20T12:54:52Z

 uk_contacts <- polymod |>
  as_contact_survey() |>
  filter(country == "United Kingdom") |>
  process_age(age_limits = c(0, 1, 5, 15))

uk_pop <- uk_contacts |>
  country_population()

cm <- uk_contacts |>
  contact_matrix(by = "age", symmetric = TRUE, survey.pop = uk_pop)

I would change

contact_matrix(by = "age", symmetric = TRUE, survey.pop = uk_pop) can become

to

`cm(by = "age", symmetric = TRUE, survey.pop = uk_pop)

On CRAN {o2geosocial} and {finalsize} are reverse (Suggests:) dependencies. Not sure how we feel about potential breaking changes. If long term the idea is to move to {contactmatrix} then I think we should maintain contact_matrix()'s current function signature but still refactor the code to split it out into smaller functions that will reduce its cyclomatic complexity which at 143 is almost 10X the recommended {lintr} limit of 15.

IMO treating age as a variable useful for stratification simplifies implementing #143 . Of course we still need the existing age estimation functionality but this can be a separate function that is called by contact_matrix()

Section comments (below) from contact_matrix() are a useful blueprint for potential functions.

  ## === check arguments and define variables
  ## === check and clean survey
  ## === check if specific countries are requested (if a survey contains data from multiple countries)
  ## === age processing: deal with ranges and missing data
  ## === check if any filters have been requested
  ## === adjust age.group.brakes to the lower and upper ages in the survey
  ## === if split, symmetric or age weights are requested, get demographic data (survey population)
  ## === process weights
  ## === merge participants and contacts into a single data table
  ## === process contact age ranges / missing ages
  ## === sample participants randomly (if requested)
  ## === calculate weighted contact matrix
  ## === split contact matrix
  ## === option to add matrix per capita, i.e. the contact rate of age i with one individual of age j in the population.
  ## === option to return participant weights

Potential mappings are...

as_contact_survey(){
  ## === check and clean survey
  ## === check if specific countries are requested (if a survey contains data from multiple countries)
  ## === merge participants and contacts into a single data table
}

filter() {
  ## === check if any filters have been requested
}

process_age(){
  ## === age processing: deal with ranges and missing data
  ## === adjust age.group.brakes to the lower and upper ages in the survey
  ## === process contact age ranges / missing ages
}

as_contact_survey(){
  ## === check if specific countries are requested (if a survey contains data from multiple countries)
}

joshwlambert · 2024-11-20T13:48:49Z

Thanks for the additional context. The misunderstanding/misinterpretation comes from the number of arguments in contact_data() and the misuse of argument defaults?

One related question is whether we want to be able to address #143 - at the moment the contact_matrix() function does a lot of manipulation of age but if it is to allow generation of general contact matrices (e.g. by sex) then it really makes sense to move all the functionality for estimating/grouping ages into other functions because users might not be interested in age, but also because if not contact_matrix() will have to grow even bigger / need more arguments.

I agree that the proposed refactor is the right development direction, as the single point of user interaction (contact_matrix()) does not scale well as you generalise.

If long term the idea is to move to {contactmatrix} then I think we should maintain contact_matrix()'s current function signature

The plan is to use the <contactmatrix> class from {contactmatrix} in {socialmixr} but not move functionality currently in {socialmixr} into {contactmatrix}? Is my understanding of this correct @Bisaloo & @sbfnk?

sbfnk · 2024-11-20T14:01:24Z

Thanks for the additional context. The misunderstanding/misinterpretation comes from the number of arguments in contact_data() and the misuse of argument defaults?

What I meant was the risk is from doing lots of processing inside the function that isn't really exposed to the user (e.g. there is no way for the user currently to see the result of interpolating within age ranges, or of puling in and manipulating demographic data - they only see the end result in the matrix). In that sense I see a risk of misinterpretation due to lack of visibility of what happens inside the contact_matrix() function. That said I have no evidence / examples of this actually having ever been an issue.

adamkucharski · 2024-11-25T09:46:58Z

Agree it can potentially be a bit confusing to have the demography generated by a contact_matrix function – reviewing the nice tutorial by @amanda-minter got me thinking more about how we present contact data vs contact matrix vs NGM etc. (especially as we previously had a few transformations that were given in a bit too much of a 'tell rather than show' way to the user)

Am I correct that the below line basically defines the raw POLYMOD data object with some aggregation?

uk_contacts <- polymod |>
  filter(country == "United Kingdom") |>
  process_age(age_limits = c(0, 1, 5, 15))

So perhaps clearer to have

uk_data <- polymod |>
  filter(country == "United Kingdom") |>
  process_age(age_limits = c(0, 1, 5, 15))

Then can extract contacts and/or demography as user needs, like you suggest?

sbfnk · 2024-11-25T10:36:25Z

The demographic data isn't actually stored inside the polymod object. When calling

uk_pop <- uk_contacts |>
  country_population()

the country_population() function would look up country and year inside uk_contacts and try to find the corresponding demographics in the world population prospects data using wpp_age(). Perhaps that could be made clearer somehow, e.g. with a better function name.

Bisaloo · 2024-11-26T15:56:04Z

The proposed workflow feels less "magic" but I think it's a good thing. There is a lot going on in socialmixr's code and in particular in contact_matrix() and decoupling the different conceptual steps into different functions makes sense to me.

The tradeoff, which is still definitely worth it in my opinion, is that users will have to write a couple extra more lines of code and spend more time to understand what is happening behind the scenes.

A nice other added benefit that was not yet mentioned in this thread is that this would resolve the case of arguments accepting multiple types (e.g., survey.pop accepting a string or a data.frame), which I always found hard to use and to maintain.

Beyond the general agreement in terms of direction, I think this also raises a lot of practical questions:

Implementing major breaking changes vs starting a new package (maybe similar to the reflection you had on bpmodels & epichains?)
This workflow feels more "tidy" (in the sense defined by the tidyverse). It may make sense to go all the way and try to follow their recommendation to use verbs as functions names. In particular, there is no specific reason to keep the contact_matrix() name since the function signature is likely to be completely different
Generally, I would keep an eye on NAMESPACE growth and try to handle the potential added complexity by being very careful in how functions are named
I would probably use this opportunity to tighten the scope of the package or split it (as proposed in more focused dependency options? #124). In particular, I am not sure if I would include getting the age data from wpp. I'm on the fence...

lwillem · 2024-12-13T08:29:37Z

Great discussion! While breaking the functionality into steps could improve user understanding, it is currently very convenient for users to obtain a contact matrix from a survey dataset without having to deal with normalization, demographic adjustments, filtering, and other complexities. From an end-user perspective, I would therefore advocate for retaining the current contact_matrix(...) interface. However, the current implementation, with most functionality directly coded inside a single function, presents challenges for further development. For users looking to extend the code, this design results in a steep learning curve.

I would recommend reorganizing the contact_matrix() function into an "interface function" that delegates tasks to multiple sub-functions (potentially without including all features). Additionally, it would be beneficial to provide documentation guiding users through the stepwise process for more in-depth analyses or for utilizing all features when needed. This approach would enhance usability for both end-users and developers looking to extend or adapt the functionality.

sbfnk · 2024-12-16T10:09:25Z

Thanks for all the comments. Synthesising the comments I would suggest to:

create functions for the individual parts of contact_matrix() so that the full workflow can be done in step-wise fashion
retain contact_matrix() in its current form at least for now, but not develop it further; the new function that creates the contact matrix from pre-processed contact data will be called something else, in line with tidyverse recommendations
split out the code to interface with the social contact data repository into a new package
defer the decision on whether to eventually deprecate contact_matrix()

sbfnk added the question label Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed future workflow #161

Proposed future workflow #161

sbfnk commented Nov 20, 2024 •

edited

Loading

joshwlambert commented Nov 20, 2024 •

edited

Loading

sbfnk commented Nov 20, 2024

bahadzie commented Nov 20, 2024

joshwlambert commented Nov 20, 2024

sbfnk commented Nov 20, 2024 •

edited

Loading

adamkucharski commented Nov 25, 2024

sbfnk commented Nov 25, 2024

Bisaloo commented Nov 26, 2024

lwillem commented Dec 13, 2024

sbfnk commented Dec 16, 2024

Proposed future workflow #161

Proposed future workflow #161

Comments

sbfnk commented Nov 20, 2024 • edited Loading

joshwlambert commented Nov 20, 2024 • edited Loading

sbfnk commented Nov 20, 2024

bahadzie commented Nov 20, 2024

joshwlambert commented Nov 20, 2024

sbfnk commented Nov 20, 2024 • edited Loading

adamkucharski commented Nov 25, 2024

sbfnk commented Nov 25, 2024

Bisaloo commented Nov 26, 2024

lwillem commented Dec 13, 2024

sbfnk commented Dec 16, 2024

sbfnk commented Nov 20, 2024 •

edited

Loading

joshwlambert commented Nov 20, 2024 •

edited

Loading

sbfnk commented Nov 20, 2024 •

edited

Loading