-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecate expect_similar() #18
Comments
To answer the issue title - no :P This was a hacky rough attempt to incorporate the original frequency (to allow more leeway for high frequency responses, since for e.g. a jump from 2 to 12 percent is usually very different in checking than a jump from 72 to 82). Should probably use an actual statistical measure instead. |
Yeah, good point. What about |
Coming back to this, df1 <- data.frame(
a = sample(1:5, 100000, TRUE),
b = sample(c(rep(1:5, 5), 1:3), 100000, TRUE),
c = sample(c(rep(1:5, 25), 1:3), 100000, TRUE),
d = sample(c(rep(1:5, 125), 1:3), 100000, TRUE)
)
df1 %>% group_by(level = a) %>% summarise(n_a = n()) %>%
left_join(
df1 %>% group_by(level = b) %>% summarise(n_b = n()), "level"
) %>%
left_join(
df1 %>% group_by(level = c) %>% summarise(n_c = n()), "level"
) %>%
left_join(
df1 %>% group_by(level = d) %>% summarise(n_d = n()), "level"
)
#> # A tibble: 5 x 5
#> level n_a n_b n_c n_d
#> <int> <int> <int> <int> <int>
#> 1 1 20180 21234 20241 20026
#> 2 2 19932 21382 20398 20159
#> 3 3 19820 21494 20255 20050
#> 4 4 20031 17956 19654 19905
#> 5 5 20037 17934 19452 19860
chisq.test(table(df1$a), p = table(df1$b), rescale.p = TRUE)
#> Chi-squared test for given probabilities
#>
#> data: table(df1$a)
#> X-squared = 767.42, df = 4, p-value < 2.2e-16
chisq.test(table(df1$a), p = table(df1$c), rescale.p = TRUE)
#> Chi-squared test for given probabilities
#>
#> data: table(df1$a)
#> X-squared = 44.997, df = 4, p-value = 3.982e-09
chisq.test(table(df1$a), p = table(df1$d), rescale.p = TRUE)
#> Chi-squared test for given probabilities
#>
#> data: table(df1$a)
#> X-squared = 8.7539, df = 4, p-value = 0.06755 If we consider the p-values, we would never say that We could potentially use the chi-squared statistic directly. But a small chi-square value doesn't necessarily indicate that the distributions are similar, only that we can't confidently tell them apart: that could be because they are similar or because there aren't many data points. |
Spoke to Andrew about this and he recommended using chi-square. I'm now questioning this function. It might be best to leave similarity testing to users given that it's fairly hairy |
@wilcoxa @tonoplast RE: similarity testing discussion this morning |
@wilcoxa I'm leaning towards deprecating this one. I've slapped an Once this is sorted out, 0.2.0 will be ready to review/merge/release. I think the other issues that are currently outstanding can wait |
@kinto-b Yeah |
@gorcha I think this one should be deprecated before CRAN-ing |
Yep, agreed |
expect_similar()
comparison the best?
Suppose we compare the following,
How it currently works
What does
expect_similar()
do internally? First it does thistestdat/R/expect-datacomp.R
Lines 33 to 34 in 9f3ef72
which yields
Then it does this
testdat/R/expect-datacomp.R
Lines 36 to 40 in 9f3ef72
which yields
How I thought it would work
But I was expecting it to run the comparison in this way:
which yields
The text was updated successfully, but these errors were encountered: