Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Callbacks can execute in wrong order #150

Closed
wch opened this issue Jun 25, 2021 · 6 comments · Fixed by #151
Closed

Callbacks can execute in wrong order #150

wch opened this issue Jun 25, 2021 · 6 comments · Fixed by #151

Comments

@wch
Copy link
Member

wch commented Jun 25, 2021

Originally from rstudio/shiny#3429. I've been able to reproduce this problem on Docker for Mac, using versions 2.5.0.0 and above. Older versions seem fine. Also, I can't seem to reproduce it when running R directly on the Mac, or R on Linux, or R in Docker on Linux.

When the C API is used and tasks are scheduled very quickly, sometimes they do not execute in the correct order.

I've posted example code at https://gist.github.com/wch/d96b3c66644282a312c5df22433d46cd

@wch
Copy link
Member Author

wch commented Jun 25, 2021

Looking at the results from the example code above, this is what happens in each test run:

  • 1 million callbacks are scheduled.
  • When the callbacks execute, there are one or two blocks, each containing around 1200-1800 callbacks, that execute in an incorrect order.

Each test run generates some data. The expected values would be something like this:

   idx value
1    1     1
2    2     2
3    3     3
4    4     4
5    5     5
6    6     6
...

idx represents the order in which a callback is run. value represents the order in which a callback was scheduled. So each row should be the same.

Here are some actual results, with the "good" (matching) rows removed, and only the mismatched rows displayed:

     idx value
1  91172 91696
2  91173 91697
3  91174 91698
4  91175 91699
5  91176 91700
6  91177 91701
7  91178 91702
8  91179 91172
9  91180 91173
10 91181 91703
11 91182 91174
12 91183 91704
13 91184 91705
14 91185 91706
15 91186 91707
...

I suspect (but don't know for sure) that these mixups happen because of poor time resolution in the Docker container running in the VM, and because there's a bug with the callbackId for the Callback objects, either when generating them or when comparing them (when the when values for two Callback objects are identical, the callbackId is used to break ties and decide which one executes first.)

bool operator<(const Callback& other) const {
return this->when < other.when ||
(!(this->when > other.when) && this->callbackId < other.callbackId);
}
bool operator>(const Callback& other) const {
return other < *this;
}

@wch
Copy link
Member Author

wch commented Jun 25, 2021

I made a branch which recorded the timestamps of Callback objects as they are created, into a vector<Timestamp>. https://github.com/r-lib/later/compare/check-order

What I found is that the timestamps increase monotonically, except at the locations where the callback executes in the wrong order. For example, notice how these values increase until index 9:

> times[486410:486430] - times[1]
 [1] 14.90152 14.90182 14.90186 14.90188 14.90193 14.90196 14.90197 14.90200
 [9] 14.86805 14.86815 14.86816 14.86821 14.86836 14.86837 14.86857 14.86867
[17] 14.86868 14.86871 14.86875 14.86882 14.86882

So this means that the problem is that, when the Timestamp objects are created, they don't always have increasing values.

This is the code that's run when they're created. It boils down to a call to gettimeofday(), and then a conversion from a timeval to a timespec:

void get_current_time(timespec *ts) {
timeval tv;
gettimeofday(&tv, NULL);
*ts = timevalToTimespec(tv);
}
class TimestampImplPosix : public TimestampImpl {
private:
timespec time;
public:
TimestampImplPosix() {
get_current_time(&this->time);
}

I wonder if this is due to a bug in the Docker VM.

@wch
Copy link
Member Author

wch commented Jun 25, 2021

It looks like the decreasing gettimeofday() results may be due to time drift correction in the VM.

Here are some SO questions where the user experienced a similar problem with VirtualBox:
https://stackoverflow.com/questions/53081264/gettimeofday-returns-inaccurate-result
https://stackoverflow.com/questions/4801122/how-to-stop-time-from-running-backwards-on-linux/4801176

The second one links to an issue in the VirtualBox tracker: https://www.virtualbox.org/ticket/7915
And a blog post titled "gettimeofday() should never be used to measure time": https://blog.habets.se/2010/09/gettimeofday-should-never-be-used-to-measure-time.html

@wch
Copy link
Member Author

wch commented Jun 26, 2021

Here's an example adapted from one of the SO links above, which demonstrates that gettimeofday() is nonmonotonic -- the clock can go backwards. It basically calls gettimeofday() repeatedly and checks if the time is less than the previous time; if so, it prints out some information about it. With the code as written, it runs for 20 seconds and then returns.

docker run --rm -ti rocker/shiny /bin/bash
R
install.packages(c('cpp11', 'later', 'brio', 'callr', 'cli', 'decor', 'desc',
  'tibble', 'vctrs'))

library(cpp11)
cpp_source(
  code = '
#include <unistd.h>
#include <stdio.h>
#include <sys/time.h>

// Returns time in milliseconds
unsigned long GetTickCount() {
  struct timeval start, end;
  unsigned long myTickCount;
  gettimeofday(&start, NULL);
  myTickCount = start.tv_sec*1000 + start.tv_usec/1000;
  return myTickCount;
}


[[cpp11::register]]
void time_test() {
  int i = 0;
  unsigned long start = GetTickCount();
  unsigned long a = start;
  unsigned long b;
  // Run for 20 seconds
  while(a - start < 20000) {
    a = GetTickCount();
    // Sleep for 0.1ms
    usleep(100);
    b = GetTickCount();
    if (a > b) {
      printf(
        "iteration: %ld, a-start: %ld a: %ld, b:%ld, b-a=%ld\\n",
        i, a-start, a, b, b-a
      );
    }
    i++;
  }
}
')


time_test()
#> iteration: 93320, a-start: 12493 a: 1624722553189, b:1624722553168, b-a=-21

The result above is what happens in Docker 2.5.0.0 on Mac. After about 12.5 seconds, it recorded an instance where the clock went backward by 21 milliseconds.

I think the solution may be to use clock_gettime(CLOCK_MONOTONIC, ...).

@jcheng5
Copy link
Member

jcheng5 commented Jun 26, 2021

Great find!! Maybe CLOCK_BOOTTIME on Linux? https://stackoverflow.com/a/3527632

@wch
Copy link
Member Author

wch commented Jun 26, 2021

The Docker blog has a post about how they handle time drift on Mac:
https://www.docker.com/blog/addressing-time-drift-in-docker-desktop-for-mac/

As long as CLOCK_BOOTTIME is monotonic, it should fix the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants