-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug in Zephyr OS high-precision timings sub-system (function sys_cycle_get_32()) #3012
Comments
by Benjamin Walsh: I think it's a duplicate of https://jira.zephyrproject.org/browse/ZEP-1416. |
by Sergey Arkhipov: Probably yes. It just extends to sys_cycle_get_32() function so it appears to be not very useful. |
by Andy Ross: OK, starting with repeating my understanding: The root cause is a straightforward race (in both loapic_timer and cortex_m_systick) between a "accumulated" cycle count which is updated in an ISR and a register read of the "current" cycle count since the last timer interrupt. Because the ISR is async, it may not have occured before the timer count reaches zero and so you can see either the correct/updated accumulated count or a stale-by-one-tick non-updated one alongside a zero cycle count. (Random aside: I can sort of see this for APIC as it's an "external" device, but on ARM the cycle counter is a CPU regiser: how did they manage NOT to specify delivery of the timer interrupt to be cycle exact vs. reads of the cycle count register! Ugh.) I think I agree with Ben in ZEP-1416, there's no proper way to fix this except by putting a irq_lock() around the two reads, which I'm reasonably certain is going to be impossible for perfrmance reasons. But it's possible to almost fix this with a heuristic. Use an atomic flag to signal the begin/end of the read pair and check that it wasn't set from the ISR in the middle. Something like this pseudocode: {code} static volatile accum_count; /* Set to 1 by ISR */ uint32_t get_curr_cycle_count() { /* read hardwre */ uint32_t k_cycle_get_32(void) (Note we don't strictly need the atomic API for this purpose and a volatile int would be fine on both ARM and x86, we might want to do that to save the handful of cycles on ARM due to the assembly functions. I think x86 just resolves to single-instruction compiler builtins). This would robustly detect the ISR race and redo both reads when that happens so you never see a non-monotonic timer read. Under normal circumstances it's just that one extra read/test/branch to do, so should be fast enough. But it's not perfect: it fixes the case in the bug here where you have a thread rapidly polling the cycle count and likely to hit the overflow every time, but it can still race against other callers to k_cycle_get_32(). If you context get preempted inside that critical section and the other thread calls k_cycle_get_32(), then it will reset the flag for you even if the ISR triggered already and you'll miss the update you were trying to detect. Honestly, I argue that that's rare enough to ignore. The race is a very small window, and it's only possible to see in code like this (which is reasonable code!) that spins on the cycle count making millions of calls per tick to get high precision. But until Zephyr gets true SMP, it's literally impossible for two threads to do this at once, so we can't actually hit this bug that way. Ben? |
by Andy Ross: Oh, and Sergey: which hardware are you seeing this on most reliably? I'm guessing interrupt delivery in qemu may not be crazy enough to trigger it? |
by Andy Ross: One final note: I just stumbled on this comment in loapic_timer: {code} If that's true, then we have one more test to do to clean up overflow conditions. Do we have any idea which platforms this is talking about? Sergey, does this match behavior you see or are you seeing only the on-tick aliasing described above? |
by Sergey Arkhipov: Andy, we are using a custom board with Intel Curie (Quark SE C1000) on it. As the performance is paramount in our application, we have to do thorough profiling of each and every section of code. That’s when we’ve faced the issue which was driving us crazy for a while, giving us deltas of around 2^32 cycles because of the issue. It happens consistently, every several minutes at best. The sample I’ve included with the issue description triggers the issue every millisecond or so. |
by Sergey Arkhipov: BTW, ARC architecture also has the same issue, not only x86 or ARM. |
by Sergey Arkhipov: Another side effect you haven’t mentioned is that the issue totally eliminates hi-res timings readouts for Zephyr. What is the point to call k_cycle_get_32() if it can’t guarantee the precision better than a system tick? |
by Sergey Arkhipov: Luckily, for x86 architecture, we can use _NanoTscRead(). But for the ARC we had to settle with system-tick resolution. |
by Benjamin Walsh: Andy Ross I was looking at this issue this morning, and I was coming up with a somewhat similar solution. I was looking at the arcv2_timer0 driver especially. I don't think you need the ISR to do anything new: k_cycle_get_32() should be able to know by reading the accumulated_cycle_count and checking it has not changed. Also, we should be able to avoid the race condition you mentioned by locking the scheduler. Here's an example of what I am talking about: {code}
} accumulated_cycle_count has to be made volatile for this to work too, as mentioned in https://jira.zephyrproject.org/browse/ZEP-1416. IIRC, the target that had an issue was Simics, 2-3 years ago. Might have been fixed by now. Sergey Arkhipov Yes, I was looking at the ARC timer0 driver, and I saw it has the same issue indeed. |
by Benjamin Walsh: Sergey Arkhipov Not sure why you're saying that this does not have sub-tick resolution ? The API still reads the hardware counter, which is normally much higher resolution than the system tick... |
by Andy Ross: Good point; your idea is cleaner and lacks the cross-thread race that mine did. One review-ish comment: do you really want the k_sched_lock() there? That's not a correctness requirement, it just makes it less likely to need the second iteration due to preemption. But the whole point is that the retry is going to be very rare, and easily detected. I think paying the extra cycles for the lock call (which I see is just the two function calls and dec/increment) isn't worth it, IMHO. |
by Benjamin Walsh: Andy Ross Yeah, I think you're right about not needing the sched_lock/unlock. Instead of tracking "cur" as well, this is even cleaner:
} |
by Sergey Arkhipov: Benjamin Walsh Well, in our code I tried to make a workaround to beat non-monotonous timer behavior for users of k_cycle_get_32() function. I had to do something with deltas of about 2^32 magnitude:
And the diff value was floating, especially for the case when non-monotonous timer issue occurred. Probably this is for the cases when another task or/and some lengthy ISR running preempts current task for the period of more than one system tick. |
by Benjamin Walsh: Sergey Arkhipov "One would expect for the function like k_cycle_get_32() to be atomic in a sense that it returns underlying timer value AT THE TIME it gets called." But that is what it's supposed to do if not for that bug that we are trying to fix. If you don't protect against preemption while trying to compute an amount of time between two points in the code flow, all bets are off. I still don't understand your comment about tick granularity though. On some implementations of k_cycle_get_32, basically on targets where the timer current counter is reloaded every tick, and thus has a period that is (way) smaller than what would fill 32-bits, we have an accumulator that gets updated every tick (in the timer ISR), and when the cycle value is needed, the hw register is read and added to the accumulator. That gives hw-frequency precision/granularity, not system tick granularity. On some other implementations that have a free running 64-bit counter, like the hpet, we can return the low 32-bits directly from the hardware. In both cases, there is overflow after 32-bits, so the behaviour is (in theory) the same. |
by Sergey Arkhipov: Benjamin Walsh Yes I never had issues with the function _NanoTscRead() which is kind of bare metal one. Probably this non-monotonous timer issue which makes k_cycle_get_32() function sometimes to return very inconsistent results has given me an idea that it can’t be trusted upon in regards of the precision. I’ll try the fix you’ve suggested. Hopefully, it resolves all the issues. |
by M R Rosen: Isnt the fundamental problem that the CCR register (at least when discussing x86 LOAPIC timer) is free running while accumulated_cycle_count depends on the interrupt to update correctly, thus, the issue arises when the CCR value is mismatched with the accumulated_cycle_count value (by using a accumulated_cycle_count value from the previous period but using the CCR value from the current period), the resulting value of the function fails to capture time correctly?As such, even with the suggested fix to k_cycle_get_32(), the function will not work correctly for subsequent calls within an ISR or a block of code where ISRs are blocked. If so, it would be important to make that note with regards to this function (that it may only work reliably when interrupts are enabled). |
by Sergey Arkhipov: M R Rosen Exactly. That’s why I would call the behavior of this function inherently non-atomic. It seems that even if we put ‘interrupt disabled’ brackets around the code to be profiled including the k_cycle_get32() calls, we can not get precise enough results. We can’t assume that we are only one accumulated_cycle_count ahead of the value read from the CCR. In applications with very fast system ticks (we currently have 1 ms system tick) and relatively slow code to be profiled it is possible that we have several timer interrupts requested while doing the metrics. In this case, it is not easy to recover actual delta value. The more reliable way is to use non-portable architecture dependent functions like _NanoTscRead() for x86. This one is inherently atomic. |
by Sergey Arkhipov: I did try the solution suggested for the sample hello-world application mentioned above and it improved the non-monotonous timer behaviour indeed. However, the k_cycle_get_32() function still sometimes returns wrong values. Instead of doing that every several milliseconds as before the fix applied, it does it on start-up and sometimes after PRINT (which are mapped on printk()) function calls. So the printk() calls act as the triggers for the issue now for the hello-world application. May be printk() does something with interrupts and thus affects k_cycle_get_32() non-monotonous timer misbehaviour? This is the result:
Printouts with diff = 30..34 happen when we have 32-bit timer overflow every 134 seconds and are absolutely valid. As it can be seen, these printouts cause series of wrong diff values happening reflecting non-monotonous timer issue. |
by Andy Ross: I'm guessing (maybe) you're seeing the bug I found commented in the loapic driver: apparently the counter doesn't stop at zero and will keep going, so near zero you can read an underflowed value. If that's true, then it should (heh, famous last words) be sufficient just to treat the timer value as a signed quantify and in the subtraction, so that negative values get added to the accumulator. Honestly that might be happening anyway though if the 2's complement math isn't otherwise doing tricky stuff, so I'm not sure. This gets hard in Jira. Can you post a patch to gerrit showing exactly what you're doing? A patch to 1.5 code will work fine to submit (obviously we won't merge be able to merge it), so no need to try to port anything. |
by Sergey Arkhipov: Andy Ross Andy, you might be right. It is not printk(). I removed it from the cycle:
and still have the next:
It looks like now with the fix applied we have the issue only when the 32-bit timer overflows. |
by Sergey Arkhipov: It looks like the fix suggested for k_cycle_get_32() fixes the issue for the ARC architecture. But it still occurs in x86 architecture. |
by Andrew Boie: Are we overthinking this?
|
by Andrew Boie: I'm seeing what I suspect is a race condition, not mentioned in this discussion previously, on LOAPIC with CONFIG_TICKLESS_IDLE=n. In this scenario, the timer is configured as periodic. The interrupt handler is very simple: it adds cycles_per_tick to the accumulated_cycle_count. What I am seeing occasionally is that accumulated_cycle_count is the same between the two calls (i.e. the interrupt handler hasn't run), but the CCR has reset itself for the next tick. So the second call to k_cycle_get_32() returns a smaller value. In every case, the delta is about 319900. Is there an execution window between when CCR resets itself, and the interrupt fires? This patch illustrates the problem: Note that in the above patch, if I remove the irq_lock()s in main.c the problem still occurs, just the printouts aren't consistent since accumulated_cycle_count may have been updated. But there still seems to be a window where accumulated_cycle_count has not been updated, yet CCR has been. Here is the output of my test. Note the acc values are the same. t0 is always acc + an amount just below cycles_per_tick, and t1 is acc plus a very small value.
In previous discussion it had been speculated that locking interrupts would solve all of this, but it appears to me that the reset of the CCR, and the firing of the interrupt, are decoupled. Locking interrupts around the body of k_cycle_get_32() is not fixing this for me. |
by Andy Ross: That doesn't sound 100% insane to me. Interrupt delivery is at least a little bit asynchronous always: there are inevitable hardware cycles between a device posting an interrupt or pulling the IRQ line down and the actual execution of the ISR (at the very least, the CPU needs to wait for the pipeline to clear). And in that time it's possible to read the state that caused the interrupt in the first place and see the values from "after" the interrupt was posted. Now, this timer is actually on the interrupt controller, so it would actually be possible to "hold" the timer rollover until after the interrupt delivery was signaled by the hardware. But I'll bet it doesn't. And in general we can't rely on that anyway. Isn't the solution just to treat the rollover as a valid condition, by e.g. doing the delta using signed math and allowing times from the "future" to be legal? |
by Andrew Boie: Andy Ross I'm not sure what you mean by signed math, there's no negative value in play. I don't see any way to detect this situation. With CCR being reset back to cycles_per_tick, for all intents and purposes we have moved back in time. I don't see a way to programmatically know "oh, CCR reset to initial value but the interrupt hasn't fired yet" |
by Andy Ross: Ah, OK. I understood that the timer had a wraparound behavior (documented as a "bug" in the loapic comments somewhere) and not a static reset. Yeah, that's a mess without some kind of extra clock (TSC maybe?) to reference against. |
by Andy Ross: I guess another possibility without using a second clock would be to define a "fudge window" around the top of the timer countdown where we "expect that the ISR hasn't fired". If you get a value in that range (which in your data above is never more than 44 cycles long), just busy-wait polling it until it gets low enough. That comes to a 0.03% chance that a call to k_cycle_get_32() will be delayed by 100 cycles or so. Maybe not too high a price. |
by Andrew Boie: It's hard to use the TSC in tandem with LOAPIC timer as they use different units and wrap around at different times. The TSC is in terms of CPU cycles, LOAPIC in terms of bus cycles. Also I've seen the window as large as 500 cycles in some runs of the test, I think a 'fudge window' on the order of a 1000 cycles may be too large. Andy Ross Benjamin Walsh I'm wondering if we should take a simpler approach and just have LOAPIC's k_cycle_get_32() simply return accumulated_cycle_count. alternatively, just use TSC for _k_cycle_get_32() and be done with it. |
by M R Rosen: Andrew Boie I would recommend against having the function just return accumulated_cycle_count because that's now just going to get you a counter with kernel tick resolution which you already have an API for, right? I think using TSC and other architecture-specific clocks might be the best way to implement what is promised by k_cycle_get_32(). I'm not sure there is a solid way to sure up the current mechanism unfortunately :( |
by Andrew Boie: M R Rosen agreed. I'm looking into a fix which will use the TSC. |
by Andrew Boie: This issue is also reproducible in QEMU on Nios 2 and RISCV. Most timer drivers have the same design: they update an accumulator in the ISR, and k_cycle_get_32() returns the sum of the accumulator + the value in the timer's count register. I believe at this point that this design is fundamentally unsound. It's also worth noting that this bug has been with us forever. |
by Andrew Boie: Also reproducible on ARC, tested on Arduino 101 ARC side. My expectation is that all implementations of k_cycle_get_32 that are based on an accumulated cycle count updated by the interrupt have this problem. Kumar Gala Maureen Helm I have no ARM hardware at my disposal and all my colleagues that do are traveling internationally. If one of you has a chance, can you run this test https://gerrit.zephyrproject.org/r/#/c/11305/1 on an ARM board of your choosing and let me know the result? QEMU didn't report any issue, but QEMU didn't on x86 either I had to run it on a real board. |
by M R Rosen: Andrew Boie If you have an Arduino 101, cant you flash the Nordic BLE core? Thats an ARM Cortex-M0( + ?) and should be supported by Zephyr 1.6+ (one module, 3 architectures :) ) |
by Andrew Boie: M R Rosen aha, I did not know this. Will give that a try. Thanks! |
by Andrew Boie: Maureen Helm thanks! For drivers that re-arm the timer with every interrupt, this can typically be fixed by eliminating races such as locking interrupts around the body of k_cycle_get_32(). This fixed the problem on ARC. However on ARM, this looks like a situation like on x86 and Nios II, where the timer is simply periodic and the interrupt just updates an accumulated cycle count and tells the system that a timer tick has elapsed. This introduces an unavoidable race where the periodic timer register could reset and be readable for a short period of time before the CPU gets the associated interrupt. In this case, a second clock source is needed. On x86 we used the CPU timestamp counter (TSC). On Nios II you can use another timer IP block. On ARM it looks like the right thing to do is use the ARM Cycle Count Register (CCNT). I'm looking to see if CMSIS has APIs for this. Also need to figure out if CCNT frequency is the same as CONFIG_SYS_CLOCK_HW_CYCLES_PER_SEC. I have patches up which are either merged or under review for all the other arches, so AFAIK ARM is the only arch left to fix. |
by Andrew Boie: M R Rosen I looked into the BLE CPU on Arduino 101, but "make flash" is unimplemented on it, and I could not find documentation on how to get an image onto the device https://wiki.zephyrproject.org/view/Supported_Boards. Do you have any pointers for getting an image onto the Arduino 101's ARM CPU and capturing console output? |
by Andrew Boie: FIxes for all arches now merged. |
by M R Rosen: Andrew Boie Sorry, I missed your comment earlier, but just for your own reference in the future: You can assign the UART port on the nrf51 series chip in Curie to any digital GPIO pins (you can see the Kconfig for the driver to see which flags control this). On the Arduino 101 however, the 3 digital pins from the SoC are not brought out (BLE_SDA, BLE_SCL, BT_GPIO can all be used for any purpose). You can use the interface with the Quark SE C1000 though to bridge the gap. Ie, put a simple pass-through application that takes anything received on UART_0 and prints it out on the x86 core and then leave the pins to the defaults for the arduino_101_ble board. Just note that due to defconfig issues, it was difficult for me to set the baudrate to the standard 115200 on both cores, the arduino_101_ble kinda forces it to be 1000000 I think. As for flashing, you can put images on the board by using dfu-util and flashing to alternate 8, ie ble_core. So, you can directly call "dfu-util -a ble_core -d 8087:0aba -D $(OUT_DIR)/zephyr.bin" to flash the core. |
Duplicate of GH-2893 |
Reported by Sergey Arkhipov:
Bug in Zephyr OS high-precision timings sub-system (function sys_cycle_get_32())
Configuration
any
Seen on nano (v1.5.0), but potentially exists on any kernel type/revision
Bug detailed description
Subsequent calls to sys_cycle_get_32() may return logically wrong time-stamps when the value returned earlier is greater than the value returned later. This is NOT a 32-bit counter overflow issue which happens naturally by design, but rather a design flaw, which could be seen randomly, sometimes even every several milliseconds, i.e. a way more often than underlying counter overflow.
Test case to replicate
Set up 1 ms system tick.
Modify hello-world sample:
Expected result
Something like this, when subsequent 'diff' printouts could happen approx. every 2 minutes (i.e. 32-bit timer overflow time @ 32MHz CPU clock):
Actual result
Something like this, when subsequent 'diff' printouts could happen approx. every several milliseconds (i.e. well below the 32-bit timer overflow time @ 32MHz CPU clock):
Notes
Basically when the next sequence occurs in the code
t0 = sys_cycle_get_32();
t1 = sys_cycle_get_32();
one would expect t0 to be less than t1 provided there is no underlying 32-bit counter overflow (which is rather rare event, happening approx every 2 min at 32MHz clock cycle, and thus expected and easily fixed).
Instead, within timer overflow cycle multiple events occur when t0 is greater than t1, basically limiting OS high-precision timer service to one system tick resolution. The core problem is in the code which is almost the same for all platforms/revisions. Here it is for x86:
The problem occurs when sys_cycle_get_32() is called at the moment when hardware down counter reaches zero and then gets auto-reinitialized to the original value but 'accumulated_cycle_count' value has not been updated yet because it happens asynchronously inside the ISR. Then it may happen that t1 value is less then t0 value for an amount of up to 'accumulated_cycle_count'. This makes OS high-precision timing sub-system useless.
Temporary workaround
For x86 architecture use _NanoTscRead() instead of sys_cycle_get_32(). It works flawlessly.
(Imported from Jira ZEP-1546)
The text was updated successfully, but these errors were encountered: