-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coroutine-aware wall-time profiling #21
Comments
Actually, I think we've got this working now using the Any ideas about how to handle this? Setting |
Hi, First of all, profiling async. code was on top of my todo list. So, I am a bit lucky to have you guys trying to solve similar issue with practical needs:) . I would happily accept any PRs and try my best helping any kind of issue regarding profiling async. code. Actually what I would be suggesting: set_context_id_cbk() like you have already figured out. We have previously solved similar problem in greenlet-profiler. See here: https://emptysqua.re/blog/greenletprofiler/ Now, you seem to have lots of background on the issue and I am pretty new to this contextvars. Could you elaborate the problem more? I remember we have done similar things in the past by simply shifting some time values back and forth but I really do not remember. Could you explain the problem in a simple example? |
I think I remember the code: I mentioned it here: fastapi/fastapi#701 (comment). I might have misunderstood the problem, let's verify. |
@sumerc import asyncio
import time
from contextvars import ContextVar
import yappi
yappi_request_id = ContextVar('yappi_request_id')
yappi_request_id.set(-1)
def get_context_id() -> int:
try:
return yappi_request_id.get()
except LookupError:
return -2
yappi.set_context_id_callback(get_context_id)
async def func_to_profile(context_id: int) -> float:
yappi_request_id.set(context_id)
start = time.time()
await asyncio.sleep(1)
end = time.time()
assert yappi_request_id.get() == context_id
return end - start
yappi.set_clock_type("wall")
yappi.start()
async def main():
context_ids = [1, 2, 3]
tasks = [func_to_profile(i) for i in context_ids]
actual_durations = await asyncio.gather(*tasks)
yappi_durations = [yappi.get_func_stats({"name": "func_to_profile", "ctx_id": i}).pop().ttot for i in context_ids]
for context_id, actual_duration, yappi_duration in zip(context_ids, actual_durations, yappi_durations):
print(f"Task {context_id}:")
print(f" Actual wall time: {actual_duration * 1000:>8.3f}ms")
print(f" Yappi wall time: {yappi_duration * 1000:>8.3f}ms")
if __name__ == '__main__':
asyncio.run(main())
"""
Task 1:
Actual wall time: 1005.207ms
Yappi wall time: 0.059ms
Task 2:
Actual wall time: 1005.039ms
Yappi wall time: 0.034ms
Task 3:
Actual wall time: 1004.950ms
Yappi wall time: 0.090ms
""" If I understand correctly, if things were behaving properly, "Yappi wall time" should be larger than "Actual wall time". |
I'm trying to read through how you got things working with greenlet, but I'm having a hard time figuring out how to get it to work with asyncio. I'll update if/when I make more progress; please let me know though if you think you see how to make it work. |
@sumerc Oh, I think I see what's going on -- the point is that the greenlet profiler has a slightly modified version of yappi that does the pausing. (I didn't realize you had linked me to a different repo!) Am I understanding that correctly? Is your proposal that we do something similar to handle asyncio calls? Do you have any ideas about an API for this? It would be great if we could get it upstreamed into the main yappi, e.g., as a different clock option or similar (not sure what you have in mind though). But even if it requires essentially replicating |
Also, I noticed in the greenlet modification of yappi, it only does the context pausing/resuming if the clock type is "cpu". So I'm a little worried something else might be going on here? In particular, I don't fully understand why the wall clock times are off in the example above. (I'll dig into this more eventually, I'll have to take a break on this for at least a few hours though.) (Or maybe the point is that we just need to make use of different versions of the _pause_ctx and _resume_ctx functions?) Edit: I definitely feel out of my depth trying to figure out what is going on here. Some potentially relevant info:
|
@dmontagu, I think your initial investigation is pretty useful and you are correct for we will need to think about different threads in an event loop. I will be looking into this in more depth, now I understand and also remembered:) the problem fully. I am currently not sure why wall time is not working for async. functions, I will start looking into that. BTW, Let me ask you this: is using multiple threads for a single event loop a common use case? Could you also provide an example or maybe just explain(without code) where it makes sense? |
Yes, multiple threads in a single event loop is relatively common in an ASGI context -- it is basically the only way you can perform blocking IO (e.g., using blocking libraries, such as a non-async elasticsearch client or similar) inside an Starlette, one of the more popular async frameworks has a built in function that allows you to basically convert a blocking, non-async function into a non-blocking call by running it in a threadpool: (For context, FastAPI (the project where the issue was initially raised) is basically a framework built on top of starlette oriented around high-speed development and OpenAPI compatibility. So all of the examples related to Starlette are basically directly relevant to FastAPI too.) While the Another common example I've seen is performing password hashing/checking for login in a threadpool, since most cryptography libraries will release the GIL while hashing a password; this is critical to prevent the server from becoming non-responsive while someone is logging in (since, for security reasons, password hashing should take a macroscopic amount of time). In this case, you'd use the More rarely, it can be helpful to actually use a ProcessPoolExecutor, rather than a ThreadPoolExecutor (specifically when CPU bound), but I would imagine that would be harder/impossible for yappi to profile properly. Fortunately, given how much rarer that is, I think it is fine to require such situations to be handled in a more manual way. That said, at least in the context of profiling a web application, I don't think one needs to be very directly concerned about the thread ID; typically more important would just be tracking the request ID (what I'm storing in the (Obviously it would also be fine if you could track both the async context id and the thread id. And maybe this is possible already since it looks like the |
Thanks for the very detailed answer. Now, I will probably be thinking about these in my head for days:) The problem with threads is more about thinking about the shared vars/global structures rather than functionality. While designing the solution, I need to think all the time that this might be called from another thread ..etc... For the wall-time problem, I think I understand the problem: the issue is (it is interesting that we have not seen it before), the profiler CALL/RETURN events are called for every context switch occurring. Normally these events are generated when a functions ends but in async. case, they are called whenever a task switch occurs. So, that is why we cannot calculate the wall time correctly becasue whenever you call sleep(..), profiler thinks it left the function. BTW, this same behaviour is also present in cProfile. It also does not calculate wall-time correctly. I mean correct might be a wrong word here as at the end of the day, when you call sleep, in fact there is nothing happening in single thread case, and for multiple threads, cProfile already says it does not support that. So, I need to think about this more. I still feel that this is solvable somehow but might require some careful work. I am thinking about an API like set_context_backend("asyncio") which sets the context id callback and enables the correct profiling for the backend. And do the same for set_ctx_backend("gevent"). I initially did not like the idea of integrating yappi to different frameworks but given the fact that there are lots of libs/use cases for these frameworks we should add support for these, IMHO. |
BTW, one of the good side effect of this behavior (calling CALL/LEAVE in context switches) is that you get correct CPU time per-task for free which we have struggled for Greenlet:) |
Yeah, that makes sense to me. I think since I think supporting this properly would definitely be a big value proposition over alternative profilers. Another question to consider: should we make it possible to track each of the following?
I think (but am not 100% sure) that each of these are distinct, and potentially useful. (For example, having a low 1, high 2, and high 3 probably means you have blocking IO that should be refactored; having a low 1, low 2, and high 3 probably means you just have slow IO). Perhaps there is nothing to do to enable tracking 1 and 2 (if that's how things currently work?) -- just don't make a |
Yes. I assume 1 and 2 is already covered.(I will also check on that too.) I am also OK call_count being incremented per context-switch, it also makes some kind of sense. I am not sure if 2 has any benefits when 3 is available. Do you really think it is possible to have a low 2 and high 3? or high 2 and low 3? |
So whenever a generator yields, it's considered as exiting the function by profilers? That's going to be an issue... I get why they'd do this for profiling, since it ensures the time being measured is strictly limited to the time being spent in the function, but it also makes proper timing of asyncio functions a lot harder to acheive, since whenever an We would need a different way to figure out the callstack of running tasks, then. After looking around, it looks like
Not neccessarily. A solution to allow profiling |
That is exactly what I will be looking into. We need to differentiate between a real coroutine exit vs couroutine task switch. But coming to the first question @dmontagu asked, is this really what we want to achieve? You guys know async. world better than me, do you think implementing 3 over 2 has practical benefits?
Yeah, but the problem is you see gevent does run until the end of coroutine and does not call profiler events for context switch, that is something we have simulated in the past. But asyncio does that. I am pretty sure all these libraries will have one way or another subtle differences like this which will probably make implementation different. First thing is first: let's focus on asyncio first. It is currently the most widely used one, am I right? |
I would personally be fine with just 1 and 3 (tracking actual CPU time spent in the function + total time that function has spent in the callstack). 2 would be nice, I suppose, and it's the current behavior so implementing it wouldn't cost any effort, but considering one needs to be picked between 2 and 3 if we don't want to alter the YStats objects, I'd prefer 3.
From the way you're describing this, I'd consider this more of a "gevent quirk" (compared to native coroutines, anyway) than a framework difference that needs to be accounted for with various modes. I'm saying this having never used gevent before, though.
Asyncio is mainly just a series of I/O systems built on top of the Python 3.5+ coroutines (which mostly are just generators with special syntax to make sure they don't interfere with real generators), and I don't believe a profiler like yappi really needs to account for anything beyond that. But I do believe that asyncio has pretty much conquered the async space in Python (I haven't heard of Twisted or Tornado in quite some time, and Trio is still fairly niche for the time being, even though it shows a lot of promise). Limiting support to asyncio only if that choice ends up having to be made would probably work just fine for the vast majority of users (myself included). I was just saying this to point out that it might not actually be needed. |
It should never be possible to have a high 2 and low 3. However, the difference between low-2-high-3 and high-2-high-3 can be significant. You can confirm low-2-high-3 and high-2-high-3 is possible by converting the function in my example above: async def func_to_profile(context_id: int) -> float:
yappi_request_id.set(context_id)
start = time.time()
await asyncio.sleep(1)
end = time.time()
assert yappi_request_id.get() == context_id
return end - start to async def func_to_profile(context_id: int) -> float:
yappi_request_id.set(context_id)
start = time.time()
time.sleep(1) # <-------------
end = time.time()
assert yappi_request_id.get() == context_id
return end - start Before the change, you get low-2-high-3; after the change you get high-2-high-3. While this is one of the most basic possible examples, this specific issue has been brought up in multiple FastAPI issues. There are also plenty of other situations (especially when depending on libraries written without High-2-high-3 means there are calls blocking your event loop, and potentially massively reducing your ability to handle concurrent requests. A common example of a mistake that could lead to this would be making use of the Detecting this is especially important because even "small" amounts of blocking IO (to the tune of 10+ms per request) will have a catastrophic impact on the maximum number of requests you can handle per second. Typically, a low-1-high-2-high-3 situation means the problem is in the python code you've written. Low-2-high-3 means that you aren't wasting compute resources waiting on a blocked call, but you are still taking a long time to return a response. For example, maybe you are asynchronously executing a database query which could benefit from a (missing) index. Your python server may not be wasting any time or compute, but you would still be getting a response to your users slower than you may like. How important this is would depend on the context and magnitude of the delay, but without an async-friendly profiler it may be difficult to discover precisely which call is causing the delay. Typically, a low-1-low-2-high-3 situation means the problem is NOT in the python code you've written, but in something you are asynchronously accessing. |
Thanks for the detailed explanation again. I will be reading the comment from time to time for reference. So, for the next step I think at least we all agree is having these metrics are very useful: a. CPU time per-coroutine let's defer the discussion of having Now, as I said before a) is already available as function exit events happen for every PLEASE: bundle your tests as unittest.TestCase because I will be adding them to unit tests later on. I will be testing the behaviour for all Python versions starting from 3.5, this is something I would be doing but if you have any chance testing on those versions too, that would be great! |
@dmontagu @sm-Fifteen The test branch name is Please send your valuable feedback. Thanks! |
iirc contextvars appears in 3.7 |
Yes yes. The issue is not related with context vars anymore. So, you can already use |
my bad I thought you wanted to test it on 3.5 |
using @dmontagu snippet from here and coroutine-profiling branch (installed with I get
|
Yes, I'll write a couple unittest.TestCase examples tonight. I really appreciate the prompt responses and progress! |
Yes. I have not tested it I will share my tests here for you to verify in a next comment. But I know the problem here. The problem is when you set the import asyncio
import time
from contextvars import ContextVar
import yappi
yappi_request_id = ContextVar('yappi_request_id')
yappi_request_id.set(-1)
def get_context_id() -> int:
try:
return yappi_request_id.get()
except LookupError:
return -2
yappi.set_context_id_callback(get_context_id)
async def _wrapper(context_id: int) -> float:
yappi_request_id.set(context_id)
assert yappi_request_id.get() == context_id
return await func_to_profile()
async def func_to_profile() -> float:
start = time.time()
await asyncio.sleep(1)
end = time.time()
return end - start
yappi.set_clock_type("wall")
yappi.start()
async def main():
context_ids = [1, 2, 3]
tasks = [_wrapper(i) for i in context_ids]
actual_durations = await asyncio.gather(*tasks)
yappi_durations = [
yappi.get_func_stats({
"name": "func_to_profile",
"ctx_id": i
}).pop().ttot for i in context_ids
]
for context_id, actual_duration, yappi_duration in zip(
context_ids, actual_durations, yappi_durations
):
print(f"Task {context_id}:")
print(f" Actual wall time: {actual_duration * 1000:>8.3f}ms")
print(f" Yappi wall time: {yappi_duration * 1000:>8.3f}ms")
if __name__ == '__main__':
asyncio.run(main())
'''
Task 1:
Actual wall time: 1001.781ms
Yappi wall time: 1001.787ms
Task 2:
Actual wall time: 1001.832ms
Yappi wall time: 1001.835ms
Task 3:
Actual wall time: 1001.859ms
Yappi wall time: 1001.862ms
''' So I have just added a |
@sumerc Awesome! This is not a problem at all; we'd be setting the contextvar's value (and therefore the context id) in a cheap middleware which wouldn't need to be profiled. Basically equivalent to the wrapper you've written, but integrated neatly with the server. |
Below is one of the test files I have used. I have tried to think corner cases(at least it is corner to me:) ) like async generators and recursive async files, old style coroutines...etc. I am counting on you guys to find more of these:) import asyncio
import yappi
import time
def mygenerator(n):
while (n):
yield n
n -= 1
# async generator
async def ticker(n):
for i in range(n):
yield i
await asyncio.sleep(1.0)
@asyncio.coroutine
def old_style_coroutine():
yield from asyncio.sleep(1)
async def myasyncfunc(n=2):
if not n:
return
await asyncio.sleep(1.0)
time.sleep(1.0)
await old_style_coroutine()
async for i in ticker(2):
print("giter", i)
await myasyncfunc(n - 1)
yappi.set_clock_type("wall")
yappi.start()
for i in mygenerator(5):
print(i)
asyncio.run(myasyncfunc())
yappi.stop()
yappi.get_func_stats().print_all() |
@sumerc Another thing to think about -- async context managers. Probably doesn't add new edge cases (and may ultimately be basically equivalent to async generators), but worth ensuring it behaves properly.
|
Yes. As yappi already has support for recursive functions I think it will not be too much problem. But of course, in this case it is new and needs to be tested. It seems to be working at least for simple cases.
Nice catch. Would be nice to test. |
Shouldn't the issue be renamed into something like "Accurate wall-time for coroutines" or something along those lines? |
Update: There is an issue related with time calculation of concurrent coroutines on same function, but I think it is close to finish, I am still testing. |
For what it's worth, here is a performance timer I made for Trio tasks. It measures wall time while the task is executing (i.e. excluding trio.sleep(), async I/O, etc.). def trio_perf_counter():
"""Trio task-local equivalent of time.perf_counter().
For the current Trio task, return the value (in fractional seconds) of a
performance counter, i.e. a clock with the highest available resolution to
measure a short duration. It includes time elapsed during time.sleep,
but not trio.sleep. The reference point of the returned value is
undefined, so that only the difference between the results of consecutive
calls is valid.
... (see https://github.com/belm0/perf-timer#trio_perf_counter and source) The implementation uses This wasn't too hard to implement for Trio, which has an instrumentation API. If you're able to implement something equivalent for asyncio, I'd like to add it to my PerfTimer package. I'm aware of one attempt which no longer works with recent Python versions. |
Nice to know. I will be notifying when the work is finished. |
Hi @dmontagu, |
thanks a lot @sumerc that looks amazing.
edit I thought it was pycharm but I got the same running directly the script in a terminal. from debugger in pycharm
|
This is great! I'll try it out when I get a chance (hopefully some time this week) and give you feedback. Thanks so much for all of your effort! |
Please do not use The other problem regarding importing sys is related with having yappi still running while the module is being unloaded. I will fix that but it is mostly cosmetic. To fix that: simply call yappi.stop(). Here is a fully working code for you to play with: import asyncio
import time
from contextvars import ContextVar
import sys
import yappi
yappi_request_id = ContextVar('yappi_request_id')
yappi_request_id.set(-1)
def get_context_id() -> int:
try:
return yappi_request_id.get()
except LookupError:
return -2
yappi.set_tag_callback(get_context_id)
async def _wrapper(context_id: int) -> float:
yappi_request_id.set(context_id)
assert yappi_request_id.get() == context_id
return await func_to_profile()
async def func_to_profile() -> float:
start = time.time()
await asyncio.sleep(1)
end = time.time()
return end - start
yappi.set_clock_type("wall")
yappi.start()
async def main():
context_ids = [1, 2, 3]
tasks = [_wrapper(i) for i in context_ids]
actual_durations = await asyncio.gather(*tasks)
yappi_durations = [
yappi.get_func_stats({
"name": "func_to_profile",
"tag": i
}).pop().ttot for i in context_ids
]
for context_id, actual_duration, yappi_duration in zip(
context_ids, actual_durations, yappi_durations
):
print(f"Task {context_id}:")
print(f" Actual wall time: {actual_duration * 1000:>8.3f}ms")
print(f" Yappi wall time: {yappi_duration * 1000:>8.3f}ms")
if __name__ == '__main__':
asyncio.run(main())
yappi.stop() |
using the snippet above I got it working in a terminal, however should I run it in pycharm I got this huge wall of warnings, sorry if I'm doing (again) something silly ! edit sorry I got the same warnings in pycharm or terminal, was trying another script from debugger in pycharm
|
No need for sorry! These are valuable User experience issues and should be taken care of. I should be using a more uncommon default tag id. edit: I have fixed the problem with import sys error on interpreter shutdown. |
Ok. This will hopefully be my final update: Anyway, I will be finalizing the tests and hopefully will be releasing a new version which includes lots of new, exciting stuff. Looking forward to your latest feedback and use cases in FastAPI, too, of course. |
Relevant PR merged to master. Closing this gigantic issue... |
Having had some time to try this in the wild, I will say that this is extremely useful, although there's a few things that tend to throw off profiling results in async contexts:
Both of these affect threads as well as coroutines (and threads are often used to run sync code without blocking the asyncio thread, so there's also that), and might be worth looking into as potential improvements. Also, it's really interesting to see the extra attention this project has been seeing now that the word has gotten out that there's a Python profiler with coroutine support! :) |
For asyncio, I do not know any way to differentiate between a task that is ready to run and still working in I/O. Maybe there is a way, tough. For GIL: again AFAIK, there is no way of knowing if a thread is waiting on GIL or I/O. I know some profilers like py-spy show you the statistically how much time spent in functions related to C extensions, maybe you can use that to retrieve GIL related information.... Not sure. Having said that, I believe these ideas are worth digging and I will be thinking in my head but I am not sure they can be incorporated into Yappi as solving these problems might require different set of tools/approaches. (walking the real stack trace of the application or instrumenting user-level functions...)
I could not quite understand this one. Could you please provide a simple example?
Yes :) Thanks to you and the other guys at FastAPI. |
Well, the way I'm thinking of would require always collecting stats for both wall and cpu time. You could get a pretty rough idea of the "time window to resume" or the "percentage of time blocked" or however such a metric would be named by just adding together the CPU time for all the other frames that ran during the lifetime of that function. This obviously wouldn't hold in the case of C extensions that are able to release the GIL on their thread, but it's a metric I'm sure a lot of people would appreciate having in a multithreaded context.
>>> import yappi
>>> import asyncio
>>> async def test_coro():
... await asyncio.sleep(5)
...
>>> async def amain():
... coro1 = test_coro()
... coro2 = test_coro()
... await asyncio.gather(coro1, coro2)
...
>>> def do_profile():
... yappi.set_clock_type("wall")
... yappi.start()
... asyncio.run(amain())
... yappi.stop()
... yappi.get_func_stats(filter={'name':'test_coro'}).print_all()
...
>>> yappi.clear_stats()
>>> do_profile()
|
OK then I might have misunderstood. We call I/O time=Wall time - CPU time and yes, this can be done with having a clock_type of both wall/cpu at the same time. It is quite possible. I had assumed you were asking for to differentiate a the wait for different objects in I/O. Like how much we waited for GIL, and how much we actually waited for the I/O operation to complete.
We can filter on ctx_id and tag fields to get per-coroutine/thread info on any result. I am really not sure on how to improve this more, honestly. The way yappi looks at this dimension is like: when it measures walltime for example, it measures it from the perspective of the thread/coroutine, thus that is why we always see the aggregated time. That is also probably why we need to have good filtering support... |
Well, that would definitely be the ideal case, but I understand that those are all things that turn out to be very hard to test for at the moment, so a
Yeah, things like |
For the sake of profiling ASGI apps, it would be really helpful if there was a way to make use of python 3.7's contextvars while filtering via
yappi.get_func_stats
. If yappi supported this, you could get per-request-response-cycle profiling through an ASGI middleware.I'm not sure if the necessary functionality already exists or not, but wanted to open an issue in case anyone else has thought about this, or in case someone is aware of an existing way to accomplish this.
See fastapi/fastapi#701 for more context on how we'd like to use this.
The text was updated successfully, but these errors were encountered: