-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding platform telemetry #2109
Conversation
Some preliminary comments -
|
added global enable/disable of telemetry to c and cxx api add docs.
In reply to: 541264286 [](ancestors = 541264286) |
README.md
Outdated
@@ -204,3 +206,8 @@ or contact [[email protected]](mailto:[email protected]) with any addi | |||
*** | |||
# License | |||
[MIT License](LICENSE) | |||
|
|||
# Data/Telemetry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We must call out that it is disabled by default. #ByDesign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea was to have the DOCS clear that telemetry will be on by default. then make the code obey that as and when we are ready . What I want to avoid is have the docs say that they are off by default, and then when we later turn the code on it's not super clear that you need to go an read the docs again (aka breaking change). in this manner, the docs and "design" are all such that it is clearly messaged telemetry is ON by default. it's a "bug" that we don't have the code behaving per spec yet. In that fashion we can "fix the code" later and all be good. It isnt a breaking change . you like ?
In reply to: 335297148 [](ancestors = 335297148)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we decided to leave the session option for this to be disabled by default in 1.0 release. The session option config should match the docs, right? Or I'm missing something? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs should reflect the current state. To start, it will be OFF by default. When it's out of preview and ready to fully turn on, we will update the documentation to reflect that state.
For the specific documentation, I suggest:
- main readme page mentions that telemetry collection is implemented in this project and to refer to the detailed data collection page. We don't need to change this documentation when default state changes.
- data collection page includes privacy policy as well as the documentation on how to enable/disable
- C API page just links to data collection page. We can still have a section heading, but it will contain nothing but the link to the data collection page. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change the docs to denote it is OFF by default during BETA, and to expect it to ON by default once it graduates from BETA (for transparency)
In reply to: 335777476 [](ancestors = 335777476)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this change like you suggested, just a heads up that now we have C API specific docs embedded in the privacy statement. If we ever have projections on these (like C# or WinRT) we might have to figure out where to put those docs. I had the turn/on and off in the C_API docs since it was specific to the C_API.
Let me know what you think with the new doc layout !
In reply to: 335829030 [](ancestors = 335829030)
docs/C_API.md
Outdated
# Telemetry | ||
This project collects usage data and sends it to Microsoft to help improve our products and services. Note however that no data collection is performed by default when using your private builds | ||
|
||
Telemetry is turned on by default when using the Official Builds. This is implemented via 'Platform Telemetry' per vendor platform providers (see telemetry.h). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turned "off" by default. Also, is it better to write the documentation in one place and link to it from everywhere else? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check out my comments on the other doc on why i like have the "spec" be ON by default, and the code will catch up later when we are ready (bug level fixes) . as to the docs spread around a bit, i went back and forth on it :) . pros/cons.
- other OSS components have multiple docs. having things like a "privacy statment" are nice.
- i like the main README being more of a TOC and link off. that way the root/readme doesn't just bloat and bloat with content.
- i like the idea of drill down for details. the top level docs have basic ideas. and you get to go down deeper (like the c_api doc) to learn how to use the actual C api in this fashion.
what ya think?
In reply to: 335297643 [](ancestors = 335297643)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's ok as long as we don't end up with inconsistent documentation. @faxu can probably review this and give some comments. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as mentioned above, the spec/documentation should reflect the current state for accuracy. that said, the top-level page (main readme) can be general enough to not require updating once default state changes. The intent of that page is to inform about the presence of telemetry-collecting code. All further details on default states can live on the Privacy info page.
The info on how to turn on/off should also be on that Privacy info page, I think. #Resolved
@@ -266,8 +266,14 @@ class PosixEnv : public Env { | |||
return filename; | |||
} | |||
|
|||
// \brief returns a provider that will handle telemetry on the current platform | |||
const Telemetry& GetTelemetryProvider() const override { | |||
return telemetryProvider_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
telemetry_provider_ #Resolved
@@ -211,6 +212,11 @@ class WindowsEnv : public Env { | |||
ORT_NOT_IMPLEMENTED(__FUNCTION__, " is not implemented"); | |||
} | |||
|
|||
// \brief returns a provider that will handle telemetry on the current platform | |||
const Telemetry& GetTelemetryProvider() const override { | |||
return telemetryProvider_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
telemetry_provider_ #Resolved
const std::unordered_map<std::string, std::string>& model_metadata, | ||
const std::string& loadedFrom, const std::vector<std::string>& execution_provider_ids) const; | ||
|
||
virtual void LogRuntimeError(uint32_t sessionId, const common::Status& status, const char* file, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
session_id #Resolved
@@ -95,6 +95,8 @@ inline std::basic_string<T> GetCurrentTimeString() { | |||
|
|||
} // namespace | |||
|
|||
std::atomic<uint32_t> InferenceSession::global_session_id_{1}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could 2 different users report the same session id in the events? How do we differentiate between them? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep. by 2 users we could have:
- 2 apps on the same user desktop machine login instance
- 2 apps running on different machines
the telemetry events have an "app session" global that is kept track of by the telemetry engines. it lets us know which machine/app combo is reporting in. so session_id is really the "ort session" differentiator inside a specific app instance (aka process ID).
In reply to: 335301237 [](ancestors = 335301237)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to have 2 different sessions within the same app instance? If yes, can we still differentiate? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory a long lived process could have global_session_id_ wrap around such that every 4B sessions would restart the numbering, and would maintain the same session id. We actually have a state machine that lives in the cloud @ (xiangting) is helping us to lock down. This is how we keep track of state and we can use to answer if the wrap around is ok, and also whether or not we need/want a "SessionEnd" event. I'll file an open issue for this.
In reply to: 335779060 [](ancestors = 335779060)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure I fully answers the quesiton :) is that what you meant? Wrap around to 4 billion? people can totally have 2 OrtSession's in the same app instance, and then we would differentiate using that uint32, right? (assuming we are ok with wrap around). another easy fix here is to just make it int64.
In reply to: 336570778 [](ancestors = 336570778,335779060)
} while (0) | ||
|
||
#define ORT_RETURN_IF_ERROR_SESSIONID_(expr) ORT_RETURN_IF_ERROR_SESSIONID(expr, session_id_) | ||
#define ORT_RETURN_IF_ERROR(expr) ORT_RETURN_IF_ERROR_SESSIONID(expr, 0) | ||
|
||
#define ORT_THROW_IF_ERROR(expr) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we plan to send events for cases when we just do ORT_THROW? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch. I first coded ORT_RETURN_IF and ORT_THROW_IF . it covers most all code paths, but not all. how about i file a tracking item in a second PR to catch all of the macros?
In reply to: 335302117 [](ancestors = 335302117)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that's fine. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
void LogProcessInfo() const override; | ||
|
||
void LogSessionCreation(uint32_t session_id, int64_t ir_version, const std::string& model_producer_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to use a struct to pass so many parameters to the function? This way if we add one more thing to the event only the struct needs to change, not the function declarations and definitions across several files. #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like it. want me to do that in this PR, or would as second change be ok ? another thing to keep in mind is that changing this is a big deal. it's a "schema change" . so this is not somethin we do all the time, or need to necessarily optimize for. in our experience we change it maybe once a year. as an example winML is coming on 4 releases now (2 years) and we haven't change the schema yet.
In reply to: 335472938 [](ancestors = 335472938)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's going to take us some iterations before we stabilize and it would be nice to avoid the pain of changing across multiple files in the meanwhile. If you're running out of time, it's ok to do this in a follow up PR. #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to add a link to the telemetry info on the Nuget and PyPi package descriptions. I suggest creating an aka.ms shortlink and using that in the description. That shortlink should route to the privacy policy page in this repo.
docs/C_API.md
Outdated
# Telemetry | ||
This project collects usage data and sends it to Microsoft to help improve our products and services. Note however that no data collection is performed by default when using your private builds | ||
|
||
Telemetry is turned on by default when using the Official Builds. This is implemented via 'Platform Telemetry' per vendor platform providers (see telemetry.h). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as mentioned above, the spec/documentation should reflect the current state for accuracy. that said, the top-level page (main readme) can be general enough to not require updating once default state changes. The intent of that page is to inform about the presence of telemetry-collecting code. All further details on default states can live on the Privacy info page.
The info on how to turn on/off should also be on that Privacy info page, I think. #Resolved
README.md
Outdated
@@ -174,3 +176,8 @@ or contact [[email protected]](mailto:[email protected]) with any addi | |||
*** | |||
# License | |||
[MIT License](LICENSE) | |||
|
|||
# Data/Telemetry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put this above the License #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TraceLoggingString(model_metadata_string.c_str(), "modelMetaData"), | ||
TraceLoggingString(loadedFrom.c_str(), "loadedFrom"), | ||
TraceLoggingString(execution_provider_string.c_str(), "executionProviderIds")); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we going to get the adapter Luid? So we can get GPU information.
We used to log it in SessionCreation telemetry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now we wont have it. We have options :
1- add an event to the EP's to output what device was selected.
2- do nothing, this event logs which EP was chosen - so we know when GPU was chosen. and we can use existing telemetry to know which GPU was on the machine. it will have a gap if the machine has 2 GPU's (what percentage are those?) .
thoughts? I was learning towards #2 ... with the option to add #1 later as/if we see it's useful.
In reply to: 336233302 [](ancestors = 336233302)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For option 2, are we going to use the device id to find corresponding GPU?
I'm fine with adding option 1 later, since the GPU usage is not high.
* Rearrange content order * Reorganize page * Update telemetry text * Update C_API.md * Update Privacy.md
This PR adds a framework and an implementation for platform telemetry.
The basic idea is to add the concept of platform nuetral telemetry (a pure virtual class) and add the first PAL implementation of a windows class.
more platforms will come later.
This PR also adds docs and a privacy statement on how telemetry will work. namely: