Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding platform telemetry #2109

Merged
merged 19 commits into from
Oct 20, 2019
Merged

Adding platform telemetry #2109

merged 19 commits into from
Oct 20, 2019

Conversation

walrusmcd
Copy link
Contributor

@walrusmcd walrusmcd commented Oct 12, 2019

This PR adds a framework and an implementation for platform telemetry.

The basic idea is to add the concept of platform nuetral telemetry (a pure virtual class) and add the first PAL implementation of a windows class.

more platforms will come later.

This PR also adds docs and a privacy statement on how telemetry will work. namely:

  • clone and build from source will never tag telemetry events such that windows will listen to them
  • official builds will tag telemetry events such that windows can listen to them. this will follow the standard CEIP program and optin/optout instructions from the user and/or enterprise.
  • official build telemetry will be tagged ON by default.
  • we added a global on/off so that an app can choose to turn off all telemetry events for their app, regardless of if the desktop machine is enrolled in windows platform CEIP.

@walrusmcd walrusmcd requested a review from a team as a code owner October 12, 2019 00:10
@pranavsharma
Copy link
Contributor

Some preliminary comments -

  1. Please follow ORT style guide. No camel case naming :-)
  2. I guess the session option to turn this off is in progress?
  3. I didn't see a place where this session option would be used. The macro that calls LogRuntimeError would've no knowledge about this option, right?

@walrusmcd
Copy link
Contributor Author

  1. awesome, i think i have all the style guide fixes, no more camel casing :)
  2. i added the enable/disable. since some of these events span session state, i set it at the global Env . an app dev can enable and disable them globally, not per session.
  3. added the setting into the telemetry for enable/disable. with the toggle. and set it to false for the first PR . we will finish the end to end testing (all the way to the cloud) and then enable them by default. we also have to configure the build pipeline in order to run things on in the official builds (since clone and build from source will never send windows telemetry )

In reply to: 541264286 [](ancestors = 541264286)

README.md Outdated
@@ -204,3 +206,8 @@ or contact [[email protected]](mailto:[email protected]) with any addi
***
# License
[MIT License](LICENSE)

# Data/Telemetry
Copy link
Contributor

@pranavsharma pranavsharma Oct 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must call out that it is disabled by default. #ByDesign

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea was to have the DOCS clear that telemetry will be on by default. then make the code obey that as and when we are ready . What I want to avoid is have the docs say that they are off by default, and then when we later turn the code on it's not super clear that you need to go an read the docs again (aka breaking change). in this manner, the docs and "design" are all such that it is clearly messaged telemetry is ON by default. it's a "bug" that we don't have the code behaving per spec yet. In that fashion we can "fix the code" later and all be good. It isnt a breaking change . you like ?


In reply to: 335297148 [](ancestors = 335297148)

Copy link
Contributor

@pranavsharma pranavsharma Oct 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we decided to leave the session option for this to be disabled by default in 1.0 release. The session option config should match the docs, right? Or I'm missing something? #Resolved

Copy link
Contributor

@faxu faxu Oct 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs should reflect the current state. To start, it will be OFF by default. When it's out of preview and ready to fully turn on, we will update the documentation to reflect that state.

For the specific documentation, I suggest:

  • main readme page mentions that telemetry collection is implemented in this project and to refer to the detailed data collection page. We don't need to change this documentation when default state changes.
  • data collection page includes privacy policy as well as the documentation on how to enable/disable
  • C API page just links to data collection page. We can still have a section heading, but it will contain nothing but the link to the data collection page. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change the docs to denote it is OFF by default during BETA, and to expect it to ON by default once it graduates from BETA (for transparency)


In reply to: 335777476 [](ancestors = 335777476)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change like you suggested, just a heads up that now we have C API specific docs embedded in the privacy statement. If we ever have projections on these (like C# or WinRT) we might have to figure out where to put those docs. I had the turn/on and off in the C_API docs since it was specific to the C_API.

Let me know what you think with the new doc layout !


In reply to: 335829030 [](ancestors = 335829030)

docs/C_API.md Outdated
# Telemetry
This project collects usage data and sends it to Microsoft to help improve our products and services. Note however that no data collection is performed by default when using your private builds

Telemetry is turned on by default when using the Official Builds. This is implemented via 'Platform Telemetry' per vendor platform providers (see telemetry.h).
Copy link
Contributor

@pranavsharma pranavsharma Oct 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turned "off" by default. Also, is it better to write the documentation in one place and link to it from everywhere else? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check out my comments on the other doc on why i like have the "spec" be ON by default, and the code will catch up later when we are ready (bug level fixes) . as to the docs spread around a bit, i went back and forth on it :) . pros/cons.

  1. other OSS components have multiple docs. having things like a "privacy statment" are nice.
  2. i like the main README being more of a TOC and link off. that way the root/readme doesn't just bloat and bloat with content.
  3. i like the idea of drill down for details. the top level docs have basic ideas. and you get to go down deeper (like the c_api doc) to learn how to use the actual C api in this fashion.

what ya think?


In reply to: 335297643 [](ancestors = 335297643)

Copy link
Contributor

@pranavsharma pranavsharma Oct 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok as long as we don't end up with inconsistent documentation. @faxu can probably review this and give some comments. #Resolved

Copy link
Contributor

@faxu faxu Oct 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned above, the spec/documentation should reflect the current state for accuracy. that said, the top-level page (main readme) can be general enough to not require updating once default state changes. The intent of that page is to inform about the presence of telemetry-collecting code. All further details on default states can live on the Privacy info page.

The info on how to turn on/off should also be on that Privacy info page, I think. #Resolved

include/onnxruntime/core/common/common.h Outdated Show resolved Hide resolved
@@ -266,8 +266,14 @@ class PosixEnv : public Env {
return filename;
}

// \brief returns a provider that will handle telemetry on the current platform
const Telemetry& GetTelemetryProvider() const override {
return telemetryProvider_;
Copy link
Contributor

@pranavsharma pranavsharma Oct 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

telemetry_provider_ #Resolved

@@ -211,6 +212,11 @@ class WindowsEnv : public Env {
ORT_NOT_IMPLEMENTED(__FUNCTION__, " is not implemented");
}

// \brief returns a provider that will handle telemetry on the current platform
const Telemetry& GetTelemetryProvider() const override {
return telemetryProvider_;
Copy link
Contributor

@pranavsharma pranavsharma Oct 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

telemetry_provider_ #Resolved

const std::unordered_map<std::string, std::string>& model_metadata,
const std::string& loadedFrom, const std::vector<std::string>& execution_provider_ids) const;

virtual void LogRuntimeError(uint32_t sessionId, const common::Status& status, const char* file,
Copy link
Contributor

@pranavsharma pranavsharma Oct 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

session_id #Resolved

@@ -95,6 +95,8 @@ inline std::basic_string<T> GetCurrentTimeString() {

} // namespace

std::atomic<uint32_t> InferenceSession::global_session_id_{1};
Copy link
Contributor

@pranavsharma pranavsharma Oct 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could 2 different users report the same session id in the events? How do we differentiate between them? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep. by 2 users we could have:

  • 2 apps on the same user desktop machine login instance
  • 2 apps running on different machines

the telemetry events have an "app session" global that is kept track of by the telemetry engines. it lets us know which machine/app combo is reporting in. so session_id is really the "ort session" differentiator inside a specific app instance (aka process ID).


In reply to: 335301237 [](ancestors = 335301237)

Copy link
Contributor

@pranavsharma pranavsharma Oct 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have 2 different sessions within the same app instance? If yes, can we still differentiate? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory a long lived process could have global_session_id_ wrap around such that every 4B sessions would restart the numbering, and would maintain the same session id. We actually have a state machine that lives in the cloud @ (xiangting) is helping us to lock down. This is how we keep track of state and we can use to answer if the wrap around is ok, and also whether or not we need/want a "SessionEnd" event. I'll file an open issue for this.


In reply to: 335779060 [](ancestors = 335779060)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I fully answers the quesiton :) is that what you meant? Wrap around to 4 billion? people can totally have 2 OrtSession's in the same app instance, and then we would differentiate using that uint32, right? (assuming we are ok with wrap around). another easy fix here is to just make it int64.


In reply to: 336570778 [](ancestors = 336570778,335779060)

onnxruntime/core/session/inference_session.h Show resolved Hide resolved
} while (0)

#define ORT_RETURN_IF_ERROR_SESSIONID_(expr) ORT_RETURN_IF_ERROR_SESSIONID(expr, session_id_)
#define ORT_RETURN_IF_ERROR(expr) ORT_RETURN_IF_ERROR_SESSIONID(expr, 0)

#define ORT_THROW_IF_ERROR(expr) \
Copy link
Contributor

@pranavsharma pranavsharma Oct 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we plan to send events for cases when we just do ORT_THROW? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. I first coded ORT_RETURN_IF and ORT_THROW_IF . it covers most all code paths, but not all. how about i file a tracking item in a second PR to catch all of the macros?


In reply to: 335302117 [](ancestors = 335302117)

Copy link
Contributor

@pranavsharma pranavsharma Oct 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that's fine. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added as a work item


In reply to: 335777787 [](ancestors = 335777787)


void LogProcessInfo() const override;

void LogSessionCreation(uint32_t session_id, int64_t ir_version, const std::string& model_producer_name,
Copy link
Contributor

@pranavsharma pranavsharma Oct 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to use a struct to pass so many parameters to the function? This way if we add one more thing to the event only the struct needs to change, not the function declarations and definitions across several files. #Pending

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like it. want me to do that in this PR, or would as second change be ok ? another thing to keep in mind is that changing this is a big deal. it's a "schema change" . so this is not somethin we do all the time, or need to necessarily optimize for. in our experience we change it maybe once a year. as an example winML is coming on 4 releases now (2 years) and we haven't change the schema yet.


In reply to: 335472938 [](ancestors = 335472938)

Copy link
Contributor

@pranavsharma pranavsharma Oct 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's going to take us some iterations before we stabilize and it would be nice to avoid the pain of changing across multiple files in the meanwhile. If you're running out of time, it's ok to do this in a follow up PR. #Pending

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i added it as a work item. thx


In reply to: 335778701 [](ancestors = 335778701)

@pranavsharma pranavsharma requested a review from faxu October 17, 2019 01:51
Copy link
Contributor

@faxu faxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to add a link to the telemetry info on the Nuget and PyPi package descriptions. I suggest creating an aka.ms shortlink and using that in the description. That shortlink should route to the privacy policy page in this repo.

docs/C_API.md Outdated
# Telemetry
This project collects usage data and sends it to Microsoft to help improve our products and services. Note however that no data collection is performed by default when using your private builds

Telemetry is turned on by default when using the Official Builds. This is implemented via 'Platform Telemetry' per vendor platform providers (see telemetry.h).
Copy link
Contributor

@faxu faxu Oct 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned above, the spec/documentation should reflect the current state for accuracy. that said, the top-level page (main readme) can be general enough to not require updating once default state changes. The intent of that page is to inform about the presence of telemetry-collecting code. All further details on default states can live on the Privacy info page.

The info on how to turn on/off should also be on that Privacy info page, I think. #Resolved

README.md Outdated
@@ -174,3 +176,8 @@ or contact [[email protected]](mailto:[email protected]) with any addi
***
# License
[MIT License](LICENSE)

# Data/Telemetry
Copy link
Contributor

@faxu faxu Oct 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put this above the License #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved it above license.


In reply to: 335830799 [](ancestors = 335830799)

TraceLoggingString(model_metadata_string.c_str(), "modelMetaData"),
TraceLoggingString(loadedFrom.c_str(), "loadedFrom"),
TraceLoggingString(execution_provider_string.c_str(), "executionProviderIds"));
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we going to get the adapter Luid? So we can get GPU information.
We used to log it in SessionCreation telemetry.

Copy link
Contributor Author

@walrusmcd walrusmcd Oct 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now we wont have it. We have options :
1- add an event to the EP's to output what device was selected.
2- do nothing, this event logs which EP was chosen - so we know when GPU was chosen. and we can use existing telemetry to know which GPU was on the machine. it will have a gap if the machine has 2 GPU's (what percentage are those?) .
thoughts? I was learning towards #2 ... with the option to add #1 later as/if we see it's useful.


In reply to: 336233302 [](ancestors = 336233302)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For option 2, are we going to use the device id to find corresponding GPU?
I'm fine with adding option 1 later, since the GPU usage is not high.

walrusmcd and others added 5 commits October 18, 2019 09:41
* Rearrange content order

* Reorganize page

* Update telemetry text

* Update C_API.md

* Update Privacy.md
@faxu faxu added the pending label Oct 18, 2019
@faxu faxu added this to the 1.0 milestone Oct 18, 2019
@faxu faxu removed the pending label Oct 18, 2019
@walrusmcd walrusmcd changed the title WIP: work in progress - adding platform telemetry Adding platform telemetry Oct 18, 2019
@snnn snnn merged commit d1159b7 into master Oct 20, 2019
@snnn snnn deleted the paulm/telemetry branch October 20, 2019 01:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants