Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise timing metrics collection to avoid unaccounted-for time #254

Merged
merged 1 commit into from
Jun 29, 2024

Conversation

robertknight
Copy link
Owner

@robertknight robertknight commented Jun 29, 2024

In the GPT-2 medium model, about 6% of the runtime for each step was reported as "[Other]", which is the difference between the total time for the run and the sum of individual operator timings. Upon analysis, this difference turned out to be a combination of:

  1. The overhead of getting the time with Instant::now
  2. Rounding and accumulation errors due to storing timings in various resolutions and using single-precision float operations to sum timings
  3. Setup time spent in Graph::run_plan before the start of the main loop

The Instant::now overhead on macOS (Intel) is about 30% rdtsc + fences and 70% call overhead according to samply.

This commit addresses these as follows:

  • The number of Instant::now calls has been reduced from 2 to 1 per step, with each call capturing the total time for the whole step, now just the call to Operator::run. This means that output processing for a step is now accounted for in the time for that step.
  • Operator timings are now represented, and summed as, Duration values. These are integer counts of nanoseconds, avoiding issues with rounding and float accumulation.
  • The run timing is now reported as the sum of operator timings, excluding setup before the main loop.

The Timer utility type was removed as it is no longer used following these.

In the GPT-2 medium model, about 6% of the runtime for each step was
reported as "[Other]", which is the difference between the total time
for the run and the sum of individual operator timings. Upon analysis,
this difference turned out to be a combination of:

 1. The overhead of getting the time with `Instant::now`
 2. Rounding and accumulation errors due to storing timings in various
    resolutions and using single-precision float operations to sum timings
 3. Setup time spent in `Graph::run_plan` before the start of the main
    loop

The `Instant::now` overhead on macOS (Intel) is about 30% rdtsc + fences and 70%
call overhead according to samply.

This commit addresses these as follows:

 - The number of `Instant::now` calls has been reduced from 2 to 1 per step,
   with each call capturing the total time for the whole step, now just the call
   to `Operator::run`. This means that output processing for a step is now
   accounted for in the time for that step.
 - Operator timings are now represented, and summed as, `Duration` values. These
   are integer counts of nanoseconds, avoiding issues with rounding and float
   accumulation.
 - The run timing is now reported as the sum of operator timings, excluding
   setup before the main loop.

The `Timer` utility type was removed as it is no longer used following these.
@robertknight robertknight force-pushed the revise-timing-metrics branch from 4cb16a1 to c3cd3d8 Compare June 29, 2024 06:38
@robertknight robertknight merged commit d7a79f6 into main Jun 29, 2024
2 checks passed
@robertknight robertknight deleted the revise-timing-metrics branch June 29, 2024 06:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant