Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add flag enforce_max_duration #798

Merged
merged 17 commits into from
Mar 4, 2024
Merged

feat: add flag enforce_max_duration #798

merged 17 commits into from
Mar 4, 2024

Conversation

anhappdev
Copy link
Collaborator

@anhappdev anhappdev commented Oct 6, 2023

The logic is now:

  • if (result_min_duration_met && result_min_queries_met && early_stopping_met) => blue text
  • else if (result_min_duration_met && early_stopping_met) => purple text
  • else red text

The result screen will look like this:

@github-actions
Copy link

github-actions bot commented Oct 6, 2023

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@sonarqubecloud
Copy link

sonarqubecloud bot commented Oct 6, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

0.0% 0.0% Coverage
0.0% 0.0% Duplication

@freedomtan
Copy link
Contributor

freedomtan commented Oct 31, 2023

Let's test it: @freedomtan @AhmedTElthakeb @mohitmundhragithub

@anhappdev anhappdev marked this pull request as ready for review November 11, 2023 13:19
@anhappdev anhappdev requested a review from a team as a code owner November 11, 2023 13:19
@freedomtan
Copy link
Contributor

I still saw invalid color for > 600 seconds. Is this expected?

@anhappdev
Copy link
Collaborator Author

I still saw invalid color for > 600 seconds. Is this expected?

No, as per @pgmpablo157321

When you set enforce_max_duration = False, it won't fail when the maximum duration is reached.

@anhappdev
Copy link
Collaborator Author

@pgmpablo157321 Can you check if the result is expected:
It looks like the flag enforce_max_duration has no effect.

cd "/Users/anh/dev/mlcommons/mobile_app_open" && \
	bazel-bin/flutter/cpp/binary/main EXTERNAL super_resolution \
		--mode=PerformanceOnly \
		--output_dir=""/Users/anh/dev/mlcommons/mobile_app_open"/output" \
		--model_file=""/Users/anh/dev/mlcommons/mobile_app_open"/mobile_back_apple/dev-resources/edsr_final/converted/edsr_f32b5_fp32.tflite" \
		--lib_path="bazel-bin/mobile_back_tflite/cpp/backend_tflite/libtflitebackend.so" \
		--images_directory=""/Users/anh/dev/mlcommons/mobile_app_open"/mobile_back_apple/dev-resources/psnr/LR" \
		--ground_truth_directory=""/Users/anh/dev/mlcommons/mobile_app_open"/mobile_back_apple/dev-resources/psnr/HR" \
		--max_duration_ms=10000 \
		--min_duration_ms=100
================================================
MLPerf Results Summary
================================================
SUT name : TFLite
Scenario : SingleStream
Mode     : PerformanceOnly
90th percentile latency (ns) : 694327333
Result is : INVALID
  Min duration satisfied : Yes
  Min queries satisfied : Skipped
  Early stopping satisfied: NO
Recommendations:
 * The test exited early, before enough queries were issued.
   See the detailed log for why this may have occurred.
Early Stopping Result:
 * Only processed 17 queries.
 * Need to process at least 64 queries for early stopping.

================================================
Additional Stats
================================================
QPS w/ loadgen overhead         : 1.52
QPS w/o loadgen overhead        : 1.53

Min latency (ns)                : 584926042
Max latency (ns)                : 699675166
Mean latency (ns)               : 653122000
50.00 percentile latency (ns)   : 679391667
90.00 percentile latency (ns)   : 694327333
95.00 percentile latency (ns)   : 699675166
97.00 percentile latency (ns)   : 699675166
99.00 percentile latency (ns)   : 699675166
99.90 percentile latency (ns)   : 699675166

================================================
Test Parameters Used
================================================
samples_per_query : 1
target_qps : 1000
target_latency (ns): 0
max_async_queries : 1
min_duration (ms): 100
max_duration (ms): 10000
min_query_count : 100
max_query_count : 0
qsl_rng_seed : 148687905518835231
sample_index_rng_seed : 520418551913322573
schedule_rng_seed : 811580660758947900
accuracy_log_rng_seed : 0
accuracy_log_probability : 0
accuracy_log_sampling_target : 0
print_timestamps : 0
performance_issue_unique : 0
performance_issue_same : 0
performance_issue_same_index : 0
performance_sample_count : 80

No warnings encountered during test.

No errors encountered during test.

Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

0.0% 0.0% Coverage
0.0% 0.0% Duplication

@freedomtan
Copy link
Contributor

@pgmpablo157321 is it possible to add something like total running time/duration to the summary generated by the loadgen (so that we can check it's valid or not in the summary page)?

@AhmedTElthakeb
Copy link
Contributor

Tested on unsupported device, run ends at 10mins but result still rendered as INVALID

@freedomtan
Copy link
Contributor

@freedomtan and @anhappdev to contact @pgmpablo157321 by e-mail

@pgmpablo157321
Copy link

@anhappdev Sorry for the late reply, I don't usually check the issues in this repo

@pgmpablo157321 Can you check if the result is expected:
It looks like the flag enforce_max_duration has no effect.

I don't think that result is invalid because of the max duration. It seems it failed because of the early stopping requirements, it may sound related but I don't think they are. Early stopping is a feature that was introduced a while ago, but basically it is a method to check that target_latency was reached, and looking at the log, it seems it was not. Could this be the problem? Are you able to test with a larger target_latency?

@freedomtan

@pgmpablo157321 is it possible to add something like total running time/duration to the summary generated by the loadgen (so that we can check it's valid or not in the summary page)?

Yes, we can report this value in the summary as well.

@freedomtan
Copy link
Contributor

looks like not the target_latency
@freedomtan to compile debug version -g and use remote gdb to check the real reason of why we got "invalid"

@freedomtan
Copy link
Contributor

@pgmpablo157321 Can you check if the result is expected: It looks like the flag enforce_max_duration has no effect.

cd "/Users/anh/dev/mlcommons/mobile_app_open" && \
	bazel-bin/flutter/cpp/binary/main EXTERNAL super_resolution \
		--mode=PerformanceOnly \
		--output_dir=""/Users/anh/dev/mlcommons/mobile_app_open"/output" \
		--model_file=""/Users/anh/dev/mlcommons/mobile_app_open"/mobile_back_apple/dev-resources/edsr_final/converted/edsr_f32b5_fp32.tflite" \
		--lib_path="bazel-bin/mobile_back_tflite/cpp/backend_tflite/libtflitebackend.so" \
		--images_directory=""/Users/anh/dev/mlcommons/mobile_app_open"/mobile_back_apple/dev-resources/psnr/LR" \
		--ground_truth_directory=""/Users/anh/dev/mlcommons/mobile_app_open"/mobile_back_apple/dev-resources/psnr/HR" \
		--max_duration_ms=10000 \
		--min_duration_ms=100
================================================
MLPerf Results Summary
================================================
SUT name : TFLite
Scenario : SingleStream
Mode     : PerformanceOnly
90th percentile latency (ns) : 694327333
Result is : INVALID
  Min duration satisfied : Yes
  Min queries satisfied : Skipped
  Early stopping satisfied: NO
Recommendations:
 * The test exited early, before enough queries were issued.
   See the detailed log for why this may have occurred.
Early Stopping Result:
 * Only processed 17 queries.
 * Need to process at least 64 queries for early stopping.

================================================
Additional Stats
================================================
QPS w/ loadgen overhead         : 1.52
QPS w/o loadgen overhead        : 1.53

Min latency (ns)                : 584926042
Max latency (ns)                : 699675166
Mean latency (ns)               : 653122000
50.00 percentile latency (ns)   : 679391667
90.00 percentile latency (ns)   : 694327333
95.00 percentile latency (ns)   : 699675166
97.00 percentile latency (ns)   : 699675166
99.00 percentile latency (ns)   : 699675166
99.90 percentile latency (ns)   : 699675166

================================================
Test Parameters Used
================================================
samples_per_query : 1
target_qps : 1000
target_latency (ns): 0
max_async_queries : 1
min_duration (ms): 100
max_duration (ms): 10000
min_query_count : 100
max_query_count : 0
qsl_rng_seed : 148687905518835231
sample_index_rng_seed : 520418551913322573
schedule_rng_seed : 811580660758947900
accuracy_log_rng_seed : 0
accuracy_log_probability : 0
accuracy_log_sampling_target : 0
print_timestamps : 0
performance_issue_unique : 0
performance_issue_same : 0
performance_issue_same_index : 0
performance_sample_count : 80

No warnings encountered during test.

No errors encountered during test.

I read the log carefully and did some tests. It turns out the enforce_max_duration works just not what we expected.
To get VALID result, before the enforce_max_duration flag. Three conditions (min duration, min queries, and early stopping) need to be satified. With enforce_max_duration, the Min queries satisfied is skipped, but min duration and early stopping still need to be satisfied. Here, as shown in the log, 64 queries or more are needed for early stopping.

https://github.com/mlcommons/inference_policies/blob/master/inference_rules.adoc#appendix-early_stopping

Result is : INVALID
   Min duration satisfied : Yes
   Min queries satisfied : Skipped
   Early stopping satisfied: NO
Recommendations:
  * The test exited early, before enough queries were issued.
    See the detailed log for why this may have occurred.
 Early Stopping Result:
  * Only processed 17 queries.
  * Need to process at least 64 queries for early stopping.

@pgmpablo157321
Copy link

@freedomtan So is this behaviour correct? or what conditions do you expect to pass to have a VALID result?

@freedomtan
Copy link
Contributor

@freedomtan So is this behaviour correct? or what conditions do you expect to pass to have a VALID result?

@pgmpablo157321 Personally, I think yes. We'll discuss it to see if the early stopping requirement is what we want.

@Mostelk and @mohitmundhragithub: what's your opinion?

@Mostelk
Copy link

Mostelk commented Dec 7, 2023

@freedomtan So is this behaviour correct? or what conditions do you expect to pass to have a VALID result?

@pgmpablo157321 Personally, I think yes. We'll discuss it to see if the early stopping requirement is what we want.

@Mostelk and @mohitmundhragithub: what's your opinion?

The goal is to test functionality of enforce_max_duration , then let us increase the duration to 45 minutes to satisfy the 64 min queries for early stopping, and see if we get valid result.

We can also discuss what is reasonable min queries for early stopping in policy meeting, rather than removing this condition.

@freedomtan
Copy link
Contributor

@freedomtan So is this behaviour correct? or what conditions do you expect to pass to have a VALID result?

@pgmpablo157321 Personally, I think yes. We'll discuss it to see if the early stopping requirement is what we want.
@Mostelk and @mohitmundhragithub: what's your opinion?

The goal is to test functionality of enforce_max_duration , then let us increase the duration to 45 minutes to satisfy the 64 min queries for early stopping, and see if we get valid result.

We can also discuss what is reasonable min queries for early stopping in policy meeting, rather than removing this condition.

Yes, I tested it before. If 64 queries are allowed by having large max duration, then we'll get VALID result.
Let's try to make all the benchmark items to run more than 64 queries at least.

@pgmpablo157321: could you please merge the branch into the inference repo's main branch

@anhappdev please rebase after that.

@freedomtan
Copy link
Contributor

@pgmpablo157321: ping

@freedomtan
Copy link
Contributor

@freedomtan to send email to check with @pgmpablo157321

@freedomtan freedomtan mentioned this pull request Jan 23, 2024
5 tasks
@freedomtan
Copy link
Contributor

For @pgmpablo157321's comment to have a mobile specific branch for loadgen:

  1. maybe we don't such requirement(s) in near future,
  2. we can maintain some local patches, as we do for TensorFlow, and upstream them later.

@freedomtan
Copy link
Contributor

freedomtan commented Feb 14, 2024

summary("  Min queries satisfied : ", min_queries_met ? "Yes" : settings.enforce_max_duration? "NO" : "Skipped");

This one is problematic too. It says:

if (min_queries_met) {
   return "Yes";
} else {
   if (settings.enforce_max_duration) {
     return "NO";
   } else {
     return "SKIPPED";
   }
}

which means, settings.enforce_max_duration should be false if we want to skip the min_queries_met

So an easy fix is to rename enforce_max_duration to dont_skip_min_queries_if_max_duration_met, but that's a bit confusing :-)

@freedomtan
Copy link
Contributor

@anhappdev I updated enforce_max_duration logic in another branch 7009512. The main fix is in inference's mobile_update branch, mlcommons/inference@ab284da

With that, we can have what @Mostelk proposed.

@Mostelk
Copy link

Mostelk commented Feb 14, 2024

@anhappdev I updated enforce_max_duration logic in another branch 7009512. The main fix is in inference's mobile_update branch, mlcommons/inference@ab284da

With that, we can have what @Mostelk proposed.

Based on this fix,, we should make enforce_max_duration 'true' by default and not-configurable

@anhappdev
Copy link
Collaborator Author

Based on this fix,, we should make enforce_max_duration 'true' by default and not-configurable

I assume this logic will not go into the main inference branch and we need to have a separate branch or a patch for this change.
If we hard-coded the enforce_max_duration flag and don't want to change it. Maybe it's better to remove that flag and update the code to always skip the min_queries_met condition (with a comment to explain why).

@freedomtan
Copy link
Contributor

Based on this fix,, we should make enforce_max_duration 'true' by default and not-configurable

I assume this logic will not go into the main inference branch and we need to have a separate branch or a patch for this change. If we hard-coded the enforce_max_duration flag and don't want to change it. Maybe it's better to remove that flag and update the code to always skip the min_queries_met condition (with a comment to explain why).

How about changing it back to false in the inference repo and set it to be true in our cpp code (and don't make it configurable). Will this increase the chance to merge back master branch of the inference repo?

@anhappdev
Copy link
Collaborator Author

How about changing it back to false in the inference repo and set it to be true in our cpp code (and don't make it configurable). Will this increase the chance to merge back master branch of the inference repo?

I don't know. I think it's more a policy decision than a technical issue.

@freedomtan
Copy link
Contributor

max_duration: max_duration should allow the task to have 64 queries to get VALID results.

as commented in #701, let's see if we can get VALID for all the tasks on some lower-tier devices.

@freedomtan
Copy link
Contributor

I tested tflite backend on couple devices. As we discussed in #701 before, 10 mins should be fine.
from running tflite backend on Samsung Galaxy S22+ (Exynos 22).

MLPerf Results Summary
================================================
SUT name : TFLite
Scenario : SingleStream
Mode     : PerformanceOnly
90th percentile latency (ns) : 867038750
Result is : VALID
  Min duration satisfied : Yes
  Min queries satisfied : Skipped
  Early stopping satisfied: Yes
Early Stopping Result:
 * Processed at least 64 queries (745).
 * Would discard 54 highest latency queries.
 * Early stopping 90th percentile estimate: 867902383
 * Early stopping 99th percentile estimate: 894015430

================================================
Additional Stats
================================================
QPS w/ loadgen overhead         : 1.24
QPS w/o loadgen overhead        : 1.24

Min latency (ns)                : 533973125
Max latency (ns)                : 894015430
Mean latency (ns)               : 807125504
50.00 percentile latency (ns)   : 860323867
90.00 percentile latency (ns)   : 867038750
95.00 percentile latency (ns)   : 869068672
97.00 percentile latency (ns)   : 870320625
99.00 percentile latency (ns)   : 874908985
99.90 percentile latency (ns)   : 894015430

================================================
Test Parameters Used
================================================
samples_per_query : 1
target_qps : 1000
target_latency (ns): 0
max_async_queries : 1
min_duration (ms): 60000
max_duration (ms): 600000
min_query_count : 1024
max_query_count : 0
qsl_rng_seed : 148687905518835231
sample_index_rng_seed : 520418551913322573
schedule_rng_seed : 811580660758947900
accuracy_log_rng_seed : 0
accuracy_log_probability : 0
accuracy_log_sampling_target : 0
print_timestamps : 0
performance_issue_unique : 0
performance_issue_same : 0
performance_issue_same_index : 0
performance_sample_count : 321

No warnings encountered during test.

No errors encountered during test.

@freedomtan
Copy link
Contributor

@anhappdev please help to make the color of VALID + early stopped results to be purple as in

Result is : VALID
  Min duration satisfied : Yes
  Min queries satisfied : Skipped
  Early stopping satisfied: Yes

@anhappdev
Copy link
Collaborator Author

@anhappdev I updated enforce_max_duration logic in another branch 7009512. The main fix is in inference's mobile_update branch, mlcommons/inference@ab284da

The logic is updated for mlperf_log_summary.txt but not for mlperf_log_detail.txt, where we parse the result.
@freedomtan Would you fix this? Or should I do it?

@anhappdev
Copy link
Collaborator Author

@anhappdev please help to make the color of VALID + early stopped results to be purple as in

The result screen will look like this:

@freedomtan
Copy link
Contributor

@anhappdev I updated enforce_max_duration logic in another branch 7009512. The main fix is in inference's mobile_update branch, mlcommons/inference@ab284da

The logic is updated for mlperf_log_summary.txt but not for mlperf_log_detail.txt, where we parse the result. @freedomtan Would you fix this? Or should I do it?

yes, please help fix it. I thought changing the log to warning (instead of error) is enough :-)

@anhappdev
Copy link
Collaborator Author

@freedomtan I don't have write access on the inference repo. Please merge this PR
mlcommons/inference#1654

Copy link

sonarqubecloud bot commented Mar 1, 2024

@anhappdev anhappdev marked this pull request as draft March 1, 2024 08:53
@anhappdev anhappdev marked this pull request as ready for review March 1, 2024 09:17
@anhappdev anhappdev merged commit 5ef666d into master Mar 4, 2024
21 checks passed
@anhappdev anhappdev deleted the anh/max-duration branch March 4, 2024 07:49
@github-actions github-actions bot locked and limited conversation to collaborators Mar 4, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants