Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Imprecise duration logged by MasterService #7849

Closed
sandeshkr419 opened this issue May 31, 2023 · 13 comments · Fixed by #7902
Closed

[BUG] Imprecise duration logged by MasterService #7849

sandeshkr419 opened this issue May 31, 2023 · 13 comments · Fixed by #7902
Assignees
Labels
bug Something isn't working distributed framework

Comments

@sandeshkr419
Copy link
Contributor

sandeshkr419 commented May 31, 2023

Describe the bug
Execution Time computations in MasterService.java reports 0s when the computed time is less than 200ms. This is because the time computations use threadPool.relativeTimeInMillis() which is useful when we require time precision in seconds. These cluster state computations, publications or of the order of ms.

To Reproduce
Steps to reproduce the behavior:

  1. Enable Debug Logs for Master Service
  2. Trigger any operation which recreates ClusterState and publishes it
    Expected behavior
    Following Log lines should appear:
[2023-05-31T11:38:34,142][DEBUG][o.o.c.s.MasterService    ] [master1] took [0s] to compute cluster state update for [index-aliases]
[2023-05-31T11:38:34,307][DEBUG][o.o.c.s.MasterService    ] [master1] took [0s] to notify listeners on successful publication of cluster state (version: 55109, uuid: Fc58ns6mRHCNuxcvADvfOg) for [index-aliases]
[2023-05-31T11:38:32,416][DEBUG][o.o.c.s.MasterService    ] [master1] took [0s] to compute cluster state update for [cluster_reroute(reroute after starting shards)]
[2023-05-31T11:38:32,416][DEBUG][o.o.c.s.MasterService    ] [master1] took [0s] to notify listeners on unchanged cluster state for [cluster_reroute(reroute after starting shards)]
[2023-05-31T11:38:34,307][DEBUG][o.o.c.s.MasterService    ] [master1] took [0s] to notify listeners on successful publication of cluster state (version: 55109, uuid: Fc58ns6mRHCNuxcvADvfOg) for [index-aliases]

The above times are not useful, since most of these operations execute in less than 200ms. A more granular time calculation can help fine tuning the performance better and offer more insights when debug logs are enabled.

OpenSearch Data Configuration:

  • 45k aliases, 5 indices. (multiple indices to bombard the cluster-state processing time)
  • In some of the operations, actual cluster state time was published when it was more than 200ms.

Suggestive Solution
Suggesting to use System.currentTimeMillis() instead of using cached time from ThreadPool. Although there can be other classes where this change might be useful, for that we can start from here and then incrementally change them as we encounter them since enabling debug logs and checking each API results might not be feasible.

Plugins
None enabled, not required.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: MacOS (M1 Pro processor)
  • Version: 3.0 / main branch
  • Local Cluster with 1 master, 1 data node

Additional context
I was trying to optimize ClusterState creation activity (ref: #7002) when I encountered this.

@sandeshkr419 sandeshkr419 added bug Something isn't working untriaged labels May 31, 2023
@andrross andrross changed the title [BUG] [BUG] Imprecise duration logged by MasterService May 31, 2023
@andrross
Copy link
Member

andrross commented May 31, 2023

actual cluster state time was published when it was more than 200ms

I think that's not quite right. The time value will only have a precision of the thread_pool.estimated_time_interval value, which defaults to 200ms, right? I think this is by design to avoid the overhead of the call to System.nanoTime(). This optimization is important when computing durations in hot path code, such as search or index operations. The question here is whether it is appropriate to use the cached time here or just always use the system clock. @sandeshkr419 Can you make the case that using the cached time here is not necessary?

@andrross
Copy link
Member

See issue #2000 and the linked PR for a similar problem caused by cached time. The linked PR has some performance data. There is also a microbenchmark for measuring the performance of System.nanoTime and System.currentTimeMillis

@sandeshkr419
Copy link
Contributor Author

sandeshkr419 commented Jun 1, 2023

Thanks @andrross for sharing this issue. Nice discussion around different time captures.
Seems like System.currentTimeMillis can give erroneous results, so will have to think over if we can leverage System.nanoTime without any performance regressions, since existing debug logging for the above particular scenario is not at all informative in present state.

@andrross
Copy link
Member

andrross commented Jun 1, 2023

will have to think over if we can leverage System.nanoTime without any performance regressions

System.nanoTime is not dirt cheap, but it does only take on the order of tens of nanoseconds. For things that happen many thousands of times per second or more in hot path code, this can add up. For things that happen less frequently, ~30 nanoseconds can be negligible. My intuition is that things like updating cluster state and notifying listeners are probably not super hot path code and we can probably calculate an accurate duration by using the real chip clock instead of using the cached time.

@shwetathareja Do you have any thoughts here?

@sandeshkr419
Copy link
Contributor Author

@andrross To add to your argument, cluster state update and associated listener are not too frequent. Not expecting too many of these calls within a second, and errors in order of 2-5 ms should be acceptable here which is far more than expected inaccuracy.

@shwetathareja
Copy link
Member

@andrross / @sandeshkr419 I agree during cluster state processing, we need more accurate duration using system clock and not cached time. The cluster state changes are not frequent and we need more accurate breakup around which listener is taking how much time or calculating new state itself is taking longer. This would help identify bottlenecks in the cluster state updates code path.

@Bukhtawar
Copy link
Collaborator

This would help identify bottlenecks in the cluster state updates code path.

For identifying bottlenecks I would prefer JMH benchmarks or profiling rather than debug logs if a +/- 200ms changes the course of action.
My concern is we start following this model, it would really get hard to ensure other parts of the code use a fine balance between when to and when not to use cached time.

@shwetathareja
Copy link
Member

shwetathareja commented Jun 7, 2023

@Bukhtawar I think we can have a guidance when to use system clocks vs cached times. JMH benchmarks are good but we don't know all the scenarios upfront. And, having these logs help debug when an issue appears in production and all of a sudden cluster state processing and applying starts taking really long. These logs were added with the same intention in the first place.

@andrross
Copy link
Member

andrross commented Jun 7, 2023

I think we can have a guidance when to use system clocks vs cached times.

@shwetathareja @Bukhtawar I tend to agree. Also, I think the cat is already out of the bag here in that we have a mix of cached time vs system clock:

% git grep 'System.nanoTime' server/src/main/ | wc -l
     161

% git grep 'relativeTimeInMillis' server/src/main | wc -l
      64

@jainankitk
Copy link
Collaborator

@Bukhtawar / @shwetathareja / @andrross - I am wondering if there is any downside to reducing the thread_pool.estimated_time_interval to lower value like 5ms or even 1ms and use the cached time at all the places. If there is single thread doing this every ms (ms precision should suffice for all our use cases), would this cause any performance impact?

That also avoids unnecessary invocations to System.nanoTime like in above PR

@andrross
Copy link
Member

@jainankitk I honestly don't know. See this comment for a test that showed doing System.nanoTime once a millisecond in a loop took about 6% of one CPU. Do you suspect that this commit is causing any sort of performance impact?

@sandeshkr419
Copy link
Contributor Author

@jainankitk I have a doubt regarding using cached time. Does setting a cached time clock to 1ms - does this implies that the cached time clock refreshes/invokes the clock every 1 ms - if yes - then using a cached clock with 1ms latency can cause regression? Given that the code in scope of this issue is not aggressive in invocation frequency. Probably, if there are other places which require a lower cached time - then we can modify this code piece to utilize the new 1ms cached time as well. But introducing cached time for this scope does not seems reasonable.

@shwetathareja
Copy link
Member

If there is single thread doing this every ms (ms precision should suffice for all our use cases), would this cause any performance impact?

@jainankitk This would be one thread per threadpool right? We can benchmark to see the impact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed framework
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants