Host instrumentation for available memory on Linux systems is less accurate than is tested for #425

MrAlias · 2020-10-28T00:08:24Z

Frequently the TestHostMemory test in the host instrumentation fails because the relative error is too high. E.g.

--- FAIL: TestHostMemory (1.05s)
    host_test.go:180: 
        	Error Trace:	host_test.go:180
        	Error:      	Relative error is too high: 0.05 (expected)
        	            	        < 0.1450000778137451 (actual)
        	Test:       	TestHostMemory
FAIL
FAIL	go.opentelemetry.io/contrib/instrumentation/host	3.076s
FAIL

opentelemetry-go-contrib/instrumentation/host/host_test.go

Line 180 in a0dc004

    
           require.InEpsilon(t, hostUsed/hostUsedUtil, hostAvailable/hostAvailableUtil, tolerance)

Or,

=== RUN   TestHostMemory
    host_test.go:183: 
        	Error Trace:	host_test.go:183
        	Error:      	Relative error is too high: 0.05 (expected)
        	            	        < 0.05530730006530893 (actual)
        	Test:       	TestHostMemory
--- FAIL: TestHostMemory (3.07s)

opentelemetry-go-contrib/instrumentation/host/host_test.go

Line 183 in a0dc004

require.InEpsilon(t, 1.0, hostUsedUtil+hostAvailableUtil, tolerance)

This was run in the following environment.

$ uname -srvmo
Linux 5.9.1-arch1-1 #1 SMP PREEMPT Sat, 17 Oct 2020 13:30:37 +0000 x86_64 GNU/Linux

Based on this build, the package we use to measure the host memory is directly reading out of /proc/meminfo for the available memory and using the following algorithm to calculate the used memory:

ret.Used = ret.Total - ret.Free - ret.Buffers - ret.Cached

The value read out of /proc/meminfo is an estimate that does not necessarily equal the inverse of the above computation:

+MemAvailable: An estimate of how much memory is available for starting new
+              applications, without swapping. Calculated from MemFree,
+              SReclaimable, the size of the file LRU lists, and the low
+              watermarks in each zone.
+              The estimate takes into account that the system needs some
+              page cache to function well, and that not all reclaimable
+              slab will be reclaimable, due to items being in use. The
+              impact of those factors will vary from system to system.

This means that our test to validate the relationship between the reported available and used memory will be an estimate at best and likely to be outside of the tolerance we are testing for.

Proposal

We should remove the tolerance test between used and available memory. At least in the interim where we are deciding how to move forward. Having a flaky test that fails randomly on certain system is not idea.

For the actual measured values, we could replace our utilization metric measurement:

opentelemetry-go-contrib/instrumentation/host/host.go

Line 212 in a0dc004

    
           hostMemoryUtilization.Observation(float64(vmStats.Used)/float64(vmStats.Total)),

with a scaled version of the UsedPercent (i.e. UsedPercent/100.0) our dependency reports. For the available system memory we could just report 1 - UsedPercent. This would not be an accurate guess as to how much memory a theoretical application starting up would actually have, but it would match the users of this monitoring code. They will expect used + available to equal 1.0.

The text was updated successfully, but these errors were encountered:

matej-g · 2020-10-28T20:50:45Z

I've been noticing the same error on my local environment and I came to the same conclusion regarding Total vs Available vs Used memory. I think the interim change to the test makes sense. Some further thoughts:

This would not be an accurate guess as to how much memory a theoretical application starting up would actually have, but it would match the users of this monitoring code. They will expect used + available to equal 1.0.

As a user, I would not be entirely sure what to expect in this case. I guess Used and Available memory are usually understood the way they are interpreted by the library that the host instrumentation depends on - it seems this is in coherence with how the free utility, which displays available and used memory in the same manner. Which means, as you pointed out, that available + used < total in realistic scenarios, as there will always be some part of memory not accounted for in either of the two states, ultimately meaning that even sum of utilizations will not equal to 1.

However, this by itself might not be posing a problem, as if my understanding of the spec conventions and surrounding discussion is correct , it is not required that the fractions of the two usage values add up to 1.

MrAlias · 2020-11-13T16:49:29Z

As a user, I would not be entirely sure what to expect in this case. I guess Used and Available memory are usually understood the way they are interpreted by the library that the host instrumentation depends on - it seems this is in coherence with how the free utility, which displays available and used memory in the same manner. Which means, as you pointed out, that available + used < total in realistic scenarios, as there will always be some part of memory not accounted for in either of the two states, ultimately meaning that even sum of utilizations will not equal to 1.

However, this by itself might not be posing a problem, as if my understanding of the spec conventions and surrounding discussion is correct , it is not required that the fractions of the two usage values add up to 1.

This is all a good point. With the fix to the test in place I'm okay closing this if we want to preserve the current value measured. @jmacd what are your thoughts on this?

MrAlias added area: instrumentation Related to an instrumentation package area: testing Related to package testing bug Something isn't working labels Oct 28, 2020

MrAlias mentioned this issue Oct 28, 2020

Replace flaky host instrumentation test #426

Merged

MrAlias mentioned this issue Nov 5, 2020

Investigate Host TestHostMemory Failure #313

Closed

MrAlias added the instrumentation: host label Nov 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Host instrumentation for available memory on Linux systems is less accurate than is tested for #425

Host instrumentation for available memory on Linux systems is less accurate than is tested for #425

MrAlias commented Oct 28, 2020

matej-g commented Oct 28, 2020

MrAlias commented Nov 13, 2020

Host instrumentation for available memory on Linux systems is less accurate than is tested for #425

Host instrumentation for available memory on Linux systems is less accurate than is tested for #425

Comments

MrAlias commented Oct 28, 2020

Proposal

matej-g commented Oct 28, 2020

MrAlias commented Nov 13, 2020