Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Host instrumentation for available memory on Linux systems is less accurate than is tested for #425

Open
MrAlias opened this issue Oct 28, 2020 · 2 comments
Labels
area: instrumentation Related to an instrumentation package area: testing Related to package testing bug Something isn't working instrumentation: host

Comments

@MrAlias
Copy link
Contributor

MrAlias commented Oct 28, 2020

Frequently the TestHostMemory test in the host instrumentation fails because the relative error is too high. E.g.

--- FAIL: TestHostMemory (1.05s)
    host_test.go:180: 
        	Error Trace:	host_test.go:180
        	Error:      	Relative error is too high: 0.05 (expected)
        	            	        < 0.1450000778137451 (actual)
        	Test:       	TestHostMemory
FAIL
FAIL	go.opentelemetry.io/contrib/instrumentation/host	3.076s
FAIL

require.InEpsilon(t, hostUsed/hostUsedUtil, hostAvailable/hostAvailableUtil, tolerance)

Or,

=== RUN   TestHostMemory
    host_test.go:183: 
        	Error Trace:	host_test.go:183
        	Error:      	Relative error is too high: 0.05 (expected)
        	            	        < 0.05530730006530893 (actual)
        	Test:       	TestHostMemory
--- FAIL: TestHostMemory (3.07s)

require.InEpsilon(t, 1.0, hostUsedUtil+hostAvailableUtil, tolerance)

This was run in the following environment.

$ uname -srvmo
Linux 5.9.1-arch1-1 #1 SMP PREEMPT Sat, 17 Oct 2020 13:30:37 +0000 x86_64 GNU/Linux

Based on this build, the package we use to measure the host memory is directly reading out of /proc/meminfo for the available memory and using the following algorithm to calculate the used memory:

ret.Used = ret.Total - ret.Free - ret.Buffers - ret.Cached

The value read out of /proc/meminfo is an estimate that does not necessarily equal the inverse of the above computation:

+MemAvailable: An estimate of how much memory is available for starting new
+              applications, without swapping. Calculated from MemFree,
+              SReclaimable, the size of the file LRU lists, and the low
+              watermarks in each zone.
+              The estimate takes into account that the system needs some
+              page cache to function well, and that not all reclaimable
+              slab will be reclaimable, due to items being in use. The
+              impact of those factors will vary from system to system.

This means that our test to validate the relationship between the reported available and used memory will be an estimate at best and likely to be outside of the tolerance we are testing for.

Proposal

We should remove the tolerance test between used and available memory. At least in the interim where we are deciding how to move forward. Having a flaky test that fails randomly on certain system is not idea.

For the actual measured values, we could replace our utilization metric measurement:

hostMemoryUtilization.Observation(float64(vmStats.Used)/float64(vmStats.Total)),

with a scaled version of the UsedPercent (i.e. UsedPercent/100.0) our dependency reports. For the available system memory we could just report 1 - UsedPercent. This would not be an accurate guess as to how much memory a theoretical application starting up would actually have, but it would match the users of this monitoring code. They will expect used + available to equal 1.0.

@MrAlias MrAlias added area: instrumentation Related to an instrumentation package area: testing Related to package testing bug Something isn't working labels Oct 28, 2020
@matej-g
Copy link
Contributor

matej-g commented Oct 28, 2020

I've been noticing the same error on my local environment and I came to the same conclusion regarding Total vs Available vs Used memory. I think the interim change to the test makes sense. Some further thoughts:

This would not be an accurate guess as to how much memory a theoretical application starting up would actually have, but it would match the users of this monitoring code. They will expect used + available to equal 1.0.

As a user, I would not be entirely sure what to expect in this case. I guess Used and Available memory are usually understood the way they are interpreted by the library that the host instrumentation depends on - it seems this is in coherence with how the free utility, which displays available and used memory in the same manner. Which means, as you pointed out, that available + used < total in realistic scenarios, as there will always be some part of memory not accounted for in either of the two states, ultimately meaning that even sum of utilizations will not equal to 1.

However, this by itself might not be posing a problem, as if my understanding of the spec conventions and surrounding discussion is correct , it is not required that the fractions of the two usage values add up to 1.

@MrAlias
Copy link
Contributor Author

MrAlias commented Nov 13, 2020

As a user, I would not be entirely sure what to expect in this case. I guess Used and Available memory are usually understood the way they are interpreted by the library that the host instrumentation depends on - it seems this is in coherence with how the free utility, which displays available and used memory in the same manner. Which means, as you pointed out, that available + used < total in realistic scenarios, as there will always be some part of memory not accounted for in either of the two states, ultimately meaning that even sum of utilizations will not equal to 1.

However, this by itself might not be posing a problem, as if my understanding of the spec conventions and surrounding discussion is correct , it is not required that the fractions of the two usage values add up to 1.

This is all a good point. With the fix to the test in place I'm okay closing this if we want to preserve the current value measured. @jmacd what are your thoughts on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: instrumentation Related to an instrumentation package area: testing Related to package testing bug Something isn't working instrumentation: host
Projects
None yet
Development

No branches or pull requests

2 participants