-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stats: envoy prometheus endpoint fails promlint #2597 Destructor failing while calling derived class function #2722 #2752
Conversation
Destructor failig while calling derived class function envoyproxy#2722 Description: The admin endpoint of envoy when queried as a prometheus endpoint fails promlint. Promlint is a sanity check tool offered by Prometheus to check the correctness of a prometheus endpoint. following issues were fixed -fixed issue with metric type being prtinted twice -added unit test as well -fixed issue in destructors of CounterImpl and GuageImpl -changed thread_local_store_test to remove check for count of free func calls Signed-off-by: Piyush Tiwari <[email protected]>
i will fix the failing asan mock tests once i get feedback on the code |
It looks like there's two unrelated issues being worked on in this PR. Can you please separate them into two PRs? |
@htuch, once this is separated, I can take the first pass at the resulting PRs. |
in seperate pr
@ggreenway FYI the non-Prom PR is goin to fix #2453 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working through the segfaults! I read through #2722 and this PR, and I'm still a bit confused. As I mentioned in my comment below, I'm skeptical that this is the right way to go. In the issue, you referred to a base class-derived class issue, but I'm having a little trouble understanding which base class references which derived class method in its destructor. Could you elaborate?
I did notice something in the code you included:
Stats::HeapRawStatDataAllocator alloc_;
Stats::RawStatData *data = alloc_.alloc(name);
Stats::CounterSharedPtr c1(new Stats::CounterImpl(*data, alloc_, std::string(name),std::move(tags)));
This may be a red herring, but just that bit of code looks dangerous because you need to be sure that the HeapRawStatDataAllocator
outlives the CounterSharedPtr
. This is because RawStatData
holds a reference to its allocator, and it's accessed when the counter destructs.
~HeapRawStatDataAllocator(); | ||
|
||
private: | ||
RawStatData* data_ = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused about this fix, so feel free to correct me if I'm misunderstanding. This seems to couple a single allocator with a single allocated object.
I thought this allocator was a long-lived object that allocates many RawStatData
objects for use elsewhere. Is it your intention to change it to an allocator per RawStatData
since you're keeping a member variable that tracks a single RawStatData
in the allocator? If so, wouldn't you need to ensure that objects like IsolatedStoreImpl
create an allocator per stat that they initialize? Without this, you're likely leaking RawStatData
objects (which is likely where the ASAN failures are coming from). I'm skeptical that this is the right direction to go to fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrice32 FYI I think all that's needed here is just proper ref-counting per the other open ticket but I'm not sure. I don't know where the original crash was seen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrice32 Thanks for looking into pr
Original error was "calling pure virtual function ", this was coming from destructor of ~CounterImpl() since it was calling alloc_.free(data_).
I was trying to create object of counterimpl and passing derived class reference (HeapRawStatDataAllocator) as one of the arguments
Stats::HeapRawStatDataAllocator alloc_;
Stats::RawStatData *data = alloc_.alloc(name);
Stats::CounterSharedPtr c1(new Stats::CounterImpl(*data, alloc_, std::string(name),std::move(tags)));
since in RawStatDataAllocator free is a pure virtual function. It seems intention was to call free of the derived class (HeapRawStatDataAllocator) in ~counterImpl destructor but by the time we reach destructor derived class reference was already lost since derived class destructor was called before.
Also for your suggestion of ref counting. Currently expectation is ref_count will be always one so ref count is not being incremented. I got this understanding based on the existing comment in free function (// This allocator does not ever have concurrent access to the raw data. ASSERT(data.ref_count_ == 1);.
In sort simple question might be in the current code (without my changes) can we create CounterImpl object . If yes please explain.
Please let me know your input , i will change code accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks for the explanation!
Why would the HeapRawStatDataAllocator
class's methods be gone during the call to the destructor of CounterImpl
? CounterImpl
just has a reference to a RawStatDataAllocator
, but it doesn't appear to have any inheritance relationship with or ownership over RawStatDataAllocator
. IIUC, the reference to RawStatDataAllocator
should be valid (and not in the process of being destructed) in the CounterImpl
destructor as long as HeapRawStatDataAllocator
(wherever it's owned) has not been destructed already.
One thing that might help here (credit @dnoe) would be if you could write up a test case that causes the failure because it's difficult to understand exactly what we're trying to fix if we don't have visibility into where it occurs. If a test in #2757 triggers this, then you can just point me to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitiwari I will comment there. I think I may see the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the discussion and work in #2757, should this PR be closed? Or is there still an issue that needs to be fixed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My vote is to close this PR and create a new one to fix #2453. This one has history/comments for fixing 2 different issues, neither of which is the one that will now be fixed here.
Agreed let's close this and open a new PR to fix #2453. Closing, |
Description:
The admin endpoint of envoy when queried as a prometheus endpoint fails promlint.
Promlint is a sanity check tool offered by Prometheus to check the correctness
of a prometheus endpoint.
following issues were fixed
-fixed issue with metric type being printed twice
-added unit test as well
-fixed issue in destructors of CounterImpl and GuageImpl
-changed thread_local_store_test to remove check for count of free function calls
Signed-off-by: Piyush Tiwari [email protected]
Risk Level: Low
Testing:
bazel tests are passing (bazel test //test/... --test_env=ENVOY_IP_TEST_VERSIONS=v4only --define tcmalloc=disabled ). I had to use --define tcmalloc=disabled explictily if i dont use some tests were failing but fas per bazil config tcmalloc should be disabled.
some asan tests are failing due to mock tests failure. Apparently mock tests might be checking number of calls to alloc.free and which is now removed from the destructor since free is a pure virtual function and derived class reference is lost when we reach the base class destructor so apparently that was not the correct place for free. I had implemented destructor and free for derived class to delete the memory which derived class alloc created.
Deprecated
some existing mock tests will need to be depreciated