Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats: envoy prometheus endpoint fails promlint #2597 Destructor failing while calling derived class function #2722 #2752

Closed
wants to merge 2 commits into from

Conversation

pitiwari
Copy link
Contributor

@pitiwari pitiwari commented Mar 7, 2018

Description:
The admin endpoint of envoy when queried as a prometheus endpoint fails promlint.
Promlint is a sanity check tool offered by Prometheus to check the correctness
of a prometheus endpoint.

following issues were fixed
-fixed issue with metric type being printed twice
-added unit test as well
-fixed issue in destructors of CounterImpl and GuageImpl
-changed thread_local_store_test to remove check for count of free function calls

Signed-off-by: Piyush Tiwari [email protected]

Risk Level: Low

Testing:

bazel tests are passing (bazel test //test/... --test_env=ENVOY_IP_TEST_VERSIONS=v4only --define tcmalloc=disabled ). I had to use --define tcmalloc=disabled explictily if i dont use some tests were failing but fas per bazil config tcmalloc should be disabled.

some asan tests are failing due to mock tests failure. Apparently mock tests might be checking number of calls to alloc.free and which is now removed from the destructor since free is a pure virtual function and derived class reference is lost when we reach the base class destructor so apparently that was not the correct place for free. I had implemented destructor and free for derived class to delete the memory which derived class alloc created.

Deprecated

some existing mock tests will need to be depreciated

       Destructor failig while calling derived class function envoyproxy#2722
Description:

The admin endpoint of envoy when queried as a prometheus endpoint fails promlint.
Promlint is a sanity check tool offered by Prometheus to check the correctness
of a prometheus endpoint.

following issues were fixed
-fixed issue with metric type being prtinted twice
-added unit test as well
-fixed issue in destructors of CounterImpl and GuageImpl
-changed thread_local_store_test to remove check for count of free func calls

Signed-off-by: Piyush Tiwari <[email protected]>
@pitiwari
Copy link
Contributor Author

pitiwari commented Mar 7, 2018

i will fix the failing asan mock tests once i get feedback on the code

@ggreenway
Copy link
Contributor

It looks like there's two unrelated issues being worked on in this PR. Can you please separate them into two PRs?

@mrice32
Copy link
Member

mrice32 commented Mar 7, 2018

@htuch, once this is separated, I can take the first pass at the resulting PRs.

@mattklein123
Copy link
Member

@ggreenway FYI the non-Prom PR is goin to fix #2453

@pitiwari
Copy link
Contributor Author

pitiwari commented Mar 7, 2018

I had removed changes for issue #2597 from this pr, creating new pr for it. This pr has fix for #2722 (destructor calling pure virtual function)

Copy link
Member

@mrice32 mrice32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working through the segfaults! I read through #2722 and this PR, and I'm still a bit confused. As I mentioned in my comment below, I'm skeptical that this is the right way to go. In the issue, you referred to a base class-derived class issue, but I'm having a little trouble understanding which base class references which derived class method in its destructor. Could you elaborate?

I did notice something in the code you included:

Stats::HeapRawStatDataAllocator alloc_;
Stats::RawStatData *data = alloc_.alloc(name);
Stats::CounterSharedPtr c1(new Stats::CounterImpl(*data, alloc_, std::string(name),std::move(tags)));

This may be a red herring, but just that bit of code looks dangerous because you need to be sure that the HeapRawStatDataAllocator outlives the CounterSharedPtr. This is because RawStatData holds a reference to its allocator, and it's accessed when the counter destructs.

~HeapRawStatDataAllocator();

private:
RawStatData* data_ = nullptr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused about this fix, so feel free to correct me if I'm misunderstanding. This seems to couple a single allocator with a single allocated object.

I thought this allocator was a long-lived object that allocates many RawStatData objects for use elsewhere. Is it your intention to change it to an allocator per RawStatData since you're keeping a member variable that tracks a single RawStatData in the allocator? If so, wouldn't you need to ensure that objects like IsolatedStoreImpl create an allocator per stat that they initialize? Without this, you're likely leaking RawStatData objects (which is likely where the ASAN failures are coming from). I'm skeptical that this is the right direction to go to fix this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrice32 FYI I think all that's needed here is just proper ref-counting per the other open ticket but I'm not sure. I don't know where the original crash was seen.

Copy link
Contributor Author

@pitiwari pitiwari Mar 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrice32 Thanks for looking into pr

Original error was "calling pure virtual function ", this was coming from destructor of ~CounterImpl() since it was calling alloc_.free(data_).
I was trying to create object of counterimpl and passing derived class reference (HeapRawStatDataAllocator) as one of the arguments

Stats::HeapRawStatDataAllocator alloc_;
Stats::RawStatData *data = alloc_.alloc(name);
Stats::CounterSharedPtr c1(new Stats::CounterImpl(*data, alloc_, std::string(name),std::move(tags)));

since in RawStatDataAllocator free is a pure virtual function. It seems intention was to call free of the derived class (HeapRawStatDataAllocator) in ~counterImpl destructor but by the time we reach destructor derived class reference was already lost since derived class destructor was called before.

Also for your suggestion of ref counting. Currently expectation is ref_count will be always one so ref count is not being incremented. I got this understanding based on the existing comment in free function (// This allocator does not ever have concurrent access to the raw data. ASSERT(data.ref_count_ == 1);.

In sort simple question might be in the current code (without my changes) can we create CounterImpl object . If yes please explain.

Please let me know your input , i will change code accordingly.

Copy link
Member

@mrice32 mrice32 Mar 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for the explanation!

Why would the HeapRawStatDataAllocator class's methods be gone during the call to the destructor of CounterImpl? CounterImpl just has a reference to a RawStatDataAllocator, but it doesn't appear to have any inheritance relationship with or ownership over RawStatDataAllocator. IIUC, the reference to RawStatDataAllocator should be valid (and not in the process of being destructed) in the CounterImpl destructor as long as HeapRawStatDataAllocator (wherever it's owned) has not been destructed already.

One thing that might help here (credit @dnoe) would be if you could write up a test case that causes the failure because it's difficult to understand exactly what we're trying to fix if we don't have visibility into where it occurs. If a test in #2757 triggers this, then you can just point me to that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrice32 yes test in #2757 will recreate the issue, just cherry-pick the commit, since you need change in other file as well since that UT is testing that function :).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitiwari I will comment there. I think I may see the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the discussion and work in #2757, should this PR be closed? Or is there still an issue that needs to be fixed here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @pitiwari was going to look into fixing #2453 here. However, maybe it makes sense to close this and open a new PR when the fix is ready since it may be a week or two. @pitiwari, what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My vote is to close this PR and create a new one to fix #2453. This one has history/comments for fixing 2 different issues, neither of which is the one that will now be fixed here.

@mattklein123
Copy link
Member

Agreed let's close this and open a new PR to fix #2453. Closing,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants