Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: fixing the segment.nce metrics? #226

Closed
bmcfee opened this issue Oct 21, 2016 · 9 comments
Closed

RFC: fixing the segment.nce metrics? #226

bmcfee opened this issue Oct 21, 2016 · 9 comments

Comments

@bmcfee
Copy link
Collaborator

bmcfee commented Oct 21, 2016

[warning: opinions follow.]

Summary

I think the normalized conditional entropy (NCE) scores for segmentation are not properly defined, and we could fix it.

Background

The NCE scores are defined in terms of conditional entropy between the label distributions estimated from the reference and estimated segmentations. From Lukashevich's paper:

image

where:

  • H(E|A) is the entropy of the estimated label given the "annotated" (reference) label
  • N_e is the number of unique labels in the estimated segmentation

(and similarly for H(A|E) and N_a).

The intuition here is that log N_e is the maximum possible entropy of any distribution over N_e labels (ie, the uniform distribution), so it's a good way to normalize the S_o.

What's wrong

The uniform distribution has nothing to do with either the reference or estimate, and can be easily skewed when we have to pad additional segments to an estimate so that it spans the same time extent as the reference. (This is covered in the mir_eval paper, as it explains the inflated deviation from the mirex implementation.)

A better way, I think, is to normalize by the marginal entropy H(E):
image

This would capture the actual decrease in uncertainty in predicting the estimated label when the reference label is provided. Since H(E) <= H(E|A) for any distribution A, it still provides a valid normalization.

If we use the marginal entropy instead of uniform, then new labels due to tiny padding segments do not significantly change the normalization on the conditional entropy, so the results should be more stable.

Proposed modification

Add a flag to the NCE metrics, false by default, which changes the normalization from uniform entropy to the marginal entropy.

Incidentally, this modification would render NCE equivalent to the v-measure, which was originally included in the segment module, but removed since it's not used by anyone in MIR or MIREX.

@craffel
Copy link
Collaborator

craffel commented Oct 21, 2016

As someone outside of segment eval, this seems reasonable to me. Do you have any intuition for how much this changes the score in practice?

Add a flag to the NCE metrics, false by default, which changes the normalization from uniform entropy to the marginal entropy.

It seems to me that if your justification is correct/widely agreed upon, we should make a stronger move than this. Can you reach out to others who may have thought about this and encourage them to participate in this issue before we take action?

@bmcfee
Copy link
Collaborator Author

bmcfee commented Oct 21, 2016

As someone outside of segment eval, this seems reasonable to me. Do you have any intuition for how much this changes the score in practice?

Sure:

  • H(E) <= H(Uniform) = log N_e (since uniform is maximum entropy)
  • -> H(E|A) / H(E) >= H(E|A) / H(Uniform) (since we're dividing by a smaller number)
  • -> 1 - H(E|A) / H(E) <= 1 - H(E|A) / H(Uniform) = S_o

so the over-segmentation score will be lower if we use marginal entropy instead of uniform.

EDIT: sorry, I missed the word "much" in your question. I haven't implemented or tested this yet, but I don't expect a single answer across the board. The amount of change will depend on how many labels are in the estimate, and how they're distributed. But regardless of how much it changes, the scores will always be lower.

It seems to me that if your justification is correct/widely agreed upon, we should make a stronger move than this. Can you reach out to others who may have thought about this and encourage them to participate in this issue before we take action?

Sure. I figured I'd start the conversation here and then kick it out to music-ir.

@craffel
Copy link
Collaborator

craffel commented Oct 21, 2016

EDIT: sorry, I missed the word "much" in your question. I haven't implemented or tested this yet, but I don't expect a single answer across the board. The amount of change will depend on how many labels are in the estimate, and how they're distributed. But regardless of how much it changes, the scores will always be lower.

Yeah, my question was basically whether you had a rough idea, for a typical song and a sane segmenter algorithm, whether this would change the score by a few percent, a few tens of percent, etc.

@dpwe
Copy link
Collaborator

dpwe commented Oct 25, 2016

I assume in most real examples, the distribution of E is far from uniform,
so H(E) may be much smaller than log_2(N(E)). I would expect the values to
be scaled quite differently and not recognizably comparable. But I like the
proposal, since I think it gives a value which is more comparable between
different tracks, where a small H(E) may drastically limit the largest
possible value of H(E|A).

DAn.

On Friday, October 21, 2016, Colin Raffel [email protected] wrote:

EDIT: sorry, I missed the word "much" in your question. I haven't
implemented or tested this yet, but I don't expect a single answer across
the board. The amount of change will depend on how many labels are in the
estimate, and how they're distributed. But regardless of how much it
changes, the scores will always be lower.

Yeah, my question was basically whether you had a rough idea, for a
typical song and a sane segmenter algorithm, whether this would change the
score by a few percent, a few tens of percent, etc.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#226 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAhs0dhuODqwgRwJs-aLM9nTwEUEMFYAks5q2ShugaJpZM4KdiUs
.

@bmcfee
Copy link
Collaborator Author

bmcfee commented Nov 1, 2016

Here are the comparisons on the SALAMI data set for upper and lower annotations, over and under, using both the uniform (current) and marginal (proposed) normalization:

image

Over and under are interchangeable here: they refer to which annotator is treated as reference and which is the estimate.

In terms of statistics, here's what we have:

stat lower_marginal_over lower_marginal_under lower_uniform_over lower_uniform_under upper_marginal_over upper_marginal_under upper_uniform_over upper_uniform_under
mean 0.7304879259568795 0.7031984146195688 0.7813302893328941 0.7589537416784916 0.6906553361418949 0.6636644217400218 0.7846906904806741 0.7675998427817651
std 0.20040428144914693 0.19621190494660254 0.16498646584185417 0.1631590866657527 0.2517375315424782 0.24850157830589112 0.1901355896692854 0.18219491046983863
min 0.0 1.1102230246251565e-16 0.0 0.029935946015629322 0.0 0.0 0.0 0.08379812231416317
25% 0.6218148681683333 0.583402379211345 0.6823379296682697 0.6645077454168302 0.5438763622173188 0.5100887528152392 0.679364360670033 0.6535960628407671
50 0.7744065105504919 0.7321832964569002 0.8147174470649066 0.7797304631499171 0.7519077001635535 0.698127629882568 0.8259829623445278 0.7919894564475272
75% 0.8945727910862088 0.8625037240202287 0.9139850147296829 0.8875118100760839 0.8982225438841507 0.8709089927223312 0.9443546337357169 0.9226939442035648
max 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

So, as expected, there's a drop when we use marginal entropy. It appears to be minor on the lower segs and rather severe on the upper segs, which makes sense since the lower segs tend to be more consistent in duration. The ones that achieve a score of 0 with marginal but not uniform are those for which the reference annotation is significantly non-uniform, but the estimate has little-to-no mutual information with it.

@pafoster
Copy link

pafoster commented Dec 5, 2016

Hi All,

To briefly add to this and having seen the recent [email protected] list discussion about naming the measure: 1 - H(E|A) / H(E) = 1 - (H(E) - I(E;A)) / H(E) = I(E;A) / H(E). So I suppose one could generically refer to it as `normalised mutual information', if the aim is to distinguish it from the previous measure? It's also worth noting that in terms of mutual information, the F1 score of the revised measures looks rather nice: 2 I(X;Y) / (H(X) + H(Y)).

Peter Foster

@bmcfee
Copy link
Collaborator Author

bmcfee commented Feb 9, 2017

So I suppose one could generically refer to it as `normalised mutual information', if the aim is to distinguish it from the previous measure?

We could; I worry that it's too similar to "normalized conditional entropy" though. I figure since V-measure is already defined in the literature, we may as well use it?

It's also worth noting that in terms of mutual information, the F1 score of the revised measures looks rather nice: 2 I(X;Y) / (H(X) + H(Y)).

Ooooh, I hadn't noticed that before. Nice!

@craffel
Copy link
Collaborator

craffel commented Mar 3, 2017

Was this resolved via #227?

@bmcfee
Copy link
Collaborator Author

bmcfee commented Mar 3, 2017

Was this resolved via #227?

I believe so, yes.

@craffel craffel closed this as completed Mar 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants