RFC: fixing the segment.nce metrics? #226

bmcfee · 2016-10-21T20:04:09Z

[warning: opinions follow.]

Summary

I think the normalized conditional entropy (NCE) scores for segmentation are not properly defined, and we could fix it.

Background

The NCE scores are defined in terms of conditional entropy between the label distributions estimated from the reference and estimated segmentations. From Lukashevich's paper:

where:

H(E|A) is the entropy of the estimated label given the "annotated" (reference) label
N_e is the number of unique labels in the estimated segmentation

(and similarly for H(A|E) and N_a).

The intuition here is that log N_e is the maximum possible entropy of any distribution over N_e labels (ie, the uniform distribution), so it's a good way to normalize the S_o.

What's wrong

The uniform distribution has nothing to do with either the reference or estimate, and can be easily skewed when we have to pad additional segments to an estimate so that it spans the same time extent as the reference. (This is covered in the mir_eval paper, as it explains the inflated deviation from the mirex implementation.)

A better way, I think, is to normalize by the marginal entropy H(E):

This would capture the actual decrease in uncertainty in predicting the estimated label when the reference label is provided. Since H(E) <= H(E|A) for any distribution A, it still provides a valid normalization.

If we use the marginal entropy instead of uniform, then new labels due to tiny padding segments do not significantly change the normalization on the conditional entropy, so the results should be more stable.

Proposed modification

Add a flag to the NCE metrics, false by default, which changes the normalization from uniform entropy to the marginal entropy.

Incidentally, this modification would render NCE equivalent to the v-measure, which was originally included in the segment module, but removed since it's not used by anyone in MIR or MIREX.

The text was updated successfully, but these errors were encountered:

craffel · 2016-10-21T20:28:55Z

As someone outside of segment eval, this seems reasonable to me. Do you have any intuition for how much this changes the score in practice?

Add a flag to the NCE metrics, false by default, which changes the normalization from uniform entropy to the marginal entropy.

It seems to me that if your justification is correct/widely agreed upon, we should make a stronger move than this. Can you reach out to others who may have thought about this and encourage them to participate in this issue before we take action?

bmcfee · 2016-10-21T20:54:14Z

As someone outside of segment eval, this seems reasonable to me. Do you have any intuition for how much this changes the score in practice?

Sure:

H(E) <= H(Uniform) = log N_e (since uniform is maximum entropy)
-> H(E|A) / H(E) >= H(E|A) / H(Uniform) (since we're dividing by a smaller number)
-> 1 - H(E|A) / H(E) <= 1 - H(E|A) / H(Uniform) = S_o

so the over-segmentation score will be lower if we use marginal entropy instead of uniform.

EDIT: sorry, I missed the word "much" in your question. I haven't implemented or tested this yet, but I don't expect a single answer across the board. The amount of change will depend on how many labels are in the estimate, and how they're distributed. But regardless of how much it changes, the scores will always be lower.

It seems to me that if your justification is correct/widely agreed upon, we should make a stronger move than this. Can you reach out to others who may have thought about this and encourage them to participate in this issue before we take action?

Sure. I figured I'd start the conversation here and then kick it out to music-ir.

craffel · 2016-10-21T21:02:38Z

EDIT: sorry, I missed the word "much" in your question. I haven't implemented or tested this yet, but I don't expect a single answer across the board. The amount of change will depend on how many labels are in the estimate, and how they're distributed. But regardless of how much it changes, the scores will always be lower.

Yeah, my question was basically whether you had a rough idea, for a typical song and a sane segmenter algorithm, whether this would change the score by a few percent, a few tens of percent, etc.

dpwe · 2016-10-25T18:47:02Z

I assume in most real examples, the distribution of E is far from uniform,
so H(E) may be much smaller than log_2(N(E)). I would expect the values to
be scaled quite differently and not recognizably comparable. But I like the
proposal, since I think it gives a value which is more comparable between
different tracks, where a small H(E) may drastically limit the largest
possible value of H(E|A).

DAn.

On Friday, October 21, 2016, Colin Raffel [email protected] wrote:

EDIT: sorry, I missed the word "much" in your question. I haven't
implemented or tested this yet, but I don't expect a single answer across
the board. The amount of change will depend on how many labels are in the
estimate, and how they're distributed. But regardless of how much it
changes, the scores will always be lower.

Yeah, my question was basically whether you had a rough idea, for a
typical song and a sane segmenter algorithm, whether this would change the
score by a few percent, a few tens of percent, etc.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#226 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAhs0dhuODqwgRwJs-aLM9nTwEUEMFYAks5q2ShugaJpZM4KdiUs
.

bmcfee · 2016-11-01T17:58:32Z

Here are the comparisons on the SALAMI data set for upper and lower annotations, over and under, using both the uniform (current) and marginal (proposed) normalization:

Over and under are interchangeable here: they refer to which annotator is treated as reference and which is the estimate.

In terms of statistics, here's what we have:

stat	lower_marginal_over	lower_marginal_under	lower_uniform_over	lower_uniform_under	upper_marginal_over	upper_marginal_under	upper_uniform_over	upper_uniform_under
mean	0.7304879259568795	0.7031984146195688	0.7813302893328941	0.7589537416784916	0.6906553361418949	0.6636644217400218	0.7846906904806741	0.7675998427817651
std	0.20040428144914693	0.19621190494660254	0.16498646584185417	0.1631590866657527	0.2517375315424782	0.24850157830589112	0.1901355896692854	0.18219491046983863
min	0.0	1.1102230246251565e-16	0.0	0.029935946015629322	0.0	0.0	0.0	0.08379812231416317
25%	0.6218148681683333	0.583402379211345	0.6823379296682697	0.6645077454168302	0.5438763622173188	0.5100887528152392	0.679364360670033	0.6535960628407671
50	0.7744065105504919	0.7321832964569002	0.8147174470649066	0.7797304631499171	0.7519077001635535	0.698127629882568	0.8259829623445278	0.7919894564475272
75%	0.8945727910862088	0.8625037240202287	0.9139850147296829	0.8875118100760839	0.8982225438841507	0.8709089927223312	0.9443546337357169	0.9226939442035648
max	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0

So, as expected, there's a drop when we use marginal entropy. It appears to be minor on the lower segs and rather severe on the upper segs, which makes sense since the lower segs tend to be more consistent in duration. The ones that achieve a score of 0 with marginal but not uniform are those for which the reference annotation is significantly non-uniform, but the estimate has little-to-no mutual information with it.

pafoster · 2016-12-05T18:15:11Z

Hi All,

To briefly add to this and having seen the recent [email protected] list discussion about naming the measure: 1 - H(E|A) / H(E) = 1 - (H(E) - I(E;A)) / H(E) = I(E;A) / H(E). So I suppose one could generically refer to it as `normalised mutual information', if the aim is to distinguish it from the previous measure? It's also worth noting that in terms of mutual information, the F1 score of the revised measures looks rather nice: 2 I(X;Y) / (H(X) + H(Y)).

Peter Foster

bmcfee · 2017-02-09T16:26:14Z

So I suppose one could generically refer to it as `normalised mutual information', if the aim is to distinguish it from the previous measure?

We could; I worry that it's too similar to "normalized conditional entropy" though. I figure since V-measure is already defined in the literature, we may as well use it?

It's also worth noting that in terms of mutual information, the F1 score of the revised measures looks rather nice: 2 I(X;Y) / (H(X) + H(Y)).

Ooooh, I hadn't noticed that before. Nice!

craffel · 2017-03-03T18:51:34Z

Was this resolved via #227?

bmcfee · 2017-03-03T23:04:41Z

Was this resolved via #227?

I believe so, yes.

bmcfee mentioned this issue Nov 1, 2016

added marginal flag to segment.nce #227

Merged

craffel closed this as completed Mar 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: fixing the segment.nce metrics? #226

RFC: fixing the segment.nce metrics? #226

bmcfee commented Oct 21, 2016 •

edited

Loading

craffel commented Oct 21, 2016

bmcfee commented Oct 21, 2016 •

edited

Loading

craffel commented Oct 21, 2016

dpwe commented Oct 25, 2016

bmcfee commented Nov 1, 2016

pafoster commented Dec 5, 2016

bmcfee commented Feb 9, 2017

craffel commented Mar 3, 2017

bmcfee commented Mar 3, 2017

RFC: fixing the segment.nce metrics? #226

RFC: fixing the segment.nce metrics? #226

Comments

bmcfee commented Oct 21, 2016 • edited Loading

Summary

Background

What's wrong

Proposed modification

craffel commented Oct 21, 2016

bmcfee commented Oct 21, 2016 • edited Loading

craffel commented Oct 21, 2016

dpwe commented Oct 25, 2016

bmcfee commented Nov 1, 2016

pafoster commented Dec 5, 2016

bmcfee commented Feb 9, 2017

craffel commented Mar 3, 2017

bmcfee commented Mar 3, 2017

bmcfee commented Oct 21, 2016 •

edited

Loading

bmcfee commented Oct 21, 2016 •

edited

Loading