-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: fixing the segment.nce metrics? #226
Comments
As someone outside of segment eval, this seems reasonable to me. Do you have any intuition for how much this changes the score in practice?
It seems to me that if your justification is correct/widely agreed upon, we should make a stronger move than this. Can you reach out to others who may have thought about this and encourage them to participate in this issue before we take action? |
Sure:
so the over-segmentation score will be lower if we use marginal entropy instead of uniform. EDIT: sorry, I missed the word "much" in your question. I haven't implemented or tested this yet, but I don't expect a single answer across the board. The amount of change will depend on how many labels are in the estimate, and how they're distributed. But regardless of how much it changes, the scores will always be lower.
Sure. I figured I'd start the conversation here and then kick it out to music-ir. |
Yeah, my question was basically whether you had a rough idea, for a typical song and a sane segmenter algorithm, whether this would change the score by a few percent, a few tens of percent, etc. |
I assume in most real examples, the distribution of E is far from uniform, DAn. On Friday, October 21, 2016, Colin Raffel [email protected] wrote:
|
Here are the comparisons on the SALAMI data set for upper and lower annotations, over and under, using both the uniform (current) and marginal (proposed) normalization: Over and under are interchangeable here: they refer to which annotator is treated as reference and which is the estimate. In terms of statistics, here's what we have:
So, as expected, there's a drop when we use marginal entropy. It appears to be minor on the lower segs and rather severe on the upper segs, which makes sense since the lower segs tend to be more consistent in duration. The ones that achieve a score of 0 with marginal but not uniform are those for which the reference annotation is significantly non-uniform, but the estimate has little-to-no mutual information with it. |
Hi All, To briefly add to this and having seen the recent [email protected] list discussion about naming the measure: 1 - H(E|A) / H(E) = 1 - (H(E) - I(E;A)) / H(E) = I(E;A) / H(E). So I suppose one could generically refer to it as `normalised mutual information', if the aim is to distinguish it from the previous measure? It's also worth noting that in terms of mutual information, the F1 score of the revised measures looks rather nice: 2 I(X;Y) / (H(X) + H(Y)). Peter Foster |
We could; I worry that it's too similar to "normalized conditional entropy" though. I figure since V-measure is already defined in the literature, we may as well use it?
Ooooh, I hadn't noticed that before. Nice! |
Was this resolved via #227? |
I believe so, yes. |
[warning: opinions follow.]
Summary
I think the normalized conditional entropy (NCE) scores for segmentation are not properly defined, and we could fix it.
Background
The NCE scores are defined in terms of conditional entropy between the label distributions estimated from the reference and estimated segmentations. From Lukashevich's paper:
where:
H(E|A)
is the entropy of the estimated label given the "annotated" (reference) labelN_e
is the number of unique labels in the estimated segmentation(and similarly for
H(A|E)
andN_a
).The intuition here is that
log N_e
is the maximum possible entropy of any distribution overN_e
labels (ie, the uniform distribution), so it's a good way to normalize theS_o
.What's wrong
The uniform distribution has nothing to do with either the reference or estimate, and can be easily skewed when we have to pad additional segments to an estimate so that it spans the same time extent as the reference. (This is covered in the mir_eval paper, as it explains the inflated deviation from the mirex implementation.)
A better way, I think, is to normalize by the marginal entropy

H(E)
:This would capture the actual decrease in uncertainty in predicting the estimated label when the reference label is provided. Since
H(E) <= H(E|A)
for any distributionA
, it still provides a valid normalization.If we use the marginal entropy instead of uniform, then new labels due to tiny padding segments do not significantly change the normalization on the conditional entropy, so the results should be more stable.
Proposed modification
Add a flag to the NCE metrics, false by default, which changes the normalization from uniform entropy to the marginal entropy.
Incidentally, this modification would render NCE equivalent to the v-measure, which was originally included in the segment module, but removed since it's not used by anyone in MIR or MIREX.
The text was updated successfully, but these errors were encountered: