Skip to content

Non-deterministic test failure in cluster_algebra_quiver #22482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
embray opened this issue Mar 1, 2017 · 30 comments
Open

Non-deterministic test failure in cluster_algebra_quiver #22482

embray opened this issue Mar 1, 2017 · 30 comments

Comments

@embray
Copy link
Contributor

embray commented Mar 1, 2017

When running sage -t -a -p 0 in the develop Docker container (off a current version of the develop branch) I'm getting:

sage -t /opt/sage/src/sage/combinat/cluster_algebra_quiver/mutation_class.py
    [54 tests, 0.34 s]
sage -t /opt/sage/src/sage/combinat/cluster_algebra_quiver/mutation_type.py
    [71 tests, 0.88 s]
sage -t /opt/sage/src/sage/combinat/cluster_algebra_quiver/quiver.py
**********************************************************************
File "/opt/sage/src/sage/combinat/cluster_algebra_quiver/quiver.py", line 880, in sage.combinat.cluster_algebra_quiver.quiver.ClusterQuiver.mutation_type
Failed example:
    Q.mutation_type()
Expected:
    ['G', 2]
Got:
    'undetermined finite mutation type'
**********************************************************************
1 item had failures:
   1 of  37 in sage.combinat.cluster_algebra_quiver.quiver.ClusterQuiver.mutation_type
    [272 tests, 1 failure, 5.57 s]

but only sometimes. Seems like maybe a test ordering issue.

CC: @tscrim @stumpc5 @Etn40ff @sagetrac-gmoose05

Component: combinatorics

Keywords: cluster

Author: Frédéric Chapoton

Issue created by migration from https://trac.sagemath.org/ticket/22482

@embray embray added this to the sage-7.6 milestone Mar 1, 2017
@embray

This comment has been minimized.

@embray
Copy link
Contributor Author

embray commented Mar 1, 2017

comment:2

Freudian slip?

@embray embray changed the title Non-deterministic test failure in cluster_algebray_quiver Non-deterministic test failure in cluster_algebra_quiver Mar 1, 2017
@embray
Copy link
Contributor Author

embray commented Mar 1, 2017

comment:3

I noticed that these mutation_class files get written to the user's actual DOT_SAGE directory while running the tests. Do the tests not use a temp dir for DOT_SAGE?

@embray
Copy link
Contributor Author

embray commented Mar 1, 2017

comment:4

It seems this probably happens if sage/combinat/cluster_algebra_quiver/quiver_mutation_type.py is not finished before sage/combinat/cluster_algebra_quiver/quiver.py starts. Having a hard time getting it to do that reliably but am working on a patch nonetheless.

@embray
Copy link
Contributor Author

embray commented Mar 1, 2017

comment:5

Replying to @embray:

It seems this probably happens if sage/combinat/cluster_algebra_quiver/quiver_mutation_type.py is not finished before sage/combinat/cluster_algebra_quiver/quiver.py starts. Having a hard time getting it to do that reliably but am working on a patch nonetheless.

No, that's not quite it either because testing justsage/combinat/cluster_algebra_quiver/quiver.py with a clean .sage would fail reliably otherwise.

@fchapoton
Copy link
Contributor

comment:6

For what it's worth, from knowledge of the mathematical context and code:

I would say that a priori this result cannot happen unless Q is something else that the value given in the previous line. How this can happen is unknown to me.

There is caching of the result in Q._mutation_type. So the problem may come from a cache confusion (objets sharing their hash ?) ?

@sagetrac-gmoose05
Copy link
Mannequin

sagetrac-gmoose05 mannequin commented Mar 1, 2017

comment:8

I think I know the issue. To avoid infinite loops when trying to figure out what the mutation type is, the first step in the command Q.mutation_type() for quivers is to run "Q.is_mutation_finite()" (unless the type of Q is already known from the start).

However, as is documented, this command "Uses a non-deterministic method by random mutations in various directions. Can result in a wrong answer."

I'm surprised that so many wrong answers are arising from this non-deterministic process, but it seems like the pseudo-random seeds are being consistently poorly chosen for this example.

Anyway, I agree that this is leading to potentially wrong answers but unless we change the "is_mutation_finite" command, this is not an entirely unexpected bug, but part of the way the code was written for computational efficiency sake.

@Etn40ff
Copy link
Contributor

Etn40ff commented Mar 1, 2017

comment:9

I agree with Gregg: this is, in general, an artifact that comes from the non
deterministic nature of mutation_type. Probably this method, and anything else calling is_mutation_finite, should mention
explicitly how it works in its documentation and have # random doctests.

On the other hand the specific error in the description of the ticket is yet
another instance of #22381. The quiver has only two non frozen vertices so it
should be recognized immediately. Coefficients mess things up.

@fchapoton
Copy link
Contributor

comment:10

I have made some micro-enhancements to the is_mutation_finite code.

This could help prevent the sporadic doctest failure, maybe.


New commits:

0f0b3c2trac 22482 small enhancements in is_mutation_finite

@fchapoton
Copy link
Contributor

Branch: u/chapoton/22482

@fchapoton
Copy link
Contributor

Commit: 0f0b3c2

@fchapoton
Copy link
Contributor

Author: Frédéric Chapoton

@embray
Copy link
Contributor Author

embray commented Mar 3, 2017

comment:11

Thanks for looking into this! For the tests, would it make sense to just run them with a hard-coded random seed? There should even be a context manager for that somewhere...

@fchapoton
Copy link
Contributor

comment:12

review, somebody ?

@embray
Copy link
Contributor Author

embray commented May 12, 2017

comment:14

For some reason the merge preview seems to have broken--trying to view it returns an internal server error.

Maybe try rebasing it, just for the heck of it?

@stumpc5
Copy link
Contributor

stumpc5 commented May 12, 2017

comment:15

same here, but you can by hand navigate to sagemath/sagetrac-mirror@0f0b3c2 .

@Frederic: I don't see how your new choice of the new direction is at all different from before...

Also, I did millions of automated tests when I wrote the code some years ago, and this non-deterministic way really always produced the correct answer. I cannot believe this can be the problem! The reason that it should work is very simple: a finite type is usually very finite (in the sense that one usually does not consider finite types with >> 10^6 many clusters), so by 1000 mutations in random directions, you either stay in a very finite set of clusters (than it's is finite), or you have hit a witness that this type is not finite by 100% - \epsilon.

@embray
Copy link
Contributor Author

embray commented May 12, 2017

comment:16

Always is a strong word though. To be clear, this was a very hard to reproduce issue, and when I run the tests for this code under normal circumstances it always passes. It just happens that if the tests are run under certain conditions there must be some state of the RNG under which it ends up failing.

I could try to add some further diagnostics and try to reproduce the failure again, if you point me in the right direction as to what to add.

@stumpc5
Copy link
Contributor

stumpc5 commented May 12, 2017

comment:17

Replying to @embray:

Okay, let's go through the problem step by step (and if below is correct, than this has nothing to do with this non-deterministic test, which I still ensure to be correctly working):

  1. you call Q.mutation_type().
  2. you call is_mutation_finite which correctly determines its finiteness, resulting in following the else clause in l912.
  3. your quiver is connected, so you call _mutation_type_from_data with optional argument compute_if_necessary=False on its unique component.
  4. (I expect that you have not built the data, so this does not result in any hit.) This results in mut_type_part == 'unknown'.
  5. you call _connected_mutation_type which can determine all finite mutation types from the infinite families (but not the exceptional types, including G2).
  6. --> I BET THIS CAUSES THE ISSUE <-- you again call _mutation_type_from_data with optional argument compute_if_necessary=True. This brute force computes all clusters of exceptional types (of correct rank) and writes the output into external files. These external files in relative_filename = 'cluster_algebra_quiver/mutation_classes_%s.dig6'%n are then finally checked.
  7. As the final result, we obtain the mutation type as in your above output mut_type_part = 'undetermined finite mutation type'.

I am not to all depth into the context of your issue. But if I must place a bet, your doctest environment has a problem with writing these external files in (6) correctly and this causes the buggy behaviour.

Cheers, Christian

@embray
Copy link
Contributor Author

embray commented May 12, 2017

comment:18

What would you mean by "has a problem writing these external files correctly" (or for that matter, by "external")? I can see that the files are being written. One thing I noticed is that they're written to my actual DOT_SAGE when it seems to me the test suite should not be writing there.

Other than that there's nothing "special" about the environment or with Docker with regard to writing files.

@stumpc5
Copy link
Contributor

stumpc5 commented May 12, 2017

comment:19

One possible problem could be that multiple tests do .mutation_type in parallel. Both find the file to not exist at the beginning, and start writing and reading from the same file which results in bad results.

Is it correct that you don't have a way to trigger the bug? How often did it happen? Once the file is there, the correct answer should come out of (3) above directly. Have you seen the bug after the file was correctly computed and placed?

@jdemeyer
Copy link
Contributor

comment:20

Replying to @stumpc5:

One possible problem could be that multiple tests do .mutation_type in parallel. Both find the file to not exist at the beginning, and start writing and reading from the same file which results in bad results.

If I'm reading the code correctly, the files are written using atomic_write which should be free of race conditions while writing the file (fixed in #15534).

To debug further, it would be very interesting to know if the problem occurs when testing just a single file or only when testing in parallel.

@jdemeyer
Copy link
Contributor

comment:21

Replying to @embray:

Other than that there's nothing "special" about the environment or with Docker with regard to writing files.

Assuming of course that there are no bugs in Docker or the OS.

@jdemeyer
Copy link
Contributor

comment:22

Note: there has been a bug in Sage involving the pexpect interface (don't ask me the number, I think it was reported by Erik) which was only reproducible on Docker. The underlying bug was a genuine race condition involving a fork(). However, for some reason, the timing outside of Docker was always such that the bug wouldn't be triggered. In Docker, things ran in a different order, causing trouble.

So: if anything in this code path involves multiple processes or threads, that's certainly a place to look for bugs.

@embray
Copy link
Contributor Author

embray commented May 12, 2017

comment:23

I don't think there's a bug here in Docker or the OS beyond that fact that it might cause the tests to run in a slightly different order than usual. As I stated in the description this happens when running the tests in parallel, and only sometimes. It can't be reproduced normally.

@fchapoton
Copy link
Contributor

comment:24

@Frederic: I don't see how your new choice of the new direction is at all different from before...

My choice of direction never makes a useless choice, instead of rejecting backward moves.

@fchapoton
Copy link
Contributor

comment:25

Note that my branch also changes the behaviour for rank n=2, where we know that everybody is of finite-mutation-type. So G2 should be handled more gracefully.

EDIT:
Maybe my branch should be used in another independant ticket, as it does not claim to solve the issue raised here completely.

@fchapoton
Copy link
Contributor

comment:26

I have moved my branch to ticket #23082, that does not pretend to solve fully the present issue of the random failing doctest. Please review #23082, this should be easy !

@fchapoton
Copy link
Contributor

Changed commit from 0f0b3c2 to none

@fchapoton
Copy link
Contributor

Changed branch from u/chapoton/22482 to none

@fchapoton
Copy link
Contributor

Changed keywords from none to cluster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants