-
Notifications
You must be signed in to change notification settings - Fork 5
Extend MixtureFinder to codon, binary, multistate, and amino acid data #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
thanks for contributing. @HuaiyanRen can you check? |
Thanks for contributing. I will check it soon once I have time. |
Thank you for your responses. I have realized that I specified inappropriate frequency types for some data types. Sorry for the confusion I may have caused. I am unable to work on it now, but I will fix it as soon as possible, referring to the frequency types tested by default by ModelFinder. |
I have fixed it now. It should work properly, probably? I have updated the tests (HS6986/iqtree3ForkTests) accordingly. |
Hello, I read your repository. I think your extension is correct for amino acid, binary and multistate data. I also never work on codon, I saw your log only consider +F1X4 and +F3X4 but no +FQ or +FO, which I'm not sure this is proper. But thank you very much again for your contribution! |
Thank you for your response. I have never dealt with codon data either, so I was fumbling about this. ModelFinder seems to test the frequency parameters I would deeply appreciate it if you could verify this. Thank you for your time and support. |
F can be applied with a mixture model. MixtureFinder consider +FQ and +FO as default for DNA, if you want +F, you need to specify it by -mset. ModelFinder consider +FQ and +F as default, but since I joined in our team, we developed mixture models (for DNA) with +FO. I didn't ask about the exact reason, by I think one reason could be, for example: {F81+FO,F81+FO} are actually two-class mixture model, although the exchangeabilities are the same in different classes, by the frequencies could be different. However, {F81+F, F81+F} is a meaningless mixture model, because +F is counted based on the alignment, so for each class the +F is the same thing. |
Thank you for your response.
Oh, thank you for pointing it out. Indeed, I would be happy to continue discussing the frequency types for codon data in MixtureFinder. Thank you for your support. |
When user input For codon models, I may ask around in the team to find someone who can answer your question. |
Sorry, I made a mistake when typing before. I want to ask: in your log file |
Thank you for your responses.
Oh, this is good to know, thank you (please ignore my deleted statement about linking or unlinking frequencies, my misunderstanding)!
Thank you!
Sorry, I misread “I also never work on codon, I saw your log only consider +F1X4 and +F3X4 but no +FQ or +FO, which I'm not sure this is proper.” and thought that you were then talking about the frequency types specified for codon data in MixtureFinder, not about my MixtureFinder test for codon data. My apologies. Thank you for your time and support. |
Dear IQ-TREE Developers, I have tested the codon MixtureFinder on data whose genetic code was standard (HS6986/iqtree3ForkTests), thus activating the tests of empirical codon models. However, the analysis stopped with an error message Could you review this issue and fix it when you are available? Thank you for your time and support. P.S. This error seems to be due to the IQ-TREE's inability to handle mixture models with classes whose frequencies are |
With a very simple change, I was able to fix the aforementioned issue, namely that IQ-TREE does not accept |
Hi all, I've also tried to explore the IQ-Tree functionality towards using multistate data recently.
And the multistate alignment ( So @HS6986, the question is: Sorry if I'm misleading here, I haven't dug deep into this topic yet. Best, |
I've just downloaded your code and made a couple of runs with your 4-state test alignment. I added some log messages to model class constructors, and here is what I've got: If I don't use MixtureFinder, but use ModelFinder not specifying the sequence type as with the following command:
However, if I use MixtureFinder instead with the following command:
So the same alignment under the same conditions (i.e. neither seq type, nor model specified) is treated differently by ModelFinder and MixtureFinder! Kinda weird behaviour to me. This situation, in fact, can lead to the following confusion:
It is the alignments like from my previous comment (with states from an arbitrarily large alphabet designated with ints) that can be assigned the existing in the original code, but unused Maybe the good old Best, |
The bug with We didn't deal with this issue because this is not a common case that users specify |
Thank you for your feedback. The reason I created the new model class When analyzing morphological data with probabilistic methods, most empiricists partition data by the number of states in each character (see here) and use Mk models (models with equal rates and frequencies) with ascertainment bias corrections (Lewis, 2001) to model data. Also in IQ-TREE, ModelFinder only considers MK+FQ(+ASC+(rate heterogeneity across characters (e.g., +G))) for morphological data. Although some software programs (MrBayes and RevBayes) implement methods that model heterogeneity of state frequencies in morphological data with mixture models (Wright et al., 2016; here), as morphological data should be partitioned by the number of states and currently ascertainment bias corrections ( On the contrary, multistate data other than morphology, such as recoded amino acid data, can and often should be analyzed in models with unequal rates and/or frequencies (e.g., MK+FO, GTRX+FQ, and GTRX+FO). If we implemented MixtureFinder so that multistate data other than morphology would be handled by However, thinking about it again, it seems that there will be no problem if we replace I will work on this as soon as possible. Thank you very much for your thorough feedback, Stefan! |
Hi @HS6986, Thank you for the extensive explanation and useful links! I think I got your idea. You state that using only the MK model and FQ frequencies by default is problematic:
But I think it is quite the opposite: Maybe the original default behaviour (MK+FQ, thus no MixtureFinder) is fine, and if a user still has a good reason to run MixtureFinder for the |
Thank you for your reply. That makes sense! It seems to me that the best choice is to specify the default models and frequency types for I am going to remove all the pieces of code that have been added in this pull request in relation to Thank you very much for your support and time. |
Yes, good idea! I think you could also suppress the following warnings from the
For example, if the alignment have 6 states, for a GTRX matrix we have to estimate only (6*6 - 6)/2 - 1 = 14 parameters. All the corresponding 14 transition pairs are likely to appear in the alignment numerous times, so there should be no concern of overfitting. However, if someone decided to use the GTRX model for true morphological data, which, just as you mentioned before, can be divided into short partitions, estimating even 14 parameters would be a problem. Hence the check for the partition length. |
Dear All,
This pull request extends MixtureFinder (Ren et al., 2024), which currently works only on DNA data, to codon, binary, multistate, and amino acid data.
For multistate and amino acid data, the frequency parameters
FQ
andFO
are tested by default, and for codon data, the frequency parametersFQ
,F
,F1X4
, andF3X4
are tested by default. As I have never done phylogenetic analyses of codon or amino acid data in my actual research, I apologize if I am doing something wrong.I have tested the modified MixtureFinder on DNA, codon, binary, multistate and amino acid data to see if it works properly, and it seems to work fine. The test data can be found at HS6986/iqtree3ForkTests.
Since I am completely unfamiliar with C++ and have little understanding of the IQ-TREE implementation, I believe that this PR might contain bugs and there are many improvements that could be made. Thus, it almost certainly needs extensive code review by the IQ-TREE developers. If you would like to get access to my repository for editing, please feel free to ask.
This is almost my first PR, so please feel free to let me know if I am doing something wrong.
Thank you very much for your time and support.