Skip to content

Extend MixtureFinder to codon, binary, multistate, and amino acid data #11

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

HS6986
Copy link

@HS6986 HS6986 commented Apr 6, 2025

Dear All,

This pull request extends MixtureFinder (Ren et al., 2024), which currently works only on DNA data, to codon, binary, multistate, and amino acid data.

For multistate and amino acid data, the frequency parameters FQ and FO are tested by default, and for codon data, the frequency parameters FQ, F, F1X4, and F3X4 are tested by default. As I have never done phylogenetic analyses of codon or amino acid data in my actual research, I apologize if I am doing something wrong.

I have tested the modified MixtureFinder on DNA, codon, binary, multistate and amino acid data to see if it works properly, and it seems to work fine. The test data can be found at HS6986/iqtree3ForkTests.

Since I am completely unfamiliar with C++ and have little understanding of the IQ-TREE implementation, I believe that this PR might contain bugs and there are many improvements that could be made. Thus, it almost certainly needs extensive code review by the IQ-TREE developers. If you would like to get access to my repository for editing, please feel free to ask.

This is almost my first PR, so please feel free to let me know if I am doing something wrong.

Thank you very much for your time and support.

@bqminh
Copy link
Member

bqminh commented Apr 6, 2025

thanks for contributing. @HuaiyanRen can you check?

@HuaiyanRen
Copy link
Collaborator

Thanks for contributing. I will check it soon once I have time.

@HS6986
Copy link
Author

HS6986 commented Apr 8, 2025

Thank you for your responses.

I have realized that I specified inappropriate frequency types for some data types. Sorry for the confusion I may have caused. I am unable to work on it now, but I will fix it as soon as possible, referring to the frequency types tested by default by ModelFinder.

@HS6986
Copy link
Author

HS6986 commented Apr 8, 2025

I have fixed it now. It should work properly, probably? I have updated the tests (HS6986/iqtree3ForkTests) accordingly.

@HuaiyanRen
Copy link
Collaborator

Hello, I read your repository. I think your extension is correct for amino acid, binary and multistate data. I also never work on codon, I saw your log only consider +F1X4 and +F3X4 but no +FQ or +FO, which I'm not sure this is proper.

But thank you very much again for your contribution!

@HS6986
Copy link
Author

HS6986 commented Apr 11, 2025

Thank you for your response.

I have never dealt with codon data either, so I was fumbling about this.

ModelFinder seems to test the frequency parameters F3X4, F1X4, F, and for codon data by default. With reference to this, I wanted to apply F3X4, F1X4, FO, and for codon data. However, according to the manual (Substitution models), FO cannot be applied to codon data, and it was confirmed in my test analysis. As such, I applied F3X4, F1X4, and for codon data. Am I correct in understanding that F cannot be applied to a class in a mixture model? I am not sure whether FQ should be applied.

I would deeply appreciate it if you could verify this.

Thank you for your time and support.

@HuaiyanRen
Copy link
Collaborator

HuaiyanRen commented Apr 11, 2025

F can be applied with a mixture model. MixtureFinder consider +FQ and +FO as default for DNA, if you want +F, you need to specify it by -mset.

ModelFinder consider +FQ and +F as default, but since I joined in our team, we developed mixture models (for DNA) with +FO. I didn't ask about the exact reason, by I think one reason could be, for example:

{F81+FO,F81+FO} are actually two-class mixture model, although the exchangeabilities are the same in different classes, by the frequencies could be different. However, {F81+F, F81+F} is a meaningless mixture model, because +F is counted based on the alignment, so for each class the +F is the same thing.

@HS6986
Copy link
Author

HS6986 commented Apr 11, 2025

Thank you for your response.

F can be applied with a mixture model.

Oh, thank you for pointing it out. Indeed, F can be applied in a mixture model, but that should only be the case if F is applied to all the classes. I think otherwise the mixture model should not work properly and probably causes errors. Is this correct? It seems to me that F can be specified in MixtureFinder only when only F is specified for frequency types and only models with unequal frequencies are specified. As such, F should not be tested in MixtureFinder by default.

I would be happy to continue discussing the frequency types for codon data in MixtureFinder.

Thank you for your support.

@HuaiyanRen
Copy link
Collaborator

When user input -mset JC,GTR -mfreq F, FO with MixtureFinder, the model below will be considered: JC+FQ, GTR+F, GTR+FO. No errors will be report. So you don't need to worry about the conflict of combination between the exchangeability matrix types and the F types.

For codon models, I may ask around in the team to find someone who can answer your question.

@HuaiyanRen
Copy link
Collaborator

Sorry, I made a mistake when typing before.

I want to ask: in your log file Chen2024.COI.Aligned.NT.Replaced.Subsampled.fas.log, it seems +F and are not estimated (ignore the +FO thing, my bad), but you mentioned that you appied , which I couldn't see in the log file.

@HS6986
Copy link
Author

HS6986 commented Apr 11, 2025

Thank you for your responses.

No errors will be report.

Oh, this is good to know, thank you (please ignore my deleted statement about linking or unlinking frequencies, my misunderstanding)!

For codon models, I may ask around in the team to find someone who can answer your question.

Thank you!

but you mentioned that you appied , which I couldn't see in the log file.

Sorry, I misread “I also never work on codon, I saw your log only consider +F1X4 and +F3X4 but no +FQ or +FO, which I'm not sure this is proper.” and thought that you were then talking about the frequency types specified for codon data in MixtureFinder, not about my MixtureFinder test for codon data. My apologies. was specified as a frequency type for codon data in MixtureFinder, but it was automatically not tested in my test analysis because empirical codon models were not tested as the genetic code of the data was not The Standard Code but The Invertebrate Mitochondrial Code (I just learned that for empirical codon models The Standard Code must be used). I will test the codon MixtureFinder again on data whose code is standard.

Thank you for your time and support.

@HS6986
Copy link
Author

HS6986 commented Apr 11, 2025

Dear IQ-TREE Developers,

I have tested the codon MixtureFinder on data whose genetic code was standard (HS6986/iqtree3ForkTests), thus activating the tests of empirical codon models. However, the analysis stopped with an error message ERROR: Frequency mixture name not found U. I suspect that this is due to the original IQ-TREE implementation rather than my implementation.

Could you review this issue and fix it when you are available?

Thank you for your time and support.

P.S. This error seems to be due to the IQ-TREE's inability to handle mixture models with classes whose frequencies are FU. This error also occurs when analyzing DNA data using MixtureFinder in IQ-TREE v3.0.0 with -mfreq FQ,F,FO. Here is the log file of the analysis.

@HS6986
Copy link
Author

HS6986 commented Apr 15, 2025

With a very simple change, I was able to fix the aforementioned issue, namely that IQ-TREE does not accept +FU in mixture models. I have updated the tests (HS6986/iqtree3ForkTests) accordingly.

@StefanFlaumberg
Copy link

Hi all,

I've also tried to explore the IQ-Tree functionality towards using multistate data recently.
As far as I could get, it seems that any kind of multistate data can readily be handled by the ModelMorphology class just after a few simple code modifications not altering the model itself (like turning off the restriction for 32 max states, etc. -- I can expand on the details in another reply if needed). For example, I could successfully run an analysis for the following test alignment (in .phy format) using states 0-399:

4       5
A       1 25 399 6 7
B       1 25 10 6 7
C       1 35 1 6 7
D       1 25 1 6 17

And the multistate alignment (Khalturin2022.WoGaps.recSR4.Subsampled.fasta) used for testing this pull request, which is a usual 4-state alignment, can be handled by the ModelMorphology class without any modification of the program at all. Thus, maybe not a MixtureFinder analysis, but a plain tree reconstruction under a specified model can be run fine for this alignment using just the original IQ-Tree code.

So @HS6986, the question is:
Do we really need to implement a whole new model class (ModelMultistate), when we already have the functionality of the ModelMorphology class?
To me, any multistate data can always be treated as morphology data, and we already have a model class for it.

Sorry if I'm misleading here, I haven't dug deep into this topic yet.

Best,
Stefan

@StefanFlaumberg
Copy link

StefanFlaumberg commented Apr 16, 2025

@HS6986,

I've just downloaded your code and made a couple of runs with your 4-state test alignment. I added some log messages to model class constructors, and here is what I've got:

If I don't use MixtureFinder, but use ModelFinder not specifying the sequence type as with the following command:
iqtree3 -s Khalturin2022.WoGaps.recSR4.Subsampled.fasta --tree-fix -pre newtest
the sequence type gets correctly recognized as SEQ_MORPH and the ModelMorphology class is initialized:

Creating fast initial parsimony tree by random order stepwise addition...
0.009 seconds, parsimony score: 170 (based on 40 sites)

!!! SeqType is MORPH !!!
!!! Initialize ModelMorphology class !!!

Perform fast likelihood tree search using MK+I+G model...

However, if I use MixtureFinder instead with the following command:
iqtree3 -s Khalturin2022.WoGaps.recSR4.Subsampled.fasta -m MIX+MF -pre newtest
the sequnce type ends up to be SEQ_MULTISTATE and the new ModelMultistate class is initialized:

MixtureFinder will also test models with unequal rates and/or frequencies

Creating fast initial parsimony tree by random order stepwise addition...
0.007 seconds, parsimony score: 171 (based on 40 sites)

!!! SeqType is MULTI !!!
!!! Initialize ModelMultistate class !!!

WARNING: GTRX multistate model will estimate 5 substitution rates that might be overfitting!
WARNING: Please only use GTRX with very large data and always test for model fit!
Perform fast likelihood tree search using GTRX+I+G model...

So the same alignment under the same conditions (i.e. neither seq type, nor model specified) is treated differently by ModelFinder and MixtureFinder! Kinda weird behaviour to me.

This situation, in fact, can lead to the following confusion:
The alignments of the following type (with states from the [0-9A-V] alphabet) should probably always be assigned the SEQ_MORPH SeqType, and this SeqType always leads to initializing the ModelMorphology class:

4 5
A 01123
B 01223
C 01123
D 11123

It is the alignments like from my previous comment (with states from an arbitrarily large alphabet designated with ints) that can be assigned the existing in the original code, but unused SEQ_MULTISTATE SeqType, and they are not actually supported in IQ-Tree currently.
Please see the Alignment::buildStateMap function to get a better idea of SEQ_MORPH vs SEQ_MULTISTATE difference.

Maybe the good old ModelMorphology, usually used by IQ-Tree for the data as in your example, could be used by MixtureFinder as well? :)

Best,
Stefan

@HuaiyanRen
Copy link
Collaborator

@HS6986

The bug with +FU has been existing in MixtureFinder for long time. Thank you for point it out and fix it!

We didn't deal with this issue because this is not a common case that users specify +FQ,FO... then GTR+FQ, which is equivalent to SYM, is selected.

@HS6986
Copy link
Author

HS6986 commented Apr 17, 2025

@StefanFlaumberg

Thank you for your feedback. The reason I created the new model class ModelMultistate was that models that can be (or are usually) applied to morphological data and multistate data (this may have been a bad expression, I have used “multistate data” here as a term for data represented by [0-9,A-Z] other than morphology), respectively, are different.

When analyzing morphological data with probabilistic methods, most empiricists partition data by the number of states in each character (see here) and use Mk models (models with equal rates and frequencies) with ascertainment bias corrections (Lewis, 2001) to model data. Also in IQ-TREE, ModelFinder only considers MK+FQ(+ASC+(rate heterogeneity across characters (e.g., +G))) for morphological data. Although some software programs (MrBayes and RevBayes) implement methods that model heterogeneity of state frequencies in morphological data with mixture models (Wright et al., 2016; here), as morphological data should be partitioned by the number of states and currently ascertainment bias corrections (+ASC) cannot be applied in mixture models in IQ-TREE, morphological data cannot be analyzed in MixtureFinder, at least currently.

On the contrary, multistate data other than morphology, such as recoded amino acid data, can and often should be analyzed in models with unequal rates and/or frequencies (e.g., MK+FO, GTRX+FQ, and GTRX+FO).

If we implemented MixtureFinder so that multistate data other than morphology would be handled by ModelMorphology, only MK models would be called by getModelSubst() in phylotesting.cpp by default, which would be highly problematic.

However, thinking about it again, it seems that there will be no problem if we replace getModelSubst() for multistate data passed to MixtureFinder with code that calls MK and GTRX as the models to test. By doing this, we can avoid the complicated and frankly messy way of creating the new class ModelMultistate.

I will work on this as soon as possible. Thank you very much for your thorough feedback, Stefan!

@StefanFlaumberg
Copy link

Hi @HS6986,

Thank you for the extensive explanation and useful links! I think I got your idea.
Here is what I think now:

You state that using only the MK model and FQ frequencies by default is problematic:

If we implemented MixtureFinder so that multistate data other than morphology would be handled by ModelMorphology, only MK models would be called by getModelSubst() by default, which would be highly problematic.

But I think it is quite the opposite:
Assume we are analysing some multistate data with 32 possible states (the current maximum allowed). If the default behaviour is to consider both MK and GTRX models and all three FQ, F and FO frequencies, than ModelFinder or MixtureFinder will have to estimate a 32*32 GTRX matrix from a single alignment -- a behaviour just as wrong as unnecessary! This will take much time and lead only to overfitting the model parameters.

Maybe the original default behaviour (MK+FQ, thus no MixtureFinder) is fine, and if a user still has a good reason to run MixtureFinder for the SEQ_MORPH seqtype data (as in the case with recoded AA data), they could just add something like -mset "+GTRX" -mfreq "FQ,F,FO" to the command?
This should spare us from having to estimate the too many parameters of the GTRX model when unnecessary, yet allow using this model if we are sure we want it. One can implement this based on the ModelMorphology class alone, and the code allowing user to provide additional models and freqs to test (with the -mset and -mfreq options) is already there.

@HS6986
Copy link
Author

HS6986 commented Apr 18, 2025

@StefanFlaumberg

Thank you for your reply.

That makes sense!

It seems to me that the best choice is to specify the default models and frequency types for SEQ_MORPH data in MixtureFinder as MK and FQ, respectively, and to make IQ-TREE generate an error (e.g., Running MixtureFinder only with the MK model and the FQ frequency is completely meaningless. Please provide additional models and/or frequencies, such as GTRX, +F, and +FO, using -mset and/or -mfreq, if you really want to use MixtureFinder for your data.) when users call MixtureFinder for multistate data without supplying either other models or frequencies. What do you think?

I am going to remove all the pieces of code that have been added in this pull request in relation to SeqMultistate.

Thank you very much for your support and time.

@StefanFlaumberg
Copy link

@HS6986,

Yes, good idea!

I think you could also suppress the following warnings from the ModelMorphology class if the number of states is <=6 and the alignment/partition length is >=100:

WARNING: GTRX multistate model will estimate 5 substitution rates that might be overfitting!
WARNING: Please only use GTRX with very large data and always test for model fit!

For example, if the alignment have 6 states, for a GTRX matrix we have to estimate only (6*6 - 6)/2 - 1 = 14 parameters. All the corresponding 14 transition pairs are likely to appear in the alignment numerous times, so there should be no concern of overfitting. However, if someone decided to use the GTRX model for true morphological data, which, just as you mentioned before, can be divided into short partitions, estimating even 14 parameters would be a problem. Hence the check for the partition length.
The best way to check partition length is probably through the number of distinct patterns:
phylo_tree->aln->getNPattern() >= 100

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants