-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathSimulating_Anc-Desc_Temporal_Discrepancy_07-31-19.Rmd
1384 lines (986 loc) · 73.6 KB
/
Simulating_Anc-Desc_Temporal_Discrepancy_07-31-19.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: Using Simulations To Examine Temporal Discrepancy Between the First Appearance
Times of Ancestors and Descendants
author: "David Bapst"
date: "July 16, 2019"
output: pdf_document
---
```{r include = FALSE, echo = FALSE}
library(paleotree)
setwd("~/AD_discrepancy")
```
# Du and Alemseged 2019: Temporal Evidence and Ancestor-Descendant Relationships
In an incomplete fossil record, the order of appearance in the fossil record may be a poor reflection of phylogeny. Additional complications arise when morphologically-delimited taxa are persistently found over geological durations, including when they 'persist through' a speciation event that produces a differentiated descendant morphotaxon. In such cases, the 'ancestral' morphotaxa will be extant at the same point in time (but perhaps not the same location) as their 'descendant' morphotaxa, and thus the incomplete record may preserve traces of the 'descendant' before its own 'ancestor'. But how often should we expect to see this in the real fossil record? How unlikely is a an age 'discrepancy' (as I will term it) of a given size?
Du and Alemseged (2019, hereafter D&A) attempt to answer this question, with specific reference to whether the *A. sediba* morphotaxon could possibly be the ancestor to *Homo*, whose first supposed fossil occurrences appear to post-date its own occurrence by a considerable duration (for the hominin fossil record). Their finding is that they overall find the observed ancestor-descendant discrepancy to be of an unlikely duration. This document is an investigation of that claim. I'm not an anthropologist, and I hold no opinion whatsoever to the actual data, so I am going to largely accept their statements about dates and taxon identity, and mainly interrogate their methods.
D&A present two lines of arguments to evaluate the supposed ancestor-descendant hypothesis, given the temporal discrepancy:
a) A model describing the probability of a discrepancy of a given magnitude, under a certain set of assumptions. (Described more below.) This model finds a very low probability of a discrepancy as large as the observed discrepancy, and consistently finds such under a range of conditions.
b) An empirical dataset of the geologic age difference between the 'first found' fossils for 28 previously-hypothesized ancestor-descendant pairs from the paleoanthropology literature. In other words, if each species was only ever a single find - the first reported collection of that find with no subsequent collections- what would the age difference be between those two species? Only one A-D pair has a discrepancy (with the descendant's first-find being geologically dated older than the first-find of the descendant), and that discrepancy is very small relative to the gap between *Homo* and *A. sediba*.
I will critique these arguments in reverse order.
# The Metadata of Published Reports of Ancestor-Descendant Pairs
My initial concern upon reading the paper was that ancestor-descendant pairs as reported from the literature are often a limited subset of what might be true. As D&A describe in their discussion, this might particularly impact their conclusion because temporal information is (nearly always) taken into account when someone decides to suggest something is ancestral to something else. Thus, they use the first-finds of each taxon instead, based on the idea that the hypothesized ancestor-descendant relationships might be based on later, roughly independent temporal information.
I rather like the attempt using first-finds to avoid the issue of a taxon's whole temporal range being non-independent of it being placed as an ancestor or descendant, but I'm not sure its enough.
First, does first-finds remove the bias? I am not entirely convinced, because when people make arguments for ancestor-descendant pairs, I think that's more likely to take into account temporal information from their earliest collections, not their latest collections. I mean, in the invert paleo world, newly published finds often change a taxon's temporal range - but that information might take decades for someone else to notice and take that into account in their interpretations. Thus, the first find and the original geochronological interpretation of that find is most influential. I would probably like to know (a) what's the geologic age difference if we restricted each pair to looking at the most recent collections from each species, and/or (b) what's the distribution of age difference look like if we took all the collections from each taxon in these ancestor-descendant pairs, and used a Monte Carlo approach to sample a single collection at random from each taxon, and look at the distribution in difference in age for all these artificial 'one-collection' taxa. (This is similar to Alroy's 1996 analysis of Cope's Rule, where he sampled subsampled species pairs from genera to examine ancestor-descendant trends in body size change.) A full examination would compare these distributions of randomized age differences, to the age differences in the actual, currently-accepted first appearance times for those taxa.
Second, while I absolutely believe that ancestor-descendant relationships is something we should talk about and consider, but I'm not convinced of the utility of previously published pairs as a baseline for what we should expect of ancestor-descendant pairs at large. For example, in my 2017 paper with Melanie Hopkins on the finely resolved record of late Cambrian pterocephaliid trilobites, we found support for up-to 16 ancestor-descendant pairs, when really only 9 pairs had been proposed previously in the literature (including several pairs where which taxon was the ancestor, and which was the descendant, was uncertain, due to uncertainty in the first appearance dates). Now, no single phylogenetic hypothesis (the method used produces samples containing hundreds of possible trees) contained all 16 pairs, but a single tree often had higher numbers of pairs than had been previously proposed. So, my expectation is that for most groups, the inclination to label specific ancestor-descendant pairs has probably been overly conservative (one exception may be ammonites or planktonic foraminifera, where systematics is still largely very traditional rather than based on quantitative analysis of character matrices, and large numbers of taxa are placed as stages along long ancestor-descendant sequences of anagenesis). Maybe hominids are like those groups, and workers have been less conservative than they should have been, but maybe not, and my expectation is that thus the ancestor-descendant pairs reported for any given group is an extremely conservative subset of those, probably over-representing those that are the most consistent with being ancestor-descendant pairs in terms of their morphological and stratigraphic relationships. Overall, this adds up to an overwhelming bias against proposing ancestors that appear later than their descendants, a bias that probably leads to us underestimating the true occurrence of ancestors occurring latter than their descendants.
Third, I would still have problems with this comparison because this is so unlike the ancestor-descendant pair we are most interested in - *A. sediba* and *Homo*. The first collection of *Homo* would be, **presumably**, us recognizing that we are extant human beings (first fossil collection would be - well, I really have no idea, but its probably pretty geologically recent). Thus, that pair has no discrepancy as evaluated by the first-find approach. One taxon is also a species known from a single very time-constrained set of collections, and the other is a genus, that becomes relatively diverse (encompassing multiple species or sub-species or proto-species or whatever you want to call them...). Under a standard birth-death-sampling model, we would expect that such increased lineage diversity makes sampling *Homo* later much more likely later than earlier, even if there is a long 'fuse' interval with a very incomplete record. (Note though that this dynamic, admittedly, strengthens their case for their probabilistic model.) I think really we would need to identify a specific *Homo* species, and talk about that species-pair specifically rather than the genus itself.
# Du & Alemseged's Model of Range Overlap and Ancestor-Descendant Temporal Discrepancy
In my data analysis courses, I tell students that models are just sets of assumptions about the world. We may describe those assumptions as mathematical relationships, but really any model is just a bundle of assumptions, and models derive all their explanatory power from the strength and specificity of those assumptions. So what are D&A assumptions? Three of these are explicitly laid out:
1) The two taxa in question are assumed to have true ranges (the time between the true time of origination and extinction) of equal duration. In this case, the two taxa are both assumed to have 0.97 Ma (mega-annum) ranges. That's an average taken from Robinson et al - its obviously not true for *Homo*, which obviously has a longer duration than 1 Ma. It isn't clear from D&A what the choice of 0.97 Ma or inferring equal length ranges has on their findings.
2) The fossil record is assumed to have uniform sampling, across space and time, and across both the ancestor and descendant. As I already indicated above, the fact *Homo* encompasses multiple species would effectively violate this assumption, but the if that effect was taken into account, it would make it even more unlikely for the record to preserve *Homo* earlier in geologic time, and thus the bias works in agreement with D&A's findings. This is a common assumption of sampling models for the fossil record, and while it is often critized (see Steven Holland's work in particular), D&A far and away exceed the typical defense against this criticism by presenting evidence that hominin.
3) Sampling events (collections, occurrences, fossil horizons - whatever you want to call them!) are independent in time and across lineages. This is a common assumption of sampling models for the fossil record, and there aren't many cases where I would argue for the opposite.
Others are not so explicit:
4) This is a direct ancestor-descendant relationship. There are no additional, 'unknown unknown' taxa with similar sampling rates. Often, when we find ancestor-descendant pairs in the fossil record, we are probably not finding direct ancestor-descendant pairs (this morphotaxon is *directly* descended from this other morphotaxon) but rather we might be missing a few completely un-sampled, un-observed morphotaxa in between. As Foote (1996) recognized, ancestor-descendant relationships are probably relatively common in the fossil record, especially indirect ancestor-descendant relationships. Now, recognizing the possibility for 'unseen' lineages would decrease the probability of finding larger A-D discrepancies, so this bias works against D&A's test (and thus strengthens their findings).
From this, they derive the probability that any one occurrence of a descendant would occur before any one occurrence of an ancestor. Interestingly, and surprisingly to me, they find that sampling rate doesn't actually impact this probability - a well-sampled fossil record and poorly-sampled fossil record are expected to have very similar probabilities for a given ancestor-descendant discrepancy, for two taxa that have identical durations, and share the same uniform sampling rate (whatever that rate happens to be).
Some properties of this model are unclear in the draft, such as 'what if taxon range is varied?'. So let's consider scenarios under this model ourselves. In doing this, I will avoid relying on or refering to the supplemental R script provided by D&A, so that I am sure I understand all the relevant details.
This model mainly has three major variables related to intervals of various length: the duration of the range for each taxon (`T_R`, which D&A set to 0.97 Ma), the observed minimum stratigraphic overlap between the two taxa, which is a fixed variable of interest and is essentially the A-D discrepancy, or difference in age between the descendant and the ancestor (`T_d`, which D&A set to 0.8 Ma, the apparent discrepancy between *A. sediba* and *Homo*), and the true, total amount of overlap between the ancestor and the descendant (both of which, note well, have equally long ranges), and is treated as a random variable in D&A's model (`T_o`).
Several additional variables are worth noting. With respect to D&A's general notation and figures, the age of the ancestor's horizon is `H_A`, the age of the descendant's horizon is `H_D`, and the offset between an descendant's true time of origination and its time of sampling is `X_D`, while `X_A` is designated as the offset between the descendant's and ancestor's horizons, minus the minimum discrepancy (`T_d`). `X_A` and `X_D` are thus treated as random variables in this model, with minimums at 0. For the descendant's horizon to post-date the ancestor's horizon by a gap as large as `T_d`, and for all other assumptions to hold, `T_R > T_o > T_d`, and thus putting bounds on `T_R` and `T_d` defines the range of possible values for `T_o`. Additionally, `X_A + X_D = T_o - T_d`, `H_A - H_D > T_o` and `X_D <= X_A` (D&A state greater than, but I do not comprehend any reason for why they cannot be equal and for all other conditions to still hold).
```{r echo = FALSE, dev='png'}
# variables
# duration of the entire range
# D&A treat this as a fixed variable based on avg species duration
# for now let's say 1
T_R <- 1
# MINIMUM amount of time that an
# ancestor's FAD (H_A) post-dates a descendant's FAD (H_D)
# this is a fixed variable of interest in D&A
# a lower bound of observed discrepancy
# e.g. 0.1
T_d <- 0.1
```
```{r echo = FALSE, dev='png'}
# amount of range overlap between anc and desc
# this is a MINIMUM and represents range extention
# in either direction (ancestor sampled later or
# descendant originated earlier)
#
# the time between the true origin of the ancestor
# and the first observed find of the descendant (H_D)
#
# D&A treat as identical to the time between the
# true origin of the descendant and the
# first observed find of the ancestor (H_A)
# see (Fig 2)
#
# this is a random variable in D&A
# for now let's say 0.5
T_o <- 0.5
```
```{r echo = FALSE, dev='png'}
# for desc FAD (H_D) to post-date anc FAD (H_A)
# T_R > T_o > T_d
# Thus, T_R places a maximum limit on T_o values
# and, T_d places a minimum limit on T_o values
# The difference between T_d and T_o is X_A and X_D
```
```{r echo = FALSE, dev='png'}
# X_A and X_D
# for T_o to be greater than T_d, the ancestor
# must be sampled some time (H_A) *after*
# the first sampled occurrence of descendant (H_D) + T_d
# this is X_A
# OR
# for T_o to be greater than T_d, the descendant
# must be sampled some time (H_D) *before*
# the first sampled occurrence of ancestor (H_A) - T_d
# this is X_A
# X_A and X_D are thus random variables, with minimums at 0
# for overlap to occur at all,
# X_A > X_D
# And
# X_A + X_D = T_o - T_d
# (sort of - remember that T_o is a random variable with a minimum)
```
```{r echo = FALSE, dev='png'}
# equation 1
# probability that you'd sample the descendant before the ancestor
# given the total ranges and the amount of overlap
# P_endD <- (T_o - T_d) / T_R
# and because we assume that the
# ancestor and descendant have identical durations
# we assume that the potential amount of time between
# T_o and T_d for H_A and H_D is distributed similarly
# P_endA <- P_endD
```
```{r echo = FALSE, dev='png'}
# equ 2
# joint probabilities for P_endD and P_endA
# P_endA_endD <- ((T_o - T_d)^2) / (T_R^2)
```
```{r echo = FALSE, dev='png'}
# equation 3g
# prob (X_A > XD) = lambda / 2*lambda = 1/2
# equation 5
# prob (H_A - H_D > T_D, given P_endA_endD) is 0.5
# equation 5b
# prob (H_A - H_D > T_D) = prob (H_A - H_D > T_D, given P_endA_endD) * P_endA_endD
# prob (H_A - H_D > T_D) = 1/2 * ((T_o - T_d)^2) / (T_R^2)
```
Equation 1 describes the probability of one fossil being found in a horizon within the region of overlap defined by `T_o - T_d`, and equation 2 describes the joint probability for two fossils (one for the ancestor, one for the descendant) from this limited region (referred to as `Pr(endA_endD)`). This is used in equation 3 (which is derived through a number of steps, during which the sampling rate drops out) to find the probability of `X_A > X_D`. By alternatively setting `X_D` or `X_A` equal to a random variable `t` and iterating over that variable, D&A determne that the probability of `X_A > X_D` is 1/2, regardless of whether `X_A` or `X_D`is set equal to `t`. This leads them to state in equation 4 that 1/2 is also the probability of `H_A - H_D > T_d`, when conditioned on only sampling horizons from the region of 'excess' overlap (D&A continually refer to this 'excess overlap' interval, or pair of intervals, as 'the black region(s)', in reference to their figure 2). This conditioned probability as `Pr(H_A - H_D > T_d | endA_endD)`.
In equation 5, they use the joint probability of `endA_endD`, from equation 2, to remove the conditioning from `Pr(H_A - H_D > T_d | endA_endD)`, and thus obtain the probability of `H_A - H_D > T_d`. This probability is discontinuous, with the probability always being zero when `T_d > T_o`.
```{r echo = FALSE, dev='png'}
# let's use notation:
# P_H_diff_greater = prob (H_A - H_D > T_D)
# `Pr(H_A - H_D > T_D
P_H_diff_greater <- ((T_o - T_d)^2) / (2*(T_R^2)) # 5b
# prob (H_A - H_D > T_D) must be zero when T_d > T_o
# thus 5b only holds when T_d > T_o
# D&A then used
T_R <- 0.97
T_d <- 0.8
# and tried many values of T_o
# presumably from T_o -> T_R
# fig 3 actually shows lots of values with T_d > T_o, and Prob at 0
# why even show that
# obviously the presumed overlap must be greater than the
# age discrepancy between FADs of Anc & Desc
```
For example, when `{r} T_R = 0.97` and `{r} T_d = 0.8` (as in D&A's analyses), and (to take a specific example), `{r} T_o = 0.9`, then the probability of `H_A - H_D > T_d` is:
```{r}
((T_o - T_d)^2) / (2*(T_R^2))
```
Or, if `{r} T_R = 1`, `{r} T_d = 0.1`, and `{r} T_o = 0.5`, then the probability of `H_A - H_D > T_d` is:
```{r}
((T_o - T_d)^2) / (2*(T_R^2))
```
We can also use this equation to recreate their figure 3:
```{r echo = FALSE, dev='png'}
# let's try to recreate their figure 3
# note they put both an absolute and relative axes for range
# overlap on this plot
#Relative overlap = total overlap (T_o) / total taxon range (T_R)
# but since T_R is so close to 1, they aren't visually distinct
# equation 5b
# P_H_diff_greater <- ((T_o - T_d)^2) / (2*(T_R^2))
T_R <- 0.97
T_d <- 0.8
T_o <- seq(T_d, T_R,
length.out = 1000)
P_H_diff_greater <- sapply(T_o,
function(T_o_i) ((T_o_i - T_d)^2) / (2*(T_R^2))
)
# add in the 0 to T_d portion with 0 probability
T_o <- c(
seq(0,T_d, length.out=100),
T_o
)
P_H_diff_greater <- c(
rep(0,100),
P_H_diff_greater
)
plot(T_o, P_H_diff_greater,
xlab = "Range Overlap (Possible value of T_o)",
ylab = "Prob. of ((Desc's FAD - Anc's FAD) > T_d)",
main = ""
)
# okay, that was almost too easy to recreate...
```
That was almost too easy. But note that most of this graph (just like D&A's Figure 3 - look at their horizontal axes) is at 0 probability, as the values of `T_o` include the range of 0 Ma to 0.8 Ma, even though `T_d = 0.8` and thus its impossible for an ancestor to occur after its descendant. Its curious why this non-informative interval with *a priori* zero probability is even shown - perhaps because they want to show the interval of range overlap from 0 Ma to 0.97 Ma because they integrate over this entire range for `T_o` when they generate their 'treating all values of overlap as equally likely' *p*-value using equation 6, which they report as 0.0009.
Regardless, for this examination, let's look at just the informative region that encompasses overlap values with non-zero probabilities for a gap of 0.8 Ma.
```{r echo = FALSE, dev='png'}
# let's try to recreate their figure 3
# note they put both an absolute and relative axes for range
# overlap on this plot
#Relative overlap = total overlap (T_o) / total taxon range (T_R)
# but since T_R is so close to 1, they aren't visually distinct
# equation 5b
# P_H_diff_greater <- ((T_o - T_d)^2) / (2*(T_R^2))
T_R <- 0.97
T_d <- 0.8
T_o <- seq(T_d, T_R,
length.out = 1000)
P_H_diff_greater <- sapply(T_o,
function(T_o_i) ((T_o_i - T_d)^2) / (2*(T_R^2))
)
plot(T_o, P_H_diff_greater,
xlab = "Range Overlap (Possible value of T_o)",
ylab = "Prob. of ((Desc's FAD - Anc's FAD) > T_d)",
main = ""
)
# okay, that was almost too easy to recreate...
```
Now we can do things like allowing `T_R` to vary, instead of `T_o` - what if we assume the true overlap between *Homo* and *A. sediba* is just slightly more than the observed minimum overlap (`T_o = 1`, which would mean that *Homo* originated from *A. sediba* at most 20 Ka before *Homo* was first sampled in the fossil record), but we don't know how long the individual ranges of each taxon is, but otherwise put a hard cap on their maximum ranges at 3 million years?
```{r echo = FALSE, dev='png'}
# what if we constrain T_o to 1 (Homo originated at most 100kyr before sediba)
# and asked how Prob of Discrepancy varied with T_R
T_d <- 0.8
T_o <- 1
# let's say that Homo and Sediba might be at most 3 million years old
max_T_R <- 3
T_R <- seq(T_o, max_T_R,
length.out = 1000)
P_H_diff_greater <- ((T_o - T_d)^2) / (2*(T_R^2))
plot(T_R, P_H_diff_greater,
xlab = "Total Taxon Range (T_R)",
ylab = "Prob of ((Desc's FAD - Anc's FAD) > T_d)",
main = ""
)
# very low probabilities??
# worse if T_o is larger
```
We can see that giving larger ranges relative to the overlap doesn't at all help the case for descendants to occur before their ancestors, because it means more geologic time during which only the ancestor existed, but wasn't sampled.
So, in figure 3 (or its recreation above), where only `T_o` is varied, we can see that the probability never gets better than 0.015 (1.5%), which leads us to wonder - what set of parameters would produce a high probability for the descendant to predate the ancestor? Presumably the hard cap on this probability is 0.5; this is theoretically true (in what scenario of uniform sampling could an ancestor be less likely to be found first than its descendant?), but also mathematically true as `Pr(H_A - H_D > T_d | endA_endD)` is 1/2, and thus the actual determining part of the model is the probability of fossil finds being in the 'end-caps'excess overlap' of the taxon ranges, given by equation 2 (`Pr(endA_endD)`).
So when is `Pr(endA_endD)` closest to 1? Let's hold `T_d = 0.8` constant, but vary `T_o` and `T_R`, and look at the values of `Pr(endA_endD)` we obtain. Because `T_R > T_o > T_d`, we will vary `T_o` from the value of `T_d` to 3 Ma (extremely unrealistic for our case study, as it probably would require *A. sediba* to still be extant, but this is just an exploration - calm down!), and vary the amount of time in the fossil taxon ranges that is not part of their overlap (`T_R_no`, where `T_R - T_o = T_R_no`), with `T_R_no` ranging from 0 to (again) a maximum 3 Ma (which isn't just implausible, but patently impossible in our case as the fossil record in quest isn't from six million years ago...).
```{r echo = FALSE, dev='png'}
T_d <- 0.8
nsamp <- 1000
# set T_o
max_T_o <- 3
T_o <- seq(T_d, max_T_o, length.out = nsamp)
# set T_R_no
T_R_no <- seq(0, 3, length.out = nsamp)
P_endA_endD <- matrix(NA, nrow = nsamp, ncol = nsamp)
rownames(P_endA_endD) <- T_o
colnames(P_endA_endD) <- T_R_no
for(i in 1:nsamp){
T_o_i <- T_o[i]
for(j in 1:nsamp){
T_R_j <- T_o_i + T_R_no[j]
P_endA_endD[i,j] <- ((T_o_i - T_d)^2) / (T_R_j^2)
}
}
filled.contour(
z = P_endA_endD,
x = T_o, y = T_R_no,
zlim = c(0,1),
color.palette = terrain.colors,
xlab = "Duration of Overlap (T_o)",
ylab="Range Duration In Excess of Overlap (T_R - T_o)",
main="Probability Surface for A-D Pair \n With Finds in Excess Overlap"
)
mtext(side=4, line=-5.2, text="Probability")
```
And now we can see that the probability of sampling from those regions of excess overlap is only going to be relatively high when (a) the ranges of the ancestor and the descendant are nearly perfectly overlapping (`T_R_no ~ 0`, or `T_o ~ T_R`), and (b) when the ranges are long relative to the discrepancy between the descendant and the ancestor. And, let's expand those maxima to 80 million years just to illustrate the point (`T_d * 5oh0 = 40`) and look at a very similar plot but for the probability of observing a descendant predating its ancestor by a discrepancy as long as `T_d`, `Pr(H_A - H_D > T_d)`. We'll rescale the color scale (probability) to a range of 0 to 0.5, as we know that 0.5 is the maximum limit in this case.
```{r echo = FALSE, dev='png'}
T_d <- 0.8
nsamp <- 2000
max_T <- 40
# set T_o
T_o <- seq(T_d, max_T, length.out = nsamp)
# set T_R_no
T_R_no <- seq(0, max_T, length.out = nsamp)
P_H_diff_greater <- matrix(NA, nrow = nsamp, ncol = nsamp)
rownames(P_H_diff_greater) <- T_o
colnames(P_H_diff_greater) <- T_R_no
for(i in 1:nsamp){
T_o_i <- T_o[i]
for(j in 1:nsamp){
T_R_j <- T_o_i + T_R_no[j]
# `Pr(H_A - H_D > T_d)
P_H_diff_greater[i,j] <- ((T_o_i - T_d)^2) / (2 * (T_R_j^2))
}
}
filled.contour(
z = P_H_diff_greater,
x = T_o, y = T_R_no,
zlim = c(0,0.5),
color.palette = terrain.colors,
xlab = "Duration of Overlap (T_o)",
ylab="Range Duration In Excess of Overlap (T_R - T_o)",
main="Probability Surface for an Ancestor \nFound 0.8 Ma After its Descendant "
)
mtext(side=4, line=-5.2, text="Probability")
```
Thus, under the model presented by D&A and its set of assumptions, we can see that ancestors post-dating their descendants by even what might be small geological intervals (e.g. less than a million years) is only *likely* when geologic durations are long relative to the length of the discrepancy in the ages of appearance (at least 3-5 times T_d, the discrepancy), and the descendant's range overlaps for about a quarter of its range with its ancestor.
Now, there are hundreds, or possibly thousands, of putative ancestor-descendant relationships that exist in the paleontological literature. This agrees with our theoretical understanding of diversification and the incompleteness of the fossil record, which suggests that there should sampled ancestors, and that their number may not be great but should not be negligible (Foote, 1996). However, D&A's model implies we should rarely expect ancestors to occur after their descendants. But, are these reasonable assumptions? What does a realistic birth-death-sampling simulation tell us?
# Simulation Tests
## Simulation Structure, Parameters, Conditions
Here, I will attempt to compare D&A's model to a birth-death-sampling simulation, obtain 10,000 simulated fossil records that meet certain conditions (i.e. having more than one taxon), and parameterize these simulation to closely as possible match the parameters D&A worked with in their study. A birth-death-sampling model is simply a model of lineages, with events occurring as if a series of independent Poisson processes, with morphotaxa 'budding' off from each other at some rate (the rate of origination, or the *birth rate*), going extinct at some rate (the rate of extinction, or *death rate*), and being sampled at particular instances in time (the *sampling rate*). These processes are presumed to be constant with uniform rates, that do not vary over time. That's almost certainly false when humanity involved, but let's just ignore that for the sake of doing the simulations at all. A birth-death-sampling model is also known as a BDS model, a birth-death-serial-sampling model, or (most commonly) a fossilized-birth-death model (Heath et al., 2014), and is exactly the sort of model now becoming increasingly and widely applied to paleontological datasets using tip-dating approaches in Bayesian phylogenetics.
In these simulations, we'll focus on the differences in first appearance times, even though D&A's model is explicitly for taxa known from single horizons. While *A. sediba* might be known from a single horizon, that certainly isn't true for *Homo*. In my view, the question that D&A pose is about the discrepancy in first appearance times between an ancestor and its descendants. As touched briefly when discussing the single-horizon assumptions of D&A's model, though, comparing data with multiple occurrences to D&A's model will likely only favor their conclusions (because taxa with multiple sampling events are less likely to have mismatch in the order of appearances of ancestor-descendant pairs).
Handily, the extinction rate is widely known to be the inverse of the mean *true* species duration. Note the 'true' - the observed species durations will always be smaller than the true ranges, due to incomplete sampling at both ends (and thus in better sampled records, more sampling means ranges that look more like the 'true' ranges.) However, we'll ignore that effect of sampling and assume extinction rates is `1/0.97`, the inverse of the mean species duration used by D&A. (This literally means that if we had a hundred species at time A, we'd expect about half of them to be extinct 0.97 Ma later - so this is a **very** high extinction rate! The average for marine invertebrates is 0.1 per lineage, per Ma - implying an average extinction rate of 10 Ma.) Even more handily, we know that for most groups in the fossil record, the long-term origination and extinction rates are generally nearly equal (Stanley, 1976; Sepkoski, 1998; Marshall, 2016). We don't really understand why that is (its partly a mathematical artifact, but also partly *not*, and doesn't agree with estimates of speciation and extinction rates from molecular phylogenies), but we can probably safely assume that the origination rate for hominins has, in the long term, been pretty similar to the rate of extinction, given that only ones species has managed to survive to today. A per-lineage sampling rate isn't really touched upon by D&A, as the parameter does not matter to their calculations, and I would hesitate to try to estimate such myself, so we can set that also equal to the extinction rate for now.
Birth-death-sampling simulations are notorious for having wildly variable results (e.g. most that start with a single species go extinct immediately), and thus we need to set constraints on what simulation output we'll accept as a viable run or not (non-viable runs are thrown out, and the simulation is run again, until the required number of acceptable runs is produced). We will require at least 10 taxa sampled in each simulated fossil record (but no more than a 100), and at least 1 extant taxon in each output run. Most runs will likely have dozens of sampled species, even with these seemingly low limits.
We will also not treat simply surviving to time 0 as 'being sampled' - that way, when we look at differences in sampling times, its always fossil finds, modeled as under a uniform Poisson process - exactly as expected by D&A's own model.
```{r include = FALSE, echo = FALSE}
files <- list.files()
saveFileName <- "saved_AD_discrepancy_workspace.Rdata"
if(any(files == saveFileName)){
load(saveFileName)
}else{
# let's do a bunch and concatanate
nRun <- 10000
# invert avg species duration to get ext rate
extRate <- 1/0.97
# set ext rate equal to orig rate
origRate <- extRate
# set sampling rate maybe equal to ext rate (??)
sampRate <- extRate
# shouldn't matter as sampling rate falls out of D&A's equations...
# sample what from each run
sampleWhat <- "sampleOnce"
# set a seed
set.seed(444)
isExtant_all <- AD_overlap_all <- AD_diff_all <- list()
# iterate over nRun
for(run in 1:nRun){
# get a simulated fossil record
saveSeed <- .Random.seed
record <- simFossilRecord(
p = origRate,
q = extRate,
r = sampRate,
nruns = 1,
# do not consider extant taxa as sampled
modern.samp.prob = 0,
# controls
totalTime = c(0, 1000),
nTotalTaxa = c(1, 1000),
# basically only conditioned on at least 1 extant taxon
nExtant = c(1, 1000),
# and 10 to 100 sampled fossil taxa
nSamp = c(10,100),
plot = TRUE
)
# convert to taxa data
taxa <- fossilRecord2fossilTaxa(record)
# get ranges
ranges <- fossilRecord2fossilRanges(record)
# drop NAs
ranges <- ranges[!is.na(ranges[,1]),]
# get sampled taxa
sampledTaxaNames <- rownames(ranges)
# get last sampled ancestor for each taxon
sampAnc <- character()
for(i in 1:length(sampledTaxaNames)){
taxon <- sampledTaxaNames[i]
isSampled <- FALSE
while(!isSampled){
ancID <- record[[taxon]]$taxa.data["ancestor.id"]
anc <- names(record)[ancID]
if(is.na(anc)){
isSampled <- TRUE
}else{
isSampled <- any(anc == sampledTaxaNames)
taxon <- anc
}
}
sampAnc[i] <- anc
}
# filter to those with sampled ancestors
names(sampAnc) <- sampledTaxaNames
sampAnc <- sampAnc[!is.na(sampAnc)]
# are there any with sampled ancestors
if(length(sampAnc)<1){
isExtant <- AD_overlap <- AD_diff <- NA
}else{
#
if(sampleWhat == "sampleOnce"){
# need to filter so only a single ancestor-descendant pair
# picked from each simulation
singleSamp <- sample(x = length(sampAnc), size = 1)
sampAnc1 <- sampAnc[singleSamp]
names(sampAnc1) <- names(sampAnc)[singleSamp]
sampAnc <- sampAnc1
}
#
if(sampleWhat == "filterPairs"){
# need to filter so each sampled ancestor
# is only used once for independence concerns
resample <- function(x, ...) x[sample.int(length(x), ...)]
uniqDesc <- sapply(unique(sampAnc),function(x)
names(sampAnc)[
resample(which(sampAnc == x),1)
]
)
sampAnc <- unique(sampAnc)
names(sampAnc) <- uniqDesc
}
#
#
# whats the anc-desc FAD difference
ancFADs <- ranges[sampAnc,1]
descFADs <- ranges[names(sampAnc),1]
AD_diff <- ancFADs - descFADs
#
# get the true overlap between taxon ranges
# start - desc orig time
# end - and or desc ext, whichever comes first
desc_orig <- sapply(names(sampAnc),function(taxon)
record[[taxon]]$taxa.data["orig.time"]
)
desc_ext <- sapply(names(sampAnc),function(taxon)
record[[taxon]]$taxa.data["ext.time"]
)
anc_ext <- sapply(sampAnc,function(taxon)
record[[taxon]]$taxa.data["ext.time"]
)
end_date <- max(anc_ext,desc_ext)
AD_overlap <- desc_orig - end_date
#
# are either the anc or desc extant
desc_extant <- sapply(names(sampAnc),function(taxon)
record[[taxon]]$taxa.data["still.alive"] == 1
)
anc_extant <- sapply(sampAnc,function(taxon)
record[[taxon]]$taxa.data["still.alive"] == 1
)
isExtant <- desc_extant | anc_extant
}
AD_diff_all[[run]] <- AD_diff
AD_overlap_all[[run]] <- AD_overlap
isExtant_all[[run]] <- isExtant
}
save.image("saved_AD_discrepancy_workspace.Rdata")
}
## if simFossilRecord fails due to an error, can print random seed with dput
# dput(saveSeed)
## simply replace .Random.seed with saveSeed if you want to rerun exactly as before
```
## Variables of Interest from the Simulations
Once we run the simulations as described above, we can explore our 10,000 simulated fossil records. In our case, there are two central variables of interest that we can look at as summary statistics, and examine the relationships between them, to evaluate how well D&A's model holds up in these more realistic (yet still simulated) scenarios.
The first of these variables of interest is The difference in age between the first sampled appearance of the ancestor and the first sampled appearance of the descendant, for every sampled pair of ancestor-descendant taxa. For all those pairs in which descendants predate their ancestor, there will be a negative temporal offset, which is the temporal difference between descendant and ancestor labeled `T_d` by D&A. Note that this is not limited to only direct ancestor-descendants, but includes ancestor-descendants where there are unsampled intermediates as well. In order to properly avoid statistical artifacts due to accidentally including data from two seperate pairs involving the same ancestor, or artifacts brought on by ignoring taxa with more than one descendant, only a single such ancestor-descendant pair was sampled from each tree. This sampling was doing at random from all possible ancestor-descendant trees in a given simulated fossil record.
```{r echo = FALSE, dev='png'}
# unlist, remove all NAs in AD_diff_all
AD_diff_all <- unlist(AD_diff_all)
AD_overlap_all <- unlist(AD_overlap_all)
isExtant_all <- unlist(isExtant_all)
AD_overlap_all <- AD_overlap_all[!is.na(AD_diff_all)]
isExtant_all <- isExtant_all[!is.na(AD_diff_all)]
# and finally filter AD_diff_all
AD_diff_all <- AD_diff_all[!is.na(AD_diff_all)]
# First, note that not every simulation produced an ancestor-descendant relationship we could look at:
#length(AD_diff_all)
# 10,000
```
The distribution of age offsets looks, in same ways, just aswhat we would expect: dominated by cases of the ancestor preceding the descendant (positive age offset values):
```{r echo = FALSE, dev='png'}
# discrepancy
hist(AD_diff_all,
main = "Difference in Age Between Ancestor and Descendant",
xlab = "Ancestor FAD - Descendant FAD (Ma)")
#summary(AD_diff_all)
# isExtant
#sum(isExtant_all)
```
Note however that a significant portion are less than zero, and thus represent ancestor-descandant discrepancy - cases where the ancestor's first appearance is younger than the descendant. How many, in fact:
```{r}
sum(AD_diff_all>0)/length(AD_diff_all)
```
One thought that might occur is that the there might be different patterns for ancestor-descendant relationships where the ancestor or descendant are still extant. (The mathematical rules of entirely extinct taxa in birth-death-sampling models are actually usually a lot more tractable, as extant taxa forces us to consider the wonderfully complicated world of censored and truncated exponential distributions). We can record that information and look at the proportion of our returned offsets that came from those that either had extant members, as opposed to being entirely extinct:
```{r}
sum(isExtant_all)/length(AD_diff_all)
```
As we go forward, we will make sure to compare the entire simulation output with what we obtain from the all-extinct and still-extant sets, to make sure there is no divergence from the whole dataset.
The second variable of interest is the amount of true overlap between the complete durations of the ancestor and the descendant (`T_o` in D&A's model), for each of the randomly-selected observed ancestor-descendant pairs we obtained temporal offsets from above. This is quite easy to calculate - note that in many cases the amount of overlap is small (and in the case of indirect ancestor-descendant relationships, may be non-existant, leading to the seemingly negative overlap values, as the two sampled taxa never were co-extant).
```{r echo = FALSE, dev='png'}
# overlap
# Extinction Time for Whichever Goes Extinct First (Ancestor or Descendant)
hist(AD_overlap_all,
main = "True Duration of Range Overlap\n Between Ancestor and Descendant",
xlab = "Desc. Origination Time - Max(Extinction Times for Anc., Desc.)")
#summary(AD_overlap_all)
```
As with the temporal offset values, we may find that pairs with extant members or not might differ.
## The Relationship between Anc-Desc Overlap and Temporal Offset in Birth-Death-Fossil Simulations
As discussed above, one of the central expectations of D&A's model is that the magnitude of ancestor-descendant age discrepancy will depend greatly on the amount of true overlap between ancestors and descendants. In either case, we will need to filter the data. For example, we could filter on those pairs that have discrepancies (the pairs where the descendant is sampled first):
```{r echo = FALSE, dev='png'}
# but how to compare to D&A results?
# These simulations have variable T_R, T_o
# and T_R and T_o differ between anc and desc
# basically noncomparable
# relationship of true overlap to observed discrepancy
#plot(AD_diff_all, AD_overlap_all)
# negative values are due to taxa having no observed overlap
# discrepancy only
plot(
y = -AD_diff_all[AD_diff_all < 0],
x = AD_overlap_all[AD_diff_all < 0],
ylab = "Desc. FAD - Anc. FAD (Ma)",
xlab = "True Range Overlap Between Anc. & Desc. (Ma)",
main = "Temporal Offset For Pairs Where Desc. Appears First",
pch=16, col=rgb(red=0, green=0.5, blue=0.7, alpha=0.1)
)
abline(lm(
-AD_diff_all[AD_diff_all < 0] ~
AD_overlap_all[AD_diff_all < 0],
lty = 4
))
# OLD
# very small negative overlaps are probably due to having extant taxa
# actually no, it was a hard to catch bug in SFR...
```
The dashed line here represents the trend line of a linear regression (not that the relationship between these two variables is very linear). Note that for offset, we are here looking at the age difference between the descendant FAD and the ancestor FAD, so that the ancestor
Or, we could examine the temporal offset for the pairs (again, negated) where there is actual overlap between the ancestor and the descendant:
```{r echo = FALSE, dev='png'}
# overlap only
plot(
y = -AD_diff_all[AD_overlap_all > 0],
x = AD_overlap_all[AD_overlap_all > 0],
ylab = "Desc. FAD - Anc. FAD (Ma)",
xlab = "True Range Overlap Between Anc. & Desc. (Ma)",
main = "Temporal Offset For Pairs with Range Overlap",
pch=16, col=rgb(red=0, green=0.5, blue=0.7, alpha=0.1)
)
abline(lm(
-AD_diff_all[AD_overlap_all > 0] ~
AD_overlap_all[AD_overlap_all > 0]
))
```
In both cases, although the relationships are very noisy, we can see that increased overlap leads to larger discrepancies between the ancestor and the descendant. But we'd loosely always expect that relationship from first principles.
## Comparing Simulation Results to Du and Alemseged's Expected Probabilities
So, how much does this match what D&A's model tells us to expect? For example, remember D&A's figure 3 that we recreated, using their equation 5b:
```{r echo = FALSE, dev='png'}
# equation 5b
# P_H_diff_greater <- ((T_o - T_d)^2) / (2*(T_R^2))
T_R <- 0.97
T_d <- 0.8
T_o <- seq(T_d, T_R,
length.out = 1000)
P_H_diff_greater <- sapply(T_o,
function(T_o_i) ((T_o_i - T_d)^2) / (2*(T_R^2))
)
plot(T_o, P_H_diff_greater,
xlab = "True Range Overlap (Possible value of T_o)",
ylab = "Prob. of ((Desc's FAD - Anc's FAD) > T_d)",
main = ""
)
```
If we simply take the value of `T_d = 0.8 Ma` used to generate this figure, we can look at our simulated datasets, and for different values of ancestor-descendant range overlap (`T_o`), we can ask how many simulations have a discrepancy of 0.8 Ma or more.
(NOTE: The following probability is thus cumulative across range overlap, representing the joint probability that (a) a pair has an overlap this large or larger *and* (b) that the pair has a discrepancy greater than 0.8 Ma. That is my interpretation of what equation 5 in D&A represents. It occurs to me that it might instead by a density, given how D&A integrate over overlap in equation 6 - but then other aspects of their derivation don't make sense to me. Regardless, if I'm wrong about my interpretation of equation 5, just skip down to the section **So just how likely is a discrepancy of 0.8 Ma?** and ignore everything in between. None of this will impact my final assessment anyway!)
```{r echo = FALSE, dev='png'}
# cumulative distribution of offset more than T_d
# for different values of overlap
T_d <- 0.8
AD_overlap_uniq <- sort(unique(AD_overlap_all))
cumulAbove <- sapply(AD_overlap_uniq, function(x)
sum(-AD_diff_all[AD_overlap_all <= x] > T_d) / length(AD_diff_all)
)
plot(
AD_overlap_uniq,
cumulAbove,
xlab = "True Range Overlap Between Anc. & Desc. (Ma)",
ylab = "Cumulative Prop. of Runs with T_d >= 0.8 Ma",
main = "Empirical Cumul. Distribution for Discrepancy at Overlap",
xlim = c(0.8, max(AD_overlap_uniq))
)
```
The above is plotted such that we only see the range overlap values greater than `T_d` (and thus that range where a discrepancy of 0.8 Ma or greater is even possible.)
Note that the difference in the horizontal axes on this plot and the previous plot - as D&A's model only considers ranges of 0.97 Ma, the maximum overlap cannot exceed that value. D&A's model doesn't even take into account the possibility for overlaps as great as 2 Ma, or the discrepancies possible at that size of overlap. Thus, to compare the two, it may be more appropriate to restrict our observations to the range of overlaps between 0.8 Ma and 0.97 Ma, and then superimposing our empirically-derived cumulative distribution of observing a discrepancy greater than `T_d` with the probability estimates from D&A within that range.
```{r echo = FALSE, dev='png'}
# superimpose the model and the simulated results
T_d <- 0.8
T_R <- 0.97
xlims <- c(
T_d, T_R
)
ylims <- c(
0, 0.021
)
lineTypes <- c(4,1)
plot(
AD_overlap_uniq,
cumulAbove,
xlab = "True Range Overlap Between Anc. & Desc. (Ma)",
ylab = "Cumulative Probability of\n Anc-Desc Discrepancy >= 0.8 Ma",
main = "Comparing Cumul. Distributions\n for Discrepancy >= T_d (0.8 Ma)",
xlim = xlims,
ylim = ylims,
lty = lineTypes[1]
)
lines(T_o, P_H_diff_greater,
lty = lineTypes[2]
)
legend("topleft",
lty = lineTypes,
legend = c("Empirical Dist. From Simulations",
"Prob. Estimated from D & A 2019")
)
```
However, because discrepancies cannot be larger than overlaps, the simulated distribution at a particular overlap value (e.g. 0.9 Ma) may be underestimated relative to D&A's model expectations, as the cumulative distribution is the proportion of simulation runs showing a value equal to or greater than the value of interest, divided by the number of simulation runs. The numerator is unaffected within our restricted range of overlap values, because discrepancies cannot be larger than overlaps, the cumulative distribution estimated from simulations at a particular overlap value (e.g. 0.9 Ma) doesn't take into account discrepancies larger than 0.97 Ma within the plotted range. However, the denominator for that cumulative probability considers many scenarios where overlap is much greater than 0.97 Ma. Obviously, we have relaxed many assumptions from D&A's model in building this simulation comparison, and many of the simulation runs are not possible under D&A's very specific model assumptions. However, we can at least say that D&A's model is not considering scenarios with overlap that high, and thus remove that proportion from the denominator, and see how much that impacts our comparison.
```{r echo = FALSE, dev='png'}
T_d <- 0.8
T_R <- 0.97
# recalculate cumulative, ignoring overlap values greater than T_R
AD_overlap_uniq <- sort(unique(AD_overlap_all))
nSimBelowTR <- sum(AD_overlap_all < T_R)
cumulAbove <- sapply(AD_overlap_uniq, function(x)
sum(-AD_diff_all[AD_overlap_all <= x] > T_d) / nSimBelowTR
)
# superimpose the model and the simulated results
lineTypes <- c(4,1)
plot(
AD_overlap_uniq,
cumulAbove,
xlab = "True Range Overlap Between Anc. & Desc. (Ma)",
ylab = "Cumulative Probability of\n Anc-Desc Discrepancy >= 0.8 Ma",
main = "Comparing Cumul. Distributions\n for Discrepancy >= T_d (0.8 Ma)",
xlim = xlims,
ylim = ylims,
lty = lineTypes[1]
)
lines(T_o, P_H_diff_greater,
lty = lineTypes[2]
)
legend("topleft",
lty = lineTypes,
legend = c("Empirical Dist. From Simulations",
"Prob. Estimated from D & A 2019")
)
```
This did not really have a very sizeable effect, as the number of ancestor-descendant pairs with overlaps that large a small minority of our simulation results.
Overall, from the last two superimposed plots, we can say that while the two have *extremely* different shapes. This may owe to the fact we are considering both direct and indirect ancestor-descendant pairs, or possibly having something to do with our relaxation of D&A's strict assumption of descendant and ancestors having similar taxon durations in the much more realistic assumptions of our siulations. However, the values are remarkably similar. Overall, one has to assume that even if some of D&A's assumptions are strenuous, the probabilities they obtain are very close to what would be expected from this particular system.
### Does it differ if we condition on extant taxa, or on extinct-only taxa?
Repeating the last plot, but conditioning our data on whether one or both taxa in an AD pair were still extant...
```{r echo = FALSE, dev='png'}
T_d <- 0.8
T_R <- 0.97
AD_overlap_extant <- AD_overlap_all[isExtant_all]
AD_diff_extant <- AD_diff_all[isExtant_all]
# recalculate cumulative, ignoring overlap values greater than T_R
AD_overlap_uniq <- sort(unique(AD_overlap_extant))
nSimBelowTR <- sum(AD_overlap_extant < T_R)
cumulAbove <- sapply(AD_overlap_uniq, function(x)
sum(-AD_diff_extant[AD_overlap_extant <= x] > T_d) / nSimBelowTR
)
# superimpose the model and the simulated results
lineTypes <- c(4,1)
plot(
AD_overlap_uniq,
cumulAbove,
xlab = "True Range Overlap Between Anc. & Desc. (Ma)",
ylab = "Cumulative Probability of\n Anc-Desc Discrepancy >= 0.8 Ma",
main = "Comparing Cumul. Distributions\n for Discrepancy >= T_d (0.8 Ma)",
xlim = xlims,
ylim = ylims,
lty = lineTypes[1]
)
lines(T_o, P_H_diff_greater,
lty = lineTypes[2]
)
legend("topleft",
lty = lineTypes,
legend = c("Empirical Dist. From Simulations",
"Prob. Estimated from D & A 2019")
)
```
... or all-extinct.
```{r echo = FALSE, dev='png'}
T_d <- 0.8
T_R <- 0.97
AD_overlap_extinct <- AD_overlap_all[!isExtant_all]
AD_diff_extinct <- AD_diff_all[!isExtant_all]
# recalculate cumulative, ignoring overlap values greater than T_R
AD_overlap_uniq <- sort(unique(AD_overlap_extinct))
nSimBelowTR <- sum(AD_overlap_extinct < T_R)
cumulAbove <- sapply(AD_overlap_uniq, function(x)
sum(-AD_diff_extinct[AD_overlap_extinct <= x] > T_d) / nSimBelowTR
)
# superimpose the model and the simulated results
lineTypes <- c(4,1)
plot(
AD_overlap_uniq,
cumulAbove,
xlab = "True Range Overlap Between Anc. & Desc. (Ma)",
ylab = "Cumulative Probability of Anc-Desc Discrepancy >= 0.8 Ma",
main = "Comparing Cumul. Distributions\n for Discrepancy >= T_d (0.8 Ma)",
xlim = xlims,
ylim = ylims,
lty = lineTypes[1]
)
lines(T_o, P_H_diff_greater,
lty = lineTypes[2]
)
legend("topleft",
lty = lineTypes,
legend = c("Empirical Dist. From Simulations",
"Prob. Estimated from D & A 2019")
)
```
In both cases the lines get less smooth (less data to calculate the empirical cumulative distributions from), but both show fairly similar values and shapes to the plot using the full dataset.
## So just how likely is a discrepancy of 0.8 Ma?
An alternative way of looking at the results of these simulations might be to ask how often we would expect to see a discrepancy of 0.8 Ma or greater in a fossil record with this set of generating parameters, without tying the probability value to a specific overlap value. D&A do this in their paper with their equation 6c, which integrates over 'all possible' values of `T_o` (i.e. the range of 0 to 0.97 Ma, the maximum of which is`T_R`), to find a *summary* *p*-value of 0.0009.
But what would be the comparable probability from our BDS simulations? To do this, first we'll need the cumulative distribution for the temporal offset itself, rather than for achieving a particular discrepancy at different values of range overlap (as D&A approached the issue).
```{r echo = FALSE, dev='png'}
# probability distribution
# only discrepancy invert scale
AD_discrepancy <- -AD_diff_all[AD_diff_all < 0]
AD_discrepancy_uniq <- sort(unique(AD_discrepancy))
cumulAbove <- sapply(AD_discrepancy_uniq, function(x)
sum(AD_discrepancy >= x) / length(AD_diff_all)
)
plot(
AD_discrepancy_uniq,
cumulAbove,
xlab = "Desc. FAD - Anc. FAD (Ma)",
ylab = "Prob. of having X Discrepancy or Greater",
main = "Empirical Probability of Anc. - Desc. Age Discrepancy"
)
abline(v = 0.8, lty=5, col = "gray", lwd=2)
# what value is closest to 0.8
closeDiff_T_d <- which(AD_discrepancy_uniq > 0.8)[1]
```
Note that age offset is plotted so that the pairs with earlier appearing descendants have *positive* offsets, not negative. While this is more intuitive, this also means the curve is actually a reverse cumulative distribution - each point's vertical scale represents the proportion of simulations that have a discrepancy of the discrepancy duration shown on the horizontal axis, *or larger*, and thus the cumulative probability increases to the left. The vertical line is our discrepancy value of interest, 0.8 Ma.
One item to note is that discrepancies themselves are not rare in these simulated datasets, as we've already seen, but that the majority are quite small, and probably insignificant in duration in a real fossil record (we would never realistically even see such small discrepancies given chronostratigraphic uncertainty, correlation issues, etc).
The closest observed discrepancy value to 0.8 Ma, for which we have a cumulative probability from our empirical curve, is:
```{r}
# the value of offset closest to 0.8 Ma
AD_discrepancy_uniq[closeDiff_T_d]
```
Ah, that's pretty close. So what is its cumulative probability?