Rosenberg lab at Stanford University
 Noah A Rosenberg +1 650 721 2599 (office phone) +1 650 724 5122 (lab phone) +1 650 724 5114 (fax) Lab: 339 Herrin Labs Office: 339A Herrin Labs Mailing address Department of Biology Stanford University 371 Serra Mall Stanford, CA 94305-5020 USA Last modified 8-5-2017

Abstracts of Rosenberg lab publications

[2016] [2015] [2014] [2013] [2012] [2011] [2010] [2009] [2008] [2007] [2006] [2005 and earlier]

[149] OK Kamneva, J Syring, A Liston, NA Rosenberg (2017) Evaluating allopolyploid origins in strawberries (Fragaria) using haplotypes generated from target capture sequencing. BMC Evolutionary Biology 17: 180. [PDF] [File S1 Table S1] [File S2 Sequences] [File S3 Table S2] [File S4 Figures S1-S11] [File S5 Table S3] [File S6 R code]

Background. Hybridization is observed in many eukaryotic lineages and can lead to the formation of polyploid species. The study of hybridization and polyploidization faces challenges both in data generation and in accounting for population-level phenomena such as coalescence processes in phylogenetic analysis. Genus Fragaria is one example of a set of plant taxa in which a range of ploidy levels is observed across species, but phylogenetic origins are unknown. Results. Here, using 20 diploid and polyploid Fragaria species, we combine approaches from NGS data analysis and phylogenetics to infer evolutionary origins of polyploid strawberries, taking into account coalescence processes. We generate haplotype sequences for 257 low-copy nuclear markers assembled from Illumina target capture sequence data. We then identify putative hybridization events by analyzing gene tree topologies, and further test predicted hybridizations in a coalescence framework. This approach confirms the allopolyploid ancestry of F. chiloensis and F. virginiana, and provides new allopolyploid ancestry hypotheses for F. iturupensis, F. moschata, and F. orientalis. Evidence of gene flow between diploids F. bucharica and F. vesca is also detected, suggesting that it might be appropriate to consider these groups as conspecifics. Conclusions. This study is one of the first in which target capture sequencing followed by computational deconvolution of individual haplotypes is used for tracing origins of polyploid taxa. The study also provides new perspectives on the evolutionary history of Fragaria.

[148] N Alcala, NA Rosenberg (2017) Mathematical constraints on FST: biallelic markers in arbitrarily many populations. Genetics 206: 1581-1600. [PDF] [File S1] [File S2]

FST is one of the most widely used statistics in population genetics. Recent mathematical studies have identified constraints that challenge interpretations of FST as a measure with potential to range from 0 for genetically similar populations to 1 for divergent populations. We generalize results obtained for population pairs to arbitrarily many populations, characterizing the mathematical relationship between FST, the frequency M of the more frequent allele at a polymorphic biallelic marker, and the number of subpopulations K. We show that for fixed K, FST has a peculiar constraint as a function of M, with a maximum of 1 only if M = i/K, for integers i with ⌈ K/2 ⌉ ≤ iK-1. For fixed M, as K grows large, the range of FST becomes the closed or half-open unit interval. For fixed K, however, some M < (K-1)/K always exists at which the upper bound on FST lies below 2 √ 2 - 2 ≈ 0.8284. We use coalescent simulations to show that under weak migration, FST depends strongly on M when K is small, but not when K is large. Finally, examining data on human genetic variation, we use our results to explain the generally smaller FST values between pairs of continents relative to global FST values. We discuss implications for the interpretation and use of FST.

[147] MD Edge, BFB Algee-Hewitt, TJ Pemberton, JZ Li, NA Rosenberg (2017) Linkage disequilibrium matches forensic genetic records to disjoint genomic marker sets. Proceedings of the National Academy of Sciences USA 114: 5671-5676. [PDF] [Supplement]

Combining genotypes across datasets is central in facilitating advances in genetics. Data aggregation efforts often face the challenge of record matching — the identification of dataset entries that represent the same individual. We show that records can be matched across genotype datasets that have no shared markers based on linkage disequilibrium between loci appearing in different datasets. Using two datasets for the same 872 people — one with 642,563 genome-wide SNPs and the other with 13 short tandem repeats (STRs) used in forensic applications — we find that 90-98% of forensic STR records can be connected to corresponding SNP records and vice versa. Accuracy increases to 99-100% when ~30 STRs are used. Our method expands the potential of data aggregation, but it also suggests privacy risks intrinsic in maintenance of databases containing even small numbers of markers — including databases of forensic significance.

[146] A Goldberg, T Günther, NA Rosenberg, M Jakobsson (2017) Robust model-based inference of male-biased admixture during Bronze Age migration from the Pontic-Caspian Steppe. Proceedings of the National Academy of Sciences USA 114: E3875-E3877. [PDF]

(No abstract)

[145] A Goldberg, Tünther, NA Rosenberg, M Jakobsson (2017) Ancient X chromosomes reveal contrasting sex bias in Neolithic and Bronze Age migrations. Proceedings of the National Academy of Sciences USA 114: 2657-2662. [PDF] [Supplement]

Dramatic events in human prehistory, such as the spread of agriculture to Europe from Anatolia and the late Neolithic/Bronze Age migration from the Pontic-Caspian Steppe, can be investigated using patterns of genetic variation among the people who lived in those times. In particular, studies of differing female and male demographic histories on the basis of ancient genomes can provide information about complexities of social structures and cultural interactions in prehistoric populations. We use a mechanistic admixture model to compare the sex-specifically-inherited X chromosome with the autosomes in 20 early Neolithic and 16 late Neolithic/Bronze Age human remains. Contrary to previous hypotheses suggested by the patrilocality of many agricultural populations, we find no evidence of sex-biased admixture during the migration that spread farming across Europe during the early Neolithic. For later migrations from the Pontic Steppe during the late Neolithic/Bronze Age, however, we estimate a dramatic male bias, with approximately five to 14 migrating males for every migrating female. We find evidence of ongoing, primarily male, migration from the steppe to central Europe over a period of multiple generations, with a level of sex bias that excludes a pulse migration during a single generation. The contrasting patterns of sex-specific migration during these two migrations suggest a view of differing cultural histories in which the Neolithic transition was driven by mass migration of both males and females in roughly equal numbers, perhaps whole families, whereas the later Bronze Age migration and cultural shift were instead driven by male migration, potentially connected to new technology and conquest.

[144] OK Kamneva, NA Rosenberg (2017) Simulation-based evaluation of hybridization network reconstruction methods in the presence of incomplete lineage sorting. Evolutionary Bioinformatics 13: 1176934317691935. [PDF] [Supplement]

Hybridization events generate reticulate species relationships, giving rise to species networks rather than species trees. We report a comparative study of consensus, maximum parsimony, and maximum likelihood methods of species network reconstruction using gene trees simulated assuming a known species history. We evaluate the role of the divergence time between species involved in a hybridization event, the relative contributions of the hybridizing species, and the error in gene tree estimation. When gene tree discordance is mostly due to hybridization and not due to incomplete lineage sorting (ILS), most of the methods can detect even highly skewed hybridization events between highly divergent species. For recent divergences between hybridizing species, when the influence of ILS is sufficiently high, likelihood methods outperform parsimony and consensus methods, which erroneously identify extra hybridizations. The more sophisticated likelihood methods, however, are affected by gene tree errors to a greater extent than are consensus and parsimony.

[143] LH Uricchio, T Warnow, NA Rosenberg (2016) An analytical upper bound on the number of loci required for all splits of a species tree to appear in a set of gene trees. BMC Bioinformatics 17: 417. [PDF]

Background. Many methods for species tree inference require data from a sufficiently large sample of genomic loci in order to produce accurate estimates. However, few studies have attempted to use analytical theory to quantify "sufficiently large." Results. Using the multispecies coalescent model, we report a general analytical upper bound on the number of gene trees n required such that with probability q, each bipartition of a species tree is represented at least once in a set of n random gene trees. This bound employs a formula that is straightforward to compute, depends only on the minimum internal branch length of the species tree and the number of taxa, and applies irrespective of the species tree topology. Using simulations, we investigate numerical properties of the bound as well as its accuracy under the multispecies coalescent. Conclusions. Our results are helpful for conservatively bounding the number of gene trees required by the ASTRAL inference method, and the approach has potential to be extended to bound other properties of gene tree sets under the model.

[142] F Disanto, NA Rosenberg (2016) Asymptotic properties of the number of matching coalescent histories for caterpillar-like families of species trees. IEEE/ACM Transactions on Computational Biology and Bioinformatics 13: 913-925. [PDF]

Coalescent histories provide lists of species tree branches on which gene tree coalescences can take place, and their enumerative properties assist in understanding the computational complexity of calculations central in the study of gene trees and species trees. Here, we solve an enumerative problem left open by Rosenberg (IEEE/ACM Transactions on Computational Biology and Bioinformatics 10: 1253-1262, 2013) concerning the number of coalescent histories for gene trees and species trees with a matching labeled topology that belongs to a generic caterpillar-like family. By bringing a generating function approach to the study of coalescent histories, we prove that for any caterpillar-like family with seed tree t, the sequence (hn)n ≥ 0 describing the number of matching coalescent histories of the nth tree of the family grows asymptotically as a constant multiple of the Catalan numbers. Thus, hntcn, where the asymptotic constant βt depends no the shape of the seed tree t. The result extends a claim demonstrated only for seed trees with at most eight taxa to arbitrary seed trees, expanding the set of cases for which detailed enumerative properties of coalescent histories can be determined. We introduce a procedure that computes from t the constant βt as well as the algebraic expression for the generating function of the sequence (hn)n ≥ 0.

[141] RS Mehta, D Bryant, NA Rosenberg (2016) The probability of monophyly of a sample of gene lineages on a species tree. Proceedings of the National Academy of Sciences 113: 8002-8009. [PDF] [Supplement] [Software]

Monophyletic groups — groups that consist of all of the descendants of a most recent common ancestor — arise naturally as a consequence of descent processes that result in meaningful distinctions between organisms. Aspects of monophyly are therefore central to fields that examine and use genealogical descent. In particular, studies in conservation genetics, phylogeography, population genetics, species delimitation, and systematics can all make use of mathematical predictions under evolutionary models about features of monophyly. One important calculation, the probability that a set of gene lineages is monophyletic under a two-species neutral coalescent model, has been used in many studies. Here, we extend this calculation for a species tree model that contains arbitrarily many species. We study the effects of species tree topology and branch lengths on the monophyly probability. These analyses reveal new behavior, including the maintenance of nontrivial monophyly probabilities for gene lineage samples that span multiple species and even for lineages that do not derive from a monophyletic species group. We illustrate the mathematical results using an example application to data from maize and teosinte.

[140] T Stadler, JH Degnan, NA Rosenberg (2016) Does gene tree discordance explain the mismatch between macroevolutionary models and empirical patterns of tree shape and branching times? Systematic Biology 65: 628-639. [PDF] [Supplement]

Classic null models for speciation and extinction give rise to phylogenies that differ in distribution from empirical phylogenies. In particular, empirical phylogenies are less balanced and have branching times closer to the root compared to phylogenies predicted by common null models. This difference might be due to null models of the speciation and extinction process being too simplistic, or due to the empirical datasets not being representative of random phylogenies. A third possibility arises because phylogenetic reconstruction methods often infer gene trees rather than species trees, producing an incongruity between models that predict species tree patterns and empirical analyses that consider gene trees. We investigate the extent to which the difference between gene trees and species trees under a combined birth-death and multispecies coalescent model can explain the difference in empirical trees and birth-death species trees. We simulate gene trees embedded in simulated species trees and investigate their difference with respect to tree balance and branching times. We observe that the gene trees are less balanced and typically have branching times closer to the root than the species trees. Empirical trees from TreeBase are also less balanced than our simulated species trees, and model gene trees can explain an imbalance increase of up to 8% compared to species trees. However, we see a much larger imbalance increase in empirical trees, about 100%, meaning that additional features must also be causing imbalance in empirical trees. This simulation study highlights the necessity of revisiting the assumptions made in phylogenetic analyses, as these assumptions, such as equating the gene tree with the species tree, might lead to a biased conclusion.

[139] M DeGiorgio, NA Rosenberg (2016) Consistency and inconsistency of consensus methods for inferring species trees from gene trees in the presence of ancestral population structure. Theoretical Population Biology 110: 12-24. [PDF]

In the last few years, several statistically consistent consensus methods for species tree inference have been devised that are robust to the gene tree discordance caused by incomplete lineage sorting in unstructured ancestral populations. One source of gene tree discordance that has only recently been identified as a potential obstacle for phylogenetic inference is ancestral population structure. In this article, we describe a general model of ancestral population structure, and by relying on a single carefully constructed example scenario, we show that the consensus methods Democratic Vote, STEAC, STAR, R* Consensus, Rooted Triple Consensus, Minimize Deep Coalescences, and Majority-Rule Consensus are statistically inconsistent under the model. We find that among the consensus methods evaluated, the only method that is statistically consistent in the presence of ancestral population structure is GLASS/Maximum Tree. We use simulations to evaluate the behavior of the various consensus methods in a model with ancestral population structure, showing that as the number of gene trees increases, estimates on the basis of GLASS/Maximum Tree approach the true species tree topology irrespective of the level of population structure, whereas estimates based on the remaining methods only approach the true species tree topology if the level of structure is low. However, through simulations using species trees both with and without ancestral population structure, we show that GLASS/Maximum Tree performs unusually poorly on gene trees inferred from alignments with little information. This practical limitation of GLASS/Maximum Tree together with the inconsistency of other methods prompts the need for both further testing of additional existing methods and development of novel methods under conditions that incorporate ancestral population structure.

[138] BFB Algee-Hewitt*, MD Edge*, J Kim, JZ Li, NA Rosenberg (2016) Individual identifiability predicts population identifiability in forensic microsatellite markers. Current Biology 26: 935-942.

Highly polymorphic genetic markers with significant potential for distinguishing individual identity are used as a standard tool in forensic testing [1,2]. At the same time, population-genetic studies have suggested that genetically diverse markers with high individual identifiability also confer information about genetic ancestry [3-6]. The dual influence of polymorphism levels on ancestry inference and forensic desirability suggests that forensically useful marker sets with high levels of individual identifiability might also possess substantial ancestry information. We study a standard forensic marker set — the 13 CODIS loci used in the United States and elsewhere [2,7-9] — together with 779 additional microsatellites [10], using direct population structure inference to test whether markers with substantial individual identifiability also produce considerable information about ancestry. Despite having been selected for individual identification and not for ancestry inference [11], the CODIS markers generate nontrivial model-based clustering patterns similar to those of other sets of 13 tetranucleotide microsatellites. Although the CODIS markers have relatively low values of the FST divergence statistic, their high heterozygosities produce greater ancestry inference potential than is possessed by less heterozygous marker sets. More generally, we observe that marker sets with greater individual identifiability also tend toward greater population identifiability. We conclude that population identifiability regularly follows as a byproduct of the use of highly polymorphic forensic markers. Our findings have implications for the design of new forensic marker sets and for evaluations of the extent to which individual characteristics beyond identification might be predicted from current and future forensic data.

[137] NA Rosenberg (2016) Admixture models and the breeding systems of H. S. Jennings: a GENETICS connection. Genetics 202: 9-13. [PDF]

(No abstract)

[136] JTL Kang, P Zhang, S Zöllner, NA Rosenberg (2015) Choosing subsamples for sequencing studies by minimizing the average distance to the closest leaf. Genetics 201: 499-511. [PDF]

Imputation of genotypes in a study sample can make use of sequenced or densely genotyped external reference panels consisting of individuals that are not from the study sample. It also can employ internal reference panels, incorporating a subset of individuals from the study sample itself. Internal panels offer an advantage over external panels because they can reduce imputation errors arising from genetic dissimilarity between a population of interest and a second, distinct population from which the external reference panel has been constructed. As the cost of next-generation sequencing decreases, internal reference panel selection is becoming increasingly feasible. However, it is not clear how best to select individuals to include in such panels. We introduce a new method for selecting an internal panel — minimizing the average distance to the closest leaf (ADCL) — and compare its performance relative to an earlier algorithm: maximizing phylogenetic diversity (PD). Employing both simulated data and sequences from the 1000 Genomes Project, we show that ADCL provides a significant improvement in imputation accuracy, especially for imputation of sites with low-frequency alleles. This improvement in imputation accuracy is robust to changes in reference panel size, marker density, and length of the imputation target region.

[135] F Disanto, NA Rosenberg (2015) Coalescent histories for lodgepole species trees. Journal of Computational Biology 22: 918-929. [PDF]

Coalescent histories are combinatorial structures that describe for a given gene tree and species tree the possible lists of branches of the species tree on which the gene tree coalescences take place. Properties of the number of coalescent histories for gene trees and species trees affect a variety of probabilistic calculations in mathematical phylogenetics. Exact and asymptotic evaluations of the number of coalescent histories, however, are known only in a limited number of cases. Here we introduce a particular family of species trees, the lodgepole species trees (λn)n ≥ 0, in which tree λn has m=2n+1 taxa. We determine the number of coalescent histories for the lodgepole species trees, in the case that the gene tree matches the species tree, showing that this number grows with m!! in the number of taxa m. This computation demonstrates the existence of tree families in which the growth in the number of coalescent histories is faster than exponential. Further, it provides a substantial improvement on the lower bound for the ratio of the largest number of matching coalescent histories to the smallest number of matching coalescent histories for trees with m taxa, increasing a previous bound of ( $\sqrt{\pi }$ /32)[(5m-12)/(4m-6)]m$\sqrt{m}$ to [$\sqrt{\mathrm{m-1}}$/(4$\sqrt{e}$)]m. We discuss the implications of our enumerative results for phylogenetic computations.

[134] R Ronen*, G Tesler*, A Akbari*, S Zakov, NA Rosenberg, V Bafna (2015) Predicting carriers of ongoing selective sweeps without knowledge of the favored allele. PLoS Genetics 11: e1005527. [PDF] [Supplement]

Methods for detecting the genomic signatures of natural selection have been heavily studied, and they have been successful in identifying many selective sweeps. For most of these sweeps, the favored allele remains unknown, making it difficult to distinguish carriers of the sweep from non-carriers. In an ongoing selective sweep, carriers of the favored allele are likely to contain a future most recent common ancestor. Therefore, identifying them may prove useful in predicting the evolutionary trajectory — for example, in contexts involving drug-resistant pathogen strains or cancer subclones. The main contribution of this paper is the development and analysis of a new statistic, the Haplotype Allele Frequency (HAF) score. The HAF score, assigned to individual haplotypes in a sample, naturally captures many of the properties shared by haplotypes carrying a favored allele. We provide a theoretical framework for computing expected HAF scores under different evolutionary scenarios, and we validate the theoretical predictions with simulations. As an application of HAF score computations, we develop an algorithm (PreCIOSS: Predicting Carriers of Ongoing Selective Sweeps) to identify carriers of the favored allele in selective sweeps, and we demonstrate its power on simulations of both hard and soft sweeps, as well as on data from well-known sweeps in human populations.

[133] A Goldberg, NA Rosenberg (2015) Beyond 2/3 and 1/3: the complex signatures of sex-biased admixture on the X chromosome. Genetics 201: 263-279. [PDF]

[132] MD Edge, NA Rosenberg (2015) A general model of the relationship between the apportionment of human genetic diversity and the apportionment of human phenotypic diversity. Human Biology 87: 313-337. [PDF]

Models that examine genetic differences between populations alongside genotype aphenotype map can provide insight about phenotypic variation among groups. We generalize a simple model of a completely heritable, additive, selectively neutral quantitative trait to examine the relationship between single-locus genetic differentiation and phenotypic differentiation on quantitative traits. In agreement with similar efforts using different models, we show that the expected degree to which two groups differ on a neutral quantitative trait is not strongly affected by the number of genetic loci that influence the trait: neutral trait differences are expected to have a magnitude comparable to the genetic differences at a single neutral locus. We discuss this result with respect to population differences in disease phenotypes, arguing that although neutral genetic differences between populations can contribute to specific differences between populations in health outcomes, systematic patterns of difference that run in the same direction for many genetically independent health conditions are unlikely to be explained by neutral genetic differentiation.

[131] NA Rosenberg, JTL Kang (2015) Genetic diversity and societally important disparities. Genetics 201: 1-12. [PDF] [Supplement]

The magnitude of genetic diversity within human populations varies in a way that reflects the sequence of migrations by which people spread throughout the world. Beyond its use in human evolutionary genetics, worldwide variation in genetic diversity sometimes can interact with social processes to produce differences among populations in their relationship to modern societal problems. We review the consequences of genetic diversity differences in the settings of familial identification in forensic genetic testing, match probabilities in bone marrow transplantation, and representation in genome-wide association studies of disease. In each of these three cases, the contribution of genetic diversity to social differences follows from population-genetic principles. For a fourth setting that is not similarly grounded, we reanalyze with expanded genetic data a report that genetic diversity differences influence global patterns of human economic development, finding no support for the claim. The four examples describe a limit to the importance of genetic diversity for explaining societal differences while illustrating a distinction that certain biologically based scenarios do require consideration of genetic diversity for solving problems to which populations have been differentially predisposed by the unique history of human migrations.

[130] NM Kopelman, J Mayzel, M Jakobsson, NA Rosenberg, I Mayrose (2015) CLUMPAK: a program for identifying clustering modes and packaging population structure inferences across K. Molecular Ecology Resources 15: 1179-1191. [PDF] [Supplement] [Software]

The identification of the genetic structure of populations from multilocus genotype data has become a central component of modern population-genetic data analysis. Application of model-based clustering programs often entails a number of steps, in which the user considers different modelling assumptions, compares results across different predetermined values of the number of assumed clusters (a parameter typically denoted K), examines multiple independent runs for each fixed value of K, and distinguishes among runs belonging to substantially distinct clustering solutions. Here, we present CLUMPAK (Cluster Markov Packager Across K), a method that automates the postprocessing of results of model-based population structure analyses. For analysing multiple independent runs at a single K value, CLUMPAK identifies sets of highly similar runs, separating distinct groups of runs that represent distinct modes in the space of possible solutions. This procedure, which generates a consensus solution for each distinct mode, is performed by the use of a Markov clustering algorithm that relies on a similarity matrix between replicate runs, as computed by the software CLUMPP. Next, CLUMPAK identifies an optimal alignment of inferred clusters across different values of K, extending a similar approach implemented for a fixed K in Clumpp and simplifying the comparison of clustering results across different K values. CLUMPAK incorporates additional features, such as implementations of methods for choosing K and comparing solutions obtained by different programs, models, or data subsets. CLUMPAK, available at http://clumpak.tau.ac.il, simplifies the use of model-based analyses of population structure in population genetics and molecular ecology.

[129] MD Edge, NA Rosenberg (2015) Implications of the apportionment of human genetic diversity for the apportionment of human phenotypic diversity. Studies in History and Philosophy of Biological and Biomedical Sciences 52: 32-45. [PDF]

Researchers in many fields have considered the meaning of two results about genetic variation for concepts of "race." First, at most genetic loci, apportionments of human genetic diversity find that worldwide populations are genetically similar. Second, when multiple genetic loci are examined, it is possible to distinguish people with ancestry from different geographical regions. These two results raise an important question about human phenotypic diversity: To what extent do populations typically differ on phenotypes determined by multiple genetic loci? It might be expected that such phenotypes follow the pattern of similarity observed at individual loci. Alternatively, because they have a multilocus genetic architecture, they might follow the pattern of greater differentiation suggested by multilocus ancestry inference. To address the question, we extend a well-known classification model of Edwards (2003) by adding a selectively neutral quantitative trait. Using the extended model, we show, in line with previous work in quantitative genetics, that regardless of how many genetic loci influence the trait, one neutral trait is approximately as informative about ancestry as a single genetic locus. The results support the relevance of single-locus genetic-diversity partitioning for predictions about phenotypic diversity.

[128] L Lehmann, NA Rosenberg (2015) Hamilton's rule: game theory meets coalescent theory. Theoretical Population Biology 103: 1. [PDF]

(No abstract)

[127] NR Garud, NA Rosenberg (2015) Enhancing the mathematical properties of new haplotype homozygosity statistics for the detection of selective sweeps. Theoretical Population Biology 102: 94-101. [PDF]

Soft selective sweeps represent an important form of adaptation in which multiple haplotypes bearing adaptive alleles rise to high frequency. Most statistical methods for detecting selective sweeps from genetic polymorphism data, however, have focused on identifying hard selective sweeps in which a favored allele appears on a single haplotypic background; these methods might be underpowered to detect soft sweeps. Among exceptions is the set of haplotype homozygosity statistics introduced for the detection of soft sweeps by Garud et al. (2015). These statistics, examining frequencies of multiple haplotypes in relation to each other, include H12, a statistic designed to identify both hard and soft selective sweeps, and H2/H1, a statistic that conditional on high H12 values seeks to distinguish between hard and soft sweeps. A challenge in the use of H2/H1 is that its range depends on the associated value of H12, so that equal H2/H1 values might provide different levels of support for a soft sweep model at different values of H12. Here, we enhance the H12 and H2/H1, haplotype homozygosity statistics for selective sweep detection by deriving the upper bound on H2/H1 as a function of H12, thereby generating a statistic that normalizes H2/H1 to lie between 0 and 1. Through a reanalysis of resequencing data from inbred lines of Drosophila, we show that the enhanced statistic both strengthens interpretations obtained with the unnormalized statistic and leads to empirical insights that are less readily apparent without the normalization.

[126] NA Rosenberg (2015) Theory in population biology, or biologically inspired mathematics? Theoretical Population Biology 102: 1-2. [PDF]

(No abstract)

[125] N Creanza, M Ruhlen, T Pemberton, NA Rosenberg, MW Feldman, S Ramachandran (2015) Comparison of worldwide phonemic and genetic variation in human populations. Proceedings of the National Academy of Sciences USA 112: 1265-1272. [PDF] [Supplementary Appendix] [Supplementary Data S1] [Supplementary Data S2] [Supplementary Data S3]

Worldwide patterns of genetic variation are driven by human demographic history. Here, we test whether this demographic history has left similar signatures phonemes — sound units that distinguish meaning between words languages — into those it has left on genes. We analyze, jointly and in parallel, phoneme inventories from 2,082 worldwide languages and microsatellite polymorphisms from 246 worldwide populations. On a global scale, both genetic distance and phonemic distance between populations are significantly correlated with geographic distance. Geographically close language pairs share significantly more phonemes than distant language pairs, whether or not the languages are closely related. The regional geographic axes of greatest phonemic differentiation correspond to axes of genetic differentiation, suggesting that there is a relationship between human dispersal and linguistic variation. However, the geographic distribution of phoneme inventory sizes does not follow the predictions of a serial founder effect during human expansion out of Africa. Furthermore, although geographically isolated populations lose genetic diversity via genetic drift, phonemes are not subject to drift in the same way: within a given geographic radius, languages that are relatively isolated exhibit more variance in number of phonemes than languages with many neighbors. This finding suggests that relatively isolated languages are more susceptible to phonemic change than languages with many neighbors. Within a language family, phoneme evolution along genetic, geographic, or cognate-based linguistic trees predicts similar ancestral phoneme states to those predicted from ancient sources. More genetic sampling could further elucidate the relative roles of vertical and horizontal transmission in phoneme evolution.

[124] EO Buzbas, NA Rosenberg (2015) AABC: approximate approximate Bayesian computation for inference in population-genetic models. Theoretical Population Biology 99: 31-42. [PDF]

Approximate Bayesian computation (ABC) methods perform inference on model-specific parameters of mechanistically motivated parametric models when evaluating likelihoods is difficult. Central to the success of ABC methods, which have been used frequently in biology, is computationally inexpensive simulation of data sets from the parametric model of interest. However, when simulating data sets from a model is so computationally expensive that the posterior distribution of parameters cannot be adequately sampled by ABC, inference is not straightforward. We present "approximate approximate Bayesian computation" (AABC), a class of computationally fast inference methods that extends ABC to models in which simulating data is expensive. In AABC, we first simulate a number of data sets small enough to be computationally feasible to simulate from the parametric model. Conditional on these data sets, we use a statistical model that approximates the correct parametric model and enables efficient simulation of a large number of data sets. We show that under mild assumptions, the posterior distribution obtained by AABC converges to the posterior distribution obtained by ABC, as the number of data sets simulated from the parametric model and the sample size of the observed data set increase. We demonstrate the performance of AABC on a population-genetic model of natural selection, as well as on a model of the admixture history of hybrid populations. The latter example illustrates how, in population genetics, AABC is of particular utility in scenarios that rely on conceptually straightforward but potentially slow forward-in-time simulations.

[123] F Disanto, NA Rosenberg (2014) On the number of ranked species trees producing anomalous ranked gene trees. IEEE/ACM Transactions on Computational Biology and Bioinformatics 11: 1229-1238. [PDF]

Analysis of probability distributions conditional on species trees has demonstrated the existence of anomalous ranked gene trees (ARGTs), ranked gene trees that are more probable than the ranked gene tree that accords with the ranked species tree. Here, to improve the characterization of ARGTs, we study enumerative and probabilistic properties of two classes of ranked labeled species trees, focusing on the presence or avoidance of certain subtree patterns associated with the production of ARGTs. We provide exact enumerations and asymptotic estimates for cardinalities of these sets of trees, showing that as the number of species increases without bound, the fraction of all ranked labeled species trees that are ARGT-producing approaches 1. This result extends beyond earlier existence results to provide a probabilistic claim about the frequency of ARGTs.

[122] A Goldberg, P Verdu, NA Rosenberg (2014) Autosomal admixture levels are informative about sex bias in admixed populations. Genetics 198: 1209-1229. [PDF]

[121] MD Edge, NA Rosenberg (2014) Upper bounds on FST in terms of the frequency of the most frequent allele and total homozygosity: the case of a specified number of alleles. Theoretical Population Biology 97: 20-34. [PDF]

FST is one of the most frequently-used indices of genetic differentiation among groups. Though FST takes values between 0 and 1, authors going back to Wright have noted that under many circumstances, FST is constrained to be less than 1. Recently, we showed that at a genetic locus with an unspecified number of alleles, FST for two subpopulations is strictly bounded from above by functions of both the frequency of the most frequent allele (M) and the homozygosity of the total population (HT). In the two-subpopulation case, FST can equal one only when the frequency of the most frequent allele and the total homozygosity are 1/2. Here, we extend this work by deriving strict bounds on FST for two subpopulations when the number of alleles at the locus is specified to be I. We show that restricting to I alleles produces the same upper bound on FST over much of the allowable domain for M and HT, and we derive more restrictive bounds in the windows M ∈ [1/I,1/(I-1)) and HT ∈ [1/I,I/(I2-1)). These results extend our understanding of the behavior of FST in relation to other population-genetic statistics.

[120] P Verdu, TJ Pemberton, R Laurent, BM Kemp, A Gonzalez-Oliver, C Gorodesky, CE Hughes, MR Shattuck, B Petzelt, J Mitchell, H Harry, T William, R Worl, JS Cybulski, NA Rosenberg, RS Malhi (2014) Patterns of admixture and population structure in native populations of northwest North America. PLoS Genetics 10: e1004530. [PDF] [Supplement]

[119] TJ Pemberton, NA Rosenberg (2014) Population-genetic influences on genomic estimates of the inbreeding coefficient: a global perspective. Human Heredity 77: 37-48. [PDF] [Supplementary Figure 1] [Supplementary Figure 2] [Supplementary Table 1] [Supplementary Table 2] [Supplementary Table 3]

Background/Aims: Culturally driven marital practices provide a key instance of an interaction between social and genetic processes in shaping patterns of human genetic variation, producing, for example, increased identity by descent through consanguineous marriage. A commonly used measure to quantify identity by descent in an individual is the inbreeding coefficient, a quantity that reflects not only consanguinity, but also other aspects of kinship in the population to which the individual belongs. Here, in populations worldwide, we examine the relationship between genomic estimates of the inbreeding coefficient and population patterns in genetic variation. Methods: Using genotypes at 645 microsatellites, we compare inbreeding coefficients from 5,043 individuals representing 237 populations worldwide to demographic consanguinity frequency estimates available for 26 populations as well as to other quantities that can illuminate population-genetic influences on inbreeding coefficients. Results: We observe higher inbreeding coefficient estimates in populations and geographic regions with known high levels of consanguinity or genetic isolation and in populations with an increased effect of genetic drift and decreased genetic diversity with increasing distance from Africa. For the small number of populations with specific consanguinity estimates, we find a correlation between inbreeding coefficients and consanguinity frequency (r=0.349, p=0.040). Conclusions: The results emphasize the importance of both consanguinity and population-genetic factors in influencing variation in inbreeding coefficients, and they provide insight into factors useful for assessing the effect of consanguinity on genomic patterns in different populations.

[118] CV Than, NA Rosenberg (2014) Mean deep coalescence cost under exchangeable probability distributions. Discrete Applied Mathematics 174: 11-26. [PDF]

We derive formulas for mean deep coalescence cost, for either a fixed species tree or a fixed gene tree, under probability distributions that satisfy the exchangeability property. We than apply the formulas to study mean deep coalescence cost under two commonly used exchangeable models - the uniform and Yule models. We find that mean deep coalescence cost, for either a fixed species tree or a fixed gene tree, tends to be larger for unbalanced trees than for balanced trees. These results provide a better understanding of the deep coalescence cost, as well as allow for the development of new species tree inference criteria.

[117] M DeGiorgio, J Syring, AJ Eckert, AI Liston, R Cronn, DB Neale, NA Rosenberg (2014) An empirical evaluation of two-stage species tree inference strategies using a multilocus dataset from North American pines. BMC Evolutionary Biology 14: 67. [PDF] [Supplementary File 1 (.xlsx, accession numbers)] [Supplementary File 2 (.pdf, supplementary analyses)] [Supplementary File 3 (.zip, data)]

BACKGROUND. As it becomes increasingly possible to obtain DNA sequences of orthologous genes from diverse sets of taxa, species trees are frequently being inferred from multilocus data. However, the behavior of many methods for performing this inference has remained largely unexplored. Some methods have been proven to be consistent given certain evolutionary models, whereas others rely on criteria that, although appropriate for many parameter values, have peculiar zones of the parameter space in which they fail to converge on the correct estimate as data sets increase in size.
RESULTS. Here, using North American pines, we empirically evaluate the behavior of 24 strategies for species tree inference using three alternative outgroups (72 strategies total). The data consist of 120 individuals sampled in eight ingroup species from subsection Strobus and three outgroup species from subsection Gerardianae, spanning ~47 kilobases of sequence at 121 loci. Each "strategy" for inferring species trees consists of three features: a species tree construction method, a gene tree inference method, and a choice of outgroup. We use multivariate analysis techniques such as principal components analysis and hierarchical clustering to identify tree characteristics that are robustly observed across strategies, as well as to identify groups of strategies that produce trees with similar features. We find that strategies that construct species trees using only topological information cluster together and that strategies that use additional non-topological information (e.g., branch lengths) also cluster together. Strategies that utilize more than one individual within a species to infer gene trees tend to produce estimates of species trees that contain clades present in trees estimated by other strategies. Strategies that use the minimize-deep-coalescences criterion to construct species trees tend to produce species tree estimates that contain clades that are not present in trees estimated by the Concatenation, RTC, SMRT, STAR, and STEAC methods, and that in general are more balanced than those inferred by these other strategies.
CONCLUSIONS. When constructing a species tree from a multilocus set of sequences, our observations provide a basis for interpreting differences in species tree estimates obtained via different approaches that have a two-stage structure in common, one step for gene tree estimation and a second step for species tree estimation. The methods explored here employ a number of distinct features of the data, and our analysis suggests that recovery of the same results from multiple methods that tend to differ in their patterns of inference can be a valuable tool for obtaining reliable estimates.

[116] EM Jewett, NA Rosenberg (2014) Theory and applications of a deterministic approximation to the coalescent model. Theoretical Population Biology 93: 14-29. [PDF]

Under the coalescent model, the random number nt of lineages ancestral to a sample is nearly deterministic as a function of time when nt<