These files describe the datasets used for the paper "An empirical evaluation of phylogenetic strategies using a multilocus dataset from North American pines" by M DeGiorgio et al. (BMC Evol Biol XX:XX-XX [2014]). Michael DeGiorgio January 16, 2014 mxd60@psu.edu -------------------------------------------------------------------------------- Legend: Short code Long code Species name ---------- -------------- --------------- PIAL P_albicaulis P. albicaulis PIAY P_ayacahuite P. ayacahuite PIFL P_flexilis P. flexilis PILA P_lambertiana P. lambertiana PIMN P_monticola P. monticola PISB P_strobiformis P. strobiformis PIST P_strobus P. strobus PICH P_chiapensis P. chiapensis PIGE P_geradiana P. geradiana PIBU P_bungeana P. bungeana PISQ P_squamata P. squamata Each individual among the 120 individuals considered by DeGiorgio et al. is given the same name in all files in which the individual appears. A total of 121 loci were in the full dataset D_p. The dataset D_s also contains 121 loci, but with one sequence per species. Datasets D_{p,0} and D_{s,0} enforce a sequence difference constraint for inclusion of a locus, and therefore have fewer than 121 loci. -------------------------------------------------------------------------------- 1. Directory Dataset_Dp This folder includes the full dataset used by DeGiorgio et al. (2014). Each folder contains a list of files, where the file name is the name of a particular sequenced genetic locus. Each file is in FASTA format. Each sequence is represented by a pair of consecutive lines, with the first line being the sequence name and the second line being the sequence. The species to which an individual sequence belongs is denoted by a substring chosen from among the short codes above. -------------------------------------------------------------------------------- 2. Directory Dataset_Dp0 This folder includes the D_{p,0} dataset used by DeGiorgio et al. (2014). Each folder contains a list of files, where the file name is the name of a particular sequenced genetic locus. Each file is in FASTA format. Each sequence is represented by a pair of consecutive lines, with the first line being the sequence name and the second line being the sequence. The species to which an individual sequence belongs is denoted by a substring chosen from among the short codes above. -------------------------------------------------------------------------------- 3. Directory Dataset_Ds This folder includes the D_s dataset used by DeGiorgio et al. (2014). Each folder contains a list of files, where the file name is the name of a particular sequenced genetic locus. Each file contains 11 lines, where each line is one of the 11 species (eight ingroup and three outgroup) considered in the study. Each line is formatted with the species' long code followed by a tab and then the sequence for that species. -------------------------------------------------------------------------------- 4. Directory Dataset_Ds0 This folder includes the D_{s,0} dataset used by DeGiorgio et al. (2014). Each folder contains a list of files, where the file name is the name of a particular sequenced genetic locus. Each file contains 11 lines, where each line is one of the 11 species (eight ingroup and three outgroup) considered in the study. Each line is formatted with the species' long code followed by a tab and then the sequence for that species. --------------------------------------------------------------------------------