Three files describe the exact data used in the Science article "Genetic structure of human populations." As described in the article and its supplementary information, there are some minor differences between the set of individuals we studied, the set in the HGDP-CEPH Human Genome Diversity Cell Line Panel, and the set genotyped by the Mammalian Genotyping Service. Thus we provide the data in the form that we used it for the analysis. With questions about these files, please contact me. Noah Rosenberg November 17, 2002 (NEXUS file added December 28, 2002) (Allele frequencies added June 1, 2003) ------------------------------------------------------------------- 1. diversitydata.stru This file includes the exact data used in the paper. The format is that used by the structure program. The first line gives the list of loci. After the first line, each individual is listed on two consecutive lines. The first five columns include the following information: (1) Individual code number assigned by CEPH. (2) Population code number assigned by us. (3) Population name. (4) Geographic information about the population. (5) Pre-defined region, as was used in the article. The next columns contain genotypes (measured in base pairs). The left-to-right order of the genotypes corresponds to the left-to-right order of the locus names on the first line of the file. The placement of genotypes on the first versus second line for an individual is arbitrary. Missing data is denoted by "-9". ------------------------------------------------------------------- 2. diversitydata.nex This file includes the exact data used in the paper. The format is that used by the GDA program. This format, the NEXUS format, is further described on the GDA website. Briefly, each locus is listed on its own line. Each individual is then listed on a single line. Individuals are coded using their population names and their code numbers as assigned by CEPH. From left to right, diploid genotypes (measured in base pairs) follow the top-to-bottom order of the loci. Missing data is denoted by "?". At the bottom of the file are three "hierarchies," which correspond to different groupings of populations into regions for analysis of molecular variance. ------------------------------------------------------------------- 3. diversitycodes.txt This file contains code numbers that have been assigned to the populations. The columns include the following information: (1) Population code number. (2) Population name. (3) Geographic information about the population. (4) Pre-defined region, as was used in the article. ------------------------------------------------------------------- 4. diversityloci.txt This file contains a list of loci, with two names given to each locus. The first column is the "locus name" and the second column is the "marker name," as described on the web page of the Marshfield Screening Sets. Some loci were not given "locus names" on the Marshfield web page; for the purposes of the study only, we have assigned these loci names that begin with "NA". The columns of this file include the following information: (1) Locus name. (2) Marker name. (3) Size of the repeated unit (2, 3, or 4 base pairs). ------------------------------------------------------------------- 5. diversitydata.freqs This file contains the count estimates of allele frequencies based on the individual data. Each line gives the frequency of an allele in a population. Locus names are the same as in diversityloci.txt. The columns of this file include the following information: (1) Locus name. (2) Allele (measured in base pairs). (3) Population. (4) Estimated frequency.