Basic population genomic analyses

Note: This tutorial is written to teach some basic population genomic analyses for NRES 721. We’ll do these exercises entirely in R, so folks should be able to follow along regardless of their operating system. Students, you can download all of the data we’ll use by using the svn checkout https://github.com/twpierson/nres721_popgen_phylo/trunk/data command in your working directory, from the /data directory in the Github repository here, or from Dropbox by clicking here.

Introduction

In the last tutorial, we assembled a RADseq-style (RADcap) dataset from the patch-nosed salamander (Urspelerpes brucei) in ipyrad. We ended up with a total of 884 loci (and 1110 SNPs) in our de novo assembly and 761 loci (and 1755 SNPs) in our reference-based assembly. As you’ve discussed earlier in the semester, there are many inferences you can make from these population genetic data (and a multitude of methods for doing so!). Today, we’ll apply some of the methods you’ve already learned and discuss how large datasets—and in 2019, several hundred loci barely qualifies as “large”!—may present unique challenges. To do this, we’ll work entirely in R.

First, let’s load the packages you need. If you don’t already have these installed, do that first.

library(phangorn)
library(adegenet)
library(hierfstat)
library(scales)

Then, set your working directory. This should be the folder in which you have the \data directory we downloaded earlier.

setwd("[your working directory]")

Neighbor-joining tree

For an easy, preliminary visualization of our data, let’s build a neighbor-joining tree. This will (very quickly) estimate a topology from a distance matrix, which we’ll calculate from our SNP data. There are many good reasons not to ultimately read too much into this topology, but it can provide a good jumping-off point to check our data.

First, we’ll read in one of the output files from the ipyrad de novo assembly. The file extension usnps tells us that this is a PHYLIP file consisting of one random SNP per locus.

ubrucei_phydat <- read.phyDat("data/ubrucei_denovo.usnps", format="phy", type="DNA")

Discussion: Why might we only include one SNP per locus?

Let’s examine this object we’ve just created.

ubrucei_phydat

## 12 sequences with 396 character and 385 different site patterns.
## The states are a c g t

Discussion: This suggests that we have 396 “character patterns”. What does that mean? Why is this number smaller than the number of loci we have?

From these data, we’ll first create a distance matrix.

ubrucei_dm <- dist.ml(ubrucei_phydat, exclude="pairwise")

From this distance matrix, we’ll create a neighbor-joining tree.

ubrucei_NJtree <- NJ(ubrucei_dm)

And we’ll now plot that as an unrooted tree, coloring the tips to match the collection locality of each sample.

plot.phylo(ubrucei_NJtree, cex=0.9, type='u',
     tip.col = c(rep("green",3), rep("red",3), rep("cornflowerblue",3), rep("purple",3)),
     font = 2, x.lim = c(-0.1, 0.4))

Discussion: Just examining this visually, what do you notice? Does this serve as a good “sanity check” for our data?

DAPC

This neighbor-joining tree provides a hint that our collection localities may be genetically differentiated, but it’s a very qualitative evaluation. Next, we’ll try a method to estimate how many “populations” or “clusters” actually exist among these data. (Of course, understanding the biological meaning of a “population” or “cluster” here might also involve some critical thinking!)

There are a variety of programs—with a diversity of underlying philosophies and statistical methods—to conduct this kind of analysis. For example, you used Structure in an earlier class. Today, we’ll try a discriminant analysis of principal components (DAPC). This method estimates the number of “genetic clusters” in the data and then estimates probabilistic individual assignments to each cluster. For a much more detailed description of these analyses and a tutorial of how to conduct them, see here.

For this analysis, let’s read in a different output file from ipyrad. The file extension (.str) tells us that this is a Structure-formatted output file, and it contains data for all variable sites. We need to provide the read.structure() function a few more pieces of information (e.g., number of individuals, number of loci), which we can glean from the summary statistics file we viewed in the last tutorial.

ubrucei_strdat <- read.structure(file="data/ubrucei_denovo.str",
                         n.ind = 12, n.loc = 1110,
                         onerowperind = FALSE, col.lab=1,
                         col.pop=0, NA.char="-9",col.other=0,
                         row.marknames=0)

The first step of the DAPC is to estimate the number of clusters. We’ll use the find.clusters() function to do this, and it’ll display interactive plots to help use choose the numbers of PCs to retain and the number of clusters. We’ll also set the seed to make sure this example is reproducible among all of us.

set.seed(1234)
ubrucei_clust <- find.clusters(ubrucei_strdat, max.n.clust = 5)

This plot shows the cumulative variance explained by including progressively more principal components. It looks like there’s really no plateau of cumulative variance, so we can retain all 11 PCs. Enter “11” into the prompt.

This plot helps us evaluate how many “clusters” best represent our data; here, we’re looking for when the BIC value is minimized (or stops decreasing dramatically). The BIC value continues to drop through k = 4, but then levels off. Go ahead and enter “4” into the prompt. (Note: This can be a rather subjective evaluation, and in practice, it’s useful to evaluate more than one value of k. Here, our judgment is also biased by the fact that we know these samples came from four different collection localites!)

That allowed us to interactively choose these parameters, but now that we know which we want to use, we could run the find.clusters() function again with these parameter values included.

set.seed(1234)
ubrucei_clust <- find.clusters(ubrucei_strdat, n.pca = 11, n.clust = 4)

Now, we’ll conduct the discriminant analysis. Because we have a small number of PCs and clusters, we’ll retain all of them here (i.e., enter “11” for n.pca and “3” for n.da).

set.seed(1234)
ubrucei_dapc <- dapc(ubrucei_strdat, ubrucei_clust$grp, n.pca = 11, n.da = 3)

We can glean lots of interesting information from this analysis. Perhaps the most important result for us is in the “posterior group assignments”. We can plot these results in a Structure-like barplot.

par(xpd = TRUE)
compoplot(ubrucei_dapc, xlab="individuals", 
          col = c("red","cornflowerblue","green","purple"), legend = FALSE,
          show.lab = TRUE)
legend(-0.3, 1.1, legend=paste("Cluster", 1:4), fill=c("red","cornflowerblue","green","purple"), ncol = 4, cex = 0.7)

Discussion: What do these results suggest to you? Are they consistent with the neighbor-joining tree?

Diversity, differentiation, and more

This analysis provides some evidence of differentiation between populations of Urspelerpes at these different collection localities. But how different are they? And how genetically diverse are they? Note: because we have such small sample sizes here, we should interpret all of these results with caution!

Let’s calculate some common summary statistics to characterize genetic diversity within and among these clusters. First, we can calculate pairwise F_st between each cluster. To do that, we need to assign “populations” to our samples; we can use the cluster assignments from our DAPC. Remember that these numbers aren’t in the same order as the collection localities (e.g., here, the first collection locality is “cluster 3”).

ubrucei_strdat$pop <- ubrucei_dapc$grp
(matFst <- pairwise.fst(ubrucei_strdat,res.type="matrix"))

          1         2         3         4
1 0.0000000 0.3515160 0.2374987 0.2352650
2 0.3515160 0.0000000 0.2641719 0.2756070
3 0.2374987 0.2641719 0.0000000 0.1454361
4 0.2352650 0.2756070 0.1454361 0.0000000

Now, let’s calculate some basic measures of genetic diversity. Let’s use the within-population gene diversity (i.e., H_s) and observed heterozygosity (H_o) to get a glimpse of this. We can use the basic.stats() function in the hierfstat package.

ubrucei_stats <- basic.stats(ubrucei_strdat, diploid=TRUE, digits=3)

We can plot H_s for each population.

par(mfrow=c(1,4))
hist(ubrucei_stats$Hs[,1], col = "red", main = "cluster 1", 
     ylab = "# SNPs", xlab = "gene diversity", ylim = c(0,1000))
text(x = 0.6, y = 600, labels = paste("mean =", round(mean(ubrucei_stats$Hs[,1], na.rm = TRUE), 2)))
hist(ubrucei_stats$Hs[,2], col = "cornflowerblue", main = "cluster 2", 
     ylab = "# SNPs", xlab = "gene diversity", ylim = c(0,1000))
text(x = 0.6, y = 600, labels = paste("mean =", round(mean(ubrucei_stats$Hs[,2], na.rm = TRUE), 2)))
hist(ubrucei_stats$Hs[,3], col = "green", main = "cluster 3", 
     ylab = "# SNPs", xlab = "gene diversity", ylim = c(0,1000))
text(x = 0.6, y = 600, labels = paste("mean =", round(mean(ubrucei_stats$Hs[,3], na.rm = TRUE), 2)))
hist(ubrucei_stats$Hs[,4], col = "purple", main = "cluster 4", 
     ylab = "# SNPs", xlab = "gene diversity", ylim = c(0,1000))
text(x = 0.6, y = 600, labels = paste("mean =", round(mean(ubrucei_stats$Hs[,4], na.rm = TRUE), 2)))

… and H_o for each population.

par(mfrow=c(1,4))
hist(ubrucei_stats$Ho[,1], col = "red", main = "cluster 1", 
     ylab = "# SNPs", xlab = "observed heterozygosity", ylim = c(0,1000))
text(x = 0.6, y = 600, labels = paste("mean =", round(mean(ubrucei_stats$Ho[,1], na.rm = TRUE), 2)))
hist(ubrucei_stats$Ho[,2], col = "cornflowerblue", main = "cluster 2", 
     ylab = "# SNPs", xlab = "observed heterozygosity", ylim = c(0,1000))
text(x = 0.6, y = 600, labels = paste("mean =", round(mean(ubrucei_stats$Ho[,2], na.rm = TRUE), 2)))
hist(ubrucei_stats$Ho[,3], col = "green", main = "cluster 3", 
     ylab = "# SNPs", xlab = "observed heterozygosity", ylim = c(0,1000))
text(x = 0.6, y = 600, labels = paste("mean =", round(mean(ubrucei_stats$Ho[,3], na.rm = TRUE), 2)))
hist(ubrucei_stats$Ho[,4], col = "purple", main = "cluster 4", 
     ylab = "# SNPs", xlab = "observed heterozygosity", ylim = c(0,1000))
text(x = 0.6, y = 600, labels = paste("mean =", round(mean(ubrucei_stats$Ho[,4], na.rm = TRUE), 2)))

Discussion: What patterns do you detect here? Are they consistent?

Isolation-by-distance

We may wonder whether the apparent differention between sampling localities can be explained by geographic distance. We can test for isolation-by-distance using a Mantel test. Because this is a rare, sensitive species, we won’t use real coordinates, but I’ll (generally) preserve the spatial relationship between sites and center them over Sanford Stadium (go Dawgs!). Below are our artificial coordinates; note that for all samples within a site, the coordinates are the same.

coords <- rbind(c(33.949966,-83.374063), c(33.949966,-83.374063), c(33.949966,-83.374063),
                c(33.944970, -83.373114), c(33.944970, -83.373114), c(33.944970, -83.373114),
                c(33.950171, -83.373929), c(33.950171, -83.373929), c(33.950171, -83.373929),
                c(33.949900, -83.372304), c(33.949900, -83.372304), c(33.949900, -83.372304))

Here’s a quick visualization of their positions to orient ourselves:

Let’s use the function mantel.randtest to conduct a Mantel test. Note: for speed and simplicity, we’re going to use simple euclidean distances calculated with the dist() function, but we could instead use F_st, etc. We can examine the results of the test through the returned object (mtest) or visually by plotting them.

(mtest <- mantel.randtest(dist(ubrucei_strdat), dist(coords)))

## Monte-Carlo test
## Call: mantel.randtest(m1 = dist(ubrucei_strdat), m2 = dist(coords))
## 
## Observation: 0.7359404 
## 
## Based on 999 replicates
## Simulated p-value: 0.001 
## Alternative hypothesis: greater 
## 
##     Std.Obs Expectation    Variance 
## 5.973736708 0.002443181 0.015076648

par(mfrow=c(1,1))
plot(mtest, nclass = 50)

Discussion: What do these results mean? Does this make sense?

We can visualize the relationship between geographic distance and genetic distance by plotting these two variables.

plot(dist(coords),dist(ubrucei_strdat), pch = 19, col = alpha("black",0.3),
     xlab = "geographic distance", ylab = "genetic distance")

Optional exercise

Now that we’ve completed these analyses using our de novo assembly, repeat them using the reference-based assembly. Hint: the .str file includes the reference, so you’ll need to modify a few functions (and later subset the data) to remove this.

Discussion: Do these results differ meaningfully?

Next week, we’ll examine what additional information we can learn when we have a reference genome available.