Utility of full-genome assemblies

Note: This brief tutorial is written for UNR’s NRES 721 as a basic introduction to the utility of full-genome data. Students, if you wish to follow along, you can download all of the data we’ll use by typing the svn checkout https://github.com/twpierson/nres721_genome/trunk/data command in your working directory, from the /data directory in the Github repository here, or from Dropbox by clicking on the link here.

Introduction

In this tutorial, we’re going to introduce full-genome assemblies and briefly discuss how they can be useful in population genetic and phylogenetic studies. For practical reasons, we have to move away from salamanders (☹️), because the only chromosome-level reference genome is 32 Gb! Nobody wants to deal with that for a tutorial.

Instead, we’ll use some real data from the pygmy rabbit (Brachylagus idahoensis). This is a small, endangered species found in the western United States.

We generated 3RAD data for several individuals of this species, and we assembled them similar to how we earlier assembled Urspelerpes data for this course. Rather than start from these raw data, we’re going to map the final loci to the reference genome. First, let’s talk about reference genomes.

What’s in an assembled genome?

In our previous tutorials, we’ve “assembled” RADseq data into loci. However, we don’t know where those loci are physically located within a genome. When assembling full genomes, our primary goal is to take many sequence reads, overlap and map them, and end up with a full blueprint of the position of every nucleotide in the genome. In practice, this means aligning reads into contigs (i.e., contiguous sequences), assembling contigs into scaffolds (i.e., series of contigs, sometimes separated by gaps of known length), and eventually assembling those contigs into chromosomes, which are the natural physical units of DNA arrangement. This is often easier said than done, and many “genome assemblies” stop at the scaffold level.

There are now myriad methods for generating sequence reads to be assembled into a genome, and researchers often use several methods to get a diversity of data. For example, the inclusion of some extremely long reads can be important for covering gaps among contigs and assembling larger scaffolds. You’ve discussed some of these methods already on Monday!

Thus, simply put, a genome assembly is a map of the physical position of nucleotides on contigs, scaffolds, and chromosomes. Many times, a genome assembly you can download will be “annotated”, which means that additional notes are included to indicate the location and putative function of genes.

Getting a genome

The National Center for Biotechnology Information (NCBI) is a tremendous online resource for all kinds of genetic and genomic data—including full-genome assemblies. You can easily browse these resources through the NCBI website. Using the dropdown menu near the top, you can select “Genome” and then search for an organism of interest.

For example, search for “rabbit”. It should then pull up the full-genome assembly for the European rabbit (Oryctolagus cuniculus), which is indeed the closest relative to our pygmy rabbits with such an assembly. One nifty tool for seeing how closely related these two species are is TimeTree. If we enter Oryctolagus and Brachylagus into the two slots on the website and hit enter, we can see an estimate generated from published studies.

Moving back to the NCBI page, we can already learn a few things about this assembly. For example, the “Assembly level” indicates that this is a chromosome-level assembly. We can click on the link below that (i.e., the assembly name) to open a new page and learn more. On this page, we can see much more information about the assembly, including its coverage, the sequencing technology used to generate it, and some summary statistics.

Discussion: Hunt down the size of this full genome. How does this compare to the size of the human genome?

Scrolling even further, we can look at the assembly information for each chromosome. Notice the “RefSeq sequence” names; these will come in handy later when we map reads to the genome.

Discussion: Note that this view doesn’t show the mitochondrial genome. Where would we find that?

Finally, we’re ready to download the genome! We can do this by clicking the big blue “Download Assembly” button in the upper right hand corner. We want to keep the file type as “Genomic FASTA”. Note: You don’t actually need to do this during class. This file is > 800 Mb, and it’ll take a bit of time to download.

Mapping loci in `bwa`

As we learned two weeks ago, we could assemble our raw 3RAD reads against this genome in ipyrad. Today, we’re going to do something even simpler. We’ll take the final loci we assembled de novo and just map those against the genome assembly.

Discussion: What is a potential downside of assembling our data against the European rabbit genome?

Install `bwa` and `samtools`

There exist various tools for mapping reads or loci against a reference genome. One popular method is the Burrows-Wheeler Aligner (bwa). You can download bwa from here, and installation instructions are available in various places online. We’ll also use the program samtools to modify our output files; you can find installation instructions for this program here. For the sake of time, I’m going to demonstrate how to align loci to the genome today, and you can just watch on my screen. However, I’ve included all of the data necessary to do this yourself in the Github repository, so feel free to experiment on your own time.

Prepare reference genome

Our first task is to “index” the reference genome. In brief, this is a process in which your computer organizes the reference genome in a fashion that optimizes the computational efficiency of the mapping. Assuming that we’ve placed our reference genome in a directory called \reference within our current working directoy, we can index by using the following command (note: again, this takes some time, so don’t do this in class today):

bwa index reference/GCF_000003625.3_OryCun2.0_genomic.fna

If we were to peek into the reference directory, we can now see that there are many more files than just the original .fna file.

(cd reference/* && ls -1 *)

GCF_000003625.3_OryCun2.0_genomic.fna
GCF_000003625.3_OryCun2.0_genomic.fna.amb
GCF_000003625.3_OryCun2.0_genomic.fna.ann
GCF_000003625.3_OryCun2.0_genomic.fna.bwt
GCF_000003625.3_OryCun2.0_genomic.fna.pac
GCF_000003625.3_OryCun2.0_genomic.fna.sa

Map loci and export data

Before mapping our loci, let’s first take a look at how I’ve formatted our files.

head -n 10 data/Brid_R1.fasta

>120753/1
TGTCAGGGATGACAGAAGAAGNTAAATCACAAAGTGAAGAATTGGCACCAGCTGCCTCACCACTGCTCTAGCCACCTGCGAGGAGACAGGATATTTTCCTTGCCCGTGACTGTCTTTCCCTTACCA
>120749/1
TCACTCAAGGGACTTAATATACCATCTGATATTAGCTGTCTCACCTAGCAACTTATGAGACATCATTTCTAGTGGAGCTAATTGGAAACTGGTTTTACTCTGTAACTCACGTCTTTCTAAAGTTG
>120748/1
CTCCTTGTACAAACTGCAATTTAAAAACCCAGCCTGTCAGACTCCTGCTGGCCTGGCAGCTCCCCACAGGCAGGGACAATGTGTGTCTCTGGGCCACTAGCCCCTCTGGGCAGTGCAGGGCTGCC
>120742/1
CTCTACCCCATCCACCCTTCCCCGTCAYACATTCTGCGCACCCCTTGGCCTGCGATGCCCTTCGGAGACATCCTCCTTAGGGGATGCCTTTGTTCATCTTAACCWGATTTTCTGGACAAACGTTTG
>120740/1
TTCGAGTCCTGGCTGCTCCACTTCCGATCCAGCTCTCTGCTACGGMCTGGGAAAGCAGCAGAAGATGGCCCAMGTCCTTGGGCTCCTGCACCAGTGTGGGAGACGAGGARGAAGCTSCTGTGCAG

So, this is a FASTA with one line for each locus. In its paired file (Brid_R2.fasta), we have the same data for the R2 side of the loci. Now, we’re ready to map these loci against the reference genome, and we can do this with the bwa mem command. Below, we use this command and use samtools to modify the output and export it as a BAM file.

bwa mem reference/GCF_000003625.3_OryCun2.0_genomic.fna data/Brid_R1.fasta data/Brid_R2.fasta | samtools view -bS - | samtools sort - data/brid_R1_R2_mapped

Now, we’ll again use samtools to export our BAM file to a readable .txt file.

samtools view data/brid_R1_R2_mapped.bam > data/brid_R1_R2_mapped.txt

We can take a peek at this file by using the head command:

head -n 10 data/brid_R1R2_mapped.txt

82006   179 NC_013669.1 15534   52  75M12D56M   =   15721   179 CGATCCTAAGCTTTTTGAAAATTATTCTGGAATTTTTCCCTCTTAGCACATCTTATGTGAGCTATCCCAGAGAGCGGCTGGGCTGNAAGGGGCCTNAGATCCCTAACAAGTGTTTCTGTTGACTCCGCAAG *   NM:i:25 MD:Z:0G54C0A6G11^TGTACCGAGAGT5A4G5A3A0G8G10A3T4T5   AS:i:59 XS:i:20
82006   115 NC_013669.1 15721   60  73M1D16M8D36M   =   15534   -179    GGACAGATAGTCTACCACAACGCTTTTAGCTCGGCATCCTAATGTTGCACAACGGGAGGCTGAGGCGCAGAGTGGGGGNGGAACCACTGGCTGAGAGGAGGCAGGGCTGGGCGAGAGCGCAGGTT   *   NM:i:17 MD:Z:12C17C0T17G16C6^G5C10^ACGGATAT5C23T6   AS:i:67 XS:i:20
19809   67  NC_013669.1 22185   60  18S108M =   22367   183 AGAAAAGGCCAGAAAATGTTAGCAATGATTCTCTTTGGCAGATGGGATTATGGGTGAGTTCTTTTGTAAAACAACTGTGAAGCGCGCTCTGTTTCACAAGAGTGTGAATCCACTTAACGGTACTGA  *   NM:i:7  MD:Z:16G29C17T1T14C4C5T15   AS:i:73 XS:i:0
19809   131 NC_013669.1 22367   60  130M    =   22185   -183    CTCAGCAGTTAACCTACTGCTTGAGATAGCCATATCCCATTTGGGAGTGCCTGGGTTTAAGTCCTGACTGCCCCCTCTGGTCCTGTGTCCTACTAGTGAACACCCTGGGATGCAACAAATGCTTGGATCG  *   NM:i:10 MD:Z:4A7G10G4C49A4C21C8G6G7C0   AS:i:84 XS:i:26
85643   179 NC_013669.1 55595   60  130M    =   55808   209 CGATCCCATCACGTTCTCTGAGCAGAGGGACAGACGAGAGCTGGGGGAGCCCCAGGACAGAGCACTCCCCTCTGCACAAGTGNCCCTGCTTCNAGTTANGTGCACCTCATCCCCACTGCATAGATGAACA  *   NM:i:8  MD:Z:0G56G2C21C9C5C4G19C6   AS:i:103    XS:i:20
85643   115 NC_013669.1 55808   60  125M    =   55595   -209    TGGNTAATTCACCACGCAAGTAAGGAAAGACCCAGGTCTGACAGCCTTTGGATAGGGGAGGACACACTTCACCCTAACAGCTGGTGTCACTCTCAGATGCTCTAAGCTTGGGACAGTGCCTTCCT   *   NM:i:5  MD:Z:3T27T51T16G12T11   AS:i:103    XS:i:20
48623   67  NC_013669.1 63034   60  125M    =   63265   232 GACCCTATCTGGCTAAGACTGGAGGATACGAATTAGCTATGTAAAAACTCTGGAATAAAAACCTTCCTGTGTGTTCCCAAAGAAAGGAAAAGCAGCATCGGATTTCAGATGTCTTTCAGCCAACA   *   NM:i:5  MD:Z:13C0T11C11C1A84    AS:i:100    XS:i:20
48623   131 NC_013669.1 63265   60  66S62M2S    =   63034   -232    GAAATCTATCCCAGGTAGGAAACCCCACTAGGAGGAGACGCCACTTCCTTGGTAATAGCGCCAGACACAGCCTCACTTTCCTCTCCCCTGTGGCTCTGTGCTAAGAACCAACAGGCTGCCCAGAGGATCG  *   NM:i:2  MD:Z:10C13G37   AS:i:52 XS:i:19
33119   67  NC_013669.1 64116   60  33M2D92M    =   64261   146 CAGTTCTGCCAAACAGAGTGCGTACTGGGAACGCTCTTACCTCGCAGAGTGTGCAGGAGGTTACAGTCAGGAACAGACCAGAGCTTGCAAAGCCCACTCCTATCAAACACAAAGGAAAGCATCTC   *   NM:i:5  MD:Z:21A11^CT20A20G50   AS:i:102    XS:i:22
33119   131 NC_013669.1 64261   18  49M1D37M5I30M9S =   64116   -146    AAGGAAAAGGATCTTGGTAGCATCACTAACGGACAGTCTGTTCCCGCCCCAGCAGAGAAGGAGAAGGCTGTACCGCAAAGGGCTACAGAAGAGAAGGGCTAGCGCTGGCTCTTCGGCAACTGGGGGATCG  *   NM:i:19 MD:Z:19A9T9C5A3^G2A7A2A6A2A0T2G21A9A7   AS:i:33 XS:i:30

Discussion: Each line is a mapped location on the genome. What do you think the various columns represent? Hint: an answer to this question on Biostars is helpful.

Interpreting results

Filter loci

Next, we’ll move to R to view, filter, and plot some of these results. Students: you can follow along here!

First, let’s set working directory and read in this text file that we created from our mapped loci. We’ll do this using the readLines function, which is just in the default base package.

setwd("[your working directory]")

Brid_R1R2_BAM <- readLines("data/brid_R1R2_mapped.txt")

Next, we’re going to parse this object by tabs and keep only the bits of information we care about for today, which are the name of our locus and which chromosome it mapped to.

Brid_dat <- matrix(nrow=length(Brid_R1R2_BAM),ncol = 2)
for(i in 1:length(Brid_R1R2_BAM)){
    temp <- strsplit(Brid_R1R2_BAM[i],"\t")[[1]]
    Brid_dat[i,] <- c(temp[c(1,3)])
}

Let’s preview this object to see what it looks like.

Brid_dat[1:30,]

      [,1]    [,2]         
 [1,] "82006" "NC_013669.1"
 [2,] "82006" "NC_013669.1"
 [3,] "19809" "NC_013669.1"
 [4,] "19809" "NC_013669.1"
 [5,] "85643" "NC_013669.1"
 [6,] "85643" "NC_013669.1"
 [7,] "48623" "NC_013669.1"
 [8,] "48623" "NC_013669.1"
 [9,] "33119" "NC_013669.1"
[10,] "33119" "NC_013669.1"
[11,] "72550" "NC_013669.1"
[12,] "72550" "NC_013669.1"
[13,] "92502" "NC_013669.1"
[14,] "92502" "NC_013669.1"
[15,] "54211" "NC_013669.1"
[16,] "10007" "NC_013669.1"
[17,] "10007" "NC_013669.1"
[18,] "4137"  "NC_013669.1"
[19,] "4137"  "NC_013669.1"
[20,] "4137"  "NC_013669.1"
[21,] "11760" "NC_013669.1"
[22,] "11760" "NC_013669.1"
[23,] "61443" "NC_013669.1"
[24,] "61443" "NC_013669.1"
[25,] "71851" "NC_013669.1"
[26,] "71851" "NC_013669.1"
[27,] "11903" "NC_013669.1"
[28,] "81007" "NC_013669.1"
[29,] "27264" "NC_013669.1"
[30,] "27264" "NC_013669.1"

What if we’re only interested in loci that mapped to the genome? We can remove other loci by filtering out those with an asterisk in the second column.

Brid_mapped <- Brid_dat[Brid_dat[,2]!="*",]

We can now check what proportion of our dataset was removed.

1-(dim(Brid_mapped)[1]/dim(Brid_dat)[1])

[1] 0.02914602

We might be tempted to interpret this as meaning that ~97% of our loci mapped to the genome. However, we probably have some entries from where a single locus mapped to more than one place! In reality, that happened relative few times in this dataset.

Discussion: What might it mean if a locus mapped to more than one place in the genome?

However, we do have two rows for each locus—one each for R1 and R2. Let’s simplify our data and keep just one entry for each locus (note: this will also remove times when a locus has mapped to more than one place in the genome!).

Brid_mapped_once <- Brid_mapped[!duplicated(Brid_mapped[,1]),]

Discussion: What proportion of our data did we remove at this step?

Plot mapped loci.

One simple question we might have is the distribution of our loci across the genome. Let’s plot the number of loci that map to each of our chromosomes or scaffolds.

barplot(table(as.character(Brid_mapped_once[,2])),
        col = "orange", xlab = "", ylab = "",
        main = "3RAD loci assembled against genome", xaxt = 'n', space = 0)
axis (side = 1, labels = c("",""),
      las = 2, at = 0.5+c(0,c(length(unique(Brid_mapped_once[,2][grep("NC_", Brid_mapped_once[,2])]))-1)),
      cex.axis = 0.5, tick = TRUE)
axis (side = 1, labels = c("chrom."),
      las = 1, at = mean(c(0,c(length(unique(Brid_mapped_once[,2][grep("NC_", Brid_mapped_once[,2])]))-1))),
      cex.axis = 1, tick = FALSE)
mtext(side = 1, text = "Scaffold", line = 2)
mtext(side = 2, text = "Number of Loci", line = 2.5)

Discussion: What do you notice about the distribution of our loci?

We can zoom and and look at only those loci that mapped to chromosomes.

barplot(table(as.character(Brid_mapped_once[,2][grep("NC_", Brid_mapped_once[,2])])),
        col = "orange", xlab = "", ylab = "",
        main = "3RAD loci assembled against genome", xaxt = 'n', space = 0)
axis (side = 1, labels = unique(Brid_mapped_once[,2][grep("NC_", Brid_mapped_once[,2])]),
      las = 2, at = 0.5+c(0:c(length(unique(Brid_mapped_once[,2][grep("NC_", Brid_mapped_once[,2])]))-1)),
      cex.axis = 0.5)
mtext(side = 1, text = "Chromosome", line = 3.6)
mtext(side = 2, text = "Number of Loci", line = 2.5)

Discussion: Does this distribution match what we might expect? How would you form your expectations?

Using Pronghorn

As you’ve likely noticed throughout these tutorials, some of the methods we use in phylogenomics and population genomics are quite computationally intensive! Our example datasets are relatively small, but with even more data, the demands of these programs might far outpace what is available on your local computer.

Fortunately, many research institutions support this kind of research through high-performance computing clusters, like the University of Nevada Reno’s “Pronghorn”. These are large and powerful groups of computers capable of conducting these analyses, and approved users can login remotely to do their work.

Once you have an account, you can use the ssh command to login to a Pronghorn node from anywhere! An example of this command is below, after which you’d be prompted to enter your password.

ssh [your username]@pronghorn.rc.unr.edu

In most cases, users need to run their programs by submitting a “job”. This is special file that contains the command you want to run along with other important information (e.g., the kind of computing power you need, how long you’ll need it, etc.). Pronghorn will then place your job in a queue and will run it once it’s turn has come.

Pronghorn uses a system called “Slurm” to manage these jobs, and we enter a job into the queue be submitting a Slurm script. Below is an example of one such script:

cat data/example_script.sl

#!/bin/bash
#SBATCH --mem-per-cpu 3000
#SBATCH --job-name example_job
#SBATCH --output ubrucei_denovo_output.txt
#SBATCH -n 24
#SBATCH -t 2-00:00

ipyrad -p params-ubrucei_denovo.txt -s 1234567 -c 24 -d -f

Each of the lines at the top indicate an important feature of the job. Here, we have told Pronghorn:

--mem-per-cpu : the amount of RAM dedicated to each CPU
--job-name : the name of our job
--output : where to print the output of our job (e.g., its status in ipyrad)
-n 24 : the number of cores we are requesting; here, we’ve requested 24 cores
-t : the maximum allowable time for our job; here, we’ve requested two full days

We would then submit the job by entering the following command into our Terminal:

sbatch data/example_script.sl

We can check on the status of all jobs we’ve submitted with the following command:

sacct -u [your username]

Utility of full-genome assemblies

Todd W. Pierson

30 October 2019

Introduction

What’s in an assembled genome?

Getting a genome

Mapping loci in `bwa`

Install `bwa` and `samtools`

Prepare reference genome

Map loci and export data

Interpreting results

Filter loci

Plot mapped loci.

Using Pronghorn

Utility of full-genome assemblies

Todd W. Pierson

30 October 2019

Introduction

What’s in an assembled genome?

Getting a genome

Mapping loci in bwa

Install bwa and samtools

Prepare reference genome

Map loci and export data

Interpreting results

Filter loci

Plot mapped loci.

Using Pronghorn

Mapping loci in `bwa`

Install `bwa` and `samtools`