RADseq data assembly in ipyrad

Note: This tutorial is written to be a basic introduction to assembling RADseq-style data in ipyrad for an in-class demonstration in UNR’s NRES 721, not to be a comprehensive guide to doing this for research purposes.

Students: if you wish to follow along, please install the latest version of ipyrad (follow the conda directions here); I have written this tutorial using v0.9.14. This program is written for Linux or Mac operating systems; if you are running Windows, you can try to run ipyrad through your favorite emulator or virtual machine, or you can just follow along as we move through the tutorial. We’ll return to R for downstream analyses!

Introduction

Reduced-representation genomic sequencing (e.g., RADseq) is a popular group of methods for generating large-scale datasets for population genomic and phylogenomic studies, especially for non-model organisms without reference genomes. The sequencing reads generated from these methods come from thousands to millions of different portions of the genome, and there are numerous methods for “assembling” these data. Typically, this means clustering sequences (either de novo or against a reference genome), aligning sequences, and callling SNPs.

In this tutorial, we’ll practice assembling RADseq-style data. Our objectives are to:

understand the general concepts re: how ipyrad assembles data
compare results from a de novo and reference-based assembly
produce output files for use in downstream analyses

To do this, we’ll use data from the patch-nosed salamander (Urspelerpes brucei). This is a tiny (~25 mm) lungless salamander endemic to just ~ 20 km² in Georgia and South Carolina. Aren’t they beautiful?

Let’s jump right in and start assembling data. If you haven’t already, navigate to a local directory you’d like to work from:

cd [your working directory]

Next, we want to download the data we’ll use. If you’re in your working directory, the following command will download our data into a \data directory within it:

svn checkout https://github.com/twpierson/nres721_assembly/trunk/data

These data are a small subset of a larger, unpublished RADcap (Hoffberg et al. 2016) dataset generated from individuals sampled across the small distribution of this Urspelerpes. This subset contains data from twelve individuals: three each from four collection localities. We can see these files in our \data directory:

ls ./data/*

./data/U03_R1_.fq.gz
./data/U03_R2_.fq.gz
./data/U04_R1_.fq.gz
./data/U04_R2_.fq.gz
./data/U05_R1_.fq.gz
./data/U05_R2_.fq.gz
./data/U19_R1_.fq.gz
./data/U19_R2_.fq.gz
./data/U20_R1_.fq.gz
./data/U20_R2_.fq.gz
./data/U21_R1_.fq.gz
./data/U21_R2_.fq.gz
./data/U23_R1_.fq.gz
./data/U23_R2_.fq.gz
./data/U24_R1_.fq.gz
./data/U24_R2_.fq.gz
./data/U25_R1_.fq.gz
./data/U25_R2_.fq.gz
./data/U41_R1_.fq.gz
./data/U41_R2_.fq.gz
./data/U42_R1_.fq.gz
./data/U42_R2_.fq.gz
./data/U43_R1_.fq.gz
./data/U43_R2_.fq.gz

There are a total of 24 files, consisting of a R1 and R2 file for each of the nine samples. Samples U03–U05 come from one collection locality, U19–U21 from a second, U23–U25 from a third, and U41–U43 from a fourth. These data have already been demultiplexed and “decloned” (i.e., PCR duplicates removed); the data for each sample has been subsetted down to 37,500 reads for consistently and manageability in this tutorial.

De novo assembly

Note: Today, we are going to use ipyrad to “call” genotypes. Some studies (e.g., Gompert and Buerkle 2011; Fumagalli et al. 2013) have demonstrated the pitfalls of definitive genotype calling and instead strongly encourage the use of “genotype likelihoods”, which allow error in estimating genotypes to propogate through downstream analyses. For the sake of simplicity today, we will be calling (and later, using) called genotypes.

There are two general ways in which we can assemble these data in ipyrad: de novo and using a reference genome. In a de novo assembly, reads are clustered and aligned against other reads, and no reference genome is required. This is really great for non-model organisms for which reference genomes don’t exist! For example, in our case, there does not exist a full reference genome for any close relative of Urspelerpes; in fact, because salamanders have very large genomes, only one chromosome-level genome assembly exists (the 32 Gb axolotl genome). For our purposes today, we’re going to start by assembling these data de novo (but see the final tutorial for an example using an assembled reference genome!).

Discussion: What might the advantages be of assembling against a reference genome?

From within our working directory, we’ll initiate a new assembly called ubrucei_denovo (and thus, create a parameters file) by entering:

ipyrad -n ubrucei_denovo

Let’s now look at this template of an ipyrad parameters file:

cat params-ubrucei_denovo.txt

------- ipyrad params file (v.0.9.14)-------------------------------------------
ubrucei_denovo                 ## [0] [assembly_name]: Assembly name. Used to name output directories for assembly steps
[your directory]               ## [1] [project_dir]: Project dir (made in curdir if not present)
                               ## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq files
                               ## [3] [barcodes_path]: Location of barcodes file
                               ## [4] [sorted_fastq_path]: Location of demultiplexed/sorted fastq files
denovo                         ## [5] [assembly_method]: Assembly method (denovo, reference)
                               ## [6] [reference_sequence]: Location of reference sequence file
rad                            ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc.
TGCAG,                         ## [8] [restriction_overhang]: Restriction overhang (cut1,) or (cut1, cut2)
5                              ## [9] [max_low_qual_bases]: Max low quality base calls (Q<20) in a read
33                             ## [10] [phred_Qscore_offset]: phred Q score offset (33 is default and very standard)
6                              ## [11] [mindepth_statistical]: Min depth for statistical base calling
6                              ## [12] [mindepth_majrule]: Min depth for majority-rule base calling
10000                          ## [13] [maxdepth]: Max cluster depth within samples
0.85                           ## [14] [clust_threshold]: Clustering threshold for de novo assembly
0                              ## [15] [max_barcode_mismatch]: Max number of allowable mismatches in barcodes
0                              ## [16] [filter_adapters]: Filter for adapters/primers (1 or 2=stricter)
35                             ## [17] [filter_min_trim_len]: Min length of reads after adapter trim
2                              ## [18] [max_alleles_consens]: Max alleles per site in consensus sequences
0.05                           ## [19] [max_Ns_consens]: Max N's (uncalled bases) in consensus (R1, R2)
0.05                           ## [20] [max_Hs_consens]: Max Hs (heterozygotes) in consensus (R1, R2)
4                              ## [21] [min_samples_locus]: Min # samples per locus for output
0.2                            ## [22] [max_SNPs_locus]: Max # SNPs per locus (R1, R2)
8                              ## [23] [max_Indels_locus]: Max # of indels per locus (R1, R2)
0.5                            ## [24] [max_shared_Hs_locus]: Max # heterozygous sites per locus
0, 0, 0, 0                     ## [25] [trim_reads]: Trim raw read edges (R1>, <R1, R2>, <R2) (see docs)
0, 0, 0, 0                     ## [26] [trim_loci]: Trim locus edges (see docs) (R1>, <R1, R2>, <R2)
p, s, l                        ## [27] [output_formats]: Output formats (see docs)
                               ## [28] [pop_assign_file]: Path to population assignment file
                               ## [29] [reference_as_filter]: Reads mapped to this reference are removed in step 3

This is the file in which you can tweak a number of assembly parameters. See here for a thorough description of what each of these means. For our purposes, we only need to change a few. In your favorite text editor, open params-ubrucei_denovo.txt and change the following lines:

[4] [sorted_fastq_path] : change this parameter to data/*fq.gz
[7] [datatype] : change this parameter to pair3rad
[8] [restriction_overhang] : change this parameter to CTAGAT,AATTC,CTAGC
[16] [filter_adapters] : change this parameter to 2
[21] [min_samples_locus] : change this parameter to 5
[25] [trim_reads] : change this parameter to 10, 120, 10, 120 ; We’re trimming off the first ten bases to deal with mapping issues caused by the inclusion/exclusion of restriction sites and trimming to a maximum length of 120 to normalize read length. This is relatively project-specific and not something you need to worry too much about when trying to understand these general concepts!
[27] [output_formats] : change this parameter to *

Then, save the file with these updates. Next, we’ll assemble the data. This whole process isn’t too computationally-intensive (i.e., it took ~40 minutes on my 16 GB RAM, 4-core Macbook), but we might not want to wait quite that long. Don’t worry—there is an opportunity to assemble data in real-time later in this tutorial. Below, we’ll break this down step-by-step. Students in class: don’t actually run any of these commands until we get to the “Reference-based assembly” section.

steps 1 & 2

We’ll do this first assembly step-by-step and talk about the output. First, we’ll run step 1. Note: the -c 2 argument tells my computer to run this using two cores. You can adjust this as necsesary depending upon what resources you have available. By default, ipyrad will use all cores available.

ipyrad -p params-ubrucei_denovo.txt -s 1 -c 2

This step simply reads in our data files, and we can review the results in the ubrucei_denovo_s1_demultiplex_stats.txt file.

cat ubrucei_denovo_s1_demultiplex_stats.txt

     reads_raw
U03      37500
U04      37500
U05      37500
U19      37500
U20      37500
U21      37500
U23      37500
U24      37500
U25      37500
U41      37500
U42      37500
U43      37500

Next, we’ll run step 2.

ipyrad -p params-ubrucei_denovo.txt -s 2

This step filters and edits reads, using thresholds set in the parameters file. This means that we (probably) don’t need to filter data before entering ipyrad! We can view the results of this step in the ubrucei_denovo_edits/s2_rawedit_stats.txt file:

cat ubrucei_denovo_edits/s2_rawedit_stats.txt

     reads_raw  trim_adapter_bp_read1  trim_adapter_bp_read2  trim_quality_bp_read1  trim_quality_bp_read2  reads_filtered_by_Ns  reads_filtered_by_minlen  reads_passed_filter
U03      37500                   1515                    851                   6826                 105668                     0                       427                37073
U04      37500                   1649                    889                   8357                 115236                     0                       460                37040
U05      37500                   1419                    831                   6975                 102394                     0                       399                37101
U19      37500                   1596                    843                   8995                  87001                     0                       324                37176
U20      37500                   1723                    828                   8798                  93216                     0                       358                37142
U21      37500                   1366                    836                   8556                  87858                     0                       315                37185
U23      37500                   1612                    835                   6983                  97489                     0                       352                37148
U24      37500                   1729                    885                   9037                 101776                     0                       358                37142
U25      37500                   1451                    855                   8632                 117176                     0                       450                37050
U41      37500                   2744                   1014                   8205                  63413                     0                       207                37293
U42      37500                   2674                    932                   7453                  69472                     0                       241                37259
U43      37500                   2413                    938                   8966                  78060                     0                       288                37212

step 3

Next, we’ll run step 3. This is the within-sample clustering step, in which the reads are compared against each other, clustered based on similarity into “loci”, and then aligned. This is a relatively computationally intensive step.

ipyrad -p params-ubrucei_denovo.txt -s 3

Once completed, we can view the results of this step in the ubrucei_denovo_clust_0.85/s3_cluster_stats.txt file:

cat ubrucei_denovo_clust_0.85/s3_cluster_stats.txt

    clusters_total  hidepth_min clusters_hidepth avg_depth_total avg_depth_mj avg_depth_stat sd_depth_total sd_depth_mj sd_depth_stat
U03          11268            6             1098            2.75        14.37          14.37          16.88       52.58         52.58
U04          12063            6             1050            2.60        14.41          14.41          15.38       50.54         50.54
U05          11417            6             1102            2.59        12.92          12.92           4.58        9.54          9.54
U19          11082            6             1078            2.68        13.47          13.47           5.40       12.74         12.74
U20          11171            6             1044            2.82        15.33          15.33          17.40       55.30         55.30
U21          11050            6             1104            2.70        13.33          13.33           5.10       11.24         11.24
U23          11658            6             1066            2.70        14.94          14.94          15.93       51.00         51.00
U24          11988            6             1054            2.60        14.13          14.13          14.87       48.59         48.59
U25          11520            6             1066            2.57        13.10          13.10           4.80       10.86         10.86
U41          10677            6             1082            2.80        14.27          14.27          13.37       40.13         40.13
U42          10742            6             1053            2.78        14.53          14.53          15.74       48.65         48.65
U43          10646            6             1044            2.85        15.11          15.11          16.41       50.70         50.70

Discussion: What can you glean from this output? What is the distribution of our sequencing coverage? What do you think might contribute to heterogeneity in sequencing coverage among and within samples?

steps 4 & 5

Next, we’ll run steps 4 and 5 together. These steps estimate sequencing error rates, estimate genome-wide heterozygosity, and call consensus sequences for each individual for use in across-sample clustering.

ipyrad -p params-ubrucei_denovo.txt -s 45

step 6

We’ll skip over the results from steps 4 & 5 and proceed immediately to step 6. This is another “clustering” step, but instead of clustering reads within individuals, we’re clustering loci across individuals. With many samples, this step can be computationally intensive, but it should run quite quickly for us.

ipyrad -p params-ubrucei_denovo.txt -s 6

step 7

Finally, we’ll run step 7, in which loci are catalogued, and data are output into a variety of formats (e.g., VCF, PHYLIP, Structure, etc.) for downstream analysis.

ipyrad -p params-ubrucei_denovo.txt -s 7

This also produces a useful file (ubrucei_denovo_outfiles/ubrucei_denovo_stats.txt) with summary statistics, which looks like this:


## The number of loci caught by each filter.
## ipyrad API location: [assembly].stats_dfs.s7_filters

                            total_filters  applied_order  retained_loci
total_prefiltered_loci                  0              0           1433
filtered_by_rm_duplicates               9              9           1424
filtered_by_max_indels                  6              6           1418
filtered_by_max_SNPs                    0              0           1418
filtered_by_max_shared_het             45             45           1373
filtered_by_min_sample                491            489            884
total_filtered_loci                   551            549            884


## The number of loci recovered for each Sample.
## ipyrad API location: [assembly].stats_dfs.s7_samples

     sample_coverage
U03              738
U04              716
U05              749
U19              705
U20              704
U21              727
U23              723
U24              712
U25              715
U41              635
U42              656
U43              685


## The number of loci for which N taxa have data.
## ipyrad API location: [assembly].stats_dfs.s7_loci

    locus_coverage  sum_coverage
1                0             0
2                0             0
3                0             0
4                0             0
5               72            72
6               80           152
7               62           214
8               76           290
9               73           363
10             101           464
11             124           588
12             296           884


The distribution of SNPs (var and pis) per locus.
## var = Number of loci with n variable sites (pis + autapomorphies)
## pis = Number of loci with n parsimony informative site (minor allele in >1 sample)
## ipyrad API location: [assembly].stats_dfs.s7_snps
## The "reference" sample is included if present unless 'exclude_reference=True'

    var  sum_var  pis  sum_pis
0   488        0  607        0
1   184      184  131      131
2    79      342   62      255
3    42      468   29      342
4    17      536   16      406
5    29      681   15      481
6    13      759    8      529
7    13      850    7      578
8     3      874    3      602
9     2      892    1      611
10    3      922    1      621
11    3      955    2      643
12    0      955    0      643
13    3      994    0      643
14    0      994    1      657
15    2     1024    0      657
16    0     1024    0      657
17    0     1024    0      657
18    1     1042    0      657
19    0     1042    0      657
20    0     1042    0      657
21    0     1042    0      657
22    0     1042    0      657
23    0     1042    0      657
24    0     1042    1      681
25    0     1042    0      681
26    0     1042    0      681
27    0     1042    0      681
28    1     1070    0      681
29    0     1070    0      681
30    0     1070    0      681
31    0     1070    0      681
32    0     1070    0      681
33    0     1070    0      681
34    0     1070    0      681
35    0     1070    0      681
36    0     1070    0      681
37    0     1070    0      681
38    0     1070    0      681
39    0     1070    0      681
40    1     1110    0      681


## Final Sample stats summary
     state  reads_raw  reads_passed_filter  refseq_mapped_reads  refseq_unmapped_reads  clusters_total  clusters_hidepth  hetero_est  error_est  reads_consens  loci_in_assembly
U03      7      37500                37073                30953                   6120           11268              1098    0.013550   0.005020            941               738
U04      7      37500                37040                31400                   5640           12063              1050    0.013419   0.005670            896               716
U05      7      37500                37101                29571                   7530           11417              1102    0.015482   0.005835            938               749
U19      7      37500                37176                29671                   7505           11082              1078    0.014641   0.005850            913               705
U20      7      37500                37142                31531                   5611           11171              1044    0.014357   0.006062            895               704
U21      7      37500                37185                29850                   7335           11050              1104    0.013657   0.005759            949               727
U23      7      37500                37148                31524                   5624           11658              1066    0.014485   0.005682            913               723
U24      7      37500                37142                31127                   6015           11988              1054    0.013287   0.005194            914               712
U25      7      37500                37050                29628                   7422           11520              1066    0.012454   0.005587            910               715
U41      7      37500                37293                29936                   7357           10677              1082    0.016725   0.006773            878               635
U42      7      37500                37259                29867                   7392           10742              1053    0.016200   0.006346            889               656
U43      7      37500                37212                30303                   6909           10646              1044    0.015841   0.007015            879               685


## Alignment matrix statistics:
snps matrix size: (12, 1110), 23.05% missing sites.
sequence matrix size: (12, 211231), 20.15% missing sites.

Discussion: Examine these summary statistics. At what steps are loci being filtered out? Is this indicative of any problems with our analysis or data?

Reference-based assembly

Although we don’t want to use a full genome assembly as a reference genome, these RADcap data provide us with another opportunity for assembly: a “pseudo-reference” genome. To understand this, it’s worth thinking a bit about what the RADcap data are. We created 3RAD libraries—each of which probably contained hundreds of thousands of loci—for each individual, but then we conducted a “sequence capture” reaction to enrich them for a pre-selected subset of 1000 loci. So ideally, these reads are mostly from those 1000 loci. Rather than assemble these de novo, if we have a “reference” sequence for each those 1000 loci, we can just assemble against those!

Discussion: How might you predict that assembling against a pseudo-reference genome will change the results of our assembly?

I’ve prepared one such pseudo-reference, and we can download the \reference directory in which it’s housed by entering the following command:

svn checkout https://github.com/twpierson/nres721_assembly/trunk/reference

We can peek at this pseudo-reference—formatted as a FASTA file—here:

head -n 10 reference/ubrucei_pseudoreference.fasta

>locus1
TATGTAAAACCCCCTCAGTGTTGTTGTCAATGTTGTCCCAAATTACATAGTGATGTCCAGCAAATGGATTAGCTACTATTTTGATCTTGTAAGAGGATTTGTTGTAGGCTTTACATCGAGCACGTGAGAAAAAGCNNNNNNNNNNGGATCAGAGTATCATGTAGGCTGGGGGCTCTAGTAACACTGAAGGACATTGACAGAGGTACTAACCTATAGTAGCTAATATCGGGTCAGTACGCAGAATAGGCCAATGTTTAGAGATGCAGCTCCTGAATATATTTCtCATC
>locus10
ATTGATAGTATTTTGCGATGGTAAATACACGAGCAAAAAAGAAAATGGCCAGATCTCGAACCCAGCCGAATAGTACACCTAGGTCTGCTTTACAGCTCTCATTATTCAGTTTTGGCATGCAAAAAGGGGAGCTGNNNNNNNNNNAACAATCGGAGCTACTTTACATAATGATGAGACGACTCCCTCCATTGATCTCCCTGAGCAAACTGGGTGCCCTGCTAATGTTGATTTATCTGTGACAGAAAGTGAAAGTACCAACGATGACCTAAATAATGTTCCACCACC
>locus100
TGATATATGACGAGACTGTCTCGTCATTAGCACTTAAGGGGTTAAAGAAAAAATATTGTATGGTTGCTAATTATGTTCATAAAAATCAAACGCCCCACAGAACCTGTAGAAACCATGTAAAAACAGTTGTTTATNNNNNNNNNNGTGGTAGATGGTGAGGGATTAGGGGAAGAAAGTCAAAAACCTAGGGGTGTCTTGTATTTTAGACGTTCAATGGTTCATTGTCTTTTAACATTGCCTACCACCGAAACTGGTATGTTGAgatGGTACGCTAATGCTTGTAGcCtCATC
>locus1000
AAGTTGCAGATTGCTGGCTTGTGGAATGGGGACCCTGCCCTACCTGTGTCCTGTAATTTGGGGTATGCCATAATAGCTTCCATGTCTGCAGAAGCCCATGTGGGCCAAACAACTAAATCAGCTTATTACCTCTGGNNNNNNNNNNGGGTTTGAAAAAGTATGGCCACTTGTCCAAGTACAGAATCATTAGGTTGGCTCCCTTATTGAAATCAgattGATTGGTTGGTTGTGCGTTGAAAGGGCTCCAAAAGGTTGGGCCCAACTTTATTCGAGAATGGCTTCAGCCACC
>locus10000
ATGGTGAGAGGGGACAGGGTCACTCCGGCCTCTGGCAGGGACACTGTCGCAATGGGGGCTGATTCTGACGAGTCACTACAGGGAAATAGCAGCACCATCAATGTATAAGCACCACGCTGGGGGGGGGGGTTTATCCGNNNNNNNNNNAGGACCCGCTGGGCCTCAACAGTTTTGGCAGACAGGTACAGAGCTTCAGAGGGGCCGTGGGGAAGAGGGTATGGTGGGGCTTCTTGAAGGCCAGGTTTCAGCTGTTTACCAGGTCACCTTTATTGATTAGTTGATGCCACC

Discussion: How is this formatted? Where did R1 and R2 data end up, and what does this tell you about the size of the loci?

Now, let’s assemble our data again—this time against this pseudo-reference! Since most of our parameters will stay the same, we can create a new parameters file here by copying our de novo parameters file.

cp params-ubrucei_denovo.txt params-ubrucei_reference.txt

We should then alter the following parameters:

[0] [assembly_name] : change this parameter to ubrucei_reference
[5] [assembly_method] : change this parameter to reference
[6] [reference_sequence] : change this parameter to reference/ubrucei_pseudoreference.fasta

Now, we’ll run this whole assembly, rather than break it down step-by-step like before. This should run relatively quickly, if you want to follow along.

ipyrad -p params-ubrucei_reference.txt -s 1234567

Once it finishes, let’s look at the same summary statistics file.

cat ubrucei_reference_outfiles/ubrucei_reference_stats.txt


## The number of loci caught by each filter.
## ipyrad API location: [assembly].stats_dfs.s7_filters

                            total_filters  applied_order  retained_loci
total_prefiltered_loci                  0              0            905
filtered_by_rm_duplicates               0              0            905
filtered_by_max_indels                  0              0            905
filtered_by_max_SNPs                    0              0            905
filtered_by_max_shared_het             23             23            882
filtered_by_min_sample                121            121            761
total_filtered_loci                   144            144            761


## The number of loci recovered for each Sample.
## ipyrad API location: [assembly].stats_dfs.s7_samples

           sample_coverage
reference              761
U03                    643
U04                    640
U05                    653
U19                    629
U20                    613
U21                    647
U23                    638
U24                    623
U25                    628
U41                    560
U42                    575
U43                    602


## The number of loci for which N taxa have data.
## ipyrad API location: [assembly].stats_dfs.s7_loci

    locus_coverage  sum_coverage
1                0             0
2                0             0
3                0             0
4                0             0
5               32            32
6               36            68
7               47           115
8               50           165
9               52           217
10              52           269
11              79           348
12             119           467
13             294           761


The distribution of SNPs (var and pis) per locus.
## var = Number of loci with n variable sites (pis + autapomorphies)
## pis = Number of loci with n parsimony informative site (minor allele in >1 sample)
## ipyrad API location: [assembly].stats_dfs.s7_snps
## The "reference" sample is included if present unless 'exclude_reference=True'

    var  sum_var  pis  sum_pis
0   229        0  490        0
1   177      177  134      134
2   108      393   49      232
3    71      606   35      337
4    47      794   16      401
5    38      984   15      476
6    26     1140    6      512
7    19     1273    8      568
8    13     1377    2      584
9     6     1431    3      611
10    8     1511    1      621
11    6     1577    0      621
12    2     1601    1      633
13    6     1679    1      646
14    1     1693    0      646
15    3     1738    0      646
16    0     1738    0      646
17    1     1755    0      646


## Final Sample stats summary
     state  reads_raw  reads_passed_filter  refseq_mapped_reads  refseq_unmapped_reads  clusters_total  clusters_hidepth  hetero_est  error_est  reads_consens  loci_in_assembly
U03      7      37500                37073                14157                  22916             931               698    0.003323   0.001832            687               643
U04      7      37500                37040                13587                  23453             948               691    0.003585   0.001900            681               640
U05      7      37500                37101                14086                  23015             938               707    0.002505   0.001809            700               653
U19      7      37500                37176                14217                  22959             929               685    0.004192   0.001778            672               629
U20      7      37500                37142                14081                  23061             922               654    0.003445   0.001852            643               613
U21      7      37500                37185                14479                  22706             922               701    0.003514   0.001924            690               647
U23      7      37500                37148                13947                  23201             933               694    0.003301   0.001803            679               638
U24      7      37500                37142                13320                  23822             936               680    0.003417   0.001822            669               623
U25      7      37500                37050                13758                  23292             941               674    0.003358   0.001779            662               628
U41      7      37500                37293                12321                  24972             914               610    0.004018   0.002619            598               560
U42      7      37500                37259                13012                  24247             928               627    0.003828   0.002814            620               575
U43      7      37500                37212                13309                  23903             937               649    0.004455   0.002187            638               602


## Alignment matrix statistics:
snps matrix size: (13, 1755), 22.69% missing sites.
sequence matrix size: (13, 181974), 17.49% missing sites.

Discussion: Compare and contrast these results to those from the de novo assembly. What can explain the differences? It might be worth reflecting back upon what exactly RADcap data are.

We’ve finished our assemblies! Next week, we’ll conduct some basic population genomic and phylogenomic analyses to see what we can learn from these data.

RADseq data assembly in `ipyrad`

Todd W. Pierson

16 October 2019