ipyrad
Note: This tutorial is written to be a basic introduction to assembling RADseq-style data in ipyrad
for an in-class demonstration in UNR’s NRES 721, not to be a comprehensive guide to doing this for research purposes.
Students: if you wish to follow along, please install the latest version of ipyrad
(follow the conda
directions here); I have written this tutorial using v0.9.14
. This program is written for Linux or Mac operating systems; if you are running Windows, you can try to run ipyrad
through your favorite emulator or virtual machine, or you can just follow along as we move through the tutorial. We’ll return to R
for downstream analyses!
Reduced-representation genomic sequencing (e.g., RADseq) is a popular group of methods for generating large-scale datasets for population genomic and phylogenomic studies, especially for non-model organisms without reference genomes. The sequencing reads generated from these methods come from thousands to millions of different portions of the genome, and there are numerous methods for “assembling” these data. Typically, this means clustering sequences (either de novo or against a reference genome), aligning sequences, and callling SNPs.
In this tutorial, we’ll practice assembling RADseq-style data. Our objectives are to:
ipyrad
assembles dataTo do this, we’ll use data from the patch-nosed salamander (Urspelerpes brucei). This is a tiny (~25 mm) lungless salamander endemic to just ~ 20 km2 in Georgia and South Carolina. Aren’t they beautiful?
Let’s jump right in and start assembling data. If you haven’t already, navigate to a local directory you’d like to work from:
cd [your working directory]
Next, we want to download the data we’ll use. If you’re in your working directory, the following command will download our data into a \data
directory within it:
svn checkout https://github.com/twpierson/nres721_assembly/trunk/data
These data are a small subset of a larger, unpublished RADcap (Hoffberg et al. 2016) dataset generated from individuals sampled across the small distribution of this Urspelerpes. This subset contains data from twelve individuals: three each from four collection localities. We can see these files in our \data
directory:
ls ./data/*
./data/U03_R1_.fq.gz
./data/U03_R2_.fq.gz
./data/U04_R1_.fq.gz
./data/U04_R2_.fq.gz
./data/U05_R1_.fq.gz
./data/U05_R2_.fq.gz
./data/U19_R1_.fq.gz
./data/U19_R2_.fq.gz
./data/U20_R1_.fq.gz
./data/U20_R2_.fq.gz
./data/U21_R1_.fq.gz
./data/U21_R2_.fq.gz
./data/U23_R1_.fq.gz
./data/U23_R2_.fq.gz
./data/U24_R1_.fq.gz
./data/U24_R2_.fq.gz
./data/U25_R1_.fq.gz
./data/U25_R2_.fq.gz
./data/U41_R1_.fq.gz
./data/U41_R2_.fq.gz
./data/U42_R1_.fq.gz
./data/U42_R2_.fq.gz
./data/U43_R1_.fq.gz
./data/U43_R2_.fq.gz
There are a total of 24 files, consisting of a R1 and R2 file for each of the nine samples. Samples U03–U05 come from one collection locality, U19–U21 from a second, U23–U25 from a third, and U41–U43 from a fourth. These data have already been demultiplexed and “decloned” (i.e., PCR duplicates removed); the data for each sample has been subsetted down to 37,500 reads for consistently and manageability in this tutorial.
Note: Today, we are going to use ipyrad
to “call” genotypes. Some studies (e.g., Gompert and Buerkle 2011; Fumagalli et al. 2013) have demonstrated the pitfalls of definitive genotype calling and instead strongly encourage the use of “genotype likelihoods”, which allow error in estimating genotypes to propogate through downstream analyses. For the sake of simplicity today, we will be calling (and later, using) called genotypes.
There are two general ways in which we can assemble these data in ipyrad
: de novo and using a reference genome. In a de novo assembly, reads are clustered and aligned against other reads, and no reference genome is required. This is really great for non-model organisms for which reference genomes don’t exist! For example, in our case, there does not exist a full reference genome for any close relative of Urspelerpes; in fact, because salamanders have very large genomes, only one chromosome-level genome assembly exists (the 32 Gb axolotl genome). For our purposes today, we’re going to start by assembling these data de novo (but see the final tutorial for an example using an assembled reference genome!).
Discussion: What might the advantages be of assembling against a reference genome?
From within our working directory, we’ll initiate a new assembly called ubrucei_denovo
(and thus, create a parameters file) by entering:
ipyrad -n ubrucei_denovo
Let’s now look at this template of an ipyrad
parameters file:
cat params-ubrucei_denovo.txt
------- ipyrad params file (v.0.9.14)-------------------------------------------
ubrucei_denovo ## [0] [assembly_name]: Assembly name. Used to name output directories for assembly steps
[your directory] ## [1] [project_dir]: Project dir (made in curdir if not present)
## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq files
## [3] [barcodes_path]: Location of barcodes file
## [4] [sorted_fastq_path]: Location of demultiplexed/sorted fastq files
denovo ## [5] [assembly_method]: Assembly method (denovo, reference)
## [6] [reference_sequence]: Location of reference sequence file
rad ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc.
TGCAG, ## [8] [restriction_overhang]: Restriction overhang (cut1,) or (cut1, cut2)
5 ## [9] [max_low_qual_bases]: Max low quality base calls (Q<20) in a read
33 ## [10] [phred_Qscore_offset]: phred Q score offset (33 is default and very standard)
6 ## [11] [mindepth_statistical]: Min depth for statistical base calling
6 ## [12] [mindepth_majrule]: Min depth for majority-rule base calling
10000 ## [13] [maxdepth]: Max cluster depth within samples
0.85 ## [14] [clust_threshold]: Clustering threshold for de novo assembly
0 ## [15] [max_barcode_mismatch]: Max number of allowable mismatches in barcodes
0 ## [16] [filter_adapters]: Filter for adapters/primers (1 or 2=stricter)
35 ## [17] [filter_min_trim_len]: Min length of reads after adapter trim
2 ## [18] [max_alleles_consens]: Max alleles per site in consensus sequences
0.05 ## [19] [max_Ns_consens]: Max N's (uncalled bases) in consensus (R1, R2)
0.05 ## [20] [max_Hs_consens]: Max Hs (heterozygotes) in consensus (R1, R2)
4 ## [21] [min_samples_locus]: Min # samples per locus for output
0.2 ## [22] [max_SNPs_locus]: Max # SNPs per locus (R1, R2)
8 ## [23] [max_Indels_locus]: Max # of indels per locus (R1, R2)
0.5 ## [24] [max_shared_Hs_locus]: Max # heterozygous sites per locus
0, 0, 0, 0 ## [25] [trim_reads]: Trim raw read edges (R1>, <R1, R2>, <R2) (see docs)
0, 0, 0, 0 ## [26] [trim_loci]: Trim locus edges (see docs) (R1>, <R1, R2>, <R2)
p, s, l ## [27] [output_formats]: Output formats (see docs)
## [28] [pop_assign_file]: Path to population assignment file
## [29] [reference_as_filter]: Reads mapped to this reference are removed in step 3
This is the file in which you can tweak a number of assembly parameters. See here for a thorough description of what each of these means. For our purposes, we only need to change a few. In your favorite text editor, open params-ubrucei_denovo.txt
and change the following lines:
[4] [sorted_fastq_path]
: change this parameter to data/*fq.gz
[7] [datatype]
: change this parameter to pair3rad
[8] [restriction_overhang]
: change this parameter to CTAGAT,AATTC,CTAGC
[16] [filter_adapters]
: change this parameter to 2
[21] [min_samples_locus]
: change this parameter to 5
[25] [trim_reads]
: change this parameter to 10, 120, 10, 120
; We’re trimming off the first ten bases to deal with mapping issues caused by the inclusion/exclusion of restriction sites and trimming to a maximum length of 120 to normalize read length. This is relatively project-specific and not something you need to worry too much about when trying to understand these general concepts![27] [output_formats]
: change this parameter to *
Then, save the file with these updates. Next, we’ll assemble the data. This whole process isn’t too computationally-intensive (i.e., it took ~40 minutes on my 16 GB RAM, 4-core Macbook), but we might not want to wait quite that long. Don’t worry—there is an opportunity to assemble data in real-time later in this tutorial. Below, we’ll break this down step-by-step. Students in class: don’t actually run any of these commands until we get to the “Reference-based assembly” section.
We’ll do this first assembly step-by-step and talk about the output. First, we’ll run step 1. Note: the -c 2
argument tells my computer to run this using two cores. You can adjust this as necsesary depending upon what resources you have available. By default, ipyrad
will use all cores available.
ipyrad -p params-ubrucei_denovo.txt -s 1 -c 2
This step simply reads in our data files, and we can review the results in the ubrucei_denovo_s1_demultiplex_stats.txt
file.
cat ubrucei_denovo_s1_demultiplex_stats.txt
reads_raw
U03 37500
U04 37500
U05 37500
U19 37500
U20 37500
U21 37500
U23 37500
U24 37500
U25 37500
U41 37500
U42 37500
U43 37500
Next, we’ll run step 2.
ipyrad -p params-ubrucei_denovo.txt -s 2
This step filters and edits reads, using thresholds set in the parameters file. This means that we (probably) don’t need to filter data before entering ipyrad
! We can view the results of this step in the ubrucei_denovo_edits/s2_rawedit_stats.txt
file:
cat ubrucei_denovo_edits/s2_rawedit_stats.txt
reads_raw trim_adapter_bp_read1 trim_adapter_bp_read2 trim_quality_bp_read1 trim_quality_bp_read2 reads_filtered_by_Ns reads_filtered_by_minlen reads_passed_filter
U03 37500 1515 851 6826 105668 0 427 37073
U04 37500 1649 889 8357 115236 0 460 37040
U05 37500 1419 831 6975 102394 0 399 37101
U19 37500 1596 843 8995 87001 0 324 37176
U20 37500 1723 828 8798 93216 0 358 37142
U21 37500 1366 836 8556 87858 0 315 37185
U23 37500 1612 835 6983 97489 0 352 37148
U24 37500 1729 885 9037 101776 0 358 37142
U25 37500 1451 855 8632 117176 0 450 37050
U41 37500 2744 1014 8205 63413 0 207 37293
U42 37500 2674 932 7453 69472 0 241 37259
U43 37500 2413 938 8966 78060 0 288 37212
Next, we’ll run step 3. This is the within-sample clustering step, in which the reads are compared against each other, clustered based on similarity into “loci”, and then aligned. This is a relatively computationally intensive step.
ipyrad -p params-ubrucei_denovo.txt -s 3
Once completed, we can view the results of this step in the ubrucei_denovo_clust_0.85/s3_cluster_stats.txt
file:
cat ubrucei_denovo_clust_0.85/s3_cluster_stats.txt
clusters_total hidepth_min clusters_hidepth avg_depth_total avg_depth_mj avg_depth_stat sd_depth_total sd_depth_mj sd_depth_stat
U03 11268 6 1098 2.75 14.37 14.37 16.88 52.58 52.58
U04 12063 6 1050 2.60 14.41 14.41 15.38 50.54 50.54
U05 11417 6 1102 2.59 12.92 12.92 4.58 9.54 9.54
U19 11082 6 1078 2.68 13.47 13.47 5.40 12.74 12.74
U20 11171 6 1044 2.82 15.33 15.33 17.40 55.30 55.30
U21 11050 6 1104 2.70 13.33 13.33 5.10 11.24 11.24
U23 11658 6 1066 2.70 14.94 14.94 15.93 51.00 51.00
U24 11988 6 1054 2.60 14.13 14.13 14.87 48.59 48.59
U25 11520 6 1066 2.57 13.10 13.10 4.80 10.86 10.86
U41 10677 6 1082 2.80 14.27 14.27 13.37 40.13 40.13
U42 10742 6 1053 2.78 14.53 14.53 15.74 48.65 48.65
U43 10646 6 1044 2.85 15.11 15.11 16.41 50.70 50.70
Discussion: What can you glean from this output? What is the distribution of our sequencing coverage? What do you think might contribute to heterogeneity in sequencing coverage among and within samples?
Next, we’ll run steps 4 and 5 together. These steps estimate sequencing error rates, estimate genome-wide heterozygosity, and call consensus sequences for each individual for use in across-sample clustering.
ipyrad -p params-ubrucei_denovo.txt -s 45
We’ll skip over the results from steps 4 & 5 and proceed immediately to step 6. This is another “clustering” step, but instead of clustering reads within individuals, we’re clustering loci across individuals. With many samples, this step can be computationally intensive, but it should run quite quickly for us.
ipyrad -p params-ubrucei_denovo.txt -s 6
Finally, we’ll run step 7, in which loci are catalogued, and data are output into a variety of formats (e.g., VCF, PHYLIP, Structure, etc.) for downstream analysis.
ipyrad -p params-ubrucei_denovo.txt -s 7
This also produces a useful file (ubrucei_denovo_outfiles/ubrucei_denovo_stats.txt
) with summary statistics, which looks like this:
## The number of loci caught by each filter.
## ipyrad API location: [assembly].stats_dfs.s7_filters
total_filters applied_order retained_loci
total_prefiltered_loci 0 0 1433
filtered_by_rm_duplicates 9 9 1424
filtered_by_max_indels 6 6 1418
filtered_by_max_SNPs 0 0 1418
filtered_by_max_shared_het 45 45 1373
filtered_by_min_sample 491 489 884
total_filtered_loci 551 549 884
## The number of loci recovered for each Sample.
## ipyrad API location: [assembly].stats_dfs.s7_samples
sample_coverage
U03 738
U04 716
U05 749
U19 705
U20 704
U21 727
U23 723
U24 712
U25 715
U41 635
U42 656
U43 685
## The number of loci for which N taxa have data.
## ipyrad API location: [assembly].stats_dfs.s7_loci
locus_coverage sum_coverage
1 0 0
2 0 0
3 0 0
4 0 0
5 72 72
6 80 152
7 62 214
8 76 290
9 73 363
10 101 464
11 124 588
12 296 884
The distribution of SNPs (var and pis) per locus.
## var = Number of loci with n variable sites (pis + autapomorphies)
## pis = Number of loci with n parsimony informative site (minor allele in >1 sample)
## ipyrad API location: [assembly].stats_dfs.s7_snps
## The "reference" sample is included if present unless 'exclude_reference=True'
var sum_var pis sum_pis
0 488 0 607 0
1 184 184 131 131
2 79 342 62 255
3 42 468 29 342
4 17 536 16 406
5 29 681 15 481
6 13 759 8 529
7 13 850 7 578
8 3 874 3 602
9 2 892 1 611
10 3 922 1 621
11 3 955 2 643
12 0 955 0 643
13 3 994 0 643
14 0 994 1 657
15 2 1024 0 657
16 0 1024 0 657
17 0 1024 0 657
18 1 1042 0 657
19 0 1042 0 657
20 0 1042 0 657
21 0 1042 0 657
22 0 1042 0 657
23 0 1042 0 657
24 0 1042 1 681
25 0 1042 0 681
26 0 1042 0 681
27 0 1042 0 681
28 1 1070 0 681
29 0 1070 0 681
30 0 1070 0 681
31 0 1070 0 681
32 0 1070 0 681
33 0 1070 0 681
34 0 1070 0 681
35 0 1070 0 681
36 0 1070 0 681
37 0 1070 0 681
38 0 1070 0 681
39 0 1070 0 681
40 1 1110 0 681
## Final Sample stats summary
state reads_raw reads_passed_filter refseq_mapped_reads refseq_unmapped_reads clusters_total clusters_hidepth hetero_est error_est reads_consens loci_in_assembly
U03 7 37500 37073 30953 6120 11268 1098 0.013550 0.005020 941 738
U04 7 37500 37040 31400 5640 12063 1050 0.013419 0.005670 896 716
U05 7 37500 37101 29571 7530 11417 1102 0.015482 0.005835 938 749
U19 7 37500 37176 29671 7505 11082 1078 0.014641 0.005850 913 705
U20 7 37500 37142 31531 5611 11171 1044 0.014357 0.006062 895 704
U21 7 37500 37185 29850 7335 11050 1104 0.013657 0.005759 949 727
U23 7 37500 37148 31524 5624 11658 1066 0.014485 0.005682 913 723
U24 7 37500 37142 31127 6015 11988 1054 0.013287 0.005194 914 712
U25 7 37500 37050 29628 7422 11520 1066 0.012454 0.005587 910 715
U41 7 37500 37293 29936 7357 10677 1082 0.016725 0.006773 878 635
U42 7 37500 37259 29867 7392 10742 1053 0.016200 0.006346 889 656
U43 7 37500 37212 30303 6909 10646 1044 0.015841 0.007015 879 685
## Alignment matrix statistics:
snps matrix size: (12, 1110), 23.05% missing sites.
sequence matrix size: (12, 211231), 20.15% missing sites.
Discussion: Examine these summary statistics. At what steps are loci being filtered out? Is this indicative of any problems with our analysis or data?
Although we don’t want to use a full genome assembly as a reference genome, these RADcap data provide us with another opportunity for assembly: a “pseudo-reference” genome. To understand this, it’s worth thinking a bit about what the RADcap data are. We created 3RAD libraries—each of which probably contained hundreds of thousands of loci—for each individual, but then we conducted a “sequence capture” reaction to enrich them for a pre-selected subset of 1000 loci. So ideally, these reads are mostly from those 1000 loci. Rather than assemble these de novo, if we have a “reference” sequence for each those 1000 loci, we can just assemble against those!
Discussion: How might you predict that assembling against a pseudo-reference genome will change the results of our assembly?
I’ve prepared one such pseudo-reference, and we can download the \reference
directory in which it’s housed by entering the following command:
svn checkout https://github.com/twpierson/nres721_assembly/trunk/reference
We can peek at this pseudo-reference—formatted as a FASTA file—here:
head -n 10 reference/ubrucei_pseudoreference.fasta
>locus1
TATGTAAAACCCCCTCAGTGTTGTTGTCAATGTTGTCCCAAATTACATAGTGATGTCCAGCAAATGGATTAGCTACTATTTTGATCTTGTAAGAGGATTTGTTGTAGGCTTTACATCGAGCACGTGAGAAAAAGCNNNNNNNNNNGGATCAGAGTATCATGTAGGCTGGGGGCTCTAGTAACACTGAAGGACATTGACAGAGGTACTAACCTATAGTAGCTAATATCGGGTCAGTACGCAGAATAGGCCAATGTTTAGAGATGCAGCTCCTGAATATATTTCtCATC
>locus10
ATTGATAGTATTTTGCGATGGTAAATACACGAGCAAAAAAGAAAATGGCCAGATCTCGAACCCAGCCGAATAGTACACCTAGGTCTGCTTTACAGCTCTCATTATTCAGTTTTGGCATGCAAAAAGGGGAGCTGNNNNNNNNNNAACAATCGGAGCTACTTTACATAATGATGAGACGACTCCCTCCATTGATCTCCCTGAGCAAACTGGGTGCCCTGCTAATGTTGATTTATCTGTGACAGAAAGTGAAAGTACCAACGATGACCTAAATAATGTTCCACCACC
>locus100
TGATATATGACGAGACTGTCTCGTCATTAGCACTTAAGGGGTTAAAGAAAAAATATTGTATGGTTGCTAATTATGTTCATAAAAATCAAACGCCCCACAGAACCTGTAGAAACCATGTAAAAACAGTTGTTTATNNNNNNNNNNGTGGTAGATGGTGAGGGATTAGGGGAAGAAAGTCAAAAACCTAGGGGTGTCTTGTATTTTAGACGTTCAATGGTTCATTGTCTTTTAACATTGCCTACCACCGAAACTGGTATGTTGAgatGGTACGCTAATGCTTGTAGcCtCATC
>locus1000
AAGTTGCAGATTGCTGGCTTGTGGAATGGGGACCCTGCCCTACCTGTGTCCTGTAATTTGGGGTATGCCATAATAGCTTCCATGTCTGCAGAAGCCCATGTGGGCCAAACAACTAAATCAGCTTATTACCTCTGGNNNNNNNNNNGGGTTTGAAAAAGTATGGCCACTTGTCCAAGTACAGAATCATTAGGTTGGCTCCCTTATTGAAATCAgattGATTGGTTGGTTGTGCGTTGAAAGGGCTCCAAAAGGTTGGGCCCAACTTTATTCGAGAATGGCTTCAGCCACC
>locus10000
ATGGTGAGAGGGGACAGGGTCACTCCGGCCTCTGGCAGGGACACTGTCGCAATGGGGGCTGATTCTGACGAGTCACTACAGGGAAATAGCAGCACCATCAATGTATAAGCACCACGCTGGGGGGGGGGGTTTATCCGNNNNNNNNNNAGGACCCGCTGGGCCTCAACAGTTTTGGCAGACAGGTACAGAGCTTCAGAGGGGCCGTGGGGAAGAGGGTATGGTGGGGCTTCTTGAAGGCCAGGTTTCAGCTGTTTACCAGGTCACCTTTATTGATTAGTTGATGCCACC
Discussion: How is this formatted? Where did R1 and R2 data end up, and what does this tell you about the size of the loci?
Now, let’s assemble our data again—this time against this pseudo-reference! Since most of our parameters will stay the same, we can create a new parameters file here by copying our de novo parameters file.
cp params-ubrucei_denovo.txt params-ubrucei_reference.txt
We should then alter the following parameters:
[0] [assembly_name]
: change this parameter to ubrucei_reference
[5] [assembly_method]
: change this parameter to reference
[6] [reference_sequence]
: change this parameter to reference/ubrucei_pseudoreference.fasta
Now, we’ll run this whole assembly, rather than break it down step-by-step like before. This should run relatively quickly, if you want to follow along.
ipyrad -p params-ubrucei_reference.txt -s 1234567
Once it finishes, let’s look at the same summary statistics file.
cat ubrucei_reference_outfiles/ubrucei_reference_stats.txt
## The number of loci caught by each filter.
## ipyrad API location: [assembly].stats_dfs.s7_filters
total_filters applied_order retained_loci
total_prefiltered_loci 0 0 905
filtered_by_rm_duplicates 0 0 905
filtered_by_max_indels 0 0 905
filtered_by_max_SNPs 0 0 905
filtered_by_max_shared_het 23 23 882
filtered_by_min_sample 121 121 761
total_filtered_loci 144 144 761
## The number of loci recovered for each Sample.
## ipyrad API location: [assembly].stats_dfs.s7_samples
sample_coverage
reference 761
U03 643
U04 640
U05 653
U19 629
U20 613
U21 647
U23 638
U24 623
U25 628
U41 560
U42 575
U43 602
## The number of loci for which N taxa have data.
## ipyrad API location: [assembly].stats_dfs.s7_loci
locus_coverage sum_coverage
1 0 0
2 0 0
3 0 0
4 0 0
5 32 32
6 36 68
7 47 115
8 50 165
9 52 217
10 52 269
11 79 348
12 119 467
13 294 761
The distribution of SNPs (var and pis) per locus.
## var = Number of loci with n variable sites (pis + autapomorphies)
## pis = Number of loci with n parsimony informative site (minor allele in >1 sample)
## ipyrad API location: [assembly].stats_dfs.s7_snps
## The "reference" sample is included if present unless 'exclude_reference=True'
var sum_var pis sum_pis
0 229 0 490 0
1 177 177 134 134
2 108 393 49 232
3 71 606 35 337
4 47 794 16 401
5 38 984 15 476
6 26 1140 6 512
7 19 1273 8 568
8 13 1377 2 584
9 6 1431 3 611
10 8 1511 1 621
11 6 1577 0 621
12 2 1601 1 633
13 6 1679 1 646
14 1 1693 0 646
15 3 1738 0 646
16 0 1738 0 646
17 1 1755 0 646
## Final Sample stats summary
state reads_raw reads_passed_filter refseq_mapped_reads refseq_unmapped_reads clusters_total clusters_hidepth hetero_est error_est reads_consens loci_in_assembly
U03 7 37500 37073 14157 22916 931 698 0.003323 0.001832 687 643
U04 7 37500 37040 13587 23453 948 691 0.003585 0.001900 681 640
U05 7 37500 37101 14086 23015 938 707 0.002505 0.001809 700 653
U19 7 37500 37176 14217 22959 929 685 0.004192 0.001778 672 629
U20 7 37500 37142 14081 23061 922 654 0.003445 0.001852 643 613
U21 7 37500 37185 14479 22706 922 701 0.003514 0.001924 690 647
U23 7 37500 37148 13947 23201 933 694 0.003301 0.001803 679 638
U24 7 37500 37142 13320 23822 936 680 0.003417 0.001822 669 623
U25 7 37500 37050 13758 23292 941 674 0.003358 0.001779 662 628
U41 7 37500 37293 12321 24972 914 610 0.004018 0.002619 598 560
U42 7 37500 37259 13012 24247 928 627 0.003828 0.002814 620 575
U43 7 37500 37212 13309 23903 937 649 0.004455 0.002187 638 602
## Alignment matrix statistics:
snps matrix size: (13, 1755), 22.69% missing sites.
sequence matrix size: (13, 181974), 17.49% missing sites.
Discussion: Compare and contrast these results to those from the de novo assembly. What can explain the differences? It might be worth reflecting back upon what exactly RADcap data are.
We’ve finished our assemblies! Next week, we’ll conduct some basic population genomic and phylogenomic analyses to see what we can learn from these data.