A technical resource for researchers interested in analyzing Phylos’s publicly available cannabis genotype data.
Foreword: Please note that this is a technical blog post. As more researchers are exploring cannabis genomes, and as we’ve worked to support researchers in understanding how to review our datasets, we developed this blog post as a technical resource. Learn more about our most recent data upload here.
We created the Phylos Galaxy, a publicly accessible data visualization, to showcase the diversity and relationships between cannabis varieties. With over 3000 varieties from more than 80 countries, the Galaxy tells the story of the development of cannabis varieties, and provides the industry with greater supply chain transparency, clarity, and confidence. To support continued research and understanding of the plant, we make the raw data available for download through the NCBI’s Sequence Read Archive (SRA).
In two data upload events (2016 and 2018) Phylos has shared data from over 2200 samples with the NCBI for use by researchers. Learn more about why we upload data here.
The raw data we have available on the SRA is in FastQ file format. To learn more about how to download the data, you can see this post by Kenichi Nakamura, our VP of Engineering. Regarding the raw sequence data, the set that was uploaded in 2016 should have two files per sample - one for the forward read and one for the reverse. The second large set (2018 data upload) is also paired end, but each sample has a single file.
Many other groups, including Google Cloud, have already used this data in various ways.
The genotype data that is available through the NCBI was generated by extracting DNA from stem tissue and preparing samples for sequencing using TruSeq Custom Amplicon Library Kits. Paired-end libraries are sequenced using an Illumina NextSeq500. Sequence data are then analyzed using a custom automated pipeline to generate genotype reports and for display on the Phylos Galaxy.
Processing raw image data to variant calls
Image files are converted to raw readable data using bcl2fastq. Using barcodes, we multiplex up to 192 samples per run. During the disambiguation step that uses the barcodes to assign reads to each sample, we allow for a single mismatch in the barcode sequence. This setting gives a high level of specificity in assigning reads to samples without sacrificing read depth which is valuable for variant calling.
We use BWA for alignment of sequence reads to the set of reference amplicons. Reference amplicons are based on the Cannatonic genome. Our TSCA amplicons are roughly 120 basepairs (bp) long. Each amplicon that we utilize, which includes the probe region, is unique in the reference sequence. Probe regions (~20-50 bp) are maintained during the alignment phase, but focal variants are not located in those probe regions. For analysis, there is a single SNP within each amplicon that we utilize; the majority of which are located in the middle of the amplicon. As we have learned more about the structure of the cannabis genome, we have filtered out amplicons if they are found in multiple copies in the genome.
Reference-based variant calling
During the development of the Phylos Genotype Test, we compared the quality of variant callers, as have others (e.g., Hwang et al., 2015). Each variant caller has different biases and limitations for variant calling. Our analysis pipeline uses Freebayes and we input a reference SNP site location during variant calling, which increases the accuracy of calls. As mentioned, TSCA uses a single amplicon per focal region and genetic variation (SNPs or indels) could impact downstream variant calling and comparisons between samples. Frequent occurrence of missing calls for a site could indicate a common indel or SNPs in a probe region. Our large data set of more than 3000 samples shows when there are problematic variants. Over time, we have removed a few lower quality SNP sites. The vast majority have high call rates; a few low frequency SNPs or indels in probe regions will not impact population genetic conclusions. For each sample, we require greater than or equal to 80% call rate over all high quality SNPs.
The stringency of our variant calling pipeline
Much of the filtering for quality is done during the variant calling step. We input a VCF with focal sites and only assess input alleles. We use only reads with mapping quality greater than or equal to 30, and in order to call a variant, we require sum of qualities supporting an allele greater than or equal to 100, mismatch read limit less than or equal to five (5), and each site covered by greater than or equal to 20 reads. Our analyses showed that varying other parameters, such as theta, minimum alternate fraction, and use of binomial priors has minimal impact on variant calls.
Clone groups represented by exemplar
Clonal propagation is very common in cannabis. Thus, many of the samples are genetically identical. In fact, approximately one third of all samples belong to a “clone group” and some groups contain 40 or more identical samples. Since this can bias population genetic analyses, we choose an exemplar from each "clone group" for conducting various analysis (e.g., PCA, IBD). One way to do this is to use the “--rel-cutoff” flag in Plink to reduce to a single exemplar sample per group. Our pipeline first identifies the set of genetically identical samples that forms each clone group, then within each group, uses the sample with the highest variant call rate as the exemplar.
We developed the Galaxy because cannabis has been in the dark for too long. We see the Galaxy and the public accessibility of that visualization as well as the raw data as a way to engage with the community of enthusiasts and scientists and expand access to general cannabis knowledge worldwide.
Email OpenData@phylos.bio to share a project you’ve worked on or to ask additional questions.