Viewing the Open Data from Phylos

One of the things that brought me to Phylos is our commitment to providing as much open data as possible. Phylos is a next-generation agricultural science company focusing on Cannabis and Hemp. Using our Plant Genotype Test product, customers can get a report about their plant, and see how it relates to others in our Galaxy dataset.

We’ve talked about how the customer retains ownership of this data, and how it being published to the NCBI can help protect against overreaching patents in the future; now, we want to show how a customer or any interested party, can go about downloading and looking at this raw data for themselves.

Note: the Open Cannabis Project has great documentation and links to raw data from Phylos as well as other sources; this post takes a more walkthrough approach.

Phylos uploads the data to the Sequence Read Archive of the NCBI, so we’re going to use a couple programs in the SRA Toolkit to do the work. Make sure it is installed in the best way for your system. On a Debian-based GNU/Linux system, the package is in apt; there is also a Homebrew formula for MacOS users.

Now that we have the two CLI tools we need from this kit, “prefetch” and “fastq-dump”, we can figure out which sequence read to download. By navigating to the NCBI page showing all biosamples from Phylos, we can find a variety that we’re interested in. For this example, I’ve chosen an OG Kush from early on in our testing business.

Grab the SRA identifier from the line at the top:

Screen capture of NCBI page with SRA identifier highlighted

Then start downloading the sequence with:

$ prefetch SRS4167515
prefetch.2.9.3: 1) Downloading 'SRS4167515'...
prefetch.2.9.3:  Downloading via https...
prefetch.2.9.3: 1) 'SRS4167515' was downloaded successfully
prefetch.2.9.3: 'SRS4167515' has 0 unresolved dependencies

This command will build a tree like the following in your home directory:

ncbi/
└── public
    └── sra
        └── SRS4167515.sra

At this point, we can use “fastq-dump” to get the raw data out of this binary format:

$ fastq-dump --split-files SRS4167515.sra 
Read 2840859 spots for SRS4167515.sra
Written 2840859 spots for SRS4167515.sra

Voilà — we can now read the .fastq files with whatever we like, the reads are there in plaintext:

 $ head SRS4167515_?.fastq
==> SRS4167515_1.fastq <==
@SRS4167515.1 1 length=76
AGTCTTGTCTCCCAGCAACAATCCGAAACCACCATCTCAATGGAAGAGGCTTCTTTCCCAGAACCTCCTCCACTTG
+SRS4167515.1 1 length=76
AAAAAAE///EAE/E</EAEA/A/EAEEEE/EAEE<EEE<E//EEEA//EEAEEA///E//EAEE/E/EAAEEEE/
@SRS4167515.2 2 length=76
GTCTTGTCTCCCAAGCAACAAGCCGAAACCACCATCTCAATGGAAGAGGCTTCTTTCCCAGACCCTCCTCCACTTG
+SRS4167515.2 2 length=76
AAAAAEEEEEEEEEEEEEEE6EEEEEEEEE/EE/EEEEE6EEEEEEEEEEEAEEAEEEAEE66EEEAEEA<AAAEE
@SRS4167515.3 3 length=76
AGTCTTGTCTCCAAGCAACAAGCCGAAACCACCATCTCAATGGAAGAGGCTTCTTTCCCAGAACCTCCTCCACTTG

==> SRS4167515_2.fastq <==
@SRS4167515.1 1 length=76
AAGATTTCGGATCTGTCCTCGGAACAAGTGGAGGAGGTTCTGGGAAAGAAGCCTCTTCCATTGAGATGGTGGTTTC
+SRS4167515.1 1 length=76
6/A/A/EA//EAAEAAAA//EAEEAEEE/<<AE/EA<///EE//EEAEEEEE/EA<6/<//EEE/E//E/EEAE//
@SRS4167515.2 2 length=76
GAGATTTCGGATCTGTCCTCGGAACAAGTGGAGGAGGTTCTGGGAAAGAAGCCTCTTCCATTGAGATGGTGGTTTC
+SRS4167515.2 2 length=76
AAA6AEEEEEEEEEEEEEEEEEEEEEEEEEE6EAEEEEAEEEEEAEEE/AEEAEEEEEEEEEEA6AEEAEEEAAE/
@SRS4167515.3 3 length=76
AAGATTTCGGATCTGTCCTCGGAACAAGTGGAGGAGGTTCTGGGAAAGAAGCCTCTTCCATTGAGATGGTGGTTTC

We’ve been keeping an eye on the Bionode project as well, and for the JavaScript / Node.js fans out there, you can do the same thing as above using these two packages, bionode-ncbi and bionode-sra. However, bionode-sra depends on the SRA Toolkit, so one would need that installed anyway.

To use these tools, make sure you have Node.js installed properly on your system, then use “npm install -g bionode-ncbi bionode-sra” to fetch and install the packages.

To make sense of this raw data, one would need to align it to a reference genome. Phylos published our assembly of the Cannatonic genome in 2016 — stay tuned for more news on this front.

Bionode-ncbi makes it very easy to fetch this assembly. The search tool returns JSON, so we’re using “jq” to parse the UID out of the response:

$ bionode-ncbi search assembly phylos | jq .uid
"874841"

$ bionode-ncbi download assembly 874841

$ cd 874841/

$ gunzip GCA_001865755.1_ASM186575v1_genomic.fna.gz

$ head GCA_001865755.1_ASM186575v1_genomic.fna 
>MNPR01000001.1 Cannabis sativa cultivar Cannatonic Cannabis.v1_scf1_q, whole genome shotgun sequence
GTGACAAGAGTAACCCTAACACCAGAGAAGGTGAATCTAGACCTGTTACGTGAAAAACTCAAAAGAATACATTCCCAGAA
ACTTCAAataccaaacccagaaaaacaaacagataAGTAAAGCATGCCATAAACACGAACAGTAAAGCAAGGGATGCGAG
AAACTTACAGTGAAGTGTTGAAACTGGGGGTTCTGTTGTCGATCAAACAAAGGAAATATAGGCTGACAGTGGTGAATGAA
GGATAACAGGGAAATTTGTAGAGTGTTCGAAGGTTTTTTCTTGGTCTGGGGATTTTGCTCTGAAAAACTCGAacaaagtt
ggcaagaaatggaaagtaaccaaatgaaggaaagatggtggcttttataggccaaagtgcatgaggaacaggcaccatca
cctaccaatcgaacggctGGGGgggcgcacgattcgaattcccaagaATCAACAGTTGGATTGATTCgtaattaaggcgg
tggaaacgcttgagtatccgtctgacaccattaatgcaccgtatcaatcaatggacatgaaacgaatcgacatctgcaaa
aagtgaatcgctgcattagAGGCGATCATTAAATGCACCTTCACGCGCTCAAAGCTCGTGCCAGGAAACTGAGGGGTCAA
TCGGAggcacttgaacaagccttgttttctttttcaaataaacaaggcttggggggtaaatgctgcccctgattttgccc

With reference genome in hand, we can use tools like the Burrows-Wheeler aligner to match reads to contigs, then generate variant calls with a tool like Freebayes. Once we have variant calls, then the real analysis can begin!

Phylos is constantly evaluating and upgrading our toolchain. Look for more posts in coming months about new technologies and techniques we’re pioneering in the industry.

Huge kudos and thanks go out to Alisha Holloway, PhD and Kayla Hardwick, PhD for organizing and uploading these datasets, proofreading this post, and teaching those around them.

Viewing the Open Data from Phylos

Subscribe to Phylos News