table of contents

Integration with Dat (WIP section)

Also check try-dat.com

This takes eukaryota data and run a gene sequence alignment on the data.

Here's the pipeline, split up into 4 parts:

{
  "import-data": [
    "bionode-ncbi search genome eukaryota",
    "dat import -d eukaryota --key=uid"
  ],
  "search-ncbi": [
    "dat export -d eukaryota",
    "grep Guillardia",
    "tool-stream extractProperty assemblyid",
    "bionode-ncbi download assembly -",
    "tool-stream collectMatch status completed",
    "tool-stream extractProperty uid",
    "bionode-ncbi link assembly bioproject -",
    "tool-stream extractProperty destUID",
    "bionode-ncbi link bioproject sra -",
    "tool-stream extractProperty destUID",
    "grep 35526",
    "bionode-ncbi download sra -",
    "tool-stream collectMatch status completed",
    "tee > metadata.json"
  ],
  "index-and-align": [
    "cat metadata.json",
    "bionode-sra fastq-dump -",
    "tool-stream extractProperty destFile",
    "bionode-bwa mem **/*fna.gz"
  ],
  "convert-to-bam": [
    "bionode-sam 35526/SRR070675.sam"
  ]
}

First, make sure your terminal is inside the eukaryota directory that you cloned earlier.

Create a new empty file called gasket.json and copy/paste the above pipeline into the file. Verify that the copy/paste worked by typing gasket ls and verify that it prints out the 4 different named pipelines:

$ gasket ls
import-data
search-ncbi
index-and-align
convert-to-bam

You can run a named pipeline by running gasket run NAME, where NAME is the name of your pipeline.

We can skip the import-data pipeline since we have already cloned the eukaryota data into a local dat database.

The second pipeline, search-ncbi, starts by doing dat export and then takes the data from dat and uses it to download additional datasets from NCBI (National Center for Biotechnology Information), a server where lots of bioinformatics datasets are hosted.

Run this pipeline:

gasket run search-ncbi

More output

By default there is no output while the pipeline is running, but if you want to see what's happening under the hood you can run the pipeline again with the DEBUG environment variable set to * (to show all possible debug output):

DEBUG=* gasket run search-ncbi

This pipeline should create a couple of folders and download some files into them.

Sequence alignment

The next pipeline, index-and-align, uses the downloaded genetic data from the search-ncbi pipeline and runs process called a DNA sequence alignment.

Try running the next pipeline, this one will take a few minutes to complete:

DEBUG=* gasket run index-and-align

When this finishes, it should have created even more files in the folders from the previous step.

The final pipeline, convert-to-bam, converts the output of the alignment into a different file format. This pipeline should be pretty fast.

DEBUG=* gasket run convert-to-bam

Congratulations, you just ran a DNA sequence alignment!