Blast
Summary: This example shows how to perform blast with BioFSharp
BioFSharp's BlastWrapper is a tool for performing different tasks in NCBI BLAST console applications (version 2.2.31+). It is able to create BLAST databases and perform blastN or blastP queries, while providing a way to set output parameter for creating a custom output format. Official documentation for all BLAST applications can be found here.
For the purpose of this tutorial, we will build a protein database using a .fastA
file containing chloroplast proteins
of Chlamydomonas reinhardtii included in BioFSharp/docs/content/data.
Our query protein for the subsequent BLAST search will be the photosystem II protein D1 from Arabidopsis thaliana chloroplast.
Creation of a BLAST database
We will use the minimal amount of parameters needed to create a BLAST database from an input file. The created database files will have the same name as the input file and will be located in the same folder. However, there are many parameters you can use to specify your database. Please refer to the NCBI user manual for more information.
First, lets specify the path of our input and the type of our database. Use a string for the input path and the provided MakeDbParams
type
for every other parameter.
open BioFSharp
open BioFSharp.IO
open BlastNCBI
open Parameters
///path and name of the input file/output database.
let inputFile = (__SOURCE_DIRECTORY__ + "/data/Chlamy_Cp.fastA")
///defines database type (in this case: a protein database)
let typeOfDatabase = Parameters.MakeDbParams.DbType Parameters.Protein
The wrapper needs to know the path of the ncbi applications.
///the path of the /bin folder where the BLAST applications are located
let ncbiPath = (__SOURCE_DIRECTORY__ + "/../../lib/ncbi-blast/bin")
We now provide the wrapper our ncbi path, the input path and a sequence of parameters (containing just one parameter in this case, the database type).
BlastWrapper(ncbiPath).makeblastdb inputFile ([typeOfDatabase;] |> seq<Parameters.MakeDbParams>)
Your console output will look like this:
|
This creates 3 new files in our directory:
Chlamy_Cp.fastA.phr
,Chlamy_Cp.fastA.pin
and Chlamy_Cp.fastA.psq
.
We have sucesssfully created our search database.
Creating a fasta file from an aminoacid string
Note: this step is not necessary if you want to use an already existing file as query. If this is the case, skip to step 3.
First, lets specify a string with our aminoacid sequence and convert it to a BioSeq
.
For more information about BioSeq
, please refer to this documentation
///Raw string of the aminoacid sequence of our query protein
let aminoacidString = "MTAILERRESESLWGRFCNWITSTENRLYIGWFGVLMIPTLLTATSVFIIAFIAAPPVDIDGIREPVSGS
LLYGNNIISGAIIPTSAAIGLHFYPIWEAASVDEWLYNGGPYELIVLHFLLGVACYMGREWELSFRLGMR
PWIAVAYSAPVAAATAVFLIYPIGQGSFSDGMPLGISGTFNFMIVFQAEHNILMHPFHMLGVAGVFGGSL
FSAMHGSLVTSSLIRETTENESANEGYRFGQEEETYNIVAAHGYFGRLIFQYASFNNSRSLHFFLAAWPV
VGIWFTALGISTMAFNLNGFNFNQSVVDSQGRVINTWADIINRANLGMEVMHERNAHNFPLDLAAVEAPS
TNG"
///header for the .fastA file
let header = ">gi|7525013|ref|NP_051039.1| photosystem II protein D1 (chloroplast) [Arabidopsis thaliana]"
///Query sequency represented as a sequence of `AminoAcid` one of BioFSharp's `BioItems`
let querySequence = BioSeq.ofAminoAcidString aminoacidString
We will now use BioFSharp's FastA
library to create a FASTA
item and write it to a file.
///path and name of the query file
let queryFastaPath = __SOURCE_DIRECTORY__ + "/data/testQuery.fastA"
///FastaItem containing header string and query sequence
let queryFastaItem = FastA.createFastaItem header querySequence
To create our .fastA
file, we need to use the BioItem.symbol
converter, which will convert the 3 letter code of the aminoacids in our biosequence
to the one letter symbol (eg. Met -> M)
FastA.write BioItem.symbol queryFastaPath [queryFastaItem;]
Performing the BLAST search
We have created our search database and the query we want to find. Before we can perform the actual search, we need to define the BLAST prameters.
Note: custom output formats can only be specified for output types CSV
, tabular
and tabular with comments
. For more information, check
the options for the command-line applicaions
First, lets specify the overall output type. This will define the outline of our output. We want our output to be in tabular form, with added information in the form of comments.
Note: when not specified otherwise, the output type will be pairwise
///overall outline of the output
let outputType = OutputType.TabularWithComments
We have a large selection of parameters that we can include in the output.
///a sequence of custom output format parameters
let outputFormat=
[
OutputCustom.Query_SeqId;
OutputCustom.Subject_SeqId;
OutputCustom.Query_Length;
OutputCustom.Subject_Length;
OutputCustom.AlignmentLength;
OutputCustom.MismatchCount;
OutputCustom.IdentityCount;
OutputCustom.PositiveScoringMatchCount;
OutputCustom.Evalue;
OutputCustom.Bitscore;
]
|> List.toSeq
Finally, we create a BlastParam
of the type OutputTypeCustom
from a touple of outputType
and outputFormat
.
Note: No touple required if you want to use the default output format. If this is the case,
just create a BlastParam
of type OutputType
.
///The final output format
let customOutputFormat = OutputTypeCustom(outputType , outputFormat)
We now have everything set up to perform the BLAST search. As we are talking about proteins, we will use blastP. The parameters needed for the Wrapper function are:
- path of the ncbi/bin folder
- path and name of the search database
- path and name of the query
- path and name of our output file
- a sequence of BLAST parameters, containing any parameters additional to the above (like our customOutputFormat)
Note: in this case we can use the string inputFile
that we used above for creating our database, as we did not specify another path or name for our database. Adjust accordingly if
done otherwise
///output file of the BLAST search
let outputPath = (__SOURCE_DIRECTORY__ + "/data/Output.txt")
BlastWrapper(ncbiPath).blastP inputFile queryFastaPath outputPath ([customOutputFormat;] |> seq<BlastParams>)
As you can see in the result file, the format is tab separated and contains the fields we specified in our our customOutputFormat
.
|
<summary> Blast Wrapper </summary>
path and name of the input file/output database.
defines database type (in this case: a protein database)
the path of the /bin folder where the BLAST applications are located
type BlastWrapper = new: rootPath: string -> BlastWrapper member blastN: searchDB: string -> query: string -> output: string -> ps: BlastParams seq -> unit member blastP: searchDB: string -> query: string -> output: string -> ps: BlastParams seq -> unit member makeblastdb: searchDB: string -> ps: MakeDbParams seq -> unit
<summary> A Wrapper to perform different BLAST tasks </summary>
--------------------
new: rootPath: string -> BlastWrapper
val seq: sequence: 'T seq -> 'T seq
--------------------
type 'T seq = System.Collections.Generic.IEnumerable<'T>
Raw string of the aminoacid sequence of our query protein
header for the .fastA file
Query sequency represented as a sequence of `AminoAcid` one of BioFSharp's `BioItems`
module BioSeq from BioFSharp.BioCollectionsExtensions
--------------------
module BioSeq from BioFSharp
<summary> This module contains the BioSeq type and its according functions. The BioSeq type is a sequence of objects using the IBioItem interface </summary>
<summary> Generates AminoAcid sequence of one-letter-code raw string </summary>
path and name of the query file
FastaItem containing header string and query sequence
<summary> Creates with header line and sequence. </summary>
<summary> Writes FastaItem to file. Converter determines type of sequence by converting type -> char. If file already exists the data is overwritten. </summary>
<summary> Basic functions on IBioItems interface </summary>
<summary> Returns the symbol of the bio item </summary>
overall outline of the output
union case BlastParams.OutputType: OutputType -> BlastParams
--------------------
[<Struct>] type OutputType = | Pairwise = 0 | Query_anchored = 1 | Query_anchored_NoIdentities = 2 | Query_anchored_Flat = 3 | Query_anchored_Flat_NoIdentities = 4 | XML = 5 | Tabular = 6 | TabularWithComments = 7 | TextASN1 = 8 | BinaryASN1 = 9 | CSV = 10 | BLAST_ArchiveFormat = 11 | JSON_Seqalign = 12 | JSON_Blast = 13 | XML2_Blast = 14
a sequence of custom output format parameters
module List from Microsoft.FSharp.Collections
--------------------
type List<'T> = | op_Nil | op_ColonColon of Head: 'T * Tail: 'T list interface IReadOnlyList<'T> interface IReadOnlyCollection<'T> interface IEnumerable interface IEnumerable<'T> member GetReverseIndex: rank: int * offset: int -> int member GetSlice: startIndex: int option * endIndex: int option -> 'T list static member Cons: head: 'T * tail: 'T list -> 'T list member Head: 'T member IsEmpty: bool member Item: index: int -> 'T with get ...
The final output format
output file of the BLAST search