Python code and example to parse BLAST output
BLAST (Basic Local Alignment Search Tool) is a widely used algorithm in bioinformatics for finding similarities between sequences. It is primarily used to search a query sequence against a large database of sequences to find similar sequences. BLAST is particularly useful in identifying homologs (evolutionarily related sequences) in large databases, such as nucleotide or protein sequences. The algorithm returns a list of sequence matches, ranked by similarity and a score indicating the strength of the similarity. BLAST is commonly used for many purposes, including gene discovery, genome annotation, protein function prediction, and evolutionary analysis.
Following is a sample blast output file:
<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
<BlastOutput_iterations>
<Iteration>
<Iteration_iter-num>1</Iteration_iter-num>
<Iteration_query-ID>Query_1</Iteration_query-ID>
<Iteration_query-def>Example Query</Iteration_query-def>
<Iteration_query-len>18</Iteration_query-len>
<Iteration_hits>
<Hit>
<Hit_num>1</Hit_num>
<Hit_id>gi|1234|gb|A2345678|ExampleHit</Hit_id>
<Hit_def>Example Hit Definition</Hit_def>
<Hit_accession>A2345678</Hit_accession>
<Hit_len>200</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>150.0</Hsp_bit-score>
<Hsp_score>50</Hsp_score>
<Hsp_evalue>1e-05</Hsp_evalue>
<Hsp_query-from>1</Hsp_query-from>
<Hsp_query-to>18</Hsp_query-to>
<Hsp_hit-from>100</Hsp_hit-from>
<Hsp_hit-to>117</Hsp_hit-to>
<Hsp_query-frame>0</Hsp_query-frame>
<Hsp_hit-frame>0</Hsp_hit-frame>
<Hsp_identity>18</Hsp_identity>
<Hsp_positive>18</Hsp_positive>
<Hsp_gaps>0</Hsp_gaps>
<Hsp_align-len>18</Hsp_align-len>
<Hsp_qseq>AGCTAGCTAGCTAGCT</Hsp_qseq>
<Hsp_hseq>AGCTAGCTAGCTAGCT</Hsp_hseq>
<Hsp_midline>||||||||||||||</Hsp_midline>
</Hsp>
</Hit_hsps>
</Hit>
</Iteration_hits>
</Iteration>
</BlastOutput_iterations>
</BlastOutput>
Following is a script to parse this file:
from Bio.Blast import NCBIXML
def parse_blast_output(file):
result_handle = open(file)
blast_records = NCBIXML.parse(result_handle)
for blast_record in blast_records:
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
# Do something with the high-scoring segment pair (HSP)
print("****Alignment****")
print("sequence:", alignment.title)
print("length:", alignment.length)
print("e value:", hsp.expect)
print("score:", hsp.score)
print("query sequence:", hsp.query)
print("match sequence:", hsp.sbjct)
# Parse the BLAST output file
parse_blast_output("blast_output.xml")
In this example, the NCBIXML.parse function is used to parse the BLAST output in XML format and create a blast_records iterator. The script then loops through the blast_record objects, which represent individual BLAST searches, and accesses the data in each record using the Bio.Blast module’s classes and methods.
You can modify the script to extract the data you’re interested in and perform any other operations you need, such as saving the data to a file or a database, or analyzing it further.
Following is output of this file using the blast file above.
****Alignment****
sequence: gi|1234|gb|A2345678|ExampleHit Example Hit Definition
length: 200
e value: 1e-05
score: 50.0
query sequence: AGCTAGCTAGCTAGCT
match sequence: AGCTAGCTAGCTAGCT