Python code to fetch a protein sequence from UniProt
UniProt is a publicly available, comprehensive, and freely accessible database of protein sequence and annotation information. It integrates data from various sources, including Swiss-Prot, TrEMBL, and PIR-PSD, and provides a central resource for protein sequence and functional information. UniProt also provides cross-references to many other databases, including NCBI’s GenBank, PDB, and Ensembl. The database is maintained by the Swiss Institute of Bioinformatics (SIB), the European Molecular Biology Laboratory (EMBL), and the Protein Information Resource (PIR). UniProt is widely used in the scientific community as a source of protein information and has become an indispensable tool for life science research, including protein analysis, annotation, and discovery.
Here’s a Python code snippet that uses the requests library to fetch a protein sequence from UniProt:
import requests
def fetch_sequence(uniprot_id):
# Set the base URL for the UniProt API
base_url = "https://www.uniprot.org/uniprot/"
# Fetch the protein sequence in FASTA format
response = requests.get(f"{base_url}{uniprot_id}.fasta")
# Return the protein sequence as a string
if response.status_code == 200:
return response.text
else:
return None
You can use this function by passing the UniProt ID of the protein you’re interested in, for example:
sequence = fetch_sequence("P12345")
print(sequence)
The output of the code will be the protein sequence in FASTA format, which is a commonly used format for representing biological sequences. The format consists of a single-line description, starting with a “>” symbol, followed by the sequence itself on multiple lines.
Here’s an example output for a hypothetical protein:
>sp|P12345|PROTEIN_NAME Protein sequence
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTF
AAB
In this example, P12345 is the UniProt ID of the protein, and PROTEIN_NAME is its name.