Welcome to ProjectM1S1Bioinfo’s documentation!¶
Quickstart Demo (GUIless)¶
Let’s import our two classes
from blast_hitter import BlastHitter from clusterizer import Clusterizer
Proteomes can be downloaded with RefSeqScraper script, or manually saved to data/genomes repository. A list of all paths should be initalised, so that we can use BlastHitter’s factory method to create a bunch of blasthitter objects of all possible permutations.
proteomes = ["../data/genomes/Rickettsia_rickettsii_str._Arizona_strain=Arizona_protein.faa", "../data/genomes/Streptococcus_pneumoniae_R6_strain=R6_protein.faa", "../data/genomes/Streptococcus_pyogenes_strain=NCTC8232_protein.faa", "../data/genomes/Streptococcus_thermophilus_LMD-9_strain=LMD-9_protein.faa", "../data/genomes/Piscirickettsia_salmonis_strain=Psal-158_protein.faa"] bhitters = BlastHitter.from_list(proteomes)
We can blast them and accumulate the reciprocal best hits with a for loop :
for bh in bhitters : bh.blast_them() bh.rbh_them()
Now let’s create a Clusterizer object after populating our blasthitters with RBH files :
clust = Clusterizer(bhitters, proteomes)
The next and final step would be to create clusters, aligning each one and concatenating all of them and last but not least would be to launch the phylogenetic algorithm and to draw the newick tree:
clust.cluster_them() clust.one_align_to_rule_them_all() clust.draw_tree()
Documentation¶
Graphical User Interface¶
-
class
interface.
window
¶ This is a class for the graphical user interface which offers different functionalities.
It has 5 main functions :
- Make_a_Tree :
A feature that allows to make a phylogenetic tree from a couple of chosen proteomes using BLAST/MUSCLE/RAxML.
- Download :
A feature that permets the user to download proteomes from RefSeq.
- E-values :
Visualizes the distribution of e-values from a BLASTp results file.
- Statistics :
A feature that allows to see some statistics of a chosen proteome.
- Exit :
Button in the menu to exit the graphical interface.
-
Download_proteome
(dropdown_list)¶ This method is called when the user has chosen, therefore has selected, the proteome that he wants to download. It downloads the proteome, chosen by the user, from RefSeq database to the disk.
- Parameters
dropdown_list (list) – A list that contains all names of proteomes available in RefSeq database which contain the pattern written by the user.
-
Search_Download_proteome
()¶ This method is called by the “Download” feature. It asks theuser to write a few letters in order to search this pattern in the list of all proteomes available in RefSeq database.
- Returns
pattern – pattern that contains a few letters write by user.
- Return type
str
-
Select_Download_proteome
(pattern)¶ This method is called when user presses enter on his keyboard. It display a dropdown list of all proteomes, that correspond to the pattern given by the user, available in RefSeq database.
- Parameters
pattern (str) – pattern that contains a few letters write by user.
- Returns
dropdown_list – A list that contains all names of proteomes available in RefSeq database which contain the pattern written by the user.
- Return type
list
-
createMenuBar
()¶ This method creates a menu bar that contains several tabs, each tab gives access to a feature of the GUI.
-
distributions_of_evalues
()¶ This methid is called by the “E-values” feature. It displays a list of all BLASTp results files that are available on the disk.
- Returns
blast_presents – A list of all BLASTp files available on the disk.
- Return type
list
-
proteomes_in_disk
()¶ This method is called when the user wants the “Make_a_Tree” feature. It displays list of all proteomes that are available in the disk.
- Returns
p_presents – A list that contains all names of proteomes available in the disk.
- Return type
list
-
reset
()¶ This method allows widget reset, except the widget of the menu bar. This function is particularly useful when user changes tabs, or also if user clicks on the tab again.
-
stats
()¶ This method is called by the “Statistics” feature. It displays a list of all proteomes that are available on the disk.
- Returns
prot_presents – A list of all proteomes available in the disk.
- Return type
list
-
validate_blast_selection
(blast_presents)¶ This method is called when the user has selected the BLASTp file that he wants to visualize as a distribution of e-values. It displays a new window with a histogram of the distribution of e-values.
- Parameters
blast_presents (list) – A list of all BLASTp files available in the disk.
- Returns
Opens a new window with a histogram of the distribution of e-values of a chosen BLASTp file.
- Return type
None
-
validate_prot_select
(prot_presents)¶ This method is called when the user has selected the proteome that he wants to see some statistics about. It displays a new window with the statistics of proteome that user has selected.
- Parameters
prot_presents (list) – A list of all proteomes available in the disk.
- Returns
Opens a new window with some statistics about the proteome that the user has chosen.
- Return type
None
-
validate_proteome_selection
(p_presents)¶ This method is called when the user has to validate the proteome selection and he wants to obtain a phylogenetic tree.
- Parameters
p_presents (list) – list that contains all names of proteomes available in the disk.
- Returns
Opens a new window with the phylogenetic tree.
- Return type
None
RefSeqScraper¶
-
class
refseq_scraper.
RefSeqScraper
¶ This class will scan refseq summary assembly dataframe, pick only latest and complete genomes and then will create a new column ‘readable’ that is the combination of organism and infraspecific name columns. based on refseq’s assembly summary file : ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt A new column “readable” will be created with pandas to make it easier for the user to choose a certain genome.
-
data
¶ A pandas dataframe that contains all refseq genomes.
- Type
pandas.df
-
cart
¶ A list of species names to download their genomes.
- Type
list
-
add_to_cart
(species)¶ This function adds a species to the object’s cart.
- Parameters
species (str) – the name of the species to be downloaded, should match records in the dataframe.
- Returns
A sentence that confirms the addition of a given species to the cart.
- Return type
str
-
download_genome
()¶ Downloads all genomes from the ftps paths in the cart and saves them to data/genomes directory.
-
mine_ftps
()¶ This function searches the dataframe and extracts the ftps that correspond to species in the cart.
- Returns
A list of ftp paths.
- Return type
list
-
mine_species
()¶ Asks the user to give a few letters and finds a few records in the dataframe that correspond to the input pattern, if the user would like to add something that he sees, it does it.
-
BlastHitter¶
-
class
blast_hitter.
BlastHitter
(query, subject)¶ This is a class for BLASTing two genomes against each other and for deducing reciprocal best hits that are common between the two blastp results files.
-
query_path
¶ the relative path for the query genome file on the disk.
- Type
str
-
query_name
¶ the name of the query genome.
- Type
str
-
subject_path
¶ the relative path for the subject genome file on the disk.
- Type
str
-
subject_name
¶ the name of the subject genome.
- Type
str
-
results_dir
¶ the relative path of the results repository for saving output files.
- Type
str
-
genomes_dir
¶ the relative path of the genomes repository.
- Type
str
-
first_blastp
¶ the path of the first .blastp file (query vs. subject blast).
- Type
str
-
second_blastp
¶ the path of the second .blastp file (subject vs. query blast).
- Type
str
-
rbh
¶ the path of the reciprocal best hits file (.blastp tabulated format).
- Type
str
-
static
best_hits_from_blast
(blastp_file)¶ Returns the best hit for every protein based on the blast output file. The algorithm is simple, take the first appearance of the protein as the best hit because BLAST sends back the best hits in descending order of quality.
- Parameters
blastp_file (str) – The blast output file with a tabulated format (outfmt = 6).
- Returns
besthits_dict – A dictionary that contains all protein queries of a given genome, with our query as a key, and the corresponding best hit as a value.
- Return type
dict
-
static
bidir_best_hits
(blastp1, blastp2, out)¶ This function determines the reciprocal best hits between two blast files, it calls the class’s best_hits_from_blast static method to determine the best hits for every blast file, creates two sets of tuples; query and best hit for the first blast, best hit and query for the second blast (inverses the second blast file). it then proceeds with the intersection between our two sets in order to establish the reciprocal best hits (RBH).
- Parameters
blastp1 (str) – The First blast file path.
blastp2 (str) – The second blast file path.
out (str) – The reciprocal best hits output file path, tabulated (outfmt = 6).
- Returns
Creates an out RBH file based on the first blast file after calculating the reciprocal best hits.
- Return type
None.
-
blast_them
()¶ This method launches two BLASTs, our two genomes against each other. it uses the class’s static methods defind earlier.
- Returns
str – The first blast output file path.
str – The second blast output file path.
-
static
evalue_dist
(blastp)¶ Generates an evalue distribution plot from a given blastp file.
- Parameters
blastp (str) – The blast file path to be analyzed.
-
classmethod
from_list
(prots_list)¶ This class method instantiates BlastHitter objects from a list of proteomes/genomes.
- Parameters
prots_list (list) – a list of genomes/proteoms file paths.
- Returns
a list of BlastHitter objects.
- Return type
list
-
getRbh
()¶ The getter of the RBH file path for a certain blasthitter object.
- Returns
RBH file path.
- Return type
str
-
static
parse_fasta
(proteome)¶ This function parses a genome/proteome a file.
- Parameters
proteome (str) – the path of the genome/proteome file.
- Returns
seqdic – dictionary with protein accessions as keys and the corresponding fasta sequences as values.
- Return type
dict
-
rbh_them
()¶ This method calculates the reciprocal best hits between our two genomes after BLASTing them, it creates the RBH file.
- Returns
The RBH output file path.
- Return type
str
-
static
seqkit_stat
(proteome)¶ This function prints important information about the proteome that will be analyzed. It parses the genome first then prints the number of sequences, cumulated length, minimum, average and maximum length. It mimics seqkit stats output.
- Parameters
proteome (str) – the path of the proteome file.
- Returns
Prints some statistics from a chosen proteome file
- Return type
None.
-
static
universal_blast
(query, subject, out, outfmt=6, typ='p')¶ Returns the blast command to be executed in the terminal.
- Parameters
query (str) – query proteome file path.
subject (str) – subject proteome file path.
out (str) – blast output file path.
outfmt (int, optional) – blast results format. The default tabulated without headers = 6.
typ (str, optional) – type of blast to run. The default is blastp.
- Returns
The blast command to be executed.
- Return type
str
-
Clusterizer¶
-
class
clusterizer.
Clusterizer
(blasthitters, proteomes)¶ This is a class for clusterizing groups of orthologue proteines (OG), extracting their respective sequences from the corresponding given proteome, aligning each cluster individually and then constructing the super-alignement.
-
rbh_files
¶ A list of reciprocal best hits file paths generated after each blast hitter object.
- Type
list
-
proteomes
¶ A list of all proteome paths present in a given analysis.
- Type
list
-
working_cluster
¶ A collection of clusters (protein accession numbers) present among the given RBH files, filtering was applied to only get a maximum of one protein per species inside a cluster (no paralogues). Dictionary values correspond to cluster IDs which are auto-incremented int values.
- Type
dict
-
corr_species_cluster
¶ The corresponding species cluster for the given working cluster dictionary, it’s the same data structure as the working cluster except the fact that protein accession numbers are replaced with Species of origin name.
- Type
dict
-
super_alignement
¶ the file path of the calculated super-alignement in aligned fasta format (.afa).
- Type
str
-
static
all_pairs_rbh
(files)¶ This function take multiple RBH files and returns a long list of all RBH couples that are present in those FILES, it calls the previous class static method.
- Parameters
files (list) – a list of all RBH file paths to be analyzed.
- Returns
total – a list of tuples of all RBH couples in all specified files.
- Return type
list
-
static
cluster_from_proteome
(cluster, proteomes, out)¶ Extracts a fasta file for a given cluster, it firsts parses all proteome files to generate one big dictionary, it then filters it to take only the headers and the sequence that correspond to accession numbers found in a cluster.
- Parameters
cluster (tuple) – A tuple of protein accession numbers found in a cluster.
proteomes (list) – A list of all proteomes file path
out (str) – The output file path.
-
cluster_them
()¶ This method is used for generalizing the clustering workflow, First, clusters from rbh files and write them out to a text. Second, generate the species cluste, third, filter the dictionaries to only one protein per orgranism inside a cluster and lastly, write out the filtered clusters to a file and set the two dictionaries as class properties so they can be used elsewise.
-
static
clustering
(rbh_files)¶ The main clustering algorithm to find reciprocal best hits among multiple RBH blast files. it uses the class’s all_pairs_rbh static method as a starting point. After getting the list of all RBH couples in all RBH files (a list of tuples), it traverses it and throws the couple of RBHs in a set inside a list if one of them is present in a given set, otherwise if the loop exited without a break, create another cluster from the two RBHs, Time complexity O(n²).
- Parameters
rbh_files (list) – A list of RBH blast file paths.
- Returns
A dictionary of all clusters present among all RBH files, keys are auto incremented integers which correspond to cluster IDs, values are NCBI accession numbers of the proteins present in a given cluster.
- Return type
dict
-
static
clusters_to_txt
(cluster_dict, out)¶ Writes a cluster dictionary to a text file, each line corresponds to a cluster of orthologue proteins.
- Parameters
cluster_dict (dict) – A dictionary of cluster ids and cluster accessions.
out (str) – The output text file path.
- Returns
- Return type
None.
-
draw_tree
(iters=10)¶ This method generates an interactive phylogenetic tree from a super alignement file. It calculates a newick tree with RAxML and then visualizes it using ETEtoolkit.
- Parameters
iters (int, optional) – Number of repetitions for bootstrap. The default is 10.
-
static
max_one_species_per_cluster
(cluster_species, cluster_dict)¶ Filters cluster dictionaries to only one protein per species. It scans the cluster species dictionary and sends back the subset of it for which we only have one protein per species, then in filters the cluster accession dictionary with the ids of the filtered one.
- Parameters
cluster_species (dict) – The cluster dictionary of species names.
cluster_dict (dict) – The cluster dictionary (with accessions).
- Returns
dict – The filtered cluster accession dicionary.
filtered (dict) – the filtered cluster species dictionary.
-
static
muscle
(cluster_dict, proteomes)¶ Creates fasta files for each cluster, and then aligns each of them using MUSCLE.
- Parameters
cluster_dict (dict) – The cluster dictionary (with accessions).
proteomes (list) – The list of all proteomes file path for a cluster.
- Returns
afasta_files – A list of all the generated aligned fasta file paths (.afa).
- Return type
list
-
one_align_to_rule_them_all
()¶ Generating the super alignement file for the analyzed species of interest, it firsts generates MSAs for all clusters using MUSCLE, and then it concatenates them. all results will be saved to respective directories. It alse sent the super alignement file path as a class property.
-
static
pair_rbh
(file)¶ This function parses the reciprocal best hits file and returns a list of all rbh couples. it extracts the first and the second columns.
- Parameters
file (str) – The RBH file_path.
- Returns
a list of tuples of all RBH couples.
- Return type
list
-
static
species_cluster
(cluster_dict, proteomes)¶ This function generates a species cluster from an accession cluster, keys are cluster ids (auto incremented int values) and values are clusters of species names for every protein in the cluster. First it parses all proteome files and every header in each proteome, afterwords, it scans the already generated cluster dictionary to generate a replica of it swapping the accession number by the species of origin.
- Parameters
cluster_dict (dict) – The cluster dictionary (with accessions).
proteomes (list) – A list of proteomes file paths.
- Returns
The corresponding species dictionary to a given cluster accession dictionary.
- Return type
dict
-
static
super_alignement
(cluster_dict, cluster_species, maligns, out)¶ This function concatenates all the multiple alignement files into one super-alignement fasta file, each header is the name of the species studied and the sequence corresponds to the concatenation of all the clusters. It first creates a seed dictionary of the first cluster i.e; with keys as species’ name and values as the aligned sequences. Afterwards, and for each cluster, it appends the aligned sequence into the species name if the cluster has a protein from a given species, otherwise it fills it with gaps that has the same length as the multi- ple alignement inside a cluster.
- Parameters
cluster_species (dict) – The cluster dictionary of species names.
cluster_dict (dict) – The dictionary of cluster accession numbers.
maligns (list) – A list of all multiple alignement files paths.
out (str) – The super-alignement output file path.
-
static
tree_generator
(super_alignement, bootstrap)¶ Generates a newick tree from a given super-alignement. it uses RAxML’s maximum likelihood algorithm.
- Parameters
super_alignement (str) – The super-alignement file path.
bootstrap (int) – Number of repetitions for bootstrap.
- Returns
tree_name – The newick tree to be visualized (the file path).
- Return type
str
-