Welcome to ProjectM1S1Bioinfo’s documentation!¶

Quickstart Demo (GUIless)¶

Let’s import our two classes

from blast_hitter import BlastHitter
from clusterizer import Clusterizer

Proteomes can be downloaded with RefSeqScraper script, or manually saved to data/genomes repository. A list of all paths should be initalised, so that we can use BlastHitter’s factory method to create a bunch of blasthitter objects of all possible permutations.

proteomes = ["../data/genomes/Rickettsia_rickettsii_str._Arizona_strain=Arizona_protein.faa",
"../data/genomes/Streptococcus_pneumoniae_R6_strain=R6_protein.faa",
"../data/genomes/Streptococcus_pyogenes_strain=NCTC8232_protein.faa",
"../data/genomes/Streptococcus_thermophilus_LMD-9_strain=LMD-9_protein.faa",
"../data/genomes/Piscirickettsia_salmonis_strain=Psal-158_protein.faa"]

bhitters = BlastHitter.from_list(proteomes)

We can blast them and accumulate the reciprocal best hits with a for loop :

for bh in bhitters  :
    bh.blast_them()
    bh.rbh_them()

Now let’s create a Clusterizer object after populating our blasthitters with RBH files :
```
clust = Clusterizer(bhitters, proteomes)
```
The next and final step would be to create clusters, aligning each one and concatenating all of them and last but not least would be to launch the phylogenetic algorithm and to draw the newick tree:
```
clust.cluster_them()
clust.one_align_to_rule_them_all()
clust.draw_tree()
```

Documentation¶

Graphical User Interface¶

class interface.window¶

This is a class for the graphical user interface which offers different functionalities.

It has 5 main functions :

Make_a_Tree :
A feature that allows to make a phylogenetic tree from a couple of chosen proteomes using BLAST/MUSCLE/RAxML.

Download :
A feature that permets the user to download proteomes from RefSeq.

E-values :
Visualizes the distribution of e-values from a BLASTp results file.

Statistics :
A feature that allows to see some statistics of a chosen proteome.

Exit :
Button in the menu to exit the graphical interface.

Download_proteome(dropdown_list)¶

This method is called when the user has chosen, therefore has selected, the proteome that he wants to download. It downloads the proteome, chosen by the user, from RefSeq database to the disk.

Parameters: dropdown_list (list) – A list that contains all names of proteomes available in RefSeq database which contain the pattern written by the user.

Search_Download_proteome()¶

This method is called by the “Download” feature. It asks theuser to write a few letters in order to search this pattern in the list of all proteomes available in RefSeq database.

Returns: pattern – pattern that contains a few letters write by user.
Return type: str

Select_Download_proteome(pattern)¶

This method is called when user presses enter on his keyboard. It display a dropdown list of all proteomes, that correspond to the pattern given by the user, available in RefSeq database.

Parameters: pattern (str) – pattern that contains a few letters write by user.
Returns: dropdown_list – A list that contains all names of proteomes available in RefSeq database which contain the pattern written by the user.
Return type: list

createMenuBar()¶: This method creates a menu bar that contains several tabs, each tab gives access to a feature of the GUI.

distributions_of_evalues()¶

This methid is called by the “E-values” feature. It displays a list of all BLASTp results files that are available on the disk.

Returns: blast_presents – A list of all BLASTp files available on the disk.
Return type: list

proteomes_in_disk()¶

This method is called when the user wants the “Make_a_Tree” feature. It displays list of all proteomes that are available in the disk.

Returns: p_presents – A list that contains all names of proteomes available in the disk.
Return type: list

reset()¶: This method allows widget reset, except the widget of the menu bar. This function is particularly useful when user changes tabs, or also if user clicks on the tab again.

stats()¶

This method is called by the “Statistics” feature. It displays a list of all proteomes that are available on the disk.

Returns: prot_presents – A list of all proteomes available in the disk.
Return type: list

validate_blast_selection(blast_presents)¶

This method is called when the user has selected the BLASTp file that he wants to visualize as a distribution of e-values. It displays a new window with a histogram of the distribution of e-values.

Parameters: blast_presents (list) – A list of all BLASTp files available in the disk.
Returns: Opens a new window with a histogram of the distribution of e-values of a chosen BLASTp file.
Return type: None

validate_prot_select(prot_presents)¶

This method is called when the user has selected the proteome that he wants to see some statistics about. It displays a new window with the statistics of proteome that user has selected.

Parameters: prot_presents (list) – A list of all proteomes available in the disk.
Returns: Opens a new window with some statistics about the proteome that the user has chosen.
Return type: None

validate_proteome_selection(p_presents)¶

This method is called when the user has to validate the proteome selection and he wants to obtain a phylogenetic tree.

Parameters: p_presents (list) – list that contains all names of proteomes available in the disk.
Returns: Opens a new window with the phylogenetic tree.
Return type: None

RefSeqScraper¶

class refseq_scraper.RefSeqScraper¶

This class will scan refseq summary assembly dataframe, pick only latest and complete genomes and then will create a new column ‘readable’ that is the combination of organism and infraspecific name columns. based on refseq’s assembly summary file : ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt A new column “readable” will be created with pandas to make it easier for the user to choose a certain genome.

data¶

A pandas dataframe that contains all refseq genomes.

Type: pandas.df

cart¶

A list of species names to download their genomes.

Type: list

add_to_cart(species)¶

This function adds a species to the object’s cart.

Parameters: species (str) – the name of the species to be downloaded, should match records in the dataframe.
Returns: A sentence that confirms the addition of a given species to the cart.
Return type: str

download_genome()¶: Downloads all genomes from the ftps paths in the cart and saves them to data/genomes directory.

mine_ftps()¶

This function searches the dataframe and extracts the ftps that correspond to species in the cart.

Returns: A list of ftp paths.
Return type: list

mine_species()¶: Asks the user to give a few letters and finds a few records in the dataframe that correspond to the input pattern, if the user would like to add something that he sees, it does it.

BlastHitter¶

class blast_hitter.BlastHitter(query, subject)¶

This is a class for BLASTing two genomes against each other and for deducing reciprocal best hits that are common between the two blastp results files.

query_path¶

the relative path for the query genome file on the disk.

Type: str

query_name¶

the name of the query genome.

Type: str

subject_path¶

the relative path for the subject genome file on the disk.

Type: str

subject_name¶

the name of the subject genome.

Type: str

results_dir¶

the relative path of the results repository for saving output files.

Type: str

genomes_dir¶

the relative path of the genomes repository.

Type: str

first_blastp¶

the path of the first .blastp file (query vs. subject blast).

Type: str

second_blastp¶

the path of the second .blastp file (subject vs. query blast).

Type: str

rbh¶

the path of the reciprocal best hits file (.blastp tabulated format).

Type: str

static best_hits_from_blast(blastp_file)¶

Returns the best hit for every protein based on the blast output file. The algorithm is simple, take the first appearance of the protein as the best hit because BLAST sends back the best hits in descending order of quality.

Parameters: blastp_file (str) – The blast output file with a tabulated format (outfmt = 6).
Returns: besthits_dict – A dictionary that contains all protein queries of a given genome, with our query as a key, and the corresponding best hit as a value.
Return type: dict

static bidir_best_hits(blastp1, blastp2, out)¶

This function determines the reciprocal best hits between two blast files, it calls the class’s best_hits_from_blast static method to determine the best hits for every blast file, creates two sets of tuples; query and best hit for the first blast, best hit and query for the second blast (inverses the second blast file). it then proceeds with the intersection between our two sets in order to establish the reciprocal best hits (RBH).

Parameters

blastp1 (str) – The First blast file path.
blastp2 (str) – The second blast file path.
out (str) – The reciprocal best hits output file path, tabulated (outfmt = 6).

Returns

Creates an out RBH file based on the first blast file after calculating the reciprocal best hits.

Return type

None.

blast_them()¶

This method launches two BLASTs, our two genomes against each other. it uses the class’s static methods defind earlier.

Returns

str – The first blast output file path.
str – The second blast output file path.

static evalue_dist(blastp)¶

Generates an evalue distribution plot from a given blastp file.

Parameters: blastp (str) – The blast file path to be analyzed.

classmethod from_list(prots_list)¶

This class method instantiates BlastHitter objects from a list of proteomes/genomes.

Parameters: prots_list (list) – a list of genomes/proteoms file paths.
Returns: a list of BlastHitter objects.
Return type: list

getRbh()¶

The getter of the RBH file path for a certain blasthitter object.

Returns: RBH file path.
Return type: str

static parse_fasta(proteome)¶

This function parses a genome/proteome a file.

Parameters: proteome (str) – the path of the genome/proteome file.
Returns: seqdic – dictionary with protein accessions as keys and the corresponding fasta sequences as values.
Return type: dict

rbh_them()¶

This method calculates the reciprocal best hits between our two genomes after BLASTing them, it creates the RBH file.

Returns: The RBH output file path.
Return type: str

static seqkit_stat(proteome)¶

This function prints important information about the proteome that will be analyzed. It parses the genome first then prints the number of sequences, cumulated length, minimum, average and maximum length. It mimics seqkit stats output.

Parameters: proteome (str) – the path of the proteome file.
Returns: Prints some statistics from a chosen proteome file
Return type: None.

static universal_blast(query, subject, out, outfmt=6, typ='p')¶

Returns the blast command to be executed in the terminal.

Parameters

query (str) – query proteome file path.
subject (str) – subject proteome file path.
out (str) – blast output file path.
outfmt (int, optional) – blast results format. The default tabulated without headers = 6.
typ (str, optional) – type of blast to run. The default is blastp.

Returns

The blast command to be executed.

Return type

str

Clusterizer¶

class clusterizer.Clusterizer(blasthitters, proteomes)¶

This is a class for clusterizing groups of orthologue proteines (OG), extracting their respective sequences from the corresponding given proteome, aligning each cluster individually and then constructing the super-alignement.

rbh_files¶

A list of reciprocal best hits file paths generated after each blast hitter object.

Type: list

proteomes¶

A list of all proteome paths present in a given analysis.

Type: list

working_cluster¶

A collection of clusters (protein accession numbers) present among the given RBH files, filtering was applied to only get a maximum of one protein per species inside a cluster (no paralogues). Dictionary values correspond to cluster IDs which are auto-incremented int values.

Type: dict

corr_species_cluster¶

The corresponding species cluster for the given working cluster dictionary, it’s the same data structure as the working cluster except the fact that protein accession numbers are replaced with Species of origin name.

Type: dict

super_alignement¶

the file path of the calculated super-alignement in aligned fasta format (.afa).

Type: str

static all_pairs_rbh(files)¶

This function take multiple RBH files and returns a long list of all RBH couples that are present in those FILES, it calls the previous class static method.

Parameters: files (list) – a list of all RBH file paths to be analyzed.
Returns: total – a list of tuples of all RBH couples in all specified files.
Return type: list

static cluster_from_proteome(cluster, proteomes, out)¶

Extracts a fasta file for a given cluster, it firsts parses all proteome files to generate one big dictionary, it then filters it to take only the headers and the sequence that correspond to accession numbers found in a cluster.

Parameters

cluster (tuple) – A tuple of protein accession numbers found in a cluster.
proteomes (list) – A list of all proteomes file path
out (str) – The output file path.

cluster_them()¶: This method is used for generalizing the clustering workflow, First, clusters from rbh files and write them out to a text. Second, generate the species cluste, third, filter the dictionaries to only one protein per orgranism inside a cluster and lastly, write out the filtered clusters to a file and set the two dictionaries as class properties so they can be used elsewise.

static clustering(rbh_files)¶

The main clustering algorithm to find reciprocal best hits among multiple RBH blast files. it uses the class’s all_pairs_rbh static method as a starting point. After getting the list of all RBH couples in all RBH files (a list of tuples), it traverses it and throws the couple of RBHs in a set inside a list if one of them is present in a given set, otherwise if the loop exited without a break, create another cluster from the two RBHs, Time complexity O(n²).

Parameters: rbh_files (list) – A list of RBH blast file paths.
Returns: A dictionary of all clusters present among all RBH files, keys are auto incremented integers which correspond to cluster IDs, values are NCBI accession numbers of the proteins present in a given cluster.
Return type: dict

static clusters_to_txt(cluster_dict, out)¶

Writes a cluster dictionary to a text file, each line corresponds to a cluster of orthologue proteins.

Parameters

cluster_dict (dict) – A dictionary of cluster ids and cluster accessions.
out (str) – The output text file path.

Returns

Return type

None.

draw_tree(iters=10)¶

This method generates an interactive phylogenetic tree from a super alignement file. It calculates a newick tree with RAxML and then visualizes it using ETEtoolkit.

Parameters: iters (int, optional) – Number of repetitions for bootstrap. The default is 10.

static max_one_species_per_cluster(cluster_species, cluster_dict)¶

Filters cluster dictionaries to only one protein per species. It scans the cluster species dictionary and sends back the subset of it for which we only have one protein per species, then in filters the cluster accession dictionary with the ids of the filtered one.

Parameters

cluster_species (dict) – The cluster dictionary of species names.
cluster_dict (dict) – The cluster dictionary (with accessions).

Returns

dict – The filtered cluster accession dicionary.
filtered (dict) – the filtered cluster species dictionary.

static muscle(cluster_dict, proteomes)¶

Creates fasta files for each cluster, and then aligns each of them using MUSCLE.

Parameters

cluster_dict (dict) – The cluster dictionary (with accessions).
proteomes (list) – The list of all proteomes file path for a cluster.

Returns

afasta_files – A list of all the generated aligned fasta file paths (.afa).

Return type

list

one_align_to_rule_them_all()¶: Generating the super alignement file for the analyzed species of interest, it firsts generates MSAs for all clusters using MUSCLE, and then it concatenates them. all results will be saved to respective directories. It alse sent the super alignement file path as a class property.

static pair_rbh(file)¶

This function parses the reciprocal best hits file and returns a list of all rbh couples. it extracts the first and the second columns.

Parameters: file (str) – The RBH file_path.
Returns: a list of tuples of all RBH couples.
Return type: list

static species_cluster(cluster_dict, proteomes)¶

This function generates a species cluster from an accession cluster, keys are cluster ids (auto incremented int values) and values are clusters of species names for every protein in the cluster. First it parses all proteome files and every header in each proteome, afterwords, it scans the already generated cluster dictionary to generate a replica of it swapping the accession number by the species of origin.

Parameters

cluster_dict (dict) – The cluster dictionary (with accessions).
proteomes (list) – A list of proteomes file paths.

Returns

The corresponding species dictionary to a given cluster accession dictionary.

Return type

dict

static super_alignement(cluster_dict, cluster_species, maligns, out)¶

This function concatenates all the multiple alignement files into one super-alignement fasta file, each header is the name of the species studied and the sequence corresponds to the concatenation of all the clusters. It first creates a seed dictionary of the first cluster i.e; with keys as species’ name and values as the aligned sequences. Afterwards, and for each cluster, it appends the aligned sequence into the species name if the cluster has a protein from a given species, otherwise it fills it with gaps that has the same length as the multi- ple alignement inside a cluster.

Parameters

cluster_species (dict) – The cluster dictionary of species names.
cluster_dict (dict) – The dictionary of cluster accession numbers.
maligns (list) – A list of all multiple alignement files paths.
out (str) – The super-alignement output file path.

static tree_generator(super_alignement, bootstrap)¶

Generates a newick tree from a given super-alignement. it uses RAxML’s maximum likelihood algorithm.

Parameters

super_alignement (str) – The super-alignement file path.
bootstrap (int) – Number of repetitions for bootstrap.

Returns

tree_name – The newick tree to be visualized (the file path).

Return type

str

Welcome to ProjectM1S1Bioinfo’s documentation!¶

Quickstart Demo (GUIless)¶

Documentation¶

Graphical User Interface¶

RefSeqScraper¶

BlastHitter¶

Clusterizer¶

Indices and tables¶