Welcome to ProjectM1S1Bioinfo’s documentation!

Quickstart Demo (GUIless)

  • Let’s import our two classes

    from blast_hitter import BlastHitter
    from clusterizer import Clusterizer
    
  • Proteomes can be downloaded with RefSeqScraper script, or manually saved to data/genomes repository. A list of all paths should be initalised, so that we can use BlastHitter’s factory method to create a bunch of blasthitter objects of all possible permutations.

    proteomes = ["../data/genomes/Rickettsia_rickettsii_str._Arizona_strain=Arizona_protein.faa",
    "../data/genomes/Streptococcus_pneumoniae_R6_strain=R6_protein.faa",
    "../data/genomes/Streptococcus_pyogenes_strain=NCTC8232_protein.faa",
    "../data/genomes/Streptococcus_thermophilus_LMD-9_strain=LMD-9_protein.faa",
    "../data/genomes/Piscirickettsia_salmonis_strain=Psal-158_protein.faa"]
    
    bhitters = BlastHitter.from_list(proteomes)
    
  • We can blast them and accumulate the reciprocal best hits with a for loop :

    for bh in bhitters  :
        bh.blast_them()
        bh.rbh_them()
    
  • Now let’s create a Clusterizer object after populating our blasthitters with RBH files :

    clust = Clusterizer(bhitters, proteomes)
    
  • The next and final step would be to create clusters, aligning each one and concatenating all of them and last but not least would be to launch the phylogenetic algorithm and to draw the newick tree:

    clust.cluster_them()
    clust.one_align_to_rule_them_all()
    clust.draw_tree()
    

Documentation

Graphical User Interface

class interface.window

This is a class for the graphical user interface which offers different functionalities.

It has 5 main functions :

Make_a_Tree :

A feature that allows to make a phylogenetic tree from a couple of chosen proteomes using BLAST/MUSCLE/RAxML.

Download :

A feature that permets the user to download proteomes from RefSeq.

E-values :

Visualizes the distribution of e-values from a BLASTp results file.

Statistics :

A feature that allows to see some statistics of a chosen proteome.

Exit :

Button in the menu to exit the graphical interface.

Download_proteome(dropdown_list)

This method is called when the user has chosen, therefore has selected, the proteome that he wants to download. It downloads the proteome, chosen by the user, from RefSeq database to the disk.

Parameters

dropdown_list (list) – A list that contains all names of proteomes available in RefSeq database which contain the pattern written by the user.

Search_Download_proteome()

This method is called by the “Download” feature. It asks theuser to write a few letters in order to search this pattern in the list of all proteomes available in RefSeq database.

Returns

pattern – pattern that contains a few letters write by user.

Return type

str

Select_Download_proteome(pattern)

This method is called when user presses enter on his keyboard. It display a dropdown list of all proteomes, that correspond to the pattern given by the user, available in RefSeq database.

Parameters

pattern (str) – pattern that contains a few letters write by user.

Returns

dropdown_list – A list that contains all names of proteomes available in RefSeq database which contain the pattern written by the user.

Return type

list

createMenuBar()

This method creates a menu bar that contains several tabs, each tab gives access to a feature of the GUI.

distributions_of_evalues()

This methid is called by the “E-values” feature. It displays a list of all BLASTp results files that are available on the disk.

Returns

blast_presents – A list of all BLASTp files available on the disk.

Return type

list

proteomes_in_disk()

This method is called when the user wants the “Make_a_Tree” feature. It displays list of all proteomes that are available in the disk.

Returns

p_presents – A list that contains all names of proteomes available in the disk.

Return type

list

reset()

This method allows widget reset, except the widget of the menu bar. This function is particularly useful when user changes tabs, or also if user clicks on the tab again.

stats()

This method is called by the “Statistics” feature. It displays a list of all proteomes that are available on the disk.

Returns

prot_presents – A list of all proteomes available in the disk.

Return type

list

validate_blast_selection(blast_presents)

This method is called when the user has selected the BLASTp file that he wants to visualize as a distribution of e-values. It displays a new window with a histogram of the distribution of e-values.

Parameters

blast_presents (list) – A list of all BLASTp files available in the disk.

Returns

Opens a new window with a histogram of the distribution of e-values of a chosen BLASTp file.

Return type

None

validate_prot_select(prot_presents)

This method is called when the user has selected the proteome that he wants to see some statistics about. It displays a new window with the statistics of proteome that user has selected.

Parameters

prot_presents (list) – A list of all proteomes available in the disk.

Returns

Opens a new window with some statistics about the proteome that the user has chosen.

Return type

None

validate_proteome_selection(p_presents)

This method is called when the user has to validate the proteome selection and he wants to obtain a phylogenetic tree.

Parameters

p_presents (list) – list that contains all names of proteomes available in the disk.

Returns

Opens a new window with the phylogenetic tree.

Return type

None

RefSeqScraper

class refseq_scraper.RefSeqScraper

This class will scan refseq summary assembly dataframe, pick only latest and complete genomes and then will create a new column ‘readable’ that is the combination of organism and infraspecific name columns. based on refseq’s assembly summary file : ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt A new column “readable” will be created with pandas to make it easier for the user to choose a certain genome.

data

A pandas dataframe that contains all refseq genomes.

Type

pandas.df

cart

A list of species names to download their genomes.

Type

list

add_to_cart(species)

This function adds a species to the object’s cart.

Parameters

species (str) – the name of the species to be downloaded, should match records in the dataframe.

Returns

A sentence that confirms the addition of a given species to the cart.

Return type

str

download_genome()

Downloads all genomes from the ftps paths in the cart and saves them to data/genomes directory.

mine_ftps()

This function searches the dataframe and extracts the ftps that correspond to species in the cart.

Returns

A list of ftp paths.

Return type

list

mine_species()

Asks the user to give a few letters and finds a few records in the dataframe that correspond to the input pattern, if the user would like to add something that he sees, it does it.

BlastHitter

class blast_hitter.BlastHitter(query, subject)

This is a class for BLASTing two genomes against each other and for deducing reciprocal best hits that are common between the two blastp results files.

query_path

the relative path for the query genome file on the disk.

Type

str

query_name

the name of the query genome.

Type

str

subject_path

the relative path for the subject genome file on the disk.

Type

str

subject_name

the name of the subject genome.

Type

str

results_dir

the relative path of the results repository for saving output files.

Type

str

genomes_dir

the relative path of the genomes repository.

Type

str

first_blastp

the path of the first .blastp file (query vs. subject blast).

Type

str

second_blastp

the path of the second .blastp file (subject vs. query blast).

Type

str

rbh

the path of the reciprocal best hits file (.blastp tabulated format).

Type

str

static best_hits_from_blast(blastp_file)

Returns the best hit for every protein based on the blast output file. The algorithm is simple, take the first appearance of the protein as the best hit because BLAST sends back the best hits in descending order of quality.

Parameters

blastp_file (str) – The blast output file with a tabulated format (outfmt = 6).

Returns

besthits_dict – A dictionary that contains all protein queries of a given genome, with our query as a key, and the corresponding best hit as a value.

Return type

dict

static bidir_best_hits(blastp1, blastp2, out)

This function determines the reciprocal best hits between two blast files, it calls the class’s best_hits_from_blast static method to determine the best hits for every blast file, creates two sets of tuples; query and best hit for the first blast, best hit and query for the second blast (inverses the second blast file). it then proceeds with the intersection between our two sets in order to establish the reciprocal best hits (RBH).

Parameters
  • blastp1 (str) – The First blast file path.

  • blastp2 (str) – The second blast file path.

  • out (str) – The reciprocal best hits output file path, tabulated (outfmt = 6).

Returns

Creates an out RBH file based on the first blast file after calculating the reciprocal best hits.

Return type

None.

blast_them()

This method launches two BLASTs, our two genomes against each other. it uses the class’s static methods defind earlier.

Returns

  • str – The first blast output file path.

  • str – The second blast output file path.

static evalue_dist(blastp)

Generates an evalue distribution plot from a given blastp file.

Parameters

blastp (str) – The blast file path to be analyzed.

classmethod from_list(prots_list)

This class method instantiates BlastHitter objects from a list of proteomes/genomes.

Parameters

prots_list (list) – a list of genomes/proteoms file paths.

Returns

a list of BlastHitter objects.

Return type

list

getRbh()

The getter of the RBH file path for a certain blasthitter object.

Returns

RBH file path.

Return type

str

static parse_fasta(proteome)

This function parses a genome/proteome a file.

Parameters

proteome (str) – the path of the genome/proteome file.

Returns

seqdic – dictionary with protein accessions as keys and the corresponding fasta sequences as values.

Return type

dict

rbh_them()

This method calculates the reciprocal best hits between our two genomes after BLASTing them, it creates the RBH file.

Returns

The RBH output file path.

Return type

str

static seqkit_stat(proteome)

This function prints important information about the proteome that will be analyzed. It parses the genome first then prints the number of sequences, cumulated length, minimum, average and maximum length. It mimics seqkit stats output.

Parameters

proteome (str) – the path of the proteome file.

Returns

Prints some statistics from a chosen proteome file

Return type

None.

static universal_blast(query, subject, out, outfmt=6, typ='p')

Returns the blast command to be executed in the terminal.

Parameters
  • query (str) – query proteome file path.

  • subject (str) – subject proteome file path.

  • out (str) – blast output file path.

  • outfmt (int, optional) – blast results format. The default tabulated without headers = 6.

  • typ (str, optional) – type of blast to run. The default is blastp.

Returns

The blast command to be executed.

Return type

str

Clusterizer

class clusterizer.Clusterizer(blasthitters, proteomes)

This is a class for clusterizing groups of orthologue proteines (OG), extracting their respective sequences from the corresponding given proteome, aligning each cluster individually and then constructing the super-alignement.

rbh_files

A list of reciprocal best hits file paths generated after each blast hitter object.

Type

list

proteomes

A list of all proteome paths present in a given analysis.

Type

list

working_cluster

A collection of clusters (protein accession numbers) present among the given RBH files, filtering was applied to only get a maximum of one protein per species inside a cluster (no paralogues). Dictionary values correspond to cluster IDs which are auto-incremented int values.

Type

dict

corr_species_cluster

The corresponding species cluster for the given working cluster dictionary, it’s the same data structure as the working cluster except the fact that protein accession numbers are replaced with Species of origin name.

Type

dict

super_alignement

the file path of the calculated super-alignement in aligned fasta format (.afa).

Type

str

static all_pairs_rbh(files)

This function take multiple RBH files and returns a long list of all RBH couples that are present in those FILES, it calls the previous class static method.

Parameters

files (list) – a list of all RBH file paths to be analyzed.

Returns

total – a list of tuples of all RBH couples in all specified files.

Return type

list

static cluster_from_proteome(cluster, proteomes, out)

Extracts a fasta file for a given cluster, it firsts parses all proteome files to generate one big dictionary, it then filters it to take only the headers and the sequence that correspond to accession numbers found in a cluster.

Parameters
  • cluster (tuple) – A tuple of protein accession numbers found in a cluster.

  • proteomes (list) – A list of all proteomes file path

  • out (str) – The output file path.

cluster_them()

This method is used for generalizing the clustering workflow, First, clusters from rbh files and write them out to a text. Second, generate the species cluste, third, filter the dictionaries to only one protein per orgranism inside a cluster and lastly, write out the filtered clusters to a file and set the two dictionaries as class properties so they can be used elsewise.

static clustering(rbh_files)

The main clustering algorithm to find reciprocal best hits among multiple RBH blast files. it uses the class’s all_pairs_rbh static method as a starting point. After getting the list of all RBH couples in all RBH files (a list of tuples), it traverses it and throws the couple of RBHs in a set inside a list if one of them is present in a given set, otherwise if the loop exited without a break, create another cluster from the two RBHs, Time complexity O(n²).

Parameters

rbh_files (list) – A list of RBH blast file paths.

Returns

A dictionary of all clusters present among all RBH files, keys are auto incremented integers which correspond to cluster IDs, values are NCBI accession numbers of the proteins present in a given cluster.

Return type

dict

static clusters_to_txt(cluster_dict, out)

Writes a cluster dictionary to a text file, each line corresponds to a cluster of orthologue proteins.

Parameters
  • cluster_dict (dict) – A dictionary of cluster ids and cluster accessions.

  • out (str) – The output text file path.

Returns

Return type

None.

draw_tree(iters=10)

This method generates an interactive phylogenetic tree from a super alignement file. It calculates a newick tree with RAxML and then visualizes it using ETEtoolkit.

Parameters

iters (int, optional) – Number of repetitions for bootstrap. The default is 10.

static max_one_species_per_cluster(cluster_species, cluster_dict)

Filters cluster dictionaries to only one protein per species. It scans the cluster species dictionary and sends back the subset of it for which we only have one protein per species, then in filters the cluster accession dictionary with the ids of the filtered one.

Parameters
  • cluster_species (dict) – The cluster dictionary of species names.

  • cluster_dict (dict) – The cluster dictionary (with accessions).

Returns

  • dict – The filtered cluster accession dicionary.

  • filtered (dict) – the filtered cluster species dictionary.

static muscle(cluster_dict, proteomes)

Creates fasta files for each cluster, and then aligns each of them using MUSCLE.

Parameters
  • cluster_dict (dict) – The cluster dictionary (with accessions).

  • proteomes (list) – The list of all proteomes file path for a cluster.

Returns

afasta_files – A list of all the generated aligned fasta file paths (.afa).

Return type

list

one_align_to_rule_them_all()

Generating the super alignement file for the analyzed species of interest, it firsts generates MSAs for all clusters using MUSCLE, and then it concatenates them. all results will be saved to respective directories. It alse sent the super alignement file path as a class property.

static pair_rbh(file)

This function parses the reciprocal best hits file and returns a list of all rbh couples. it extracts the first and the second columns.

Parameters

file (str) – The RBH file_path.

Returns

a list of tuples of all RBH couples.

Return type

list

static species_cluster(cluster_dict, proteomes)

This function generates a species cluster from an accession cluster, keys are cluster ids (auto incremented int values) and values are clusters of species names for every protein in the cluster. First it parses all proteome files and every header in each proteome, afterwords, it scans the already generated cluster dictionary to generate a replica of it swapping the accession number by the species of origin.

Parameters
  • cluster_dict (dict) – The cluster dictionary (with accessions).

  • proteomes (list) – A list of proteomes file paths.

Returns

The corresponding species dictionary to a given cluster accession dictionary.

Return type

dict

static super_alignement(cluster_dict, cluster_species, maligns, out)

This function concatenates all the multiple alignement files into one super-alignement fasta file, each header is the name of the species studied and the sequence corresponds to the concatenation of all the clusters. It first creates a seed dictionary of the first cluster i.e; with keys as species’ name and values as the aligned sequences. Afterwards, and for each cluster, it appends the aligned sequence into the species name if the cluster has a protein from a given species, otherwise it fills it with gaps that has the same length as the multi- ple alignement inside a cluster.

Parameters
  • cluster_species (dict) – The cluster dictionary of species names.

  • cluster_dict (dict) – The dictionary of cluster accession numbers.

  • maligns (list) – A list of all multiple alignement files paths.

  • out (str) – The super-alignement output file path.

static tree_generator(super_alignement, bootstrap)

Generates a newick tree from a given super-alignement. it uses RAxML’s maximum likelihood algorithm.

Parameters
  • super_alignement (str) – The super-alignement file path.

  • bootstrap (int) – Number of repetitions for bootstrap.

Returns

tree_name – The newick tree to be visualized (the file path).

Return type

str

Indices and tables