Welcome to dnazip’s documentation!

Installation

  • You can install the package either from pip or from the source code hosted on github.

With pip

pip install dnazip-bioinfo

From source

git clone https://github.com/dabane-ghassan/dnazip.git
cd dnazip
sudo python3 setup.py install

Getting started

GUI

  • After installing the package from source or using pip, the interface can be launched simply from the command line:

    dnazip
    
  • If problems occur with the installation, an interface instance can be imported and launched:

    from dnazip.interface import Interface
    gui = Interface()
    gui.main()
    

Using the library

Generating a random DNA sequence

  • A random DNA sequence with an alphabet of ATCGN can be generated for any given length specified in the parameter.

    from dnazip.sequence import Sequence
    randseq = Sequence.generate(length=5000)
    Sequence('/path/to/new/seq').write(randseq)
    

Encoding a DNA sequence with Burros-Wheeler + Huffman Coding

  • To encode a DNA sequence using BWT and Huffman coding, you can use a FullEncoder object that will save two files to the same directory as the sequence, the Burros-Wheeler transform and the UTF-8 zipped format of the sequence:

    from dnazip.encoder import FullEncoder
    encode = FullEncoder('/path/to/seq')
    encode.full_zip()
    
  • The attributes of the object can be accessed to see intermediary results:

    encode.bw_encoder.rotations # a matrix of string rotations from a sequence
    encode.bw_encoder.bwm # The Burros-Wheeler Matrix
    encode.bw_encoder.bwt # The Burros-Wheeler Transform
    
    encode.huff_encoder.header # The header of the zip file that contains Huffman codes for each character as well as the sequence binary padding
    encode.huff_encoder.binary # The binary sequence of the BW transform
    encode.huff_encoder.unicode # 8-bits encoded binary sequence
    
  • A random sequence of size 1kB was compressed efficiently to 549 bytes.

Decoding a DNA sequence with Huffman decoding + Reversing Burros-Wheeler transform

  • To decode a zipped DNA sequence using Huffman decoding and the inverse of BWT, you can use a FullDecoder object that will work in the same manner as the FullEncoder object:

    from dnazip.decoder import FullDecoder
    decode = FullDecoder('path/to/seq')
    decode.full_unzip()
    
  • The attributes can also be accessed to see intermediary results:

    decode.huff_decoder.header # The header of the zip file that contains Huffman codes for each character as well as the sequence binary padding that where saved when the Huffman tree was created
    decode.huff_decoder.unicode # 8-bits encoded sequence
    decode.huff_decoder.binary # The binary sequence
    
    decode.bw_decoder.bwm # The Burros-Wheeler reconstructed matrix
    decode.bw_decoder.original # The original sequence
    

Building the Burros-Wheeler transform using the advanced algorithm

  • The BWT can be constructed using a Suffix Array (SA) based algorithm that has a better time and space complexities:

    from dnazip.sequence import Sequence
    from dnazip.burros_wheeler import BurrosWheeler
    
    seq = Sequence('/path/to/seq').read()
    BurrosWheeler.bwt_advanced(seq)
    

Documentation

Graphical User Interface

View architecture of the main application, i.e; a GUI.

class interface.Interface[source]

View class of the application using a Tkinter interface.

BW_output(controller: dnazip.encoder.BWEncoder) → Iterator[str][source]

This method is used to collect all output for the BW encoding.

Parameters

controller (BWEncoder) – The given controller.

Yields

Iterator[str] – The content to be shown in the interface.

DeBW_output(controller: dnazip.decoder.BWDecoder) → Iterator[str][source]

This method is used to collect all output for the BW decoding.

Parameters

controller (BWDecoder) – The given controller.

Yields

Iterator[str] – The content to be shown in the interface.

Huff_output(controller: dnazip.encoder.HuffEncoder) → Iterator[str][source]

This method is used to collect all output for Huffman encoding.

Parameters

controller (HuffEncoder) – The given controller.

Yields

Iterator[str] – The content to be shown in the interface.

about() → None[source]

This method is used to generate the about menu page information.

bwt_window() → None[source]

This method creates a Tkinter Toplevel window for the step-by-step BWT protocol, The output file of the protocol will be shown at the beginning.

create_buttons() → None[source]

This method creates buttons for the interface.

create_main() → None[source]

This method is used to create the main menu interface of the GUI.

create_menu() → None[source]

Creates the menu for the interface.

deHuff_output(controller: dnazip.decoder.HuffDecoder) → Iterator[str][source]

This method is used to collect all output for the Huff decoding.

Parameters

controller (HuffDecoder) – The given controller.

Yields

Iterator[str] – The content to be shown in the interface.

debwt_window() → None[source]

This method creates a Tkinter Toplevel window for the step-by-step inverse BWT protocol, The output file of the protocol will be shown at the beginning.

final_btn(controller: Iterator[str]) → str[source]

This method is used to create the output of a universal final step button for all protocols of the program, it will be passed to a given label that changes its content.

Parameters

controller (Iterator[str]) – The controller of a given protocol that contains all steps to show.

Returns

The final step to be shown.

Return type

str

fullunzip_output(controller: dnazip.decoder.FullDecoder) → Iterator[str][source]

This method is used to collect all output for a full decompression protocol.

Parameters

controller (FullDecoder) – The given controller.

Yields

Iterator[str] – The content to be shown in the interface.

fullunzip_window() → None[source]

This method creates a Tkinter Toplevel window for the step-by-step Full unzipping protocol (HUffman decoding + inverse BWT), The output file of the protocol will be shown at the beginning.

fullzip_output(controller: dnazip.encoder.FullEncoder) → Iterator[str][source]

This method is used to collect all output for a full protocol encoding (BWT + Huffman compression).

Parameters

controller (FullEncoder) – The given controller.

Yields

Iterator[str] – The content to be shown in the interface.

fullzip_window() → None[source]

This method creates a Tkinter Toplevel window for the step-by-step Fullzip protocol (BWT+Huffman compression), The output file of the protocol will be shown at the beginning.

generate_random() → None[source]

This method is used to generate random sequences of DNA of length 50 and show them inside the random text box tkinter entry.

huffcode_window() → None[source]

This method creates a Tkinter Toplevel window for the step-by-step Huffman coding protocol, The output file of the protocol will be shown at the beginning.

huffdecode_window() → None[source]

This method creates a Tkinter Toplevel window for the step-by-step Huffman decoding, The output file of the protocol will be shown at the beginning.

main() → None[source]

Launches the mainloop of the interface.

next_btn(controller: Iterator[str]) → str[source]

This method is used to create the output of a universal next button for all protocols of the program, it will be passed to a given label that changes its content.

Parameters

controller (Iterator[str]) – The controller of a given protocol that contains all steps to show.

Returns

Every step to be shown, one at a time.

Return type

str

no_file_error() → None[source]

Shows an error when no file is selected.

open_file() → None[source]

This method is used to create a file chooser for the interface.

program_output(window: tkinter.Tk, path: str) → None[source]

Shows the output of a given protocol.

Parameters
  • window (Tk) – The given window.

  • path (str) – The output file path to be shown in the messagebox.

quit() → None[source]

Quits the current interface.

save_random() → None[source]

This method is used to save a randomly generated DNA sequence to a file.

step_by_step(window: tkinter.Tk, protocol: Iterator[str], names: Generator) → None[source]

This method creates a universal step by step advancing interface for a given chosen protocole.

Parameters
  • window (Tk) – The given window.

  • protocol (Iterator[str]) – The output protocol to be displayed.

  • names (Generator) – The names of the steps to be displayed.

update_text(widget: tkinter.Text, new_text: str) → None[source]

This method updates the Text tkinter widget with every step in every protocol. it deletes its contents and then replaces it with new_text parameter.

Parameters
  • widget (Text) – Tkinter’s Text widget.

  • new_text (str) – The new text to be shown in the widget.

Returns

Replaces text in a given Text widget.

Return type

None

Encoder Classes

Encoder classes, Controller architecture in a MVC layout.

class encoder.BWEncoder(path: str)[source]

An encoder class for Burros-Wheeler transform, it is used as a controller in a MVC architecture.

path

The path of the file to be transformed with Burros-Wheeler.

Type

str

seq

The sequence that was extracted from the file; a Sequence object.

Type

Sequence

bwt_output

The output file path for BWT.

Type

str

rotations

A matrix of rotations from the original sequence.

Type

List[str]

bwm

The Burros-Wheeler Matrix.

Type

List[str]

bwt

The Burros-Wheeler transform of the sequence.

Type

str

encode() → None[source]

The main encoding method of the controller.

Returns

Fills all the properties of an object and writes out the transformed sequence to a file.

Return type

None

class encoder.FullEncoder(path: str)[source]

An encoder class for both the Burros-Wheeler transform and Huffman compression, controller architecture.

path

The path of the file to be compressed with BWT + Huffman compression.

Type

str

bw_encoder

A BWEncoder object to do the Burros-Wheeler transform on a sequence.

Type

BWEncoder

huff_encoder

A HuffEncoder object to do the Huffman compression on the BW transform.

Type

HuffEncoder

full_zip() → None[source]

The main encoding method of the controller, it first encodes the sequence with BWT, then passes the BWT to Huffman compression.

Returns

Fills all the properties of an object and writes out the compressed sequence to a file.

Return type

None

class encoder.HuffEncoder(path: str)[source]

An encoder class for Huffman compression, it is used as a controller in a MVC architecture.

path

The path of the file to be compressed with Huffman compression.

Type

str

seq

The sequence that was extracted from the file; a Sequence object.

Type

Sequence

huff_output

The output file path for Huffman compression.

Type

str

binary

The binary sequence that was translated from the original sequence using Huffman tree and codes.

Type

str

header

The header of the compressed file; contains Huffman codes and paths as well as padding that were generated when compressing the sequence.

Type

str

unicode

The compressed format of the sequence.

Type

str

compressed

The compressed sequence to be written to a file.

Type

str

encode() → None[source]

The main encoding method of the controller.

Returns

Fills all the properties of an object and writes out the compressed sequence to a file.

Return type

None

Decoder Classes

Decoder classes, Controller architecture in a MVC layout.

class decoder.BWDecoder(path: str)[source]

A decoder class for inversing the Burros-Wheeler transform, it is used as a controller in a MVC architecture.

path

The path of the file to be transformed with inverse of Burros-Wheeler.

Type

str

seq

The sequence that was extracted from the file; a Sequence object.

Type

Sequence

debwt_output

The output file path for the inverse of BWT.

Type

str

bwm

The reconstructed Burros-Wheeler Matrix.

Type

List[str]

original

The original sequence from the inverse BWT.

Type

str

decode() → None[source]

The main decoding method of the controller.

Returns

Fills all the properties of an object and writes out the original sequence to a file.

Return type

None

class decoder.FullDecoder(path: str)[source]

A decoder class for both Huffman decompression and the inverse of the Burros-Wheeler transform, controller architecture.

path

The path of the file to be compressed with BWT + Huffman compression.

Type

str

huff_decoder

A HuffDecoder object to do the Huffman decompression on a compressed sequence.

Type

HuffDecoder

bw_decoder

A BWDecoder object to do the inverse of Burros-Wheeler transform on the output of Huffman decompression.

Type

BWDecoder

full_unzip() → None[source]

The main decoding method of the controller, it first decodes the sequence with Huffman decompression, then passes the uncompressed sequence to do the Inverse of Burros-Wheeler transform and obtain the original sequence.

Returns

Fills all the properties of an object and writes out the original sequence to a file.

Return type

None

class decoder.HuffDecoder(path: str)[source]

A decoder class for Huffman decompression, it is used as a controller in a MVC architecture.

path

The path of the file to be decompressed with Huffman decompression.

Type

str

seq

The sequence that was extracted from the file; a Sequence object.

Type

Sequence

dehuffman_output

The output file path for Huffman decompression.

Type

str

binary

The binary sequence that was translated from unicode using Huffman codes.

Type

str

header

The header of the compressed file; contains Huffman codes and paths as well as padding that were generated when compressing the sequence.

Type

str

unicode

The compressed format of the sequence.

Type

str

decompressed

The decompressed sequence; normally the Burros-Wheeler transform of an original sequence.

Type

str

decode() → None[source]

The main decoding method of the controller.

Returns

Fills all the properties of an object and writes out the decompressed sequence to a file.

Return type

None

Sequence

A class to read and write DNA sequences (or a list of characters) to a file. In a MVC structure, this represents the Model of the data.

class sequence.Sequence(path: str)[source]

A class to represent a sequence, sequences can be read from files or writtes to files.

path

The path of the file to be read.

Type

str

static generate(length: int) → str[source]

This method is used to generate a random DNA sequence using the random python library.

Returns

The random DNA sequence.

Return type

str

read() → str[source]

This method is used to read the contents of the file into a python string.

Returns

The contents of the file, i.e. the sequence.

Return type

str

read_bytes() → str[source]

This method is used to read a file that has been written in bytes, it will be useful when reading files that has been compressed with Huffman coding.

Returns

The contents of the file, i.e. the sequence.

Return type

str

write(content: str) → None[source]

This method is used to write out to a file.

Parameters

content (str) – The contents of the file.

Returns

Writes out to a new file.

Return type

None

write_bytes(content: str) → None[source]

This method is used to write out to a new file into a bytes format, UTF-8 encoding, useful when we want to write out the contents of the Huffman compression to a file.

Parameters

content (str) – The contents of the file.

Returns

Writes out to a new file in a bytes format.

Return type

None

Burrows-Wheeler Algorithm

Burrows-Wheeler Algorithm Class, contains naive and advanced construction methods, static methods were written as explicit as possible and were factorized to facilitate algorithmic readability. The yielding in some functions doesn’t respect Space complexity, but it was done in this manner to facilitate the data flow into the View class (the GUI).

class burros_wheeler.BurrosWheeler[source]

A class to represent the Burrows-Wheeler algorithm, all methods are static for ease of reading and outside usability.

static bwt_advanced(sequence: str) → str[source]

Generates a Burrows-Wheeler Transfrom from a suffix array, advanced construction of BWT. Better algorithmic complexity.

Parameters

sequence (str) – The sequence to ve transformed.

Returns

The Burrows-Wheeler Transform.

Return type

str

static construct_bwm(rotations: List[str]) → List[str][source]

This method constructs the Burrows-Wheeler Matrix from a list of string rotations.

Parameters

rotations (List[str]) – A list of strings, i.e; a BW matrix like object.

Returns

A list of strings or a Burrows-Wheeler Matrix.

Return type

List[str]

static decode_bwt(matrix: List[str]) → str[source]

This method returns the original sequence from a given Burrows-Wheeler Matrix, the original sequence is the line that ends with the character ‘$’.

Parameters

matrix (List[str]) – A Burrows-Wheeler Matrix.

Returns

The original sequence.

Return type

str

static encode_bwt(matrix: List[str]) → str[source]

Returns the Burrows-Wheeler Transform from a given Burros-Wheeler Matrix. the Burros-Wheeler Transform corresponds to the last column of the matrix.

Parameters

matrix (List[str]) – A Burrows-Wheeler Matrix.

Returns

The Burrows-Wheeler Transform.

Return type

str

static pprint(mat: List[str]) → None[source]

Pretty print, this method prints a burrows wheeler matrix beautifully (without lists and strings).

Parameters

mat (List[str]) – The Burros-Wheeler matrix, i.e; a list of strings.

Returns

Prints the matrix.

Return type

None

static reconstruct_bwm(bwt: str) → List[str][source]

This method reconstructs the Burrows-Wheeler Matrix given the corresponding Burros-Wheeler Transform. The naive algorithm for constructing the matrix given the transform is going to iteratively add the transform as a left column, then sorts lexicographically the columns.

Parameters

bwt (str) – The Burrows-Wheeler Transform.

Returns

A Burrows-Wheeler Matrix.

Return type

List[str]

static string_rotations(seq: str) → List[str][source]

Returns all string rotations of a sequence.

Parameters

seq (str) – he sequence to be rotated.

Returns

Returns a list of strings, i.e; a BW matrix like object.

Return type

List[str]

static suffix_array(sequence: str) → List[Tuple[str, int]][source]

Builds a suffix-array from a given sequence of characters. - Complexity of the algorithm O(n^2log(n)) - Sorting is O(nlogn) and Finding is O(n)

Parameters

sequence (str) – The given sequence of characters.

Returns

The suffix array of the sequence; a list of tuples.

Return type

List[Tuple[str, int]]

Huffman Coding Algorithm

Huffman coding algorithm classes, contains node and tree separate classes.

class huffman.HuffmanNode(char: str, freq: int, left: object = None, right: object = None)[source]

A class to represent heap nodes of a huffman coding tree.

char

The character to be saved as a node value.

Type

str

freq

The frequency of the char.

Type

int

right_child

The right child of the heap node.

Type

HuffmanNode

left_child

The left child of the heap node.

Type

HuffmanNode

dir

The path of the current node in the Huffman tree, usually a string of 0s and 1s.

Type

str

property left_child

The getter of the left child.

Returns

Return type

object

property right_child

The getter of the right child.

Returns

Return type

object

class huffman.HuffmanTree(sequence: str)[source]

A class to represent a huffman tree.

sequence

The sequence to be encoded, i.e; the Burros-Wheeler transform of the sequence.

Type

str

frequency

A dictionary to store frequency values for each character of the sequence, the character as a key and its frequency as a value.

Type

Dict[str, int]

root

The root node of the Huffman tree.

Type

HuffmanNode

codes

The Huffman codes for a given sequence, The character as a key and its tree path as a value(a string).

Type

Dict[str, str]

pad

The padding of the sequence, i.e; the number of zeroes that was added to the end of sequence until it was divisible by 8 (coding in 8 bits).

Type

int

static binstr_to_seq(bin_str: str, codes: Dict[str, str]) → str[source]

Transforms a binary string to a sequence given a codes dictionary of paths.

Parameters
  • bin_str (str) – The binary string to be transformed.

  • codes (Dict[str, str]) – A dictionary of characters alongside their paths.

Returns

The original sequence.

Return type

str

static binstr_to_unicode(bin_str: str) → str[source]

This method codes a binary sequence in 8-bits in UTF-8.

Parameters

bin_str (str) – The binary sequence to be coded.

Returns

The unicode string that corrsponds to the binary sequence.

Return type

str

codes_to_header() → str[source]

This method transforms the codes of a given Huffman tree object into a header to be saved in the compressed file (useful for when decompressing later).

Returns

The header of the compressed file.

Return type

str

create_tree()huffman.HuffmanNode[source]

The main algorithm for creating The Huffman tree.

Returns

The root node of the tree.

Return type

HuffmanNode

static freq_dict(sequence: str) → Dict[str, int][source]

This method returns the frequence dictionary of a given sequence.

Parameters

sequence (str) – The sequence of interest.

Returns

A dictionary of character frequencies.

Return type

Dict[str, int]

get_codings(node: huffman.HuffmanNode, val: str = '') → None[source]

A depth-first search recursive algorithm to find the paths of each character in the tree.

Parameters
  • node (HuffmanNode) – The starting node, the root of the tree.

  • val (str, optional) – The path of the current recursion in a tree. The default is ‘’.

Returns

Fills the codes dictionary property of the Huffman Tree.

Return type

None

static header_to_codes(header: str) → Dict[str, str][source]

This method takes the header of a compressed file and transforms it into a codes dictionary that’s useful when decompressing a sequence.

Parameters

header (str) – The header of the compressed file.

Returns

A codes dictionary which maps from characters to thier given paths.

Return type

Dict[str, str]

static remove_padding(bin_str: str, pad: int) → str[source]

This function removes the padding from a given binary sequence, the padding is normally a variable number of zeroes added when coding in 8-bits.

Parameters
  • bin_str (str) – The binary string to be stripped from a sequence of zeroes.

  • pad (int) – The padding, the number of zeroes at the end of the binary sequence.

Returns

The no-padded binary string.

Return type

str

seq_to_binstr() → str[source]

This method transforms the current sequence of the Huffman tree to a binary sequence using the codes (paths of every character), it also adds a padding to the end of the binary sequence (a variable number of zeroes, when sequence is divisible by 8 then add 8 zeroes by default). The binary sequence will be coded in 8-bits later. It saves the padding in the codes property.

Returns

The binary sequence.

Return type

str

static unicode_to_binstr(unicode: str) → str[source]

Transforms a unicode sequence to a binary sequence.

Parameters

unicode (str) – The unicode sequence to be transformed.

Returns

The corrsponding binary string.

Return type

str

Indices and tables