Welcome to dnazip’s documentation!¶
Installation¶
You can install the package either from pip or from the source code hosted on github.
With pip¶
pip install dnazip-bioinfo
From source¶
git clone https://github.com/dabane-ghassan/dnazip.git
cd dnazip
sudo python3 setup.py install
Getting started¶
GUI¶
After installing the package from source or using pip, the interface can be launched simply from the command line:
dnazip
If problems occur with the installation, an interface instance can be imported and launched:
from dnazip.interface import Interface gui = Interface() gui.main()
Using the library¶
Generating a random DNA sequence¶
A random DNA sequence with an alphabet of ATCGN can be generated for any given length specified in the parameter.
from dnazip.sequence import Sequence randseq = Sequence.generate(length=5000) Sequence('/path/to/new/seq').write(randseq)
Encoding a DNA sequence with Burros-Wheeler + Huffman Coding¶
To encode a DNA sequence using BWT and Huffman coding, you can use a FullEncoder object that will save two files to the same directory as the sequence, the Burros-Wheeler transform and the UTF-8 zipped format of the sequence:
from dnazip.encoder import FullEncoder encode = FullEncoder('/path/to/seq') encode.full_zip()
The attributes of the object can be accessed to see intermediary results:
encode.bw_encoder.rotations # a matrix of string rotations from a sequence encode.bw_encoder.bwm # The Burros-Wheeler Matrix encode.bw_encoder.bwt # The Burros-Wheeler Transform encode.huff_encoder.header # The header of the zip file that contains Huffman codes for each character as well as the sequence binary padding encode.huff_encoder.binary # The binary sequence of the BW transform encode.huff_encoder.unicode # 8-bits encoded binary sequence
A random sequence of size 1kB was compressed efficiently to 549 bytes.
Decoding a DNA sequence with Huffman decoding + Reversing Burros-Wheeler transform¶
To decode a zipped DNA sequence using Huffman decoding and the inverse of BWT, you can use a FullDecoder object that will work in the same manner as the FullEncoder object:
from dnazip.decoder import FullDecoder decode = FullDecoder('path/to/seq') decode.full_unzip()
The attributes can also be accessed to see intermediary results:
decode.huff_decoder.header # The header of the zip file that contains Huffman codes for each character as well as the sequence binary padding that where saved when the Huffman tree was created decode.huff_decoder.unicode # 8-bits encoded sequence decode.huff_decoder.binary # The binary sequence decode.bw_decoder.bwm # The Burros-Wheeler reconstructed matrix decode.bw_decoder.original # The original sequence
Building the Burros-Wheeler transform using the advanced algorithm¶
The BWT can be constructed using a Suffix Array (SA) based algorithm that has a better time and space complexities:
from dnazip.sequence import Sequence from dnazip.burros_wheeler import BurrosWheeler seq = Sequence('/path/to/seq').read() BurrosWheeler.bwt_advanced(seq)
Documentation¶
Graphical User Interface¶
View architecture of the main application, i.e; a GUI.
-
class
interface.
Interface
[source]¶ View class of the application using a Tkinter interface.
-
BW_output
(controller: dnazip.encoder.BWEncoder) → Iterator[str][source]¶ This method is used to collect all output for the BW encoding.
- Parameters
controller (BWEncoder) – The given controller.
- Yields
Iterator[str] – The content to be shown in the interface.
-
DeBW_output
(controller: dnazip.decoder.BWDecoder) → Iterator[str][source]¶ This method is used to collect all output for the BW decoding.
- Parameters
controller (BWDecoder) – The given controller.
- Yields
Iterator[str] – The content to be shown in the interface.
-
Huff_output
(controller: dnazip.encoder.HuffEncoder) → Iterator[str][source]¶ This method is used to collect all output for Huffman encoding.
- Parameters
controller (HuffEncoder) – The given controller.
- Yields
Iterator[str] – The content to be shown in the interface.
-
bwt_window
() → None[source]¶ This method creates a Tkinter Toplevel window for the step-by-step BWT protocol, The output file of the protocol will be shown at the beginning.
This method creates buttons for the interface.
Creates the menu for the interface.
-
deHuff_output
(controller: dnazip.decoder.HuffDecoder) → Iterator[str][source]¶ This method is used to collect all output for the Huff decoding.
- Parameters
controller (HuffDecoder) – The given controller.
- Yields
Iterator[str] – The content to be shown in the interface.
-
debwt_window
() → None[source]¶ This method creates a Tkinter Toplevel window for the step-by-step inverse BWT protocol, The output file of the protocol will be shown at the beginning.
-
final_btn
(controller: Iterator[str]) → str[source]¶ This method is used to create the output of a universal final step button for all protocols of the program, it will be passed to a given label that changes its content.
- Parameters
controller (Iterator[str]) – The controller of a given protocol that contains all steps to show.
- Returns
The final step to be shown.
- Return type
str
-
fullunzip_output
(controller: dnazip.decoder.FullDecoder) → Iterator[str][source]¶ This method is used to collect all output for a full decompression protocol.
- Parameters
controller (FullDecoder) – The given controller.
- Yields
Iterator[str] – The content to be shown in the interface.
-
fullunzip_window
() → None[source]¶ This method creates a Tkinter Toplevel window for the step-by-step Full unzipping protocol (HUffman decoding + inverse BWT), The output file of the protocol will be shown at the beginning.
-
fullzip_output
(controller: dnazip.encoder.FullEncoder) → Iterator[str][source]¶ This method is used to collect all output for a full protocol encoding (BWT + Huffman compression).
- Parameters
controller (FullEncoder) – The given controller.
- Yields
Iterator[str] – The content to be shown in the interface.
-
fullzip_window
() → None[source]¶ This method creates a Tkinter Toplevel window for the step-by-step Fullzip protocol (BWT+Huffman compression), The output file of the protocol will be shown at the beginning.
-
generate_random
() → None[source]¶ This method is used to generate random sequences of DNA of length 50 and show them inside the random text box tkinter entry.
-
huffcode_window
() → None[source]¶ This method creates a Tkinter Toplevel window for the step-by-step Huffman coding protocol, The output file of the protocol will be shown at the beginning.
-
huffdecode_window
() → None[source]¶ This method creates a Tkinter Toplevel window for the step-by-step Huffman decoding, The output file of the protocol will be shown at the beginning.
-
next_btn
(controller: Iterator[str]) → str[source]¶ This method is used to create the output of a universal next button for all protocols of the program, it will be passed to a given label that changes its content.
- Parameters
controller (Iterator[str]) – The controller of a given protocol that contains all steps to show.
- Returns
Every step to be shown, one at a time.
- Return type
str
-
program_output
(window: tkinter.Tk, path: str) → None[source]¶ Shows the output of a given protocol.
- Parameters
window (Tk) – The given window.
path (str) – The output file path to be shown in the messagebox.
-
save_random
() → None[source]¶ This method is used to save a randomly generated DNA sequence to a file.
-
step_by_step
(window: tkinter.Tk, protocol: Iterator[str], names: Generator) → None[source]¶ This method creates a universal step by step advancing interface for a given chosen protocole.
- Parameters
window (Tk) – The given window.
protocol (Iterator[str]) – The output protocol to be displayed.
names (Generator) – The names of the steps to be displayed.
-
update_text
(widget: tkinter.Text, new_text: str) → None[source]¶ This method updates the Text tkinter widget with every step in every protocol. it deletes its contents and then replaces it with new_text parameter.
- Parameters
widget (Text) – Tkinter’s Text widget.
new_text (str) – The new text to be shown in the widget.
- Returns
Replaces text in a given Text widget.
- Return type
None
-
Encoder Classes¶
Encoder classes, Controller architecture in a MVC layout.
-
class
encoder.
BWEncoder
(path: str)[source]¶ An encoder class for Burros-Wheeler transform, it is used as a controller in a MVC architecture.
-
path
¶ The path of the file to be transformed with Burros-Wheeler.
- Type
str
-
bwt_output
¶ The output file path for BWT.
- Type
str
-
rotations
¶ A matrix of rotations from the original sequence.
- Type
List[str]
-
bwm
¶ The Burros-Wheeler Matrix.
- Type
List[str]
-
bwt
¶ The Burros-Wheeler transform of the sequence.
- Type
str
-
-
class
encoder.
FullEncoder
(path: str)[source]¶ An encoder class for both the Burros-Wheeler transform and Huffman compression, controller architecture.
-
path
¶ The path of the file to be compressed with BWT + Huffman compression.
- Type
str
-
huff_encoder
¶ A HuffEncoder object to do the Huffman compression on the BW transform.
- Type
-
-
class
encoder.
HuffEncoder
(path: str)[source]¶ An encoder class for Huffman compression, it is used as a controller in a MVC architecture.
-
path
¶ The path of the file to be compressed with Huffman compression.
- Type
str
-
huff_output
¶ The output file path for Huffman compression.
- Type
str
-
binary
¶ The binary sequence that was translated from the original sequence using Huffman tree and codes.
- Type
str
-
header
¶ The header of the compressed file; contains Huffman codes and paths as well as padding that were generated when compressing the sequence.
- Type
str
-
unicode
¶ The compressed format of the sequence.
- Type
str
-
compressed
¶ The compressed sequence to be written to a file.
- Type
str
-
Decoder Classes¶
Decoder classes, Controller architecture in a MVC layout.
-
class
decoder.
BWDecoder
(path: str)[source]¶ A decoder class for inversing the Burros-Wheeler transform, it is used as a controller in a MVC architecture.
-
path
¶ The path of the file to be transformed with inverse of Burros-Wheeler.
- Type
str
-
debwt_output
¶ The output file path for the inverse of BWT.
- Type
str
-
bwm
¶ The reconstructed Burros-Wheeler Matrix.
- Type
List[str]
-
original
¶ The original sequence from the inverse BWT.
- Type
str
-
-
class
decoder.
FullDecoder
(path: str)[source]¶ A decoder class for both Huffman decompression and the inverse of the Burros-Wheeler transform, controller architecture.
-
path
¶ The path of the file to be compressed with BWT + Huffman compression.
- Type
str
-
huff_decoder
¶ A HuffDecoder object to do the Huffman decompression on a compressed sequence.
- Type
-
bw_decoder
¶ A BWDecoder object to do the inverse of Burros-Wheeler transform on the output of Huffman decompression.
- Type
-
full_unzip
() → None[source]¶ The main decoding method of the controller, it first decodes the sequence with Huffman decompression, then passes the uncompressed sequence to do the Inverse of Burros-Wheeler transform and obtain the original sequence.
- Returns
Fills all the properties of an object and writes out the original sequence to a file.
- Return type
None
-
-
class
decoder.
HuffDecoder
(path: str)[source]¶ A decoder class for Huffman decompression, it is used as a controller in a MVC architecture.
-
path
¶ The path of the file to be decompressed with Huffman decompression.
- Type
str
-
dehuffman_output
¶ The output file path for Huffman decompression.
- Type
str
-
binary
¶ The binary sequence that was translated from unicode using Huffman codes.
- Type
str
-
header
¶ The header of the compressed file; contains Huffman codes and paths as well as padding that were generated when compressing the sequence.
- Type
str
-
unicode
¶ The compressed format of the sequence.
- Type
str
-
decompressed
¶ The decompressed sequence; normally the Burros-Wheeler transform of an original sequence.
- Type
str
-
Sequence¶
A class to read and write DNA sequences (or a list of characters) to a file. In a MVC structure, this represents the Model of the data.
-
class
sequence.
Sequence
(path: str)[source]¶ A class to represent a sequence, sequences can be read from files or writtes to files.
-
path
¶ The path of the file to be read.
- Type
str
-
static
generate
(length: int) → str[source]¶ This method is used to generate a random DNA sequence using the random python library.
- Returns
The random DNA sequence.
- Return type
str
-
read
() → str[source]¶ This method is used to read the contents of the file into a python string.
- Returns
The contents of the file, i.e. the sequence.
- Return type
str
-
read_bytes
() → str[source]¶ This method is used to read a file that has been written in bytes, it will be useful when reading files that has been compressed with Huffman coding.
- Returns
The contents of the file, i.e. the sequence.
- Return type
str
-
write
(content: str) → None[source]¶ This method is used to write out to a file.
- Parameters
content (str) – The contents of the file.
- Returns
Writes out to a new file.
- Return type
None
-
write_bytes
(content: str) → None[source]¶ This method is used to write out to a new file into a bytes format, UTF-8 encoding, useful when we want to write out the contents of the Huffman compression to a file.
- Parameters
content (str) – The contents of the file.
- Returns
Writes out to a new file in a bytes format.
- Return type
None
-
Burrows-Wheeler Algorithm¶
Burrows-Wheeler Algorithm Class, contains naive and advanced construction methods, static methods were written as explicit as possible and were factorized to facilitate algorithmic readability. The yielding in some functions doesn’t respect Space complexity, but it was done in this manner to facilitate the data flow into the View class (the GUI).
-
class
burros_wheeler.
BurrosWheeler
[source]¶ A class to represent the Burrows-Wheeler algorithm, all methods are static for ease of reading and outside usability.
-
static
bwt_advanced
(sequence: str) → str[source]¶ Generates a Burrows-Wheeler Transfrom from a suffix array, advanced construction of BWT. Better algorithmic complexity.
- Parameters
sequence (str) – The sequence to ve transformed.
- Returns
The Burrows-Wheeler Transform.
- Return type
str
-
static
construct_bwm
(rotations: List[str]) → List[str][source]¶ This method constructs the Burrows-Wheeler Matrix from a list of string rotations.
- Parameters
rotations (List[str]) – A list of strings, i.e; a BW matrix like object.
- Returns
A list of strings or a Burrows-Wheeler Matrix.
- Return type
List[str]
-
static
decode_bwt
(matrix: List[str]) → str[source]¶ This method returns the original sequence from a given Burrows-Wheeler Matrix, the original sequence is the line that ends with the character ‘$’.
- Parameters
matrix (List[str]) – A Burrows-Wheeler Matrix.
- Returns
The original sequence.
- Return type
str
-
static
encode_bwt
(matrix: List[str]) → str[source]¶ Returns the Burrows-Wheeler Transform from a given Burros-Wheeler Matrix. the Burros-Wheeler Transform corresponds to the last column of the matrix.
- Parameters
matrix (List[str]) – A Burrows-Wheeler Matrix.
- Returns
The Burrows-Wheeler Transform.
- Return type
str
-
static
pprint
(mat: List[str]) → None[source]¶ Pretty print, this method prints a burrows wheeler matrix beautifully (without lists and strings).
- Parameters
mat (List[str]) – The Burros-Wheeler matrix, i.e; a list of strings.
- Returns
Prints the matrix.
- Return type
None
-
static
reconstruct_bwm
(bwt: str) → List[str][source]¶ This method reconstructs the Burrows-Wheeler Matrix given the corresponding Burros-Wheeler Transform. The naive algorithm for constructing the matrix given the transform is going to iteratively add the transform as a left column, then sorts lexicographically the columns.
- Parameters
bwt (str) – The Burrows-Wheeler Transform.
- Returns
A Burrows-Wheeler Matrix.
- Return type
List[str]
-
static
string_rotations
(seq: str) → List[str][source]¶ Returns all string rotations of a sequence.
- Parameters
seq (str) – he sequence to be rotated.
- Returns
Returns a list of strings, i.e; a BW matrix like object.
- Return type
List[str]
-
static
suffix_array
(sequence: str) → List[Tuple[str, int]][source]¶ Builds a suffix-array from a given sequence of characters. - Complexity of the algorithm O(n^2log(n)) - Sorting is O(nlogn) and Finding is O(n)
- Parameters
sequence (str) – The given sequence of characters.
- Returns
The suffix array of the sequence; a list of tuples.
- Return type
List[Tuple[str, int]]
-
static
Huffman Coding Algorithm¶
Huffman coding algorithm classes, contains node and tree separate classes.
-
class
huffman.
HuffmanNode
(char: str, freq: int, left: object = None, right: object = None)[source]¶ A class to represent heap nodes of a huffman coding tree.
-
char
¶ The character to be saved as a node value.
- Type
str
-
freq
¶ The frequency of the char.
- Type
int
-
right_child
¶ The right child of the heap node.
- Type
-
left_child
¶ The left child of the heap node.
- Type
-
dir
¶ The path of the current node in the Huffman tree, usually a string of 0s and 1s.
- Type
str
-
property
left_child
¶ The getter of the left child.
- Returns
- Return type
object
-
property
right_child
¶ The getter of the right child.
- Returns
- Return type
object
-
-
class
huffman.
HuffmanTree
(sequence: str)[source]¶ A class to represent a huffman tree.
-
sequence
¶ The sequence to be encoded, i.e; the Burros-Wheeler transform of the sequence.
- Type
str
-
frequency
¶ A dictionary to store frequency values for each character of the sequence, the character as a key and its frequency as a value.
- Type
Dict[str, int]
-
root
¶ The root node of the Huffman tree.
- Type
-
codes
¶ The Huffman codes for a given sequence, The character as a key and its tree path as a value(a string).
- Type
Dict[str, str]
-
pad
¶ The padding of the sequence, i.e; the number of zeroes that was added to the end of sequence until it was divisible by 8 (coding in 8 bits).
- Type
int
-
static
binstr_to_seq
(bin_str: str, codes: Dict[str, str]) → str[source]¶ Transforms a binary string to a sequence given a codes dictionary of paths.
- Parameters
bin_str (str) – The binary string to be transformed.
codes (Dict[str, str]) – A dictionary of characters alongside their paths.
- Returns
The original sequence.
- Return type
str
-
static
binstr_to_unicode
(bin_str: str) → str[source]¶ This method codes a binary sequence in 8-bits in UTF-8.
- Parameters
bin_str (str) – The binary sequence to be coded.
- Returns
The unicode string that corrsponds to the binary sequence.
- Return type
str
-
codes_to_header
() → str[source]¶ This method transforms the codes of a given Huffman tree object into a header to be saved in the compressed file (useful for when decompressing later).
- Returns
The header of the compressed file.
- Return type
str
-
create_tree
() → huffman.HuffmanNode[source]¶ The main algorithm for creating The Huffman tree.
- Returns
The root node of the tree.
- Return type
-
static
freq_dict
(sequence: str) → Dict[str, int][source]¶ This method returns the frequence dictionary of a given sequence.
- Parameters
sequence (str) – The sequence of interest.
- Returns
A dictionary of character frequencies.
- Return type
Dict[str, int]
-
get_codings
(node: huffman.HuffmanNode, val: str = '') → None[source]¶ A depth-first search recursive algorithm to find the paths of each character in the tree.
- Parameters
node (HuffmanNode) – The starting node, the root of the tree.
val (str, optional) – The path of the current recursion in a tree. The default is ‘’.
- Returns
Fills the codes dictionary property of the Huffman Tree.
- Return type
None
-
static
header_to_codes
(header: str) → Dict[str, str][source]¶ This method takes the header of a compressed file and transforms it into a codes dictionary that’s useful when decompressing a sequence.
- Parameters
header (str) – The header of the compressed file.
- Returns
A codes dictionary which maps from characters to thier given paths.
- Return type
Dict[str, str]
-
static
remove_padding
(bin_str: str, pad: int) → str[source]¶ This function removes the padding from a given binary sequence, the padding is normally a variable number of zeroes added when coding in 8-bits.
- Parameters
bin_str (str) – The binary string to be stripped from a sequence of zeroes.
pad (int) – The padding, the number of zeroes at the end of the binary sequence.
- Returns
The no-padded binary string.
- Return type
str
-
seq_to_binstr
() → str[source]¶ This method transforms the current sequence of the Huffman tree to a binary sequence using the codes (paths of every character), it also adds a padding to the end of the binary sequence (a variable number of zeroes, when sequence is divisible by 8 then add 8 zeroes by default). The binary sequence will be coded in 8-bits later. It saves the padding in the codes property.
- Returns
The binary sequence.
- Return type
str
-