Linguagens e Ambientes de Programação (2014/2015) [Eng.Inf. - DI/FCT/UNL]

Enunciado do 2º Trabalho Prático (C)

Fernando Birra


Datas

  • 24/Abr (22:00) - Divulgação deste enunciado
  • 10/Mai (22:00) - Data e hora limite de entrega do 2º projeto (foi adiado 25 horas).
  • 13/Mai (22:00) - Data e hora limite de entrega com atraso. Um valor de penalização por cada dia de atraso


    Changelog


    Regras de submissão

    Super Zipper


    Introduction

    In today's world, the great appetite for bandwidth in communications and the desire to store large volumes of data makes data compression an important technology. This second LAP project is about data compression.

    The goal of this project is to develop a module in C, called "Zipper", exporting a collection of types and functions that are used in the implementation of a particular compressing/decompressing algorithm.


    Basic compressing technique

    When the Morse alphabet was envisioned, the choice of the sequences of dots and dashes was not made at random. One obvious goal was to minimize the length of the messages and therefore a variable-length encoding was used, associating shorter sequences of dots and dashes to the most commonly used characters. This idea of using binary strings of variable size to represent different symbols persists in many modern compression algorithms. The sequence of bits that is used to represent an input symbol is called the code of that symbol. In out algorithm, that we will call basic technique, the input file is traversed twice:

    An example

    Suppose that the input file contains only these five distinct symbols, A, B, C, D, E, with the number of occurrences 4, 2, 3, 2, 11, respectively. In total there are 22 symbols. This information is stored in a frequency table.

    The compression tree corresponding to the previous frequency table is this:

    Note that the symbols appear in tree leaves accompanied by their respective frequencies. As for the internal nodes, they record the sum of the frequencies of the descendant symbols. According to these rules, the root always records the total number of symbols contained in the file, in this case 22.

    There is a simple rule to obtain the code for each symbol from the tree. Starting from the root and making the journey down towards the symbol of interest, we consider that each left descend represents the bit 0 and each right descend represents the bit 1. From the compression tree we build an translation dictionary for our five symbols as follows:

    Now we are ready to perform the actual compression.

    Non-prefix property

    Note that in our dictionary, none of the codes is prefix of any other code. For example , there is a "000" but there are no more codes beginning with "000" . There is also a "1" but there are no more codes beginning "1". This easily results from the fact that all symbols are located at the leaves of the compression tree.

    This property, called non-prefix property, allows the unambiguous encoding of the data.

    Construction of the compression tree

    Here is a method for creating the compression tree from the frequency table:

    Properties of a compression tree

    Every compression tree has the following three interesting properties:
    1. An internal node always has two children.
    2. The frequency of an internal node is always strictly greater than the frequency of any of his two sons.
    3. Performing a scanning of the tree by levels, from bottom to top and from left to right, the sequence of frequencies of the nodes is never decreasing.

    Decompressing a compressed file

    Decompressing a compressed file requires the compression tree that was used to perform the initial compression. Just read the compressed file, bit by bit, and use the bits to guide you along a downward path of the tree. When you reach a leaf, a symbol has been recognized. Proceed to recognize the next symbol.

    The end of file

    Returning to our example, the 22 input file symbols are compressed into 44 bits (44=3*4+3*2+3*3+3*2+1*11). But this number is not a multiple of 8, which means that the last bit of the compressed data does not end at the border of a byte. So this question arises: how can the decoder determine where the data really ends in the encoded file?

    We solve this problem by having the encoder writing the total number of symbols in the header of the encoded file (more on this later). For now, we just need to realize that this information is already known, since our root node of the compression tree stores it.

    However, there is yet another problem during the enconding, now concerning the last symbol of the input file. Since the compressor processes the input file as a sequence of symbols of length N, the original file may not end at a symbol boundary. So, the last symbol is required to be padded (filled on the right) with zeros before encoding. Thus, the decoder will later face the problem of not knowing how many zeros were padded during compression. To solve this problem, we have the encoder writing, in the header of the compressed file, the real length (in bits) of the last symbol.

    Since the operating system does not allow us to have files whose size is not an integral number of bytes, in case our encoded file does not end at a byte boundary, we fill the missing bits with zeros on the right (padding)

    Criticism

    The compression technique used in this project has several problems:

    Nevertheless, it already exhibits some important ingredients of real-life compressing tools.


    Programming the system

    We now move to the details regarding the implementation of our system. You will need to program all the required operations in a single file. We provide you with an initial skeleton of the required zipper.c. The header file zipper.h is also provided and you cannot modify its contents. We also include an example main.c for your convenience.

    Reading and writing bits

    As you probably guessed, we will need to implement operations that will allow us to read and write arbitrary sequences of bits from/into files. We call these sequences Chunk and its data type is internally defined as a 32-bit unsigned value: For convenience we also declare the following additional basic data types: Our chunks store their bit sequences right aligned. So, if a chunk represents a sequence of 13 bits, for example, these bits are supposed to be least significant ones of the 32 used internally. Important: In all chunk operations, their bits are read/written to the file from left to right.

    There is a subtle detail regarding this way of representing sequences of bits: we can't tell their length just by looking at their value! For instance, the sequences 101, 0101 and 00101 will all be represented in the same way. We solved this problem (not in an ideal way) by carrying along, in most Chunk operations, a value representing the length of the chunk.

    The ChunksFile data type will allow the operations listed below:

    The top level functions

    The top level functions used to compress and decompress files are listed below: One thing that you may find intriguing is the last argument of the zip operation. In fact, since we have bit access to the files we are not restricted to handle the input (uncompressed) files as sequences of bytes. We may treat the the file to be compressed as a sequence of symbols of fixed length, but not necessarily with the value 8. Thus, we can compress files treating them as sequences of symbols with length 4, 7, 11, etc.

    Also, by looking at the unzip function's arguments, we observe that, for decompression, we don't need to pass the symbol length as an argument. Not that we will not need it, but because it would be extremely frustrating to have to memorize which length was used during the compression to be able to decompress a file. A better way is to store that information in clear, at the start of the encoded (compressed) file.

    Encoding symbols and decoding codes...

    The two operations that will do the job of encoding symbols into codes and decoding codes back to symbols are listed below: These functions operate directly from the files that are being processed. After an encoding or decoding operation, we have consumed a little bit more of the input file. In these functions, we need to return two different values: the chunk and its length. The return value is the number of bits of the Chunk representing the encoded symbol or the decoded code. The chunk is returned via an output parameter.

    Central to these operations is the ZipperInfo structure, defined in zipper.h. This structure holds all the information needed for the encode and decode operations. Unsurprisingly, this structure holds both the compression tree and the translation dictionary, both described in the Basic Compression Technique.

    Binary format of the compressed file

    The compressed file is organized in 3 sections:

    Storing the compression tree in a file...

    As mentioned before, the compression tree needs to be included in the compressed file. Since the data type for the actual tree is an opaque data type that you will have to choose and implement, we can only describe how the tree needs to be stored in the compressed file.

    The algorithm to dump the compression tree is quite simple:

    Removing ambiguity...

    During the tree construction process, we repeatedly take two trees from the forest and merge them as subtrees under a new parent node. Two questions arise if we want a completely unambiguous process:

    1. Which two trees should we pick?
    2. Which will be the left and the right subtrees?

    Firstly, we need to define a total order "<" over the trees in our forest:

    Now, the answers to the two questions:
    1. We always pick the two lesser trees from the forest.
    2. The left tree will be the lesser tree amoung the two picked trees.
    Note that these rules were used at each and every step, when building the tree of our example.

    Where to start?

    We strongly advise you to start with the ChunksFile file access operations. Next we recommend you to implement the operations of the FreqTable datatype and count the occurrences of the symbols in an input file. Alternatively, in an even easier starting approach, you can treat the files to be compressed as sequences of bytes (8 bits) and start right away with FreqTable operations.


    Regras principais


    Regras de entrega


    Outras regras


    Avaliação

    O docente responsável pela gestão e pela avaliação deste trabalho é o Professor Fernando Birra.

    A nota do projeto será em grande parte determinada por meios automáticos, através do Mooshak. Portanto é essencial respeitar a especificação contida neste enunciado, em todos os seus detalhes.

    Mas, relativamente a programas que funcionem minimamente, também haverá uma apreciação mais subjetiva da qualidade, tendo em conta aspetos, tais como:

    Obviamente não é obrigatório fazer o trabalho todo para obter nota positiva. Mas, claro, vale a pena trabalhar para produzir uma solução bastante completa e com a melhor qualidade possível.


    Observações


    Final

    Bom trabalho! Esperamos que goste.