DNA sequencing is an important technique, pervasively used in modern medical research. This problem is about pushing the state-of-the-art even further, by means of an ultra-fast DNA scanner. This analyser works by beaming a DNA molecule multiple times, thus obtaining multiple readings of incomplete nucleotide sequences. Speed in these readings comes at a price, and the trade-off is accuracy, the readings aren’t always right. Nevertheless, the machine has a great repeatability property, it repeats the errors in a deterministic way, thus allowing them to be corrected in a post-processing phase. Additionally, it identifies that a given sequence, although unreadable, was already spotted in other places, or in other readings. Your mission is to program the core of this analyser.
You need to determine the precise representation of a sequence of DNA nucleotides, from a sequence of readings given as input. Each reading is a sequence of characters, comprising the four nucleotide letters (AGCT), and well identified unreadable chunks of the form X1234-100. In the representation of a chunk, 1234 stands for an integer identifier, and 100 stands for its size. For instance,
is a sequence of 4 + 5 + 3 + 5 + 5 = 22 nucleotides, where there are two chunks of unknown nucleotides, with size 5 (X1-5 and X2-5), and where the chunk with identifier 2 happens twice at the end of the sequence. Moreover, consider the following reading conditions:
that allow a post-processing correction, by adopting the most generic nucleotide. For instance, if a set of readings reads A, and a T for the same position, the outcome should be G.
The input consists in a line with a positive integer (W) that denotes the fixed width of the DNA strand being analysed, a line with a positive integer (N) that denotes the number of samples being analysed, a line with a positive integer (X) that denotes the maximum number of chunks used, numbered from 0 to X - 1, and a sequence of N lines containing a sequence of characters, each representing a reading.
• | 1 ≤ W ≤ 5001 | Width of the DNA strand | |
• | 1 ≤ N ≤ 5001 | Number of samples | |
• | 0 ≤ X ≤ 10000 | Maximum number of chunks |
The output is a sequence of nucleotide characters ACGT, and character ? representing nucleotides that cannot be determined.
22
3
3
ACGX0-5CAAAX1-5X1-5
AAGTACACCAAAX0-5X1-5
ACX2-3CACCAAAAAAAAAX2-3A
ACGCGCGCCAAACGCGCCGCGC
7
2
5
X3-1ACX0-3X4-1
X4-1X0-3GX1-1X3-1
?ACACA?