Reed–Solomon codes for coders/Additional information

This section presents additional information about the structure of QR codes. The information here is currently incomplete. The full specification is in ISO standard 18004; unfortunately, this document is not freely available. Readers in search of the full details might turn to examining the source code of one of the many open-source QR code decoders available.

Standard QR codes
A standard QR symbol has the following features:
 * Three bull's-eye locator patterns in the corners
 * Format information (red) placed around the locators
 * Alternating light and dark timing marks in one row and column connecting the locators
 * Smaller bull's-eye alignment patterns within the symbol (only in symbols that are 25&times;25 and larger)
 * Version information (blue) placed near the corners (45&times;45 and larger)
 * Quiet zone 4 modules wide around the symbol

Micro QR codes
Micro QR symbols have these features:
 * One locator pattern in the corner
 * Format information around the locator
 * Timing marks along two edges
 * Quiet zone 2 modules wide around the symbol

Model 1 QR codes
Model 1 QR symbols are an older version of the standard that is rarely seen anymore. They have the following features.


 * Three locator patterns in the corners
 * Format information placed around the locators
 * Timing marks in one row and column connecting the locators
 * Extension patterns along the right and lower edges
 * Quiet zone 4 modules wide around the symbol

Symbol size
There are 40 sizes of standard QR codes, and 4 sizes of micro QR codes. Confusingly, these are referred to as "versions" by the QR code specification document. The table below gives the size in modules and capacity in 8-bit codewords for each version. Some of the capacity is used to store data and some is used to store error correction information, depending on the error correction level chosen.

The capacities for the non-micro QR codes can be given by the following formula: $$ \left \lfloor \frac{1}{8}\left( 16\left(V^2+8V+4\right) -25\left(\left\{V \ge 2:1,0\right\} +\left \lfloor \frac{V}{7} \right \rfloor \right)^2 -\left\{V \ge 7:1,0\right\}\left(36+40\left \lfloor \frac{V}{7} \right \rfloor\right) \right) \right \rfloor $$

Note that there are two slightly different codeword layouts for version 3 micro QR codes, depending on whether the error correction level is L or M.

Model 1 size
Model 1 symbols have the same size as the corresponding model 2 version, but the codeword capacity is slightly different.

Alignment pattern
Standard QR codes of version 2 or higher contain a grid of alignment markers. The markers are placed in rows and columns given by the table below. The numbering starts at zero and counts from the upper left corner. Three corners of the grid overlap the locator patterns, and those alignment markers are omitted.

Format information
All QR codes use 5 bits of format information, encoded with generator 10100110111 to produce 15 bits.

In standard QR codes, the first two bits give the error correction level, as shown below. The remaining three bits identify the mask pattern for the data area.

Model 2 codes (the current standard) use the format mask 101010000010010, while model 1 codes use 010100000100101 instead. Note that the first five bits are inverted. This means that, for example, L is represented by two dark modules in model 2, but by two light modules in model 1. The masks were chosen so that valid masked format information from one model is never valid for the other model.

In Micro QR codes, the format information has a different structure. The first three bits give the version and ECL, as shown below; only certain error correction levels are possible for each version. The other two bits select the mask pattern for the data area.

Micro QR format information is masked with 100010001000101.

Data masking
Data masking inverts certain modules in the data area based on the formulas below. The variables i and j are the zero-based row and column numbers of a module. The purpose of the masking is to break up large solid blocks of black or white, and to avoid patterns that look confusingly similar to the locator marks. There are 8 different mask patterns available for standard QR codes, and 4 for micro codes.

RS block size
Because the Reed–Solomon coding uses 8-bit codewords, an individual code block cannot exceed 255 codewords in length. Since the larger QR symbols contain much more data than that, it is necessary to break the message up into multiple blocks. The QR specification does not use the largest possible block size, though; instead, it defines the block sizes so that no more than 30 error-correction symbols appear in each block. This means that at most 15 errors per block can be corrected, which limits the complexity of certain steps in the decoding algorithm.

The table below shows the data capacity and block sizes for each error correction level. The parenthesized pairs (n,k) give the total block length n and the number of data codewords in the block k. Capacity is in codewords, and does not take mode indicator overhead (see below) into account so the number of bytes that can be stored is slightly less.

Note that version 1 and 3 micro QR codes only store 4 bits for the final data codeword. Four zero bits are appended before processing by the error correction algorithm.

The message is broken into blocks by simply putting the first k codewords in the first block, the next k in the second, and so on.

Interleaving
There may be localized damage to a QR symbol; for example, it may be partially obscured by a smudge. The symbol can still be decoded properly if the allowable number of errors is not exceeded in any of its error correction blocks. For this reason, the blocks are spread out by interleaving their codewords together when the symbol is created. The codeword sequence is built by taking the first codeword from each block, then the second from each, and so on. If one block is exhausted, skip over it and continue placing codewords from the other blocks. After all data codewords have been placed in sequence, the error correction codewords are placed similarly.

For example, a version 5 symbol stores 134 codewords. At error correction level Q, there are 62 data codewords and 72 error correction codewords. The message is broken into blocks by filling each row in the data section of the table below in turn. The first two blocks are one codeword shorter than the last two.

Next, error correction codes are calculated for each row of the table. Finally, the table is read column-by-column from left to right, resulting in the following codeword sequence:


 * D1 D16 D31 D47 D2 D17 D32 D48 ... D15 D30 D45 D61 D46 D62 E1 E19 E37 E55 ... E18 E36 E54 E72

Model 1 block size
Model 1 symbols use different block sizes, as shown below. Only 4 bits are stored for the first codeword, and 4 zero bits are prepended before processing with the error-correction algorithms. Multiple block sizes are not used within a symbol, so there are sometimes leftover codewords.

Model 1 QR symbols do not use interleaving. The data codewords for each block are simply concatenated in order, followed by the error correction codewords.

Encoding modes
Standard QR codes use a 4-bit encoding mode indicator followed by a length field. The length field has a variable number of bits, which depends on the mode and the symbol version.

After the length, the encoded data bits appear, followed by another mode indicator. This allows different encoding modes to be used for different portions of the message. The special indicator 0000 signals the end of the message, but it can be omitted if there are not enough data bits left in the symbol to store it.

Here is a list of special indicator values. They do not initiate an encoding mode, but are used for other purposes.

Micro QR codes use a different indicator system. Version 1 micro symbols do not use an indicator; only numeric encoding is possible. Version 2 uses a 1-bit indicator, and only the numeric and alphanumeric encodings can be used. Versions 3 and 4 can use all four encoding modes.

End-of-message can be signaled with a numeric field of length zero.

Numeric encoding
In numeric encoding, each group of three digits is encoded with ten bits. If the number of digits is not a multiple of three, the last group is shortened. If there is one leftover digit, it is encoded in four bits. If there are two digits in the last group, it is stored with seven bits.

Alphanumeric encoding
The alphanumeric encoding stores 45 different characters, including numbers, upper-case letters, space, and a few symbols, as shown in the table below.

Two characters A and B are combined with the following formula, and the result is stored in an 11-bit value.


 * 45 &times; A + B

If there are an odd number of characters to be encoded, the last one is stored by itself in a 6-bit field.

Byte encoding
This encoding simply stores a sequence of 8-bit bytes. According to the QR 2005 standard, the ISO Latin-1 encoding is used by default, but in previous standards the default was JIS-8 (the 8-bit subset of Shift JIS). ECI markers are used to select a different encoding; unfortunately, many QR code readers do not understand ECI markers, so encoders frequently do not include them. In practice, QR codes that store characters outside the basic ASCII set often use UTF-8 or various language-specific encodings, and the decoder must guess which encoding is being used.

ECI
After the ECI indicator are one, two, or three bytes that specify the Extended Channel Interpretation to be used. Examine the high bits of the first data byte to determine the number of data bytes present.

Here is a partial list of defined ECI codes.

This list is based on information from the ZXing project.

Kanji encoding
QR codes were developed in Japan and have special support for storing Japanese text. The Kanji encoding uses 13 bits for each Kanji character stored. It is based on the Shift JIS encoding, which uses two bytes per kanji character. The first Shift JIS byte is in the range  to   or   to , and the second is   to. The encoded value in the QR code is as follows.


 * (A– )&times; + (B– ) if  &le;A&le;
 * (A– )&times; + (B– ) if  &le;A&le;

Note that characters higher than  cannot be encoded with this method. This means that the JIS X 0208 character set can be encoded, but not the JIS X 0213 extension.

Structured append
Structured append mode allows a long message to be split between several separate QR symbols. The indicator is followed by a 4-bit field that gives the number of this symbol within the set, counting from 0. Next is another 4-bit field which gives the number of the last symbol in the set (equal to the number of symbols minus 1). Finally, there is an 8-bit exclusive-or checksum of the bytes in the complete message. (For kanji data, the bytes of the ShiftJIS encoding are checksummed.) Each symbol in the message contains the same checksum value, and this helps ensure that parts from different messages are not mixed up when decoding.

Universal Reed-Solomon Codec
In the main article, a simple Reed-Solomon codec was described, but for the sake of simplicity, some non essential parameters were hidden. These parameters do not affect the capabilities of the decoder, in fact they just shift the internal values, but the structure remains totally equivalent, and so the resulting repairing capability is the same (ie, the Singleton bound still applies).

A universal Reed-Solomon codec is a Reed-Solomon encoder/decoder which allows to setup those hidden parameters. This allows to decode the RS codes generated by (almost) any other Reed-Solomon codec, however it was implemented.

Here is the list of the non essential, shifting parameters:


 * fcr: first consecutive root, this allows to specify the value to "jump" when computing the sequential terms of the generator polynomial. Usually set at 0, but bounded between 0 and the Galois Field's cardinality (eg, 255 for GF(2^8)).
 * generator: the integer value that will replace alpha in the standard academic nomenclatura. This must be a prime integer, such as 2, 3, 5, etc. In practice, what is the generator? You know, when we use the log/antilog tables to do our computations, from what are these tables generated from? You guessed it: the generator number. Usually, it's set to 2, so the log/antilog tables are generated by computing the sequence of powers: 0, 2^0, 2^1, ..., 2^p. We can then almost entirely work with these log/antilog tables, except at the first stage of encoding, and at the first and last stage of decoding, because we need to convert back and forth from/to the input numbers to the log/antilog tables values. So basically, you will see we need to supply the generator to only a few functions that needs to convert from or to the log space.
 * prim: the primitive polynomial to compute the reductions in the field. Warning: you cannot use just any number! If prim is wrong, you will get an infinite loop when trying to compute. Prim is always an integer between field's cardinality and its double (eg, for GF(2^8) prim must be between 256 and 512), but it must be irreducible, which depends on the chosen generator. So Prim depends on both the generator and c_exp.

To compute the list of possible primitive polynomials for the chosen generator and field's exponent, you can use the function:

Beware that for c_exp > 14, the process may be so long that it never finishes! This is because finding the prime polynomials is a NP-Hard problem, and this function uses an exhaustive search (so the result is always correct but the time complexity is exponential). But in practice, if you just need one primitive polynomial to make the RS codec work (and not the list of all possible prim polynomials), you can set fast_primes=True and single=True, this will return you the first primitive polynomial the function can find, which will usually be computed in just a few seconds.

This universal Reed-Solomon codec goes further by allowing to work in any GF(2^p) field, by providing these parameters:


 * c_exp: the Galois Field's exponent (ie, the p in GF(2^p)). For example, c_exp=16 means that you will work in GF(2^16) == GF(65536)
 * field_charac: an internal variable you don't have to modify. It just computes the full range/cardinality of the field minus 1, eg, for GF(2^16), field_charac = 65535.
 * field_charac_full: just like field_charac but +1 (ie, the full field's range including 0).

Here is the code of the universal Reed-Solomon codec. You can compare it with a diff to the code presented in the main article to see the differences:

Autodetecting the Reed-Solomon parameters
If you intend to work with the QR codes or RS codes generated by another RS codec than your own, you may have to find what are the parameters.

If you have a sample input message and its corresponding RS code, then you can use the following function to try to find the correct set of parameters to decode it.

This function is naive, and thus not fit for realtime computation. You should use it only once on a message (or a set of messages) to compute the parameters, and then hardcode these parameters in your RS codec.

There are a few research articles about more efficient algorithms, but this is outside the scope of this article.

Probabilistic prime polynomial generator
The deterministic prime polynomial provided above is a good first approach to find prime polynomials in Galois Fields, as it is guaranteed to find all the prime polynomials. However, this guarantee comes at a cost: algorithmic complexity. Thus, if you try to find a prime polynomial for a higher Galois Field such as  or worst , the process will either take an almost infinite amount of time or memory.

Indeed, the prime number theorem says that the probability that the n-th number is prime is equal to: $$p_n ~= n log n$$, which means in other words that bigger is the Galois Field, more sparse are the prime polynomials, and thus more time we will spend search for one.

To avoid these issues, we can relax our guarantees by using a Monte-Carlo probabilistic generator. Thus, the generator below does not guarantee to find all prime polynomials, but it will find some, and anyway we need only one to initialize the Reed-Solomon codec.

How it works:
 * We pick a number randomly between  and
 * We check if it is prime, by using a probabilistic primality test called Miller-Rabin. Without going into the mathematical details, a primality test serves as a way to check if the number is prime or not. In our case, it is probabilistic, which means that we have a chance of having a false positive (but never a false negative). Thus, a composite number (ie, a non prime polynomial) can be declared as prime. To avoid that, the primality test is wrapped inside a tournament: we run the test  times to ensure the result is correct. Bigger is this number, more guarantees we get, but also more time the test takes (but this is insignificant compared to the deterministic approach).
 * If the number/polynomial seems prime, then build the whole Galois Field generator polynomial to check if there is no duplication of any value in the field (the field must not be surjective, no two values must be the same).
 * If the Galois Field generator polynomial only contains unique value, then we found a prime polynomial!

Below is a Python code implementing a probabilistic prime polynomial generator.

Note that the Bloom Filter is optional, it is provided here only as a way to be able to compute bigger fields without overblowing the memory because of the galois field generator polynomial test.

Optimizing performances
There are several ways to optimize the performances of a Reed-Solomon codec, but most can be classified into three categories:


 * Algorithms with lower algorithmic complexity, which changes the asymptotic time, not just by a constant. This is the Holy Grail of error correction codes, and some O(n log n) algorithms exist for Reed-Solomon such as NTT (FFT applied to Galois Fields).
 * Mathematical tricks to reduce the complexity of operations, without changing the asymptotic time (or bending it a little).
 * Implementation tricks, such as inner loop optimization, parallel calculation using SIMD, double buffering by using thread to write a codeword down while the next one is being computed at the same time, etc. Most of the implementation optimization tricks for Reed-Solomon are in fact the same as those applied to matrix multiplication and factorization, with the only difference being that Reed-Solomon requires Galois Field arithmetics, and thus XOR (carryless addition) and modulo a prime number.

The following sections will describe some of the techniques, without being exhaustive.

Note that there is also a fourth way that will not be covered here: using specifically designed hardware designed to maximize parallelization of operations (such as VLSI or FPGA boards).

Algorithms with lower complexity

 * Fast Fourier Transform: As we saw in the tutorial, Reed-Solomon encoding is done by polynomial division of the message with the generator polynomial. Mathematically, polynomial division is equivalent (in some conditions) in the spatial domain with a deconvolution in the frequency domain. Hence, with the Fast Fourier Transform (FFT), we can convert a polynomial into the frequency domain, apply a division, and project back the result by using an Inverse Fourier Transform (IFFT). This algorithm runs asymptotically in O(n log n).
 * Number Theoretic Transform: An extension of the Fast Fourier Transform technique, where the modulo is done on any integer of choice instead of $$e^{-2\pi i/N}$$. This has the great advantage that the results of the NTT can be directly in the target Galois Field instead of doing a modular reduction post FFT . This type of transform can potentially be more optimized than FFT, since Galois Field operations only rely of XOR, additional implementation optimization tricks can be applied (a benchmark in Python comparing NTT with FTT is available here).

Mathematical tricks
Mathematical tricks are half-way between a change of algorithm and an implementation optimization, because it does change parts of the algorithm in order to account for computers architectures and thus optimize computations on them, but by changing mathematical operators instead of computer architectural details. One such example is Cauchy-Reed-Solomon.

In our implementation, the generator polynomial was in fact a very simple Vandermonde matrix (in our case it was only a vector), but in Cauchy-Reed-Solomon, a Cauchy matrix is used instead. This provides nice mathematical properties, one being that only XOR operations are needed to encode, removing the overhead of galois field multiplications, by considering the galois field's numbers not as bit vectors (polynomial forms) but as bit matrices, with the  in the chosen GF field. Another difference is that it adds an additional packet-size parameter which controls the number of symbols that are to be put in each ECC block (so that you are not anymore limited to 28 = 256 characters but to (28 × packet size, but the packet size must be smaller than the GF field used) and which is a lot faster and resilient to burst errors than standard Reed–Solomon.

Implementation optimization

 * Bit manipulation: since Galois Field operations are all using XOR at the low level, bit manipulation can greatly speed up calculations, one simple example is to use  to multiply by 2 instead of using   or even  . Indeed, this command the shifting of the variable   to one bit left, which effectively add a bit and thus multiply   by  . And we could multiply by any other multiple of two:   to multiply   by.
 * Reducing k and n: the number of ecc symbols can be reduced to decrease the calculation time of Reed-Solomon, since it is proportional. Also, the total size of a codeword can be reduced: this will allow you to compute smaller messages, but potentially faster. This is particularly useful when using Reed-Solomon in a GF(2^16) field: instead of having codewords of 65535 symbols, which would not fit very will in computers memory, you can set the codeword size to 1024, or any other value that fits your L1 or L2 CPU cache. This will allow your CPU to optimize the transfert of information back and forth from the hard drive: for one codeword calculation, only one access to the hard drive or the RAM is required to get the values, and then everything is done with CPU memory.
 * SIMD and parallelization: by encoding multiple messages at once, the CPU (if supporting SIMD instructions, which most can nowadays) will broadcast the same operations on all messages, since anyway you are using the same generator polynomial/matrix. This drastically reduces the calculation time. Note that for some programming language, you may find optimized distributions that increases SIMD performance on specific architectures, such as Intel Distribution for Python. In addition, working in parallel allows to load multiple messages at once from a file/stream, which reduces the I/O workload by the number of parallel messages you load.
 * Double buffering: I/O is probably the biggest bottleneck after the heavy CPU calculations required by Reed-Solomon. To alleviate a bit this issue, you can use a thread to save encoded messages while starting immediately the calculation of the next batch of codewords.
 * Using CRC to enable erasure decoding, thus doubling the number of recoverable symbols compared to error decoding. The idea is to use Cyclic Redundancy Code, a type of error detection code, to detect if there is any error in a source message, and if this is the case, which symbols were corrupted. The advantage of CRC is that it is a lot faster to compute than Reed-Solomon, thus for the same amount of symbols, you can reduce drastically your calculation time.

Resources: ParPar implementation notes.