Data Structures and Algorithms/Sorting Data

The wish for order is pretty old and there are as many reasons as there are things that could be sorted. But the main reason for sorting is the fact that finding can be much faster when one can rely on an order. Sorting is the main condition for effective searching. This is true for stores as well as dictionaries.

In this chapter sorting is restricted to the sorting of data records. Since the times are long gone when tapes were used as data storage, we restrict to the sorting of data where we have random access. It does not matter whether the records are stored in the fast RAM or on disc; the functionality of the algorithms stays the same.

Additionally we require that the data records have all the same length and structure and that they are stored as a list; hence this chapter is called „Sorting of Arrays“. The structure of the data records is of no importance for us; we only have a look at the key, which is that part of the data record, by which the records are sorted.

At the times when storage was very expensive, the memory needs of an algorithm were of importance and algorithms that could work "in place" were favored (the array with the data was given to the routine doing the sorting and the sorted data was returned in the same array). Nowadays it is much more interesting to have the possibility to select the behavior depending on the wish to later use the data again in their original order.

When checking the usability of an algorithm the following criteria are of importance:
 * is the algorithm able to handle the sorting key
 * is there sufficient memory available
 * is the algorithm faster than alternative ones
 * is it possible to implement the algorithm in the wanted programming language

In this chapter we will have a look at some selected sorting algorithms in order to explain the basic strategies; after studying this chapter one should be able to select a suitable algorithm for a given problem and to understand the principles and characteristics of algorithms not explained here.

Introduction
Simplified one could say that sorting is based on these elementary operations:
 * selecting and inserting
 * interchange
 * spreading and collecting
 * distributing

Sorting in applications

In the mayority of all applications sorting is an important operation and is responsable for a big amount of processing time when there are more than a few hundred data records. There are estimates that 20% to 30% of the runtime of professional programs is used for sorting. Sorting we find in very different fields.


 * words in a dictionary (alphabetical order)
 * statement of account (ordered by date)
 * student records (name, social security number, subject, ...)
 * addresses (zip-code, name of town/city, street, ....)

Definition of the Problem
When a child sorts its toys all blue things go into one box, all yellow ones into another and all red ones into a third box and all others into box number four. We see that the criteria for sorting must not be numerical, they must only be distinguishable.

Therefore we have to introduce another restriction: when comparing the elements of a list, it must be possible to tell whether element A is smaller, of same value, or bigger than element B. When sorting colored toys we have to number the colors -making a numbered list- or we could use the wavelength of the colors.

When comparing numerical values, we normally work with build in functions; the compiler will automaticly know how to treat

if(A > B)

depending on the definition of the variables. If somebody writes a program in assembler for an old 8-bit processor, he will have to check the definition of how the bits of a 32-bit float have to be interpreted so that a meaningful sorting routine can be programed.

For alphabetical order we need to have a list with two columns. In the first column we have the number of the row and in the other column we have the interpretation. One of the most commonly used list of this type is the ASCII-table. What the computer does is not really an alphabetical sort, the sort is by position numbers.

All the algorithms discussed in this chapter work with numerical representations where a smaller-, bigger-, same-relation is defined.

Indirect sort
The time necessary for swapping data records can be reduced drastically when it is possible to maintain a list with references to the records and just to swap these references.

After the sorting the array with the data records is still untouched; but in the array with the references we have the numbers of the data records in such a way that when rearranging the data records following these numbers, the data records will be sorted. The time necessary for this rearrangement increases in linear fashion $$\mathcal O(n)$$ with the number of data records.

Working with indirect sort is indicated when the costs for swapping are high (the length of the data record is much longer than the 4 bytes for an index) and there is no problem to provide the additional memory (since it is just 4 bytes per record this problem should occur seldom).

Stability
In applications it happens frequently that data records have to be sorted by more than one criteria. For example we have a list with addresses and we want to sort it by second name and within the second name by first name.

A sorting algorithm is called stable when such a list is sorted first by first name and then by second name, and every block with identical second name is still sorted by first name. A stable sorting algorithm does not destroy the results of previous sorts.

With algorithms that are not stable there is much more effort involved for getting the same result. First the list has to be sorted by second name and then all the blocks with identical second name have to be detected and sorted by first name.

For sorting an address list by zip-code, within zip-code by street name, within street name by second name, within second name by first name, the program code will become a bit complicated when using an unstable algorithm. With a stable algorithm the four sorts are done in reverse order and everything is done.

Runtime-Analysis
Sorting algorithms are playing an important role in computer science and during the years a lot of algorithms were developed. A classical way to make up groups of algorithms is to distinguish whether they are comparison-based or not. Since most algorithms of the last type were very limited concerning their use this meant "usable" or "exotic".

We will show here that it is a much better idea to distinguish dividing and non-dividing algorithms in order to explain the different speeds. With dividing is meant making subgroups from groups. In this chapter we only give a very brief description of the algorithms; a detailed description can be found further down.

First we have a look at an algorithm of the non-dividing type.

SelectionSort
If there are only two data records, the algorithms of this type are reduced to the question whether the two records have to be swapped or not. In this special case these algorithms are faster than all the others.

Now we have a look at some algorithms which follow the motto "divide and conquer".

Automatic Division vs Fixed Division
It is conspicuous that for QuickSort we work with the logarithm of n and for the other algorithms with N; we have a closer look at the why.

There is an unlimited number of different real numbers and it is possible to proof that there is even an unlimited number of real numbers in any given interval. QuickSort and algorithms alike work with the assumption, that it is impossible to know beforehand, how many times the domains have to be subdivided. Therefor the number of division levels is evaluated during the sort.

The other algorithms work with the assumption, that on a computer the number of different values is limited and countable. Even when working with floats of double precision the maximum number of different values is known beforehand and this number is 264. When the dividing factor stays constant -same number of buckets or domains on any level- it is possible to tell the number of division levels that make sense; from there on only records with identical key values will be found in a bucket.

How to compare algorithms
If we have 4 data records, QuickSort has finished its job after two recursion levels (on any recursion level all the records have to be handled once). If working with RadixSort and numbers from 00.000 up to 99.999, all the data records have to be handled five times. We have a real advantage for QuickSort.

This changes instantly when we have to sort 200.000 data records. RadixSort still has to handle all the records only five times but QuickSort now generates 18 recursion levels (218 = 262 144) and handles all the records on any level.

When looking for the fastest algorithm for a task we can use the equations in order to estimate the relation between them. If one wants to know from which number of records onwards ExtraDix will be faster than QuickSort when sorting 64-bit integers, one gets

Under the assumption that C and F have the same value, F / C becomes 1 and the result is that for more than 256 data records ExtraDix will be faster than QuickSort.

dividing algorithms
The algorithms of this category follow the motto "divide and conquer".

Why the motto "divide and conquer" brings a benefit when sorting you can learn here.

As already shown in the runtime-analysis these algorithms can be subdivided into those which calculate the number of recursion levels dynamically and those which work with a fixed number of repetitions.

dynamic division
The algorithms of this type do the division into sub-domains based on the number of data records. Most of these algorithms do not work stable.

fixed division
Before starting the sort it is known how many different key values are possible. If sorting by an 4 byte unsigned integer the values can be between 0 and a bit more than 4 billions. When using RadixSort it is clear that the numbers in decimal notation can have up to ten digits so we need ten sorting sequences. When using ExtraDix four are enough.