This is the second in a series about core data structures and algorithms.
Considering how important sorting is to computer science and programming, it’s actually a mystery why more programmers don’t appreciate it.
I think because it’s so fundamental, and it’s built-in so tightly to many of the most important systems in use today — databases being the key example — programmers rarely stop to think about it.
What can sorting do?
Before looking at the various sorting algorithms, let’s take a look at some of the problems sorting can solve.
- The most obvious application is that a sorted data set makes it really fast — as fast as is ever possible — to find the smallest or largest value. Data is often maintained in a sorted state specifically for this reason.
- If your data set is sorted, you can find duplicates in a linear time — simply iterate over the entire set, checking for duplicates as you go.
- A sorted data set means you don’t need a hash table to count duplicates. You can just emit the counts as you walk the data set.
- Of course — and often most importantly — a sorted data set is amenable to a binary search, which is the most important search algorithm there is.
How do we sort?
Sorting algorithms are one of the most vibrant parts of computer science. There isn’t a large number of search algorithms, but there are a large number of sorting algorithms — and the run times vary significantly too.
So if you need to sort a data set, which is a good algorithm to choose? You can start by asking yourself the following questions:
- Is memory a concern? If so an in-place sort like insertion sort, or quicksort, might be best.
- Are you concerned about worst-case runtime? Quicksort is a famous, effective sorting algorithm with average runtime of O(n log(n)), but presented with already sorted data set, its runtime increases to O(n2). In contrast mergesort always runs in O(n log(n)) — but it’s not an in-place sort!
- Do you believe your data is almost sorted? In that case insertion sort runs in nearly linear time, and is often the best choice for small data sets.
- Do you need the sort to be stable? Stability means that elements that are equal before the sort remain in the same order after the sort. Mergesort, insertion sort, and bubble sort (for example) are stable. Quicksort and heap sort are not.
- Do you need to implement it yourself? Perhaps a well-tested, mature library is not available to you. Quicksort can be relatively complex to implement correctly and efficiently, but an insertion sort is easy to implement.
- Do you want to minimize the amount of data moved? Some sorting algorithms are better at this than others.
Sorting is surprisingly dynamic. Adaptive sorting techniques may change the sorting algorithm during the sorting process as the sorted fraction of the data increases. An example of this is Timsort, which is the standard sorting algorithm in Python.
My favourite sorting algorithms
I have three favourites — mergesort, heap sort, and counting sort.
Mergesort is relatively easy to understand and easy to implement — recursive implementations work well. And its runtime is always O(n log(n)). Mergesort is a good approach if you have many already-sorted datasets, it can work very well. Memory usage can be concern however.
Heap sort is my second favourite, due to its elegance. In a heap sort you take one data structure — a min or max heap (depending on the order of sorting desired) and simply insert all your elements. Once all elements have been sorted, you simply remove each element from the heap sequentially and they come out in sorted order.
Granted you have to write the heap data structure, but its use to sort a data set is very appealing.
Both mergesort and heap sort run, on the average, in O(n log(n)), though heap sort has a best case of O(n).
Finally there is a counting sort. While a counting sort is not used much in practice, it can be very powerful if the range of values is known ahead of time.
You start by creating an array, with a slot for every possible value. You can then iterate over the dataset, adding each element into the correct entry. If there is an element, or elements, already in the entry, use a linked list to connect the elements. Finally you iterate over the entire array, emitting values as you go. It’s simple to understand and runs in linear time. A great application is sorting a set of people by age — you are probably safe with a range of 0 to 150!