Earlier posts (in Russian):
- Shannon entropy and data compression. Part 1 (theory)
- Shannon etropy and data compression. Part 2 (example)
- Kolmoorov-Levin structures (theory)
- The notion of reproducible code (theory)
- On specifications and cryptographic hashing. Part 1 (theory)
- On specification and cryptographic hashing. Part 2 (example)
Discussion
From Table 3 in the previous post the following can be seen:
- Machine code is algorithmically simple (with a relatively low Kolmogorov complexity K) because the program length is a fixed value, independent of the data size (the size of an image file supplied as input to the program can be very large). When estimating the length of the shortest program implementing the SHA-256 algorithm, the compiler adds a term Ccomp, independent of the data; this is the computational overhead for optimizing the code executed by the operating system. Kolmogorov complexity of the program code is low, although the program code itself as a sequence of symbols is statistically complex, as evidenced by the relatively high value of its Shannon entropy.
- Note also that the entropy of machine code is higher than that of C code. This is expected, since machine code should have less structural redundancy than human-readable programming languages
- It is also worth noting that the entropy of program code, regardless of the implementation language, never reaches its maximum: a language invariably contains redundancy and ambiguity. Any linguist will tell you this. It was not our goal to prove this in general. Here we can only confirm it based on the results of our study.
- Even a relatively simple algorithm for computing a hash for a data file has both a relatively high Shannon entropy and a relatively low Kolmogorov complexity, and thus passes the design test.
- At the same time, our data isn't classified as designed. This is also expected, as the data in both files represents the results of random processes: paint drips on paper and black-and-white image noise.
Randomness is constantly manifested in nature. Specificity (regularity) is also constantly manifested. However, in naturally generated configurations of matter, randomness and specificity represent the two extremes of a spectrum of the effects of unintelligent causation (see Table 1):
- Descriptions of random weakly compressible configurations of matter are characterized by high complexity and low specificity;
- Regular structures are characterized by low complexity (they are highly compressible) and high specificity;
In contrast, the signature of specified complexity:
complexity (relatively high Shannon entropy) + specificity (relatively low Kolmogorov complexity)
is a reliable practical classification rule for design.
Both randomness and natural regularity are inert to pragmatics (function). Natural processes are undirected and do not have a pragmatic purpose. Natural selection, which our opponents often refer to as a counter-example to the design detection rule we are discussing here, operates on already existing functional phenotypes. Evolution does not select for a future function. From our study, it is clear that the Shannon information model is not an adequate means to highlight the specificity of biological functions. From the examples we have considered, it is clear that both natural regularity and randomness are killers of functional information. Indeed, it is impossible to encode a complex function using only random and regular strings: the former will exhibit high complexity and low specificity, while the latter are massively redundant, which corresponds to low complexity and high specificity. Functional strings, on the other hand, exhibit high Shannon entropy and high specificity (low Kolmogorov complexity) at the same time. Such structures in practice reliably point to design.
| Configurations of matter | Description Metric | Complexity | Specificity | Class | |
| Shannon entropy |
Kolmogorov complexity |
||||
| Liquids, gases | High | High | High | Low | Not a design |
| Graphical, audio, text noise | |||||
| Crystalls | Low | Low | Low | High | |
| Interference patterns, convection pattenrs, regular strings | |||||
| Literature, technical prose | Medium | Medium | Design | ||
| Byte code (executable files) | High | High | |||
| Protein coding parts of DNA molecules | |||||
| Functional parts of primary protein structures | |||||
Table 1. Characteristics of configurations of matter
Examples of configurations of matter with high Shannon entropy and low Kolmogorov complexity
1. Irrational numbers
Suppose we received a radio signal from space transmitting the value of π or e up to a million decimal places. In this case, as prescribed by our heuristic, we happily make inference to an intelligent source of the signal, which is the only option supported by available empirical evidence, based on our current knowledge. Indeed, on the one hand, we will not be able to compress the data without loss of information because irrational numbers have a close to random distribution of digit values. On the other hand, the algorithmic complexity of π or e is low: there exist short algorithms that calculate π or e. It is a clear case of design.
2. Cellular automata
Cellular automata are known to be a mathematical model of organisms and biological evolution. There are known examples of cellular automata that produce patterns exhibiting both high entropy and low algorithmic complexity. These will be classified by our heuristic as designed.
Consider the two-dimensional pattern generated by the so-called Rule 30 which looks like this:
![]() |
![]() |
![]() |
| а. The first 22 states of the automaton | b. High entropy pattern of "rule 30" | c. An organism with a pigmentation pattern similar to that of "rule 30" |
| i-th state | 111 | 110 | 101 | 100 | 011 | 010 | 001 | 000 |
| i+1-th state of the center cell | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 |
| d. State transition rules | ||||||||
Fig. 1 Cellular automaton Rule 30. Source: Wikipedia
Figure 1a-b shows the states of the Rule 30 cellular automaton at successive discrete moments of time i = 0, 1, 2, ... from top to bottom. The states of the automaton are represented by horizontal rows of cells of two colors: 0 is white, 1 is black. The evolution of the automaton begins at time i = 0, when only one cell is black, the rest are white. The program executed by the automaton determines the color of a cell at time i + 1 depending on the color of the cell itself and its two neighbors to the right and left at time i. The rules for changing the color of cells are very simple. They are shown at the bottom of the image on the left, and also as a table in Figure 1d. These rules can be expressed briefly by the formula:
cell_colori+1 = cell_colori XOR (left_neighbor_cell_colori OR right_neighbor_cell_colori).
The rule is so named because the state at time i + 1 represents the number 30 in binary: 000111102 = 30. Although the program for computing the cell states is short, it eventually produces an image with a very high Shannon entropy (chaos).
Biological structures modelled by very simple automata like Rule 30 will be classified as designed even though they may prove reachable by evolution. In these cases, our complex specificity heuristic will generate false positives. However, as we shall see later, the relative number of these will be low.
Improving the accuracy of design detection beyond what our heuristic can achieve is possible e.g. by using the algorithmic specified complexity (ASC) measure [W. Ewert, W. Dembski, R. Marks: Algorithmic Specified Complexity in the Game of Life]. ASC is a development upon the simpler information theoretic ideas presented here.
Importantly, simple cellular automata like Rule 30 or other similar computational models may create a false impression that natural unguided interactions of matter can generate arbitrarily functionally complex structures. The point here is that any cellular automata, even if they implement mathematically trivial rules, model real phenomena that occur in already established rich information contexts enabling data reading and processing, which, ultimately, points to design. More about this can be found here.
3. Biological evolution
The idea that undirected natural processes, such as biological evolution, can generate structures exhibiting arbitrarily complex functions has many supporters. In particular, it is argued that biological evolution may be biased toward functional phenotypes (such as protein clusters) that have low Kolmogorov complexity, since they may be easier for the evolutionary process to detect. Cellular automata, as well as observations of similar patterns in nature (as in Fig. 1c), are often cited as evidence supporting such views. But to what extent do observations actually confirm the ability of evolution to produce complex function? In my view, such assumptions are largely speculative. Even proponents of an exclusively evolutionary origin of proteins acknowledge that nature is a jack of all trades, not an inventor [F. Jacob "Evolution and tinkering", Science. 196 (4295): 1161–6]. Here is what we can say:
- As I have already noted, evolutionary selection operates not on the principle of future function, but on the principle of optimizing the reproductive advantage of existing functional phenotypes.
- Function in the general case is non-additive: it is impossible to create one complex function by simply putting together a number of less complex functions. Consequently, functional areas in phase spaces are necessarily isolated islands. Further, the more complex a function is, the less the size of the islands becomes and the more sparse they are.
- There is currently no empirical evidence for the ability of non-intelligent, undirected processes to produce statistically significant quantities of functional information. In my opinion, there never will be, since limited probabilistic evolutionary resources will only allow for evolutionary generation of relatively simple functional structures. As far as primary protein structures are concerned, an optimistic estimate I know of, of the maximum number of states reachable by evolution is 2140, which is equivalent to a max of 140 functional bits or a functional primary protein structure of length ⌈140/log220⌉ = 33 amino acids (AA) maximum. The average length of a protein domain (a functional unit of a primary protein structure), according to various estimates, ranges from 100 to 150 AA, while domains shorter than 40 AA are considered short (short domains are primarily involved in regulatory functions). At the same time, the maximum functional complexity of a number of protein clusters, scattered across the search space, amounts to tens of kilobytes of functional information (Fig. 2).

Fig.2 Functional information in the primary structures for a group of 35 proteins studied in the paper: [K. Durston et al, Measuring the functional sequence complexity of proteins]. FSC: functional sequence complexity, functional complexity of linear amino acid sequences (aa); fit — functional bit, functional bit. I have marked the boundary of the evolutionary capability to create functions, the so-called "edge of evolution", with red lines (≤ 140 functional bits, or up to 33-AA long functional strings)
No single classifier in practice is 100% accurate or optimally sensitive. Practical methods of solving detection problems invariably involve errors. Design detection is no exception:
- False positive, Type I error: something that is not a design is classified as a design;
- False negative, Type II error: a design is not detected.
But even worse, the problem of improving accuracy is a problem of finding an acceptable compromise:
- Often as we try to reduce the number of false negatives, the number of false positive increases, and vice versa.
- According to Google Gemini, in popular protein databases such as CATH, ~99% of the functions are provided by domains of 40 AA or longer.
- There are estimates of the sparsity of function in the space of linear protein structures. For example, [D. Axe "Estimating the prevalence of protein sequences adopting functional enzyme folds"] estimates that only 1 in every 1077 is functional on average.
As stated earlier, if we want to improve design detection accuracy, we have to employ a more sensitive and more involved metric. An example is presented in [W. Ewert, W. Dembski, R. Marks: Algorithmic Specified Complexity in the Game of Life].
In conclusion, it must be noted that design detection is not based on gaps in our knowledge, which we are allegedly trying to close by introducing a divine agent. On the contrary, our inference to design from an observed combination of high complexity and high specificity (namely, function) is based on what we do know. The validity of inference to design from high functional complexity is continually reinforced by empirical evidence, particularly in information technology: any sufficiently complex function designed by humans is generated top down via intelligent agency, whereas, at this moment, there are no observations of complex function arising non-intelligently, without guidance purely as a byproduct of chance and necessity. All this means that we can reliably infer design from observations of statistically significant levels of complex function.


