The present page explains some topics, terminology and techniques applied to the Voynich MS text analysis. Some of these terms are used also outside the text analysis section. The representation of the Voynich MS script was explained on the analysis home page.
The first thing any analyst of the Voynich MS is likely to do is to count and make frequency tables of single characters, pairs, etc. and to do the same with the (apparent) words. When doing this, it appears immediately that some statistical properties are quite different for different parts of the MS, and appear to be strongly page-dependent. This was already noticed by Th. Petersen in the 1930's and indicated in his hand transcription of the MS (1).
It was first analysed in some more detail by Prescott Currier in the 1970's. Currier indicated that the Voynich MS appears to have been written in two languages, which he called A and B. He was careful to point out that these are not necessarily different languages, but could be dialects, subject matter or different encryption, in case the MS has been written in a code or cipher. Since Currier also detected two handwriting styles (which he called 1 and 2) and found a perfect correlation: all pages in language A were in hand 1 and language B was in hand 2, he concluded that the MS had to be the work of at least two people. In fact he suggested further hands which he called 3, 4, X and Y, but while the languages A and B and hands 1 and 2 are easy to recognise, the other hands have not become as generally accepted.
Currier presented his findings in detail during a symposium about the Voynich MS which was held on 30 November 1976 and led by Mary D'Imperio. Currier's paper has been converted into electronic form. This HTML version only has the main text. The full paper with all the long tables is available also in separate files in PostScript format at a >> page prepared by J. Reeds and mirrored by J.Stolfi or, converted to PDF, via this link.
The Herbal section appears to be a mixture of A-language and B-language pages, and the distribution of these pages was strictly according to the bifolio, i.e. entire bifolia are always written in one language. He also saw two different handwriting styles which were fully correlated with the type of language. This has already been shown in a previous page.
The main properties which Currier gave for his two languages are:
The above shows that the Currier language is evident from criteria based on single characters, character groups and whole words.
The Currier languages have at least a historic meaning. Whether the dichotomy into 'A' and 'B' languages is real, or there is a smooth transition between the two will be discussed later. Currier identified the pages in a fully supervised manner, i.e. it was a human decision for each page. Later, we will see methods to classify the Voynich MS text in an unsupervised manner, i.e. the decision is entirely left to a computer algorithm without specific a priori criteria.
The entropy of the language of the Voynich MS was first studied by Bennett (1976) (2), and when his estimates turned up rather anomalous values for the Voynich MS text, compared with most European languages (old and new), this became an important topic for subsequent investigations. The meaning of entropy is therefore introduced in some detail here. Note that this is not a very formal mathematical introduction, but mainly one aiming at allowing the reader to undertand the various analyses that use it. Bennett refers to Shannon's important paper on "The Mathematical Theory of Communication", and a useful summary for the interested reader may be found >>here.
Entropy is a quantity that could be interpreted as amount of 'chaos' or unpredictability, in the sense that lower values of entropy are equivalent with higher amounts of order or predictability. If a string of characters has full predictability, it carries no information. Once one knows the first character, one can predict all subsequent ones: one knows everything. The entropy of a piece of text is therefore also a measure of the amount of information it carries. The entropy values used in the study of the Voynich MS text are usually expressed in bits (of information).
This is best elaborated using a simple example.
Imagine someone were to create a string of numbers using a die. He would roll the die, write down the top face number, and repeat this process as long as he wanted to. The number that appears each time (at each event) is a piece of information. The amount of information is inversely proportional to the probability of the event. If the die is a perfect 6-face die, then the probability (p) of throwing a '1' (or any other number) is 1/6. The number of bits of information (b) obtained at this event is the 2-base logarithm of 1/p, or:
b = ^{2}log(1/p) = - ^{2}log(p)
On average, the number of bits of information obtained at the appearance of each number (which is the entropy, denoted here by a capital H) is the weighted average for each possible outcome. (Note that in the following the * sign means multiplication.)
H = Sum { p * b } = - Sum { p * ^{2}log(p) }
In our example of a string of text (whether digits or characters) generated by throwing a perfect 6-face die, the entropy on a single-character basis is:
H(char) = - Sum { 1/6 * ^{2}log( 1/6 ) } = ^{2}log ( 6 )
Had the die not been perfect, but weighted, the six probabilities would not all have been equal to 1/6, and the resulting entropy value would have been lower than the above value. (A mathematical proof of this is straightforward but outside the scope of this page). The die is a bit more predictable (has a tendency towards the most probable number) and the entropy is lower.
Had the die not had 6 sides but any number N, the maximum entropy (in case all probabilities are equal) would have equalled
^{2}log ( N )
The main point resulting from the above is that the entropy is a value that can be computed for something which can assume a number of values, and the sum of the probabilities for each of the values is one. One could use the index i to denote each of the values, and p(i) the probability that the 'thing' has value i. The formula for the entropy of this is:
H = - Sum { p(i) * ^{2}log [p(i)] }
For a piece of text, the single-character entropy can be computed using the probabilities of the occurrence of each of the characters of the alphabet. For a text written in an alphabet of 26 characters, this entropy will be less than ^{2}log(26) or 4.700, knowing that not all 26 characters will appear equally frequently. Of course we don't know the real probability, but we can only estimate it by looking at the frequency of occurrance, and assuming that this is close to the probability. This will be reasonable for sufficiently large samples, but not in case a text is not sufficiently large to capture rare characters or combinations of characters properly.
It is therefore important to keep in mind that the single-character entropy computed in the above manner for any piece of text in any language will only be an approximation of the true or inherent single-character entropy of that language. Apart from the above-mentioned approximation, there is also the issue that the entropy will depend on the subject matter and the writing style of the author.
The most important point is, that the entropy does not depend on which character set or alphabet is used, but only on the probabilities or frequencies. This means, that the entropy of a text does not change if it is subject to a simple substitution encoding. This is the main reason why the entropy is of such interest for the Voynich MS. The entropy also does not change, by the way, in case the text is written backwards.
Entropy can also be computed for words rather than single characters. A text written in a vocabulary of 10,000 words will have a word entropy less than ^{2}log(10000) or less than 13.288, depending on the distribution of the word frequencies. It can furthermore be computed for character pairs. Going back to the example of throwing the die, there are 36 possibilities for a pair of throws. The entropy for this (assuming a perfect die) is
^{2}log(36) = 2 * ^{2}log(6)
It is evident that if the occurrence of the two consecutive events is independent, the 'pair' entropy equals twice the 'single event' entropy.
In natural language, however, the occurrence of a character in a text is not independent from the previous character. For example, in English the probability of encountering the character 'u' depends highly on whether the previous character was a 'q' (in which case the probability is essentially 1), another 'u' (in which case it would be very close to zero) or anything else. This introduces the concept of conditional entropy. It can be shown mathematically that the conditional single-character entropy (the entropy of the probability distribution of a single character, given that the preceding one is known) equals the difference between the character pair (=digraph) entropy and the single character entropy. This conditional character entropy is less than the 'normal' character entropy.
A word on terminology: single-character entropy is sometimes called first-order entropy. Character-pair entropy is sometimes called second-order entropy, while the conditional single-character entropy is also sometimes called second-order entropy. The values given for these quantities should remove any doubt about what is meant, since the conditional second-order entropy is always less than the single-character entropy, while the character pair entropy is always greater than the single-character entropy.
Zipf's law (strictly the first Zipf law) concerns the frequency of words in a piece of text. If one orders the words according to decreasing frequency, i.e. label the most frequent word as nr.1, the second most frequent word as nr.2, etc, and then make a plot of the frequency of this word according to this rank, the result should show a straight line with a slope of -1, if both scales are logarithmic.
This general statement requires some elaboration:
The straight line with slope -1 in a double-logarithmic scale means that the probability for the item ranked at nr. i equals:
p(i) = C / i
where C is a constant depending on the number of items N, and it is defined by the fact that the sum of all probabilities has to equal 1. Thus, if a quantity can assume a well-defined number of values, and it strictly obeys Zipf's law, its entropy can be predicted exactly.
Following is a table which illustrates this. The first column gives the number of possible values. The second gives the maximum entropy, if all probabilities are equal. The third column gives the entropy if the quantity exactly obeys Zipf's law. For example, the value 26 represents the number of letters in the (Latin) alphabet. If they are all equally frequent (which they are not), the character entropy would equal 4.700. If they exactly followed Zipf's law (which is also not true, but certainly closer to the truth), the character entropy would equal 3.929. The table has been set up for reasonable values of alphabet size, number of digraphs and number of words in a text.
Number H(max) H(Zipf) 16 4.000 3.403 17 4.087 3.470 18 4.170 3.532 19 4.248 3.591 20 4.322 3.647 21 4.392 3.700 22 4.459 3.750 23 4.524 3.798 24 4.585 3.844 25 4.644 3.887 26 4.700 3.929 27 4.755 3.969 28 4.807 4.008 29 4.858 4.045 30 4.907 4.081 31 4.954 4.116 32 5.000 4.149 33 5.044 4.181 34 5.087 4.213 35 5.129 4.243 36 5.170 4.273 37 5.209 4.301 100 6.644 5.310 200 7.644 5.986 500 8.966 6.851 1000 9.966 7.489 2000 10.966 8.115 5000 12.288 8.927 10,000 13.288 9.532 20,000 14.288 10.130 50,000 15.610 10.911 |
Cluster analyses have been applied in order to find out more about the Currier languages. Typically, this method requires that for each page of the MS a quantitative attribute is found, which may consist of one number, but also a set of numbers. Next, it requires the definition of a 'distance' function which takes any pair of attributes and computes a distance value which should be low if the attributes are similar and high if they are dissimilar. Such quantitative values and their distances can be based on single characters, digraphs, words or any other property.
The complete set of distances then consists of a square matrix, where point (a,b) represents the distance between the properties of pages a and b. For usual and meaningful definitions of this distance, the value for (a,a) will be zero (since the properties are equal) and the value for (a,b) will be the same as for (b,a). The matrix will be symmetrical with respect to its main diagonal.
The final, and sometimes difficult task is then to decide, on the basis of the square matrix of distance values, which pages are similar (for example: written in the same language or on the same topic) and which are not; into how many groups or 'clusters' the whole set of pages can be subdivided. This process may be done in a supervised or unsupervised manner.
Many more tools have been applied since the MS has been available in machine-readable format (i.e. transcribed). Some of the more general ones are briefly described in the following. Furthermore, many people have introduced their own techniques which are described in their papers.
Hidden Markov Modelling (HMM) has been applied to the Voynich MS text among others to autonomously subdivide the character set of a text in several groups that appear 'common' by analysing the transition probabilties between characters. It is typically capable of separating a text into vowels and consonants, but with the Voynich MS text the result is rather different. It is treated rather briefly in section 5: Sentences, paragraphs, sections. This method tends to run fully unsupervised.
The Russian mathematician Sukhotin devised several analysis techniques which have primarily been published in Russian. In the early 1990's they have been translated by Jacques Guy. Best known is his technique to identfy vowels and consonants in a text, which has also been applied by Jacquest Guy. It is discussed in section 2: Character statistics.
This technique compares 'local' or 'regional' statistics over different ranges of the MS, and detects typical patterns for meaningful text. In several of these techniques, the text properties or statistics of the original text are compared with a scrambled version of the text (all words scrambled arbitrarily) and looking at the difference. While these techniques seems to work, precisely what they show or why may not always be completely clear. Some results are given in section 5: Sentences, paragraphs, sections.