![]() |
![]() |
Or use your browser's BACK button
This page investigates the amount of information contained in each Voynich 'word', and compares it with that of texts in other (known) languages. Despite the fact that the average Voynich word is shorter than the average Latin or English word, and that the predictability of single characters in Voynich words is higher than in normal languages (since 'Voynichese' has a lower unconditional and conditional single-character entropy), it appears that the Voynich vocabulary is as diverse as that of the investigated comparison texts. This means that Voynichese is much more economical in its use of characters, or that in fact Latin (the language mainly used in the comparisons) is more 'wasteful'.
No hard conclusions about the nature of the Voynich MS language can be drawn on the basis of the statistics presented in this article alone, but either supporting or opposing evidence for many of the prevalent hypotheses about the nature of 'Voynichese' may be found. For example, it will be shown that there is no particular reason to assume that the spaces in the Voynich MS text are anything else than normal word spaces, and the Voynich words appear to be normal words, not syllables. The comparison with a Chinese text in the Pinyin transliteration system shows a vast difference, far bigger than with normal Latin, while a comparison with an unfortunately short sample of text in Dalgarno's articifical language (designed in 1640, TBC) shows a surprisingly good match.
Both single character entropies and digraph entropies or conditional single-character entropies of Voynichese are lower than, say, Latin or English (see e.g. >> a paper by Dennis Stallings). Also, the words in Voynichese tend to be realtively short (see Gabriel Landini's article about the application of Zipf's laws to the Ms - reference to be added). Thus one would expect Voynichese words to be more restricted or less diverse than words in Latin. This word diversity can be measured either by counting the number of different words (tokens) for texts of various lengths, or by computing the single word entropy from the word frequency distribution. Both statistics have some shortcomings: the number of tokens is affected by spelling or transription errors and the single word entropy can only be estimated from very long texts. Both statistics will be computed for texts in Voynichese and in other languages, using samples of the same length (counted in number of words) to minimise these problems.
It was already reported in the past that a large transcribed section of Voynichese had a single word entropy of 10 bits, just like normal language. Here an apparent contradiction appears. If Voynichese uses fewer characters with restricted variability to form a sufficient number of words, then the other languages must be wasteful. This 'waste of information' will be investigated below.
The appearance of a particular token at a certain point in a text
is an event with a probability
0 < p(tok) < 1. The amount of
information gained if this token occurs (expressed in bits) equals
b(tok) = - 2log (p(tok))
Here "2log" denotes the logarithm of base 2.
Taking the average number of bits of information for all words (weighted average of all tokens) results in the standard formula for entropy. Using HW as the symbol for the single word entropy:
HW = SUM p(tok) * b(tok) = - SUM p(tok) 2log (p(tok))
Summation is over all tokens.
The bits of information can be considered to be distributed over all characters of the word. If we take the appearance of the English token 'the' at word i of a text, the probability p('the') can be (trivially) calculated as:
p(:the_) = p(:t) * p(:th)/p(:t) * p(:the)/p(:th) * p(:the_)/p(:the)
where the colon indicates a word start (which will be omitted in the future) and the underscore denotes a space (the word end).
The ratios in this product give the conditional probabilities counting from the start of the word.
p(the_) = p(t) * p(h|t) * p(e|th) * p(_|the)
Taking the 2log of the previous expression gives:
b(the_) = b(t) + b(th)-b(t) + b(the)-b(th) + b(the_)-b(the)
It is more interesting to take the average of all words, and this is achieved by replacing the number of bits for one token by the corresponding entropy values.
Source texts in various languages will be analysed by computing entropy values for word-initial sequences of various lengths (I am restricting to 1-, 2-, 3- and 4-character sequences) and the full word entropy. These values will be denoted by Hw1, Hw2, Hw3, Hw4 and HW respectively. The conditional entropies are denoted as hw2, hw3, hw4 and hW and computed as the difference between two absolute entropies. Thus, the average number of bits per word and per character follow the following, again trivial, relationship:
HW = Hw1 + hw2 + hw3 + hw4 + hW
Obviously, the meaning of hW is the combined information contained in all characters after the fourth.
One practical consideration is how to combine the statistics of words of different lengths. All words are considered to end with an infinite number of trailing spaces. The first space at the end of a word (e.g. the underscore used above, in 'the_') is a significant character. It tells the reader that the word will not be 'there' or 'their'. All remaining spaces contain no further information, as the probability of a space following a space equals one, and its contribution to the entropy zero.
Tests were run for both short and long texts. The need for using long texts is clear, but only short texts are available for various sections of the Voynich MS.
Number of tokens and entropy values are plotted below, as a function of text length. As for full words, the number of tokens can also be computed for word-initial n-graphs. The following colour table applies for the short texts:
| red | 24 pages of herbal-A, in Curva |
|---|---|
| blue | 24 pages of herbal-B, in Curva |
| pink | Same herbal-A sample, in EVA |
| cyan | Same herbal-B sample, in EVA |
| green | Genesis (Vulgate) |
| grey | De Bello Gallico (Latin) |
Some initial observations, before a more complete discussion in the section 'Conclusions' below:
It is necessary to confirm the above observations by using longer text samples. The longest consistent part in the Voynich MS is formed by the stars (recipes) section in Quire 20. It is compared (entropy only) with the following other texts:
| green | Genesis chapters 1-25 (Vulgate) |
|---|---|
| grey | De Bello Gallico (Latin) |
| blue | Stars section of the VMs, in Curva |
Finally, two tables which give the numerical values from the above graphs for a text with 8000 words:
| Characters. First: | 1 | 2 | 3 | 4 | All |
|---|---|---|---|---|---|
| De Bello Gallico | 4.0891 | 6.3754 | 8.0874 | 8.8913 | 10.1625 |
| Genesis | 3.9840 | 6.2321 | 7.8513 | 8.5048 | 9.3313 |
| VMs stars | 3.2321 | 5.2267 | 7.2669 | 8.8186 | 9.9265 |
| Characters: | 1st | 2nd | 3rd | 4th | Rest |
|---|---|---|---|---|---|
| De Bello Gallico | 4.0891 | 2.2863 | 1.7124 | 0.8039 | 1.2712 |
| Genesis | 3.9840 | 2.2481 | 1.6192 | 0.6535 | 0.8265 |
| VMs stars | 3.2321 | 1.9946 | 2.0402 | 1.5517 | 1.1079 |
The above shows that:
The last point above is a remarkable feat for a medieval text. William Friedman thought that the Voynich MS could be in some early form of constructed language, and the above observations are quite compatible with this hypothesis.
Also, due to the lack of success in comparing Voynichese with 'normal' European languages, there have been suggestions that the Voynich MS may be some alphabetic rendering of a syllabic language or writing system.
Both suggestions have chronological difficulties, but cannot be rejected off-hand. Sample texts for both are available, both relatively short. These are compared with Voynichese below, together with two more other unusual texts which were obtained by modifying the Vulgate Genesis or the Bello Gallico using word games explained at >> Stolfi's Web Pages.
| red | 24 pages of herbal-A, in Curva |
|---|---|
| blue | 24 pages of herbal-B, in Curva |
| yellow | Chinese sample in pinyin |
| cyan | The Lord's prayer in Dalgarno's language |
| green | Genesis modified with Gabriel Landini's dain daiin susbtitution |
| grey | De Bello Gallico modified with Rene Zandbergen's word scrambling susbtitution |
The Chinese sample has a value of Hw1 slightly above 4, which is the same as in Latin, and significantly higher than Voynichese. If has a value of HW which is significantly lower than Voynichese and Latin, showing that the language/script uses fewer tokens. Thus, the idea that Voynichese is a written version of a syllabic script is rejected under this simple assumption, and a more complicated method of converting the syllabic script to Voynichese will be required in order to maintain this hypothesis. The statistics for Caesar's Latin shown above, indicate that it will in fact be easier to convert Latin to Voynichese than to do this with Chinese.
The artificial language is the only text sample that exhibits values of Hw1 and hw2 similar to Voynichese. After that, the language suffers from a shortage of tokens even more severe than the Pinyin sample. Admittedly, the contents of the sample are likely to be partly responsible for this feature, but it is to be questioned whether any invented language will have a sufficiently rich vocabulary, and the same question may be asked for the vocabulary of glossolalia.
The modified Latin texts were not offered as realistic potential explanations for the anomalies in Voynichese. The 'dain daiin' substitution makes the word entropy collapse to below the level of Hw2 of Latin. The word scrambling technique is closest to Voynichese overall, but its hw2 is still significantly higher.
No definite conclusions can be drawn, and if certain hypotheses about the nature of the Voynich MS language seem to be contradicted, it may be possible to find more elaborate ways to match Voynichese with syllabic writing, artificial languages, glossolalia or a word game.
Still, the following has been found:
Most of the text samples used in this study were prepared by Jorge Stolfi. The sample of Dalgarno's language was prepared by Adam McLean.
![]() |
![]() |
Or use your browser's BACK button
Copyright René Zandbergen, 2011