Home Map

Or use your browser's BACK button

From digraph entropy to word entropy in the Voynich MS

Abstract

This page investigates the amount of information contained in each Voynich 'word', and compares it with that of texts in other (known) languages. Despite the fact that the average Voynich word is shorter than the average Latin or English word, and that the predictability of single characters in Voynich words is higher than in normal languages (since 'Voynichese' has a lower unconditional and conditional single-character entropy), it appears that the Voynich vocabulary is as diverse as that of the investigated comparison texts. This means that Voynichese is much more economical in its use of characters, or that in fact Latin (the language mainly used in the comparisons) is more 'wasteful'.

No hard conclusions about the nature of the Voynich MS language can be drawn on the basis of the statistics presented in this article alone, but either supporting or opposing evidence for many of the prevalent hypotheses about the nature of 'Voynichese' may be found. For example, it will be shown that there is no particular reason to assume that the spaces in the Voynich MS text are anything else than normal word spaces, and the Voynich words appear to be normal words, not syllables. The comparison with a Chinese text in the Pinyin transliteration system shows a vast difference, far bigger than with normal Latin, while a comparison with an unfortunately short sample of text in Dalgarno's articifical language (designed in 1640, TBC) shows a surprisingly good match.

Introduction

Both single character entropies and digraph entropies or conditional single-character entropies of Voynichese are lower than, say, Latin or English (see e.g. >> a paper by Dennis Stallings). Also, the words in Voynichese tend to be realtively short (see Gabriel Landini's article about the application of Zipf's laws to the Ms - reference to be added). Thus one would expect Voynichese words to be more restricted or less diverse than words in Latin. This word diversity can be measured either by counting the number of different words (tokens) for texts of various lengths, or by computing the single word entropy from the word frequency distribution. Both statistics have some shortcomings: the number of tokens is affected by spelling or transription errors and the single word entropy can only be estimated from very long texts. Both statistics will be computed for texts in Voynichese and in other languages, using samples of the same length (counted in number of words) to minimise these problems.

It was already reported in the past that a large transcribed section of Voynichese had a single word entropy of 10 bits, just like normal language. Here an apparent contradiction appears. If Voynichese uses fewer characters with restricted variability to form a sufficient number of words, then the other languages must be wasteful. This 'waste of information' will be investigated below.

Short description of the numerical analyses

The appearance of a particular token at a certain point in a text is an event with a probability
0 < p(tok) < 1. The amount of information gained if this token occurs (expressed in bits) equals

b(tok) = - 2log (p(tok))

Here "2log" denotes the logarithm of base 2.

Taking the average number of bits of information for all words (weighted average of all tokens) results in the standard formula for entropy. Using HW as the symbol for the single word entropy:

HW = SUM p(tok) * b(tok) = - SUM p(tok) 2log (p(tok))

Summation is over all tokens.

The bits of information can be considered to be distributed over all characters of the word. If we take the appearance of the English token 'the' at word i of a text, the probability p('the') can be (trivially) calculated as:

p(:the_) = p(:t) * p(:th)/p(:t) * p(:the)/p(:th) * p(:the_)/p(:the)

where the colon indicates a word start (which will be omitted in the future) and the underscore denotes a space (the word end).

The ratios in this product give the conditional probabilities counting from the start of the word.

p(the_) = p(t) * p(h|t) * p(e|th) * p(_|the)

Taking the 2log of the previous expression gives:

b(the_) = b(t) + b(th)-b(t) + b(the)-b(th) + b(the_)-b(the)

It is more interesting to take the average of all words, and this is achieved by replacing the number of bits for one token by the corresponding entropy values.

Source texts in various languages will be analysed by computing entropy values for word-initial sequences of various lengths (I am restricting to 1-, 2-, 3- and 4-character sequences) and the full word entropy. These values will be denoted by Hw1, Hw2, Hw3, Hw4 and HW respectively. The conditional entropies are denoted as hw2, hw3, hw4 and hW and computed as the difference between two absolute entropies. Thus, the average number of bits per word and per character follow the following, again trivial, relationship:

HW = Hw1 + hw2 + hw3 + hw4 + hW

Obviously, the meaning of hW is the combined information contained in all characters after the fourth.

One practical consideration is how to combine the statistics of words of different lengths. All words are considered to end with an infinite number of trailing spaces. The first space at the end of a word (e.g. the underscore used above, in 'the_') is a significant character. It tells the reader that the word will not be 'there' or 'their'. All remaining spaces contain no further information, as the probability of a space following a space equals one, and its contribution to the entropy zero.

Results

Tests were run for both short and long texts. The need for using long texts is clear, but only short texts are available for various sections of the Voynich MS.

Number of tokens and entropy values are plotted below, as a function of text length. As for full words, the number of tokens can also be computed for word-initial n-graphs. The following colour table applies for the short texts:

Table 1: legend to the following set of Figures
red 24 pages of herbal-A, in Curva
blue 24 pages of herbal-B, in Curva
pink Same herbal-A sample, in EVA
cyan Same herbal-B sample, in EVA
green Genesis (Vulgate)
grey De Bello Gallico (Latin)

Counts

Entropy

Some initial observations, before a more complete discussion in the section 'Conclusions' below:

It is necessary to confirm the above observations by using longer text samples. The longest consistent part in the Voynich MS is formed by the stars (recipes) section in Quire 20. It is compared (entropy only) with the following other texts:

Table 2: legend to the following set of Figures
green Genesis chapters 1-25 (Vulgate)
grey De Bello Gallico (Latin)
blue Stars section of the VMs, in Curva

Entropy for longer text samples

Finally, two tables which give the numerical values from the above graphs for a text with 8000 words:

Table 3: Cumulative bits of information
Characters. First: 1 2 3 4 All
De Bello Gallico 4.0891 6.3754 8.0874 8.8913 10.1625
Genesis 3.9840 6.2321 7.8513 8.5048 9.3313
VMs stars 3.2321 5.2267 7.2669 8.8186 9.9265
Table 4: Bits of information per character
Characters: 1st 2nd 3rd 4th Rest
De Bello Gallico 4.0891 2.2863 1.7124 0.8039 1.2712
Genesis 3.9840 2.2481 1.6192 0.6535 0.8265
VMs stars 3.2321 1.9946 2.0402 1.5517 1.1079

Interpretation

The above shows that:

The last point above is a remarkable feat for a medieval text. William Friedman thought that the Voynich MS could be in some early form of constructed language, and the above observations are quite compatible with this hypothesis.

Also, due to the lack of success in comparing Voynichese with 'normal' European languages, there have been suggestions that the Voynich MS may be some alphabetic rendering of a syllabic language or writing system.

Both suggestions have chronological difficulties, but cannot be rejected off-hand. Sample texts for both are available, both relatively short. These are compared with Voynichese below, together with two more other unusual texts which were obtained by modifying the Vulgate Genesis or the Bello Gallico using word games explained at >> Stolfi's Web Pages.

red 24 pages of herbal-A, in Curva
blue 24 pages of herbal-B, in Curva
yellow Chinese sample in pinyin
cyan The Lord's prayer in Dalgarno's language
green Genesis modified with Gabriel Landini's dain daiin susbtitution
grey De Bello Gallico modified with Rene Zandbergen's word scrambling susbtitution

The Chinese sample has a value of Hw1 slightly above 4, which is the same as in Latin, and significantly higher than Voynichese. If has a value of HW which is significantly lower than Voynichese and Latin, showing that the language/script uses fewer tokens. Thus, the idea that Voynichese is a written version of a syllabic script is rejected under this simple assumption, and a more complicated method of converting the syllabic script to Voynichese will be required in order to maintain this hypothesis. The statistics for Caesar's Latin shown above, indicate that it will in fact be easier to convert Latin to Voynichese than to do this with Chinese.

The artificial language is the only text sample that exhibits values of Hw1 and hw2 similar to Voynichese. After that, the language suffers from a shortage of tokens even more severe than the Pinyin sample. Admittedly, the contents of the sample are likely to be partly responsible for this feature, but it is to be questioned whether any invented language will have a sufficiently rich vocabulary, and the same question may be asked for the vocabulary of glossolalia.

The modified Latin texts were not offered as realistic potential explanations for the anomalies in Voynichese. The 'dain daiin' substitution makes the word entropy collapse to below the level of Hw2 of Latin. The word scrambling technique is closest to Voynichese overall, but its hw2 is still significantly higher.

Conclusions

No definite conclusions can be drawn, and if certain hypotheses about the nature of the Voynich MS language seem to be contradicted, it may be possible to find more elaborate ways to match Voynichese with syllabic writing, artificial languages, glossolalia or a word game.

Still, the following has been found:

  1. The apparent words in the Voynich MS appear to be really words. They are as varied as the words in Latin texts of a similar length.
  2. The first and second character of Voynich words (using the Curva alphabet) have lower entropy than in Latin. The Voynich words contain more information from the third character onwards (in the conditional sense).
  3. The word-initial statistics of Voynichese are matched by one example of an artificial langauge (which postdates the Voynich MS by at least one and a half centuries).
  4. The statistics of Voynichese and a Mandarin text written in the Pinyin script (using a trailing numerical character to indicate tone) are very different.
  5. A word game to translate Latin to Voynichese must:
    1. Increase predictability of word starts
    2. Make words shorter
    3. Maintain the length of the vocabulary
    This seems at odds with what one would expect from the type of word games which have been looked at by Voynich MS list members (this does not refer specifically to the two examples used above).

Acknowledgments

Most of the text samples used in this study were prepared by Jorge Stolfi. The sample of Dalgarno's language was prepared by Adam McLean.

Home Map

Or use your browser's BACK button

Copyright René Zandbergen, 2011
Comments, questions, suggestions? Your feedback is welcome.
Latest update: 2011/02/26