This page will first list a number of observations about the Voynich MS character statistics that may be found in the printed literature, and then concentrate on more quantitative analysis results.
(Note: Tiltman treats f as a variant form of k and p as a variant form of t. In the following, characters or sequences in parentheses represent such variant forms).
Currier's first observation has been noted independently by several people, and was taken up recently by Brian Cham, who developed the >>curve-line system out of it.
Oddly enough, there is no consolidated set of this most basic statistic, due to the use of different transcription alphabets and different transcription sources. Several examples may be found in different sources.
One example is found in D'Imperio (1978) (see note 4), Fig. 28 on p.106, from several sources but none covering the entire MS text.
Some graphics related to single character statistics of the entire MS may be found in the beginning of this page.
As a very short summary, the single character frequency distribution in the most important transcription alphabets is largely similar to that of texts in normal European languages, thought the drop in frequency appears to be marginally steeper.
An algorithm for detection of vowels and consonants was designed by B.V. Sukhotin, and Jacques Guy has experimented with this in the 1990's. He published a first English summary of the algorithm in Cryptologia (see note 5). Results indicated that the characters that look like vowels (a, o, y) also appeared statistically like vowels, though the confidence of the result was not very high. There is also a recent >>internet blog entry related to running Sukhotin's algorithm on individual pages of the MS.
The concept of entropy has been explained in the introductory page and the reader should have read this introduction in order to properly appreciate the following.
The entropy of the Voynich MS text was first analysed in detail by the Yale professor William Ralph Bennett Jr. (6). He developed the concept in many easy steps and in more detail than in the above-mentioned introductory page. He first analysed texts in common European languages and then addressed the Voynich MS text, which he transcribed using his own transcription alphabet (7). He writes (8):
[...] the statistical properties of the Voynich Manuscript are quite remarkable. The writing exhibits fantastically low values of the entropy per character over that found in any normal text written in any of the possible source languages (see Table 5). The values of h1 [i.e. first order entropy - RZ] are comparable to those encountered earlier in this chapter with tables of numbers. Yet the ratio h1/h2 is much more representative of European languages than of a table of numbers alone.
His computed values are as follows (9):
|Entropy order||Normal languages||Voynich MS|
|First||3.91 - 4.14||3.66|
|Second||3.01 - 3.37||2.22|
|Third||2.12 - 2.62||1.86|
He finally identified one language with a set of similarly low entropy values, namely Hawaiian, but he also pointed out that this is not likely to be significant.
More statistics related to entropy calculations may be found in an on-line paper by >>Dennis Stallings: understanding the second-order entropies of Voynich text. This basically confirms the results of Bennett. His descriptions will be useful for those who have no access to a copy of Bennett's book.
I have re-done the calculation for first- and second-order entropy for a larger number of languages, using the text of the Universal Declaration of Human Rights (10). This analysis will be described in more detail in a dedicated page, and for the moment I just show some of the results. In this analysis, the space character has not been interpreted as a character, but as a separator between words. The first plot below shows the (conditional) second-order entropy plotted against the first-order entropy, for a number of modern European languages, also including the results for the Voynich MS. The Voynich MS statistics are those computed by Bennett (left-most point) and those computed by Dennis Stallings for Herbal-A and Herbal-B using the Currier alphabet. The meaning of the legend is shown in a table below the figure. It is clear that none of the languages shows a similar behaviour to the Voynich MS text.
|ROM||Romance languages||Latin, French, Spanish, Italian, Portuguese, Catalan, Galician, Occitan (Auvergnan), Corsican, Friuli, Maltese|
|GER||Germanic languages||English, German, Dutch, Frisian, Afrikaans|
|SCA||Scandinavian languages||Swedish, Norwegian (Modern and Bogmål), Danish, Icelandic|
|SLA||Slavic languages||Russian, Polish, Czech, Slovak, Croatian, Bulgarian, Macedonian, Belorus, Georgian|
|GAE||Gaelic languages||Scottish Gaelic, Irish Gaelic, Breton, Manx|
The calculations have been repeated for a number of other languages from around the world. These are listed in the following table:
|EUO||European, other||Albanian, Basque, Finnish, Hungarian|
|AFR||African||Ethiopian (Amharic), Swahili, Hausa, Edo, Somali, Bari|
|IEU||Indo-European||Greek, Estonian, Latvian, Lithuanian, Farsi, Hindi, Nepali, Urdu|
|ASI||Asian||Turkish, Armenian, Turkmen, Kurdish, Hebrew, Arabic, Azerbaidjani, Bengali, Minjiang (a Chinese dialect, spoken vs. written), Tibetan, Mongolian, Japanese, Korean, Thai, Laotian, Burmese, Cambodian, Vietnamese, Indonesian, Tagalog, Cebuano, Hawaiian|
The result is shown in the following plot, where the points for the "European" languages have been repeated in grey.
Here we see a number of points among the group of Asian languages that lie in the relative vicinity of the Voynich MS text. The lowest and leftmost point is Tagalog (Philippinian). The two points to the right of this are the spoken and written version of Minjiang. These text have been written in the Latin alphabet without indication of tones. Hawaiian, the language named by Bennett, is the lowest point directly above those for the Voynich MS.
An alternative method to compute entropy is the so-called 'commas' method, which has been used by Jim Reeds and later by Gabriel Landini. This will be included here at some future time.
Jorge Stolfi has set up a tool to visualise the number of bits of entropy per character in the following location: >>Jorge Stolfi: where are the bits?
Furthermore, I have addressed the question how it is possible that the character and digraph entropy of the Voynich MS text is so much lower than that of, say, Latin, while the word entropy (about which something will be said here) is similar. This is addressed at this page: From digraph entropy to word entropy. I realise that this page is rather hard to understand in its present form, and I will re-do it. The short summary is that, counting from the start of each word, the entropy per character in the Voynich MS starts off lower for the first two characters, but is higher for the remainder of the characters, when compared to normal languages.
While the entropy values are single values derived from a frequency distribution, more can be learned by looking at the detail of these distributions, for which see here. This discussion, which is not yet fully completed, exemplifies even better how much different the Voynich MS text is from 'reular languages'. The comparison with Asian languages in this form has not yet been made.
There is a critically important conclusion to be drawn from the first- and second-order entropy values reported by various authors. As already mentioned in the analysis section introduction, the entropy values do not change when one consistently replaces characters by others, i.e. in a simple substitution cipher. This tells us something about the possible plain text of the Voynich MS.
In general, and quite briefly, any attempt to translate the Voynich MS into something meaningful in Greek, Latin, English, etc. using a simple substitution must fail. As this is the first thing most people will try, we can begin to understand how the MS has resisted all translation attempts.
However, there is much more to this, as we shall see in the following (11).
Following observations are paraphrased from Currier's papers (see note 3).
The obsernation of Currier that the line appears to be a functional unit was further analysed in 2012 by Elmar Vogt, for which >>see here. One of the most obvious features he shows is that, when using the Eva alphabet, the first word tends to be on average 1 character longer than the second and following words.
Julian Bunn highlighted the positions of the gallows characters on each folio of the MS in >>a page at his blog, in colour coded graphics. They show a peculiar vertical pattern, which may be related to the observations of Andreas Schinner in his 2007 cryptologia paper (14), which is discussed in a later page.
The following page by Sean V. Palmer gives a very visual representation of the feature that many characters have very preferential positions inside the words of the MS: >>Voynich MS glyph position stacks.