Contents Home Map

This page is still incomplete and, to use a term from the earlier days of the WWW: 'under construction'.

Application of 2-state Hidden Markov Modelling


This page follows on to a preceding page that looked into the detailed character frequency distribution and the resulting anomalous entropy of the Voynich MS text. It is necessary to have read that page, before going through the present page. Figure 1 below is a copy of Figure 10 of the preceding page, showing the difference between the character pair distribution between a known plain text in Latin and the FSG transcription of the Voynich MS.

Mattioli, Latin Voynich MS, FSG

Figure 1: relative character pair distributions of Latin and Voynich text (FSG) compared

Vowels and consonants

The plots for the plain languages are dominated by the alternation of vowels and consonants, and the fact that such combinations are the ones that tend to have the highest frequencies. The Figure for the German text "Tristan" looked somewhat different from the other three, because there are more consonants among the highest-frequency characters. It is of interest to redo the plots by separating the vowels from the consonants. For the known languages this can be done easily, but for the text of the Voynich MS we do not know which characters are vowels and which are consonants, or in fact if the characters in the Voynich MS alphabet can really be separated in this way.

This problem can be approached by applying a two-state Hidden Markov Model (HMM) to all texts. This technique has been introduced already with a short dedicated description. For texts in most common languages this effectively classifies all characters into vowels and consonants, though sometimes with some minor 'surprises'. The most frequent character in the languages of the four plain texts is a vowel. Therefore, for the Voynich MS text, we will call the 'vowel' state the one that includes the single most frequent character. This is the character o.

I have done this analysis based on my own, alternative implementation of a two state HMM (1). The results in the following are based on a previous version of this tool, which has in the mean time been updated significantly. As a consequence, the experiments will be re-run, and this page will be updated considerably.

For texts in known languages the tool converged fairly quickly and produced the same results as the standard implementation, for those cases in which it has been compared. For the text of the Voynich MS the results reported by Reddy and Knight (2011) (2) suggest that this method is not successful in identifying vowels and consonants in the Voynich MS text (3). After some experimentation, it turned out that it was helpful to treat spaces as a separate, third state, which is possible with my alterative method, but not with the standard HMM algorithm (4).

The results of this experimentation are shown in the following. The first plot shows the expected effect on the organisation and colour distributions in these plots.

Figure 2: expected patterns of vowel and consonant frequency distributions

The following Figures show the effect of the new sorting on the Mattioli text. The HMM-sorted text has all 'vowels' in decreasing frequency first, followed by all 'consonants' in decreasing frequency.

Figure 3: Mattioli, by character frequency

Figure 4: Mattioli, HMM-sorted

Mattioli, by frequency Mattioli, HMM-sorted

Figure 5: effect of vowel/consonant separation (by HMM) on Mattioli text (Latin)

Note: in addition to the above-mentioned reason for re-doing the analysis, all of the following Figures are out of date as the rows and columns in the plots are swapped (first character is on the horizonal scale), and the characters were not yet added to the margins of the matrix plots.

The following is still old.

The following plots show the effect of the application of the same procedure on all texts that have been used before.

Figure 4: effect of HMM algorithm on all 8 sample texts

The following table shows the results of the separation of characters.

"Vowels" i e a u o y
"Consonants" t s r n m c l d p q b v f g h x z k
"Vowels" e i a u o y w
"Consonants" t s n r m c l d p q b v g f x h z k
"Vowels" e a i o u
"Consonants" n r l t s c d m p v g h f b q z x
"Vowels" e i a h u o k y
"Consonants" n s r t d l c g m b w f z v p j q x
"Vowels" O 9 C A 0
o y e a iiir
"Consonants" 8 S E F R 4 P Z M 2 N Q B X J V T W D 3 U I 6 Y K 7 G H L 5
d ch l k r q t Sh iin s in cTh p cKh m f ir cPh n iiin iir i g cFh im j il iil iim iiim
"Vowels" O C G A T Z 7
o e y a ch (c h) j
"Consonants" C 8 D E R H 4 M S 2 P K I N F 0 L Y
e d k l r t q iin Sh s p m i in f n x
(ZL - Eva)
"Vowels" o e y a c s n u
o e y a C s n u
"Consonants" h d i k l r t q p m f g x b j v z
h d i k l r t q p m f g x b j v z
(ZL - Cuva)
"Vowels" O Y A E U F
o y a e ee f
"Consonants" D S K L R T H Z M C N P J I G X B Q V
d ch k l r t q Sh iin s in p m/z i g x b j v

For the plain texts, the most obvious 'unexpected' outcome is the listing of h as a vowel in the German text. In general, and depending on settings, either c or h tends to be classified as a vowel. Looking at the plot in detail, one can see that the h (fourth row and fourth column) has very different combinations to the left and to the right. The single dark red square in the fourth row corresponds with the combination ch.

For the Voynich texts, the first thing to note is that, for the FSG transcription, the character e appears both in the vowel and the consonant state. Indeed, the algorithm converged with a 50/50 probability on this character.

Apart from that, as already indicated, the HMM algorithm had trouble converging for the Voynich MS texts. While this will still be investigated further, it can be attributed tentatively to the asymmetry of the plots, that was already observed. Characters tend to make different combinations 'to the left' than 'to the right'.

Despite these problems, the above plots shows the difference between the Voynich MS text and that of a Latin plain text even more clearly.


A summary with some (tentative) conclusions will be added once the calculations have been re-done.


This is a purely statistical method based on the character pair frequency distribution. Experimentation with a more 'classical' HMM implementation will still be done.
Reddy and Knight (2011).
Also confirmed by earlier discussions with Jim Reeds.
At least not without a significant adaptation.


Contents Home Map

Copyright René Zandbergen, 2019
Comments, questions, suggestions? Your feedback is welcome.
Latest update: 07/02/2019