![]() |
![]() |
Or use the browser's BACK button
Various cluster analyses of the pages of the Voynich MS have been performed, and each of these was able to show the clear difference between the Currier languages A and B, and the various flavours of Currier language B. These analyses were based either on the single character distribution using the Currier or FSG alphabets, or the distribution of Voynich words, which in principle is independent of the chosen alphabet.
This page looks at the distribution of digraphs over the pages of the Voynich MS, which should give a very clear picture because:
To count digraphs, one should know what constitutes a single character and this is the first problem. The existing transcription alphabets are probably all wrong.
The EVA transcription alphabet is of analytical rather than
synthetic nature, which means that certain Voynich characters which
probably represent one character are written as a composite (such as
ch
and iin
). The Currier alphabet has the opposite problem.
By translating the Voynich MS text to a more suitable alphabet, this
problem may be reduced. The following mixture of EVA and Currier (we
may call it Curva) is used.
|
|
|
It should be noted that this alphabet is only intended for use in this particular study, and is not suggested as a good transcription alphabet. In particular the equation (EVA:) cth = tch, ckh = kch (etc) is not the best choice, but this was done to restrict the alphabet size to 26.
When all digraphs of the complete Voynich MS, transcribed in Curva, are counted, it turns out that 355 different digraphs exist. During this count, EVA special characters were not processed. Furthermore, digraphs spanning word breaks were not counted and uncertain spaces were discarded (i.e. treated as 'no space'). There were 232 pages with some text, the ones without text being f101r2 and f116v. (The text seen on f101r2 was counted on f101r1, as the lines of text span both pages).
Next, the digraphs were counted on each page separately. The page length differs greatly. Short pages such as some Herbal-A pages had as few as 200 diagraph, but long pages such as in the stars section could have up to 3000 digraphs. One can arrange the 355 frequency values (the fractions) for each page in a vector of length 355, where the sum of all components of each vector equals one. Thus, 232 vectors arise, all contained in the quadrant where all components are positive. The vectors all point to a small region in 355-dimensional space.
It is more interesting to subtract from each of the vectors the one vector which represents the average digraph distribution for the entire Ms. Now the vector for each page shows the way in which this particular page differs from the 'average manuscript page'.
Now, pages with similar properties will be close to each other in this hyperspace, so their Euclidian distance will be small. This distance can be computed for each pair of pages, leading to 232**2 values which can be arranged in a square matrix.
The following plot shows a colour-coded 232 * 232 square matrix, where the page number increases from 1 to 232 from left to right and from top to bottom. The main diagonal (top left to bottom right) has values 0. The colour scale is such that large distances are dark blue going through light blue, cyan, green, yellow, orange, red to magenta for the smallest distances.
The first thing which appears is the checkerboard pattern showing that certain groups of pages have similar properties. It reflects the known alternation of pages in Herbal-A, Herbal-B, Biological, Pharmaceutical and Stars sections. To show these properties better, the same matrix can be plotted with the pages grouped per known section.
The pages have been reordered by the illustration countained in them (but also taking into account the split of the herbal section into Herbal A and Herbal B as defined by Currier):
It is possible to tentatively identify the following languages and dialects. The first character (capital) gives the language, and any other characters (lower case) a variation or dialect of this language.
To investigate further, it will be necessary to somehow quantify the smilarities and difference of all of the languages. As the hypercloud of points cannot be easily visualised in all its dimensions, various projections onto two-dimensional space may be tried out to discover the relation between all languages and dialects.
For certain digraphs, the relative frequency per page will vary significantly and for others this will be less the case. The frequency of such digraphs (e.g. 'DY') will tell a lot about the language or dialect used. The size of the hypercloud of points along this axis is larger than along other axes. It is of interest to find the vector in 355-dimensional space along which the size of the hypercloud is maximal. Then, the next largest direction perpendicular to it may be searched, etc, to find a small number of base vectors which form an orthonormal system, and which well describe the most important dimensions of the cloud.
The following procedure will not necessarily find this maximum, but it will find something near to the maximum.
Now, the coefficient along this base vector may be computed for each page vector. The contribution of this base vector may then be subtracted from each of the page vector, meaning that the hypercloud collapses to a space with dimension one less than before.
After this last step, the procedure may be repeated, and it will automatically find the next most important base vector, which will be perpendicular to the first. This whole procedure may be repeated several times.
The plot below shows the decomposition of all page vectors along the four most important directions. Base vectors 2 to 4 are plotted against base vector 1.
Clearly visible is that base vector 1 points along the direction from Biological to Herbal-A pages, as should have been expected. The rms variation of these coefficients is about 0.051, which is signficantly higher than the rms variation of any component of the page vectors (the maximum being 0.038).
One feature which is not immediately obvious from the above figure is how much the language usage depends on whether the bifolio on which it is written has a standard size or has additional folds. To demonstrate this surprising feature, the plot is made again twice, below, on the left using only text on foldout bifolios and on the right only the standard-sized bifolios.
|
|
This clearly shows that Currier, who did not use any text transcribed from the foldout pages (to be checked!!), was correct in identifying two separate 'languages'. Now, we know that the language of the missing pages actually 'bridges the gap'.
Further information may be gleaned from the following plots, which show for some frequent digraphs (in Curva) which fraction they form on each of the pages. The first graph has the pages (on the X-scale) ordered as they appear in the Voynich MS. The second has them in the order described above.
|
|
The colour code corresponds to the section of the MS as indicated by the two-character code introduced above (see between the two plots). The digraphs are indicated in Curva.
More explanations will follow here.
![]() |
![]() |
Or use the browser's BACK button