Analysis Section ( 4/5 ) - Word statistics

4.1 Introduction

The term 'word' requires an additional introduction. The MS clearly has groups of characters separated by spaces. It has become usual to call these groups of characters 'words', but it is not certain that these also represent words in the grammatical sense. When one looks a bit more closely, it becomes evident that not all word spaces are that clear, and occasionally, spaces between individual characters are as wide as some word spaces.

A second point concerns terminology as used by different authors. The question 'How many words are there in the Voynich MS?' could lead to two different answers, namely the total number of words, or the number of different words. Reddy and Knight (1) use in this context the terms word tokens and word types, which I shall follow throughout this web site. The meaning is most easily explained by noting that the phrase:

To Be Or Not To Be

has six word tokens and four word types. Whenever the short term 'word' is used (e.g. quoting a source) the distinction is not relevant. Reddy and Knight report that the Voynich MS text contains 37,919 word tokens and 8114 word types (2).

As before, this page will first list a number of observations about the Voynich MS words that may be found in the older literature, and then concentrate on word statistics analyses that have been performed later.

Due to various reasons, word statistics are likely to be less reliable than most other statistics. One the one hand, there is the uncertainty in the identification of word spaces. Secondly, there are uncertainties in the transliteration, causing that identical words may be represented differently, or different words may be transliterated the same. Finally, when considering word length, the answer will depend on the transliteration alphabet used, and the uncertainty what constitutes a single character.

4.2 Observations in printed literature at the word level

This is not yet complete.

4.2.1 D'Imperio (1978) (3)

There are very few single-letter words in the running text, primarily  and .

4.3 Are word spaces significant?

At first sight this seems an easy question, since the word spaces are clearly observable, and the word tokens delineated by them show a frequency distribution that is quite natural. Immediately, a number of frequently occurring word types like  or  may be recognised.

On the other hand, it can also be observed that there tend to be a few specific characters that occur most frequently immediately before or immediately after word spaces. It is as if this were similar to the rules in the Arabic script that certain characters cannot be connected to the next character. Such spaces would not be real word spaces. In addition, some words that appear standing alone can also appear connected together. For example beside   also  occurs.

Then again, an argument in favour that the word spaces are real is formed by the labels. The label words, which are found standing alone, also occur in the running text separated by spaces, and only very rarely with a space in between. This qualitative argument still needs to be confirmed quantitatively, though.

In conclusion, one cannot be absolutely certain about this question, but the evidence tends to be in favour that the visible spaces in the MS are meant to be word (or other unit of meaning) separators.

4.4 Are words really words?

Even if the word spaces are significant, one may still doubt that each such word token really represents a complete word in plain text. It has been suggested that each written word token only represents a syllable, or even just one character. In the latter case, the MS text would either be highly verbose, or contain lots of nulls.

Even if the Voynich MS word tokens are words, there is the additional important question whether it is possible to convert the Voynich MS text to a meaningful plain text by a word for word substitution. In most proposed solutions this tends to be an unstated assumption.

4.5 Word frequencies

This must still be included. This is closely related with the analyses of Zipf's law.

The number of word types that appear only once in the MS (so-called hapax legomena) is rather high (about half of all word types). A detailed comparison with plain texts in other languages should be made.

4.6 Word length distribution

Word length statistics are perhaps the least realiable of all, as explained in the introduction. One can still do the calculations, and expect different results for the word token length distribution and the word type length distribution, since the more frequent word types tend to be shorter. (This is true both for normal languages and the Voynich MS text). In this analysis (local copy) (4) Jorge Stofi arrived at the surprising conclusion that, based on his assumptions about character and space definition, the word type length distribution is almost perfectly binomial. This is unusual for a natural language. This observation is still not properly understood or explained. Stolfi presents one potential solution here (local copy).

This topic is also addressed by Reddy and Knight (see note 1) who suggest that this shape is very similar to that of devowelled English or Arabic, or in fact Chinese represented in Pinyin. They suggest that the Voynich MS text in this respect resembles an abjad. There is, however, the problem that devoweled languages will have entropy values that are much higher than the Voynich MS text.

4.7 Word entropy

The word token entropy for the entire MS has once been computed to be 11 bits, which would be normal for a text of this length (see here). The source for this is however unknown to me, and is another point that should be verified.

Reddy and Knight (see note 1) compute the value for combined biological and recipes section (all in Currier B language), which comprises 17,597 word tokens, as 9.666, which is again an expected value.

In the character statistics page I already referred to a paper comparing digraph entropy and word entropy. Here, the word entropy for the first 8000 words in the recipes section is computed as 9.927, not fully consistent with the value above, but all values are the right order of magnitude for a normal text.

4.8 Zipf's laws

This topic was analysed by Gabriel Landini, who wrote a paper titled "Zipf's law in the Voynich Manuscript". This is not available on-line and will be summarised here shortly.

4.9 Other material

Line-initial/final and word initial/final properties, to be included.

Notes

1: See Reddy and Knight (2011).
2: Based on the transliteration file "voynich.now" in the Currier transliteration alphabet.
3: See D'Imperio (1978).
4: Stolfi uses the terms token for word tokens and word for word types.