Punctuation in the literature of major languages ​​is intriguingly mathematical

Punctuation in the literature of major languages ​​is intriguingly mathematical

This article was reviewed based on Science X’s editorial process and policies. The editors have highlighted the following attributes ensuring the credibility of the content:






Risk functions represent the probability of using a punctuation mark as a function of the length of the sequence without these marks. In terms of punctuation, the most ‘interlinguistic’ is German (green graph). Credit: Source: IFJ PAN

A moment’s hesitation… Yes, a period here but shouldn’t there be a comma there? Or would a dash be better? Punctuation can be a pain; it is often simply overlooked. Wrong! The most recent statistical analyzes paint a different picture: punctuation seems to “grow out” of the foundations shared by all the languages ​​(examined), and its characteristics are anything but trivial.

Punctuation appears to many as a necessary evil, one to be happily ignored whenever possible. Recent analyzes of the literature written in the current major languages ​​of the world force us to change this opinion. In fact, the same statistical characteristics of punctuation patterns have been observed in several hundred works written in seven languages, mostly Western.

Punctuation, of which all ten representatives are found in the introduction to this text, turns out to be a universal and indispensable complement to the mathematical perfection of every language studied. Such a remarkable conclusion about the role of simple commas, exclamation marks or full stops comes from an article by scientists from the Institute of Nuclear Physics of the Polish Academy of Sciences (IFJ PAN) in Kraków, published in the journal Chaos, solitons and fractals.

“The present analyzes are an extension of our previous findings on the multifractal characteristics of sentence length variation in works of world literature. After all, what is sentence length? It is nothing more than the distance to the next specific punctuation point So now we have taken all punctuation marks under a statistical magnifier, and also examined what happens to punctuation during translation,” says Prof. Stanislaw Drozdz (IFJ PAN, Kraków University of Technology).

Two sets of texts were studied. Major analyzes of punctuation within each language were conducted on 240 best-selling works of literature written in the seven major Western languages: English (44), German (34), French (32), Italian (32), Spanish (32), Polish (34) and Russian (32). This particular selection of languages ​​was based on one criterion: the researchers assumed that no fewer than 50 million people spoke the language in question, and that works written in it should have received no fewer than five Nobel Prizes in literature.

Furthermore, for the statistical validity of the research results, each book had to contain at least 1,500 sequences of words separated by punctuation marks. A separate collection has been prepared to observe the stability of punctuation in translation. It contained 14 works, each of which was available in each of the languages ​​studied (two of the 98 language versions, however, were omitted due to their unavailability).

In total, the authors of both collections included such writers as Conrad, Dickens, Doyle, Hemingway, Kipling, Orwell, Salinger, Woolf, Grass, Kafka, Mann, Nietzsche, Goethe, Lafayette, Dumas, Hugo, Proust, Verne, Eco , Cervantes, Sienkiewicz or Reymont.

The attention of the Cracow researchers was primarily attracted by the statistical distribution of the distance between consecutive punctuation marks. It soon became apparent that in all the languages ​​studied it was best described by one of the precisely defined variants of the Weibull distribution.

A curve of this type has a characteristic shape: it first grows rapidly and then, after reaching a maximum value, it descends a little more slowly to a certain critical value, below which it vanishes with small and constantly decreasing dynamics. The Weibull distribution is usually used to describe survival phenomena (e.g. population versus age), but also various physical processes, such as increased material fatigue.

“The agreement of the distribution of word sequence lengths between punctuation marks with the functional form of the Weibull distribution was better the more punctuation mark types we included in the analyses; for all punctuation marks, agreement was nearly complete. At the same time, some differences in distributions are evident between different languages, but these simply boil down to selecting slightly different values ​​for the distribution parameters, specific to the language in question. Punctuation thus appears to be an integral part of all languages study,” notes Prof. Drozdz.

After a moment he adds with some amusement: “…and since Weibull’s distribution deals with such phenomena as survival, it can be said without too much irony that punctuation has in its nature a literally built-in struggle for survival.”

The next stage of the analyzes was to determine the risk function. In the case of punctuation, it describes how the conditional probability of success changes, i.e. the probability of the next punctuation mark if such a mark has not yet appeared in the analyzed sequence.

The results here speak for themselves: the language characterized by the least propensity to use punctuation is English, with Spanish not far behind; Slavic languages ​​turned out to be the most dependent on punctuation. The risk function curves for punctuation marks in the six languages ​​studied appeared to follow a similar pattern, differing mainly in vertical displacement.

German turned out to be the exception. Its risk function is the only one that intersects most of the curves constructed for other languages. German punctuation thus seems to combine the punctuation characteristics of many languages, making it a kind of punctuation of Esperanto.

The above observation dovetails with the subsequent analysis, which was to see whether the punctuation features of the original literary works can be seen in their translations. As expected, the language that most faithfully transformed the punctuation from the source language to the target language turned out to be German.

In spoken communication, pauses can be justified by human physiology, such as the need to catch your breath or take a moment to structure what needs to be said next in one’s mind. And in written communication?

“Creating a sentence by adding one word after another while making sure the message is clear and unambiguous is a bit like pulling the string of a bow: it’s easy at first, but it becomes more challenging every moment. If there are no elements of the text (and this is the role of punctuation), the difficulty of interpretation increases as the string of words gets longer, too tight an arc can break and too long a sentence can become incomprehensible, so the author is faced with to the need to ‘release the arrow’, i.e. to close a passage of text with a sort of punctuation mark. This observation applies to all the languages ​​analysed, so it is what could be defined as a linguistic law”, says Dr . Tomasz Stanisz (IFJ PAN), first author of the article in question.

Finally, it is worth noting that the invention of punctuation is relatively recent, punctuation marks were not found in ancient texts at all. The emergence of optimal punctuation patterns in modern written languages ​​can therefore be interpreted as a result of their evolutionary advancement. However, excessive need for punctuation is not necessarily a sign of such refinement.

English and Spanish, simultaneously the most universal languages, appear, in the light of previous studies, to be less severe regarding the frequency of the use of punctuation. These languages ​​are likely to be so formalized in terms of sentence construction that there is less room for ambiguity which should be resolved with punctuation marks.

More information:
Tomasz Stanisz et al, Universal Versus System Specific Features of Punctuation Usage Patterns in Major Western Languages, Chaos, solitons and fractals (2023). DOI: 10.1016/j.chaos.2023.113183

Provided by The Henryk Niewodniczanski Institute of Nuclear Physics Polish Academy of Sciences

Leave a Reply

Your email address will not be published. Required fields are marked *