Lexical Complexity of Dante's Inferno

by Alexis Hope,

A corpus linguistics analysis of the original Italian text of the Divina Commedia — measuring vocabulary coverage thresholds and frequency distribution to quantify the reading challenge it presents to a learner of Italian.

Background

Dante Alighieri (1265–1321) was a Florentine poet whose Divina Commedia — written between approximately 1308 and 1320, the year before his death — is considered a cornerstone of world literature and a foundational text of the Italian language. The work comprises three canticles: Inferno, Purgatorio, and Paradiso, tracing an allegorical journey through the afterlife guided by the Roman poet Virgil.

Its publication in vernacular Italian was itself a political act. Notable literature of the time was written in Latin, accessible only to the educated and affluent. Dante's choice of the Florentine dialect elevated it to a literary standard and contributed directly to the emergence of modern Italian as a national language.

The translation challenge is compounded by the work's formal constraints. Written in terza rima — interlocking three-line stanzas with the rhyme scheme ABA BCB CDC — Italian's phonetic regularity and natural rhyme density made this structure achievable in a way English cannot replicate. Translating for literal meaning loses the music; translating for the music loses precision. John Ciardi's widely-read English translation navigates this by annotating each canto with historical and political context: the poem is dense with inferred critiques of contemporary Florentine politics and church corruption that would have been legible to a contemporary reader and are opaque to a modern one without scaffolding.

How difficult is it to read the Divina Commedia in its original Italian?

Methodology

The analysis uses Python and NLTK to perform frequency analysis on the original Italian text of the Inferno. The pipeline:

  1. Tokenization — the text is lowercased and split into tokens using nltk.word_tokenize() with the Italian punkt model, separating words from punctuation.
  2. Frequency distributionnltk.FreqDist(tokens) produces a ranked count of every token in the corpus.
  3. Coverage analysis — for each vocabulary threshold N, the cumulative frequency of the top-N words is divided by the total token count to produce a comprehension percentage. This models how much of the text a reader could recognise if they knew exactly N words.
  4. Seven Sins keyword analysis — the Italian terms for the seven deadly sins (Superbia, Avarizia, Lussuria, Invidia, Gola, Ira, Accidia) are searched as stemmed patterns across the corpus to produce relative frequency distributions.

The Inferno contains approximately 10,000 unique word forms — significantly higher than typical contemporary Italian fiction, partly because medieval Italian orthography had not yet standardised, producing variant spellings of the same root as distinct types. The truncated forms visible in the top-100 list (quest, quell, cant, vid) are elided forms common in Italian verse, where final vowels are dropped before words beginning with a vowel — a prosodic convention rather than a distinct vocabulary item.

The coverage curve follows the pattern predicted by Zipf's Law: a small number of high-frequency types account for a disproportionate share of tokens. Knowing 100 words gives approximately 50% comprehension by token coverage — comparable to Harry Potter in German, though with a harder path to full coverage given the larger unique type count. Nation's research on reading comprehension suggests 98% coverage is the threshold for unsupported reading; the charts below show how far Dante's vocabulary demands stretch that curve.

Comprehension relative to known words

Frequency Distribution of the Seven Sins

Key word density of top 1,000 words

Top 100 words in the Inferno

The distribution is dominated by function words — articles, prepositions, conjunctions — consistent with Zipf's Law: the most frequent 100 types account for the majority of tokens, most of them grammatical rather than lexical. A reader who knows these 100 words already recognises a large proportion of the text by occurrence, even without grasping the content words that carry the narrative.

e
4282
che
3786
la
2515
di
2104
a
1983
non
1477
per
1410
in
1153
si
1134
l
959
com
852
le
828
io
827
de
824
li
822
sì
805
mi
739
il
708
più
673
da
663
con
660
del
649
642
lo
599
chi
578
se
561
quest
542
al
516
ma
510
quell
476
tu
476
tutt
463
ne
449
qual
382
suo
382
quel
380
nel
376
i
347
me
339
quand
332
son
329
tant
325
poi
320
o
313
fu
312
così
310
mio
308
lor
304
un
302
sua
301
diss
298
ti
286
là
278
ben
277
prim
270
già
266
sol
260
vid
251
noi
250
laltr
247
lui
246
ved
230
qui
230
dov
223
era
219
occhi
219
quant
218
cant
211
sé
210
perché
208
ché
197
mia
197
ancor
194
part
191
ciel
182
ver
181
pur
181
fa
180
tal
178
nostr
177
dal
175
né
174
gent
170
or
165
ed
164
ciò
163
però
163
giù
163
cui
161
fuor
159
sanz
157
lun
157
poc
154
ad
152
tra
150
esser
150
fatt
149
mond
149
cha
148
ven
145
Tags: