Lexical Complexity of Dante's Inferno

A corpus linguistics analysis of the original Italian text of the Divina Commedia — measuring vocabulary coverage thresholds and frequency distribution to quantify the reading challenge it presents to a learner of Italian.

Background

Dante Alighieri (1265–1321) was a Florentine poet whose Divina Commedia — written between approximately 1308 and 1320, the year before his death — is considered a cornerstone of world literature and a foundational text of the Italian language. The work comprises three canticles: Inferno, Purgatorio, and Paradiso, tracing an allegorical journey through the afterlife guided by the Roman poet Virgil.

Its publication in vernacular Italian was itself a political act. Notable literature of the time was written in Latin, accessible only to the educated and affluent. Dante's choice of the Florentine dialect elevated it to a literary standard and contributed directly to the emergence of modern Italian as a national language.

The translation challenge is compounded by the work's formal constraints. Written in terza rima — interlocking three-line stanzas with the rhyme scheme ABA BCB CDC — Italian's phonetic regularity and natural rhyme density made this structure achievable in a way English cannot replicate. Translating for literal meaning loses the music; translating for the music loses precision. John Ciardi's widely-read English translation navigates this by annotating each canto with historical and political context: the poem is dense with inferred critiques of contemporary Florentine politics and church corruption that would have been legible to a contemporary reader and are opaque to a modern one without scaffolding.

How difficult is it to read the Divina Commedia in its original Italian?

Methodology

The analysis uses Python and NLTK to perform frequency analysis on the original Italian text of the Inferno. The pipeline:

Tokenization — the text is lowercased and split into tokens using nltk.word_tokenize() with the Italian punkt model, separating words from punctuation.
Frequency distribution — nltk.FreqDist(tokens) produces a ranked count of every token in the corpus.
Coverage analysis — for each vocabulary threshold N, the cumulative frequency of the top-N words is divided by the total token count to produce a comprehension percentage. This models how much of the text a reader could recognise if they knew exactly N words.
Seven Sins keyword analysis — the Italian terms for the seven deadly sins (Superbia, Avarizia, Lussuria, Invidia, Gola, Ira, Accidia) are searched as stemmed patterns across the corpus to produce relative frequency distributions.

The Inferno contains approximately 10,000 unique word forms — significantly higher than typical contemporary Italian fiction, partly because medieval Italian orthography had not yet standardised, producing variant spellings of the same root as distinct types. The truncated forms visible in the top-100 list (quest, quell, cant, vid) are elided forms common in Italian verse, where final vowels are dropped before words beginning with a vowel — a prosodic convention rather than a distinct vocabulary item.

The coverage curve follows the pattern predicted by Zipf's Law: a small number of high-frequency types account for a disproportionate share of tokens. Knowing 100 words gives approximately 50% comprehension by token coverage — comparable to Harry Potter in German, though with a harder path to full coverage given the larger unique type count. Nation's research on reading comprehension suggests 98% coverage is the threshold for unsupported reading; the charts below show how far Dante's vocabulary demands stretch that curve.

Comprehension relative to known words

Frequency Distribution of the Seven Sins

Key word density of top 1,000 words

Top 100 words in the Inferno

The distribution is dominated by function words — articles, prepositions, conjunctions — consistent with Zipf's Law: the most frequent 100 types account for the majority of tokens, most of them grammatical rather than lexical. A reader who knows these 100 words already recognises a large proportion of the text by occurrence, even without grasping the content words that carry the narrative.

e: 4282
che: 3786
la: 2515
di: 2104
a: 1983
non: 1477
per: 1410
in: 1153
si: 1134
l: 959
com: 852
le: 828
io: 827
de: 824
li: 822
sì: 805
mi: 739
il: 708
più: 673
da: 663
con: 660
del: 649
è: 642
lo: 599
chi: 578
se: 561
quest: 542
al: 516
ma: 510
quell: 476
tu: 476
tutt: 463
ne: 449
qual: 382
suo: 382
quel: 380
nel: 376
i: 347
me: 339
quand: 332
son: 329
tant: 325
poi: 320
o: 313
fu: 312
così: 310
mio: 308
lor: 304
un: 302
sua: 301
diss: 298
ti: 286
là: 278
ben: 277
prim: 270
già: 266
sol: 260
vid: 251
noi: 250
laltr: 247
lui: 246
ved: 230
qui: 230
dov: 223
era: 219
occhi: 219
quant: 218
cant: 211
sé: 210
perché: 208
ché: 197
mia: 197
ancor: 194
part: 191
ciel: 182
ver: 181
pur: 181
fa: 180
tal: 178
nostr: 177
dal: 175
né: 174
gent: 170
or: 165
ed: 164
ciò: 163
però: 163
giù: 163
cui: 161
fuor: 159
sanz: 157
lun: 157
poc: 154
ad: 152
tra: 150
esser: 150
fatt: 149
mond: 149
cha: 148
ven: 145