Lexical Complexity of Dante's Inferno
by Alexis Hope,
A corpus linguistics analysis of the original Italian text of the Divina Commedia — measuring vocabulary coverage thresholds and frequency distribution to quantify the reading challenge it presents to a learner of Italian.
Background
Dante Alighieri (1265–1321) was a Florentine poet whose Divina Commedia — written between approximately 1308 and 1320, the year before his death — is considered a cornerstone of world literature and a foundational text of the Italian language. The work comprises three canticles: Inferno, Purgatorio, and Paradiso, tracing an allegorical journey through the afterlife guided by the Roman poet Virgil.
Its publication in vernacular Italian was itself a political act. Notable literature of the time was written in Latin, accessible only to the educated and affluent. Dante's choice of the Florentine dialect elevated it to a literary standard and contributed directly to the emergence of modern Italian as a national language.
The translation challenge is compounded by the work's formal constraints. Written in terza rima — interlocking three-line stanzas with the rhyme scheme ABA BCB CDC — Italian's phonetic regularity and natural rhyme density made this structure achievable in a way English cannot replicate. Translating for literal meaning loses the music; translating for the music loses precision. John Ciardi's widely-read English translation navigates this by annotating each canto with historical and political context: the poem is dense with inferred critiques of contemporary Florentine politics and church corruption that would have been legible to a contemporary reader and are opaque to a modern one without scaffolding.
How difficult is it to read the Divina Commedia in its original Italian?
Methodology
The analysis uses Python and NLTK to perform frequency analysis on the original Italian text of the Inferno. The pipeline:
- Tokenization — the text is lowercased and split into tokens using
nltk.word_tokenize()with the Italian punkt model, separating words from punctuation. - Frequency distribution —
nltk.FreqDist(tokens)produces a ranked count of every token in the corpus. - Coverage analysis — for each vocabulary threshold N, the cumulative frequency of the top-N words is divided by the total token count to produce a comprehension percentage. This models how much of the text a reader could recognise if they knew exactly N words.
- Seven Sins keyword analysis — the Italian terms for the seven deadly sins (Superbia, Avarizia, Lussuria, Invidia, Gola, Ira, Accidia) are searched as stemmed patterns across the corpus to produce relative frequency distributions.
The Inferno contains approximately 10,000 unique word forms — significantly higher than typical contemporary Italian fiction, partly because medieval Italian orthography had not yet standardised, producing variant spellings of the same root as distinct types. The truncated forms visible in the top-100 list (quest, quell, cant, vid) are elided forms common in Italian verse, where final vowels are dropped before words beginning with a vowel — a prosodic convention rather than a distinct vocabulary item.
The coverage curve follows the pattern predicted by Zipf's Law: a small number of high-frequency types account for a disproportionate share of tokens. Knowing 100 words gives approximately 50% comprehension by token coverage — comparable to Harry Potter in German, though with a harder path to full coverage given the larger unique type count. Nation's research on reading comprehension suggests 98% coverage is the threshold for unsupported reading; the charts below show how far Dante's vocabulary demands stretch that curve.
Comprehension relative to known words
Frequency Distribution of the Seven Sins
Key word density of top 1,000 words
Top 100 words in the Inferno
The distribution is dominated by function words — articles, prepositions, conjunctions — consistent with Zipf's Law: the most frequent 100 types account for the majority of tokens, most of them grammatical rather than lexical. A reader who knows these 100 words already recognises a large proportion of the text by occurrence, even without grasping the content words that carry the narrative.
- e
- 4282
- che
- 3786
- la
- 2515
- di
- 2104
- a
- 1983
- non
- 1477
- per
- 1410
- in
- 1153
- si
- 1134
- l
- 959
- com
- 852
- le
- 828
- io
- 827
- de
- 824
- li
- 822
- sì
- 805
- mi
- 739
- il
- 708
- più
- 673
- da
- 663
- con
- 660
- del
- 649
- è
- 642
- lo
- 599
- chi
- 578
- se
- 561
- quest
- 542
- al
- 516
- ma
- 510
- quell
- 476
- tu
- 476
- tutt
- 463
- ne
- 449
- qual
- 382
- suo
- 382
- quel
- 380
- nel
- 376
- i
- 347
- me
- 339
- quand
- 332
- son
- 329
- tant
- 325
- poi
- 320
- o
- 313
- fu
- 312
- così
- 310
- mio
- 308
- lor
- 304
- un
- 302
- sua
- 301
- diss
- 298
- ti
- 286
- là
- 278
- ben
- 277
- prim
- 270
- già
- 266
- sol
- 260
- vid
- 251
- noi
- 250
- laltr
- 247
- lui
- 246
- ved
- 230
- qui
- 230
- dov
- 223
- era
- 219
- occhi
- 219
- quant
- 218
- cant
- 211
- sé
- 210
- perché
- 208
- ché
- 197
- mia
- 197
- ancor
- 194
- part
- 191
- ciel
- 182
- ver
- 181
- pur
- 181
- fa
- 180
- tal
- 178
- nostr
- 177
- dal
- 175
- né
- 174
- gent
- 170
- or
- 165
- ed
- 164
- ciò
- 163
- però
- 163
- giù
- 163
- cui
- 161
- fuor
- 159
- sanz
- 157
- lun
- 157
- poc
- 154
- ad
- 152
- tra
- 150
- esser
- 150
- fatt
- 149
- mond
- 149
- cha
- 148
- ven
- 145