Analysis of Harry Potter in German

A frequency analysis of Harry Potter und der Stein der Weisen using Python and NLTK, exploring what corpus linguistics reveals about the vocabulary threshold for reading comprehension in a second language.

Methodology

The source text was processed using NLTK (Natural Language Toolkit) in Python. The pipeline: tokenize the raw text with nltk.word_tokenize, lowercase and strip punctuation, then build a frequency distribution with nltk.FreqDist. Stopwords were deliberately not removed — function words like articles, pronouns, and conjunctions are exactly what a language learner needs to acquire, and stripping them would distort the comprehension model.

The cumulative coverage curve was computed by sorting tokens by frequency and calculating the running percentage of total token count. The house name frequency distributions were extracted by filtering FreqDist results against a known list of proper nouns and plotting their positional occurrence across the corpus.

How soon can I start reading in a target language?

Sooner than you might expect. In Harry Potter und der Stein der Weisen, the 100 most frequent words account for over 50% of the total content. The corpus contains 82,390 total tokens, of which 7,364 are unique types — a type-token ratio that reflects the natural redundancy of narrative prose.

This steep coverage curve is a direct expression of Zipf's Law: in any natural language corpus, word frequency is inversely proportional to its rank. The most common word appears roughly twice as often as the second most common, three times as often as the third, and so on. The result is that a small high-frequency vocabulary covers a disproportionately large share of any text.

Knowing 400 words gets you through 70% of the content; 2,000 words covers 90%. Beyond that threshold, each additional percentage point of comprehension requires acquiring significantly more vocabulary — but at 90% comprehension, context becomes a powerful enough signal that unfamiliar words can often be inferred from surrounding text.

Comprehension relative to known words

The curve above illustrates the diminishing returns clearly: comprehension rises steeply through the first 1,000 words then flattens. This is consistent with frequency distributions observed across most European language corpora.

100 words account for more than 50% of the content.
3,727 words appear only once, accounting for less than 5% of the content.

Those 3,727 single-occurrence words are hapax legomena — from the Greek, meaning "said only once." In corpus linguistics, hapax count is used as a measure of lexical richness; a high hapax-to-token ratio indicates a diverse vocabulary. In Rowling's German translation, hapaxes represent 50.6% of the unique vocabulary but contribute minimally to overall coverage, which is why the comprehension curve flattens so sharply past the 2,000 word mark.

The practical implication for language learners is significant: the entry cost to basic reading comprehension is low. The top 100 words in this corpus are overwhelmingly function words — articles (der, die, das, ein), pronouns (er, sie, ich, es), prepositions, and conjunctions. These are acquired early in any structured language course and transfer immediately to reading comprehension.

Frequency Distribution of Hogwarts School Houses

Proper nouns are an underappreciated comprehension resource for learners approaching a familiar text. Character names and place names recognised from the English source material provide free anchor points — tokens whose meaning is immediately known regardless of German vocabulary level. In this corpus, named characters appear in the top 300 words and account for approximately 3% of total content.

The house name distribution above shows when each house is first mentioned across the narrative arc. Slytherin and Hufflepuff appear earliest — in the scene on the Hogwarts Express where Malfoy introduces them — before Gryffindor and Ravenclaw are established. For a learner, these named landmarks help frame context around surrounding unfamiliar vocabulary, a technique closely related to what NLP practitioners call named entity recognition as a scaffolding mechanism for semantic parsing.

This contextual scaffolding is the deeper reason mass reading works as a language acquisition strategy. Comprehensible input — content understood at roughly 95–98% — allows learners to infer meaning from context rather than requiring lookup. At 2,000 words of active vocabulary you are close to that threshold in this corpus. Flash cards and structured drills build the initial foundation efficiently, but reading is where vocabulary becomes embedded in semantic context — which is where durable acquisition happens.

Key word density of top 1,000 words

The bubble chart above maps the top 1,000 words by frequency. Circle area is proportional to occurrence count, and the colour grouping reflects loose semantic clustering. The dominance of function words at the centre reflects the Zipfian distribution — high frequency, low semantic density. The periphery, where content words live, represents the vocabulary that carries the actual meaning of the narrative.

Top 100 words in Harry Potter und der Stein der Weisen

und: 2604
ein: 1955
er: 1824
der: 1734
die: 1718
harry: 1422
sie: 1376
den: 1061
nicht: 1058
zu: 1027
in: 993
war: 922
sich: 905
ich: 870
es: 845
auf: 843
das: 840
sein: 730
sagt: 728
mit: 693
von: 685
hatt: 606
ihn: 581
ist: 522
du: 487
dem: 477
dass: 477
wie: 467
an: 449
als: 448
ihr: 430
ron: 425
was: 416
noch: 394
hab: 372
hagrid: 366
doch: 363
aus: 349
um: 348
so: 326
ihm: 320
all: 319
potter: 314
seit: 296
wir: 292
wenn: 290
im: 282
stein: 275
hat: 272
hermin: 272
dies: 253
vor: 241
konnt: 231
kein: 224
weis: 221
üb: 219
aber: 217
für: 214
nur: 209
nach: 209
sah: 201
dann: 201
schon: 190
etwas: 189
ganz: 187
professor: 179
auch: 176
snap: 175
würd: 173
and: 167
imm: 164
da: 164
mal: 161
dumbledor: 161
gross: 155
wied: 155
mein: 153
mir: 152
durch: 150
am: 149
nun: 148
gut: 147
ja: 146
mehr: 143
hätt: 143
uns: 143
mich: 142
hier: 141
dir: 127
jetzt: 124
zum: 123
dich: 121
onkel: 120
dudley: 119
sind: 119
klein: 118
aug: 118
vernon: 117
ging: 116
dein: 114