Analysis of Harry Potter in German

by Alexis Hope,

A frequency analysis of Harry Potter und der Stein der Weisen using Python and NLTK, exploring what corpus linguistics reveals about the vocabulary threshold for reading comprehension in a second language.

Methodology

The source text was processed using NLTK (Natural Language Toolkit) in Python. The pipeline: tokenize the raw text with nltk.word_tokenize, lowercase and strip punctuation, then build a frequency distribution with nltk.FreqDist. Stopwords were deliberately not removed — function words like articles, pronouns, and conjunctions are exactly what a language learner needs to acquire, and stripping them would distort the comprehension model.

The cumulative coverage curve was computed by sorting tokens by frequency and calculating the running percentage of total token count. The house name frequency distributions were extracted by filtering FreqDist results against a known list of proper nouns and plotting their positional occurrence across the corpus.

How soon can I start reading in a target language?

Sooner than you might expect. In Harry Potter und der Stein der Weisen, the 100 most frequent words account for over 50% of the total content. The corpus contains 82,390 total tokens, of which 7,364 are unique types — a type-token ratio that reflects the natural redundancy of narrative prose.

This steep coverage curve is a direct expression of Zipf's Law: in any natural language corpus, word frequency is inversely proportional to its rank. The most common word appears roughly twice as often as the second most common, three times as often as the third, and so on. The result is that a small high-frequency vocabulary covers a disproportionately large share of any text.

Knowing 400 words gets you through 70% of the content; 2,000 words covers 90%. Beyond that threshold, each additional percentage point of comprehension requires acquiring significantly more vocabulary — but at 90% comprehension, context becomes a powerful enough signal that unfamiliar words can often be inferred from surrounding text.

Comprehension relative to known words

The curve above illustrates the diminishing returns clearly: comprehension rises steeply through the first 1,000 words then flattens. This is consistent with frequency distributions observed across most European language corpora.

100 words account for more than 50% of the content.
3,727 words appear only once, accounting for less than 5% of the content.

Those 3,727 single-occurrence words are hapax legomena — from the Greek, meaning "said only once." In corpus linguistics, hapax count is used as a measure of lexical richness; a high hapax-to-token ratio indicates a diverse vocabulary. In Rowling's German translation, hapaxes represent 50.6% of the unique vocabulary but contribute minimally to overall coverage, which is why the comprehension curve flattens so sharply past the 2,000 word mark.

The practical implication for language learners is significant: the entry cost to basic reading comprehension is low. The top 100 words in this corpus are overwhelmingly function words — articles (der, die, das, ein), pronouns (er, sie, ich, es), prepositions, and conjunctions. These are acquired early in any structured language course and transfer immediately to reading comprehension.

Frequency Distribution of Hogwarts School Houses

Proper nouns are an underappreciated comprehension resource for learners approaching a familiar text. Character names and place names recognised from the English source material provide free anchor points — tokens whose meaning is immediately known regardless of German vocabulary level. In this corpus, named characters appear in the top 300 words and account for approximately 3% of total content.

The house name distribution above shows when each house is first mentioned across the narrative arc. Slytherin and Hufflepuff appear earliest — in the scene on the Hogwarts Express where Malfoy introduces them — before Gryffindor and Ravenclaw are established. For a learner, these named landmarks help frame context around surrounding unfamiliar vocabulary, a technique closely related to what NLP practitioners call named entity recognition as a scaffolding mechanism for semantic parsing.

This contextual scaffolding is the deeper reason mass reading works as a language acquisition strategy. Comprehensible input — content understood at roughly 95–98% — allows learners to infer meaning from context rather than requiring lookup. At 2,000 words of active vocabulary you are close to that threshold in this corpus. Flash cards and structured drills build the initial foundation efficiently, but reading is where vocabulary becomes embedded in semantic context — which is where durable acquisition happens.

Key word density of top 1,000 words

The bubble chart above maps the top 1,000 words by frequency. Circle area is proportional to occurrence count, and the colour grouping reflects loose semantic clustering. The dominance of function words at the centre reflects the Zipfian distribution — high frequency, low semantic density. The periphery, where content words live, represents the vocabulary that carries the actual meaning of the narrative.

Top 100 words in Harry Potter und der Stein der Weisen

und
2604
ein
1955
er
1824
der
1734
die
1718
harry
1422
sie
1376
den
1061
nicht
1058
zu
1027
in
993
war
922
sich
905
ich
870
es
845
auf
843
das
840
sein
730
sagt
728
mit
693
von
685
hatt
606
ihn
581
ist
522
du
487
dem
477
dass
477
wie
467
an
449
als
448
ihr
430
ron
425
was
416
noch
394
hab
372
hagrid
366
doch
363
aus
349
um
348
so
326
ihm
320
all
319
potter
314
seit
296
wir
292
wenn
290
im
282
stein
275
hat
272
hermin
272
dies
253
vor
241
konnt
231
kein
224
weis
221
üb
219
aber
217
für
214
nur
209
nach
209
sah
201
dann
201
schon
190
etwas
189
ganz
187
professor
179
auch
176
snap
175
würd
173
and
167
imm
164
da
164
mal
161
dumbledor
161
gross
155
wied
155
mein
153
mir
152
durch
150
am
149
nun
148
gut
147
ja
146
mehr
143
hätt
143
uns
143
mich
142
hier
141
dir
127
jetzt
124
zum
123
dich
121
onkel
120
dudley
119
sind
119
klein
118
aug
118
vernon
117
ging
116
dein
114
Tags: