Zipf's law is a neat, general fact about word frequency distribution. G K Zipf discovered that the frequency of the kth most frequent word is proportional to 1/k (Human Behavior and the Principle of Least Effort, an Introduction to Human Ecology (Reading, MA, Addison-Wesley, 1949), cited in Knuth, The Art of Computer Programming: vol 3, Sorting and Searching (Reading, MA: Addison-Wesley, 1973), 397). The top hundred words in this database adhere to the law quite well.
frequency cumulative frequency alphabet
(per milln) frequency rank rank
the 68351.63 68351.63 1 318525
of 33008.66 101360.29 2 212425
and 28651.11 130011.40 3 11331
to 27599.22 157610.62 4 322312
a 23160.48 180771.10 5 1
in 20670.81 201441.91 6 149032
is 10571.15 212013.06 7 156934
that 10549.02 222562.08 8 318470
was 9939.26 232501.34 9 356587
it 9882.90 242384.23 10 157771
for 9309.44 251693.67 11 114281
on 7636.66 259330.33 12 213645
with 7171.07 266501.39 13 361235
he 7167.84 273669.23 14 134413
be 7153.17 280822.40 15 27945
I 7036.88 287859.28 16 146205
by 5866.89 293726.17 17 44040
as 5793.35 299519.52 18 19178
at 5154.12 304673.64 19 20631
you 5043.27 309716.91 20 364651
are 5000.14 314717.05 21 17618
his 4963.47 319680.52 22 139433
had 4922.27 324602.79 23 131212
not 4899.77 329502.56 24 209444
this 4789.41 334291.97 25 319827
have 4685.82 338977.79 26 134106
from 4625.21 343603.01 27 117354
but 4616.26 348219.26 28 43732
which 4131.11 352350.37 29 358956
she 3991.77 356342.14 30 285912
they 3982.95 360325.09 31 319435
or 3975.58 364300.67 32 214838
an 3836.07 368136.73 33 10593
her 3692.13 371828.86 34 137067
were 3482.45 375311.31 35 358233
there 3025.87 378337.18 36 319027
we 2953.92 381291.10 37 357241
their 2929.78 384220.88 38 318680
been 2924.28 387145.16 39 28958
has 2873.74 390018.90 40 133676
will 2775.94 392794.84 41 360225
one 2764.69 395559.53 42 213720
all 2630.80 398190.33 43 7706
would 2617.11 400807.44 44 362548
can 2355.35 403162.80 45 46162
if 2247.43 405410.22 46 147000
who 2226.26 407636.48 47 359548
more 2195.16 409831.64 48 196881
when 2193.48 412025.12 49 358850
said 2149.41 414174.53 50 274265
do 2139.12 416313.65 51 88648
what 2053.98 418367.63 52 358673
about 1907.52 420275.15 53 652
its 1888.51 422163.66 54 157935
so 1844.57 424008.24 55 293328
up 1816.81 425825.05 56 347711
into 1803.28 427628.33 57 155127
no 1789.08 429417.41 58 205310
him 1787.13 431204.53 59 138999
some 1783.31 432987.85 60 294419
could 1753.24 434741.08 61 68666
them 1668.31 436409.39 62 318729
only 1646.85 438056.24 63 213824
time 1609.99 439666.22 64 321515
out 1547.86 441214.09 65 217118
my 1526.21 442740.30 66 200056
two 1514.46 444254.76 67 330909
other 1513.23 445767.98 68 216850
then 1475.27 447243.25 69 318748
may 1455.47 448698.73 70 184593
over 1443.56 450142.28 71 218315
also 1409.47 451551.75 72 8585
new 1404.41 452956.16 73 204064
like 1366.44 454322.60 74 173657
these 1328.58 455651.18 75 319382
me 1316.41 456967.59 76 185895
after 1302.93 458270.52 77 4998
first 1287.14 459557.66 78 111382
your 1285.88 460843.54 79 364711
did 1283.43 462126.98 80 84058
now 1281.59 463408.56 81 209859
any 1279.86 464688.42 82 15074
people 1215.83 465904.26 83 229078
than 1203.22 467107.47 84 318396
should 1172.27 468279.75 85 287398
very 1159.18 469438.93 86 352460
most 1112.14 470551.07 87 197488
see 1097.46 471648.52 88 281471
where 1096.15 472744.67 89 358869
just 1060.74 473805.41 90 160985
made 1050.69 474856.10 91 179480
between 1031.01 475887.12 92 31750
back 1022.58 476909.69 93 24006
way 984.89 477894.58 94 357170
many 981.20 478875.78 95 182122
years 981.16 479856.94 96 364108
being 973.72 480830.66 97 29466
our 970.28 481800.94 98 217097
how 969.81 482770.75 99 142630
work 956.09 483726.84 100 362239