Where do the words in /usr/share/dict/words come from?

/usr/share/dict/words contains lots of words. How is this list generated? Are its contents the same across different Unices? Is there any standard dictating what it must contain?

All I’ve been able to turn up so far is that on Ubuntu/Debian the list comes from the wordlist packages, but their descriptions offer no clue on how the lists were actually generated.

Asked By: Mark Amery

||

You’re asking multiple questions, but I think the main one is:

Is there any standard dictating what it must contain?

To my knowledge, no.

Given that, your related questions:

How is this list generated? Are its contents the same across different Unices?

are answered “it depends on each different Unix”.

The convention of including a word list as part of the operating system comes from the spell(1) utility, which uses it for a primitive spell-checking procedure.

That spell-checking procedure is described in the academic paper “Development of a Spelling List”, by M. D. McIlroy of Bell Labs, 1982.

You should check your operating system’s package manager for where the spelling list comes from, how it is generated, and what alternatives are available.

On Debian GNU+Linux, for example:

  • The /usr/share/dict/words file is a symbolic link managed using the Debian “alternatives” system.
  • A common word list package providing that link is the wamerican package.
  • The package documentation for wamerican states its word list comes from the SCOWL (Spell Checker Oriented Word Lists) project.

Many other word list packages can be installed; they each have the “Provides: wordlist” field:

$ aptitude search '?provides(wordlist)' | wc -l
34

On different Unices, you’ll need to see the package system and documentation to know the provenance and alternatives of the word list.

Answered By: bignose
Categories: Answers Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.