Does /usr/share/dict/words contain personal information?

I am considering including a copy of my /usr/share/dict/words file in a public GitHub repository for a project that requires dictionaries. Is this a bad idea, and if so, why?

I’m particularly interested in the privacy/security (or even legal?) aspects. Are there common programs that add words to this dictionary, for example if I choose "Add to Dictionary" in a spell checker? Is the file likely to contain any sensitive information, such as my username (I checked that, and it doesn’t, but there could be similar things I didn’t think to check). It’d be impractical to look through all 104,334 words. Perhaps it’s just the usr in the path making me unnecessarily concerned.

I’ve read over these questions about where the words come from. However, is it probable that any words have since been added or removed?

I suppose if nothing has changed, I could just get the source. But if some programs added helpful (non-personal) words, I’d want to keep those.

In case it’s important, I am running Ubuntu 23.10. But I’d prefer a slightly more general answer, if possible.

Note

I am fully aware that

  • it would be possible to point to the file path in code rather than "hard coding" it into the repo, and
  • this may not be the best free English word list.

However, I’m not interested in using a different list instead of this one (in such a case, I’d rather just use both). And if I use a list, it’s necessary that I can include the actual file.

Asked By: kviLL

||

/usr/share/dict/words isn’t normally modifiable by non-root users, so your regular use of programs using it for spell checking wouldn’t modify it.

In fact, since it’s in /usr, on most systems it “belongs” to the system and is only modified through updates to the system (in your case, Ubuntu). At least on Linux systems it’s usually one of the word lists provided by SCOWL (Spell Checker Oriented Word Lists).

As such, it’s safe to copy and distribute, as long as you follow the license terms. Another approach might just be to rely on /usr/share/dict/words on your project’s users’ systems — you mention you don’t want to do this, but in many cases the file will be the same as that available on your system. This can even include CI — on typical Ubuntu-based CI environments, install wamerican to get the default US /usr/share/dict/words file.

Answered By: Stephen Kitt
Categories: Answers Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.