Can tr commands be chained to avoid multiple tr processes in a pipeline?

I have a bunch of txt files, I’d like to output them lower-cased, only alphabetic and one word-per line, I can do it with several tr commands in a pipeline like this:

tr -d '[:punct:]' <doyle_sherlock_holmes.txt | tr '[:upper:]' '[:lower:]' | tr ' ' 'n'

Is it possible to do this in one scan? I could write a C program to do this, but I feel like there’s a way to do it using tr, sed, awk or perl.

Asked By: tlehman

||

You can combine multiple translations (excepting complex cases involving overlapping locale-dependent sets), but you can’t combine deletion with translation.

<doyle_sherlock_holmes.txt tr -d '[:punct:]' | tr '[:upper:] ' '[:lower:]n'

Two calls to tr are likely to be faster than a single call to more complex tools, but this is very dependent on the input size, on the proportions of different characters, on the implementation of tr and competing tools, on the operating system, on the number of cores, etc.

Here are a few approaches:

  • GNU grep and tr: find all words and make them lower case

    grep -Po 'w+' file | tr '[A-Z]' '[a-z]'
    
  • GNU grep and perl: as above but perl handles the conversion to lower case

    grep -Po 'w+' file | perl -lne 'print lc()'
    
  • perl: find all alphabetic characters and print them in lower case (thanks @steeldriver):

    perl -lne 'print lc for /[a-z]+/ig' file
    
  • sed: remove all characters that are not alphabetic or spaces, substitute all alphabetic characters with their lower case versions and replace all spaces with newlines. Note that this assumes that all whitespace is spaces, no tabs.

    sed 's/[^a-zA-Z ]+//g;s/[a-zA-Z]+/L&/g; s/ +/n/g' file
    
Answered By: terdon

Yes. You can do that w/ tr in an ASCII locale (which is, for a GNU tr anyway, kind of its only purview). You can use the POSIX classes, or you can reference the byte values of each character by octal number. You can split their transformations across ranges, as well.

LC_ALL=C tr '[:upper:]-101133-140173-377' '[:lower:][n*]' <input

The above command would transform all uppercase characters to lowercase, ignore lowercase chars entirely, and transform all other characters to newlines. Of course, then you wind up with a ton of blank lines. The tr -squeeze repeats switch could be useful in that case, but if you use it alongside the [:upper:] to [:lower:] transformation then you wind up squeezing uppercase characters as well. In that way it still requires a second filter like…

LC... tr ... | tr -s \n

…or…

LC... tr ... | grep .

…and so it winds up being a lot less convenient than doing…

LC_ALL=C tr -sc '[:alpha:]' \n <input | tr '[:upper:]' '[:lower:]'

…which squeezes the -complement of alphabetic characters by sequence into a single newline a piece, then does the upper to lower transform on the other side of the pipe.

That isn’t to say that ranges of that nature are not useful. Stuff like:

tr '-377' '[1*25][2*25][3*25][4*25][5*25][6*25][7*25][8*25][9*25][0*]' </dev/random

…can be pretty handy as it converts the input bytes to all digits over a spread spectrum of their values. Waste not, want not, you know.

Another way to do the transform could involve dd.

tr '-377' '[A*64][B*64][C*64][D*64]' </dev/urandom |
dd bs=32 cbs=8 conv=unblock,lcase count=1

dadbbdbd
ddaaddab
ddbadbaa
bdbdcadd

Because dd can do both unblock and lcase conversions at the same time, it might even be possible to pass much of the work off to it. But that can only be really useful if you can accurately predict the number of bytes per word – or at least can pad each word with spaces beforehand to a predictable byte count, because unblock eats trailing spaces at the end of each block.

Answered By: mikeserv
Categories: Answers Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.