Replace string in a huge (70GB), one line, text file

I have a huge (70GB), one line, text file and I want to replace a string (token) in it.
I want to replace the token <unk>, with another dummy token (glove issue).

I tried sed:

sed 's/<unk>/<raw_unk>/g' < corpus.txt > corpus.txt.new

but the output file corpus.txt.new has zero-bytes!

I also tried using perl:

perl -pe 's/<unk>/<raw_unk>/g' < corpus.txt > corpus.txt.new

but I got an out of memory error.

For smaller files, both of the above commands work.

How can I replace a string is such a file?
This is a related question, but none of the answers worked for me.

Edit:
What about splitting the file in chunks of 10GBs (or whatever) each and applying sed on each one of them and then merging them with cat? Does that make sense? Is there a more elegant solution?

The usual text processing tools are not designed to handle lines that don’t fit in RAM. They tend to work by reading one record (one line), manipulating it, and outputting the result, then proceeding to the next record (line).

If there’s an ASCII character that appears frequently in the file and doesn’t appear in <unk> or <raw_unk>, then you can use that as the record separator. Since most tools don’t allow custom record separators, swap between that character and newlines. tr processes bytes, not lines, so it doesn’t care about any record size. Supposing that ; works:

<corpus.txt tr 'n;' ';n' |
sed 's/<unk>/<raw_unk>/g' |
tr 'n;' ';n' >corpus.txt.new

You could also anchor on the first character of the text you’re searching for, assuming that it isn’t repeated in the search text and it appears frequently enough. If the file may start with unk>, change the sed command to sed '2,$ s/… to avoid a spurious match.

<corpus.txt tr 'n<' '<n' |
sed 's/^unk>/raw_unk>/g' |
tr 'n<' '<n' >corpus.txt.new

Alternatively, use the last character.

<corpus.txt tr 'n>' '>n' |
sed 's/<unk$/<raw_unk/g' |
tr 'n>' '>n' >corpus.txt.new

Note that this technique assumes that sed operates seamlessly on a file that doesn’t end with a newline, i.e. that it processes the last partial line without truncating it and without appending a final newline. It works with GNU sed. If you can pick the last character of the file as the record separator, you’ll avoid any portability trouble.

Here’s a small Go program that performs the task (unk.go):

package main

import (
    "bufio"
    "fmt"
    "log"
    "os"
)

func main() {
    const (
        pattern     = "<unk>"
        replacement = "<raw_unk>"
    )
    var match int
    var char rune
    scanner := bufio.NewScanner(os.Stdin)
    scanner.Split(bufio.ScanRunes)
    for scanner.Scan() {
        char = rune(scanner.Text()[0])
        if char == []rune(pattern)[match] {
            match++
            if match == len(pattern) {
                fmt.Print(replacement)
                match = 0
            }
        } else {
            if match > 0 {
                fmt.Print(string(pattern[:match]))
                match = 0
            }
            if char == rune(pattern[0]) {
                match = 1
            } else {
                fmt.Print(string(char))
            }
        }
    }
    if err := scanner.Err(); err != nil {
        log.Fatal(err)
    }
}

Just build it with go build unk.go and run it as ./unk <input >output.

EDIT:

Sorry, I didn’t read that everything is in one line, so I tried to read the file character by character now.

EDIT II:

Applied same fix as to the C program.

Answered By: Patrick Bucher

GNU grep can show you the offset of matches in “binary” files, without having to read whole lines into memory. You can then use dd to read up to this offset, skip over the match, then continue copying from the file.

file=...
newfile=...
replace='<raw_unk>'
grep -o -b -a -F '<unk>' <"$file" |
(   pos=0
    while IFS=$IFS: read offset pattern
    do size=${#pattern}
       let skip=offset-pos
       let big=skip/1048576
       let skip=skip-big*1048576
       dd bs=1048576 count=$big <&3
       dd bs=1 count=$skip <&3
       dd bs=1 count=$size of=/dev/null <&3
       printf "%s" "$replace"
       let pos=offset+size
    done
    cat <&3
) 3<"$file" >"$newfile"

For speed, I’ve split the dd into a big read of blocksize 1048576 and a smaller read of 1 byte at a time, but this operation will still be a little slow on such a large file. The grep output is, for example, 13977:<unk>, and this is split on the colon by the read into variables offset and pattern. We have to keep track in pos of how many bytes have already been copied from the file.

Answered By: meuh

For such a big file, one possibility is Flex. Let unk.l be:

%%
<unk>     printf("<raw_unk>");  
%%

Then compile and execute:

$ flex -o unk.c  unk.l
$ cc -o unk -O2 unk.c -lfl
$ unk < corpus.txt > corpus.txt.new
Answered By: JJoao

I think the C version might perform much better:

#include <stdio.h>
#include <string.h>

#define PAT_LEN 5

int main()
{
    /* note this is not a general solution. In particular the pattern
     * must not have a repeated sequence at the start, so <unk> is fine
     * but aardvark is not, because it starts with "a" repeated, and ababc
     * is not because it starts with "ab" repeated. */
    char pattern[] = "<unk>";          /* set PAT_LEN to length of this */
    char replacement[] = "<raw_unk>"; 
    int c;
    int i, j;

    for (i = 0; (c = getchar()) != EOF;) {
        if (c == pattern[i]) {
            i++;
            if (i == PAT_LEN) {
                printf("%s", replacement);
                i = 0;
            }
        } else {
            if (i > 0) {
                for (j = 0; j < i; j++) {
                    putchar(pattern[j]);
                }
                i = 0;
            }
            if (c == pattern[0]) {
                i = 1;
            } else {
                putchar(c);
            }
        }
    }
    /* TODO: fix up end of file if it ends with a part of pattern */
    return 0;
}

EDIT: Modified according to suggestions from the comments. Also fixed bug with the pattern <<unk>.

Answered By: Patrick Bucher

Using perl

Managing your own buffers

You can use IO::Handle‘s setvbuf to manage the default buffers, or you can manage your own buffers with sysread and syswrite. Check perldoc -f sysread and perldoc -f syswrite for more information, essentially they skip buffered io.

Here we roll our own buffer IO, but we do it manually and arbitrarily on 1024 bytes. We also open the file for RW so we do it all on the same FH at once.

use strict;
use warnings;
use Fcntl qw(:flock O_RDWR);
use autodie;
use bytes;

use constant CHUNK_SIZE => 1024 * 32;

sysopen my $fh, 'file', O_RDWR;
flock($fh, LOCK_EX);

my $chunk = 1;
while ( sysread $fh, my $bytes, CHUNK_SIZE * $chunk ) {
  if ( $bytes =~ s/<unk>/<raw_unk>/g ) {
    seek( $fh, ($chunk-1)* CHUNK_SIZE, 0 );
    syswrite( $fh, $bytes, 1024);
    seek( $fh, $chunk * CHUNK_SIZE, 0 );
  }
  $chunk++;
}

If you’re going to go this route

  1. Make sure <unk> and <raw_unk> are the same byte size.
  2. You may want to make sure our buffered method doesn’t cross the CHUNKSIZE boundary, if you’re replacing more than 1 byte.
Answered By: Evan Carroll

With perl, you could work with fixed length records like:

perl -pe 'BEGIN{$/=1e8}
          s/<unk>/<raw_unk>/g' < corpus.txt > corpus.txt.new

And hope that there won’t be <unk>s spanning across two of those 100MB records.

Answered By: Stéphane Chazelas

EDITED at 2024: apparently the behavior of replace was changed and new lines are not allowed anymore, therefore this answer is no longer valid.

There is a replace utility in the mariadb-server/mysql-server package. It replaces simple strings (not regular expressions) and unlike grep/sed/awk replace does not care about n and . Memory consumption is constant with any input file (about 400kb on my machine).

Of course you do not need to run a mysql server in order to use replace, it is only packaged that way in Fedora. Other distros/operating systems may have it packaged separately.

Answered By: legolegs

So you don’t have enough physical memory (RAM) to hold the whole file at once, but on a 64-bit system you have enough virtual address space to map the entire file. Virtual mappings can be useful as a simple hack in cases like this.

The necessary operations are all included in Python. There are several annoying subtleties, but it does avoid having to write C code. In particular, care is needed to avoid copying the file in memory, which would defeat the point entirely. On the plus side, you get error-reporting for free (python “exceptions”) :).

#!/usr/bin/python3
# This script takes input from stdin
# (but it must be a regular file, to support mapping it),
# and writes the result to stdout.

search = b'<unk>'
replace = b'<raw_unk>'


import sys
import os
import mmap

# sys.stdout requires str, but we want to write bytes
out_bytes = sys.stdout.buffer

mem = mmap.mmap(sys.stdin.fileno(), 0, access=mmap.ACCESS_READ)
i = mem.find(search)
if i < 0:
    sys.exit("Search string not found")

# mmap object subscripts to bytes (making a copy)
# memoryview object subscripts to a memoryview object
# (it implements the buffer protocol).
view = memoryview(mem)

out_bytes.write(view[:i])
out_bytes.write(replace)
out_bytes.write(view[i+len(search):])
Answered By: sourcejedi

Here is another single UNIX command line that might perform better than other options, because you can “hunt” for a “block size” that performs well. For this to be robust you need to know that you have at least one space in every X characters, where X is your arbitrary “block size”. In the example below I have chosen a “block size” of 1024 characters.

fold -w 1024 -s corpus.txt | sed 's/<unk>/<raw_unk>/g' | tr '/n' '/0'

Here, fold will grab up to 1024 bytes, but the -s makes sure it breaks on a space if there is at least one since the last break.

The sed command is yours and does what you expect.

Then the tr command will “unfold” the file converting the newlines that were inserted back to nothing.

You should consider trying larger block sizes to see if it performs faster. Instead of 1024, you might try 10240 and 102400 and 1048576 for the -w option of fold.

Here is an example broken down by each step that converts all the N’s to lowercase:

[root@alpha ~]# cat mailtest.txt
test XJS C4JD QADN1 NSBN3 2IDNEN GTUBE STANDARD ANTI UBE-TEST EMAIL*C.34X test

[root@alpha ~]# fold -w 20 -s mailtest.txt
test XJS C4JD QADN1
NSBN3 2IDNEN GTUBE
STANDARD ANTI
UBE-TEST
EMAIL*C.34X test

[root@alpha ~]# fold -w 20 -s mailtest.txt | sed 's/N/n/g'
test XJS C4JD QADn1
nSBn3 2IDnEn GTUBE
STAnDARD AnTI
UBE-TEST
EMAIL*C.34X test

[root@alpha ~]# fold -w 20 -s mailtest.txt | sed 's/N/n/g' | tr 'n' ''
test XJS C4JD QADn1 nSBn3 2IDnEn GTUBE STAnDARD AnTI UBE-TEST EMAIL*C.34X test

You will need to add a newline to the very end of the file if it has one, because the tr command will remove it.

Answered By: alfreema

You could try bbe (binary block editor), a “sed for binary files”.

I had good success using it on a 7GB text file with no EOL chars, replacing multiple occurrences of a string with one of different length. Without attempting any optimisation it gave an average processing throughput of > 50MB/s.

Answered By: ovirt

This may be overkill for a 70GB file and simple search & replace, but the Hadoop MapReduce framework would solve your problem right now at no cost (choose the ‘Single Node’ option when setting it up to run it locally) – and will can be scaled to infinite capacity in the future without the need to modify your code.

The official tutorial at https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html uses (extremely simple) Java but you can find client libraries for Perl or whatever language you feel like using.

So if later on you find that you are doing more complex operations on 7000GB text files – and having to do this 100 times per day – you can distribute the workload across multiple nodes that you provision or that are automatically provisioned for you by a cloud-based Hadoop cluster.

Answered By: Sam Rahimi

If we have a minimum amount of <unk> (as expected by Zipf’s law),

awk -v RS="<unk>" -v ORS="<raw_unk>" 1
Answered By: JJoao

All of the previous suggestions require reading the entire file and writing the entire file. This not only takes a long time but also requires 70GB of free space.

1) If I understand you specific case correctly would it be acceptable to replace <unk> with some other string of the SAME length?

2a) Are there multiple occurrences?
2b) If so do you know how many?

I’m sure you have solved this year-plus problem already and I’d like to know what solution you used.

I’d propose a solution (most likely in C ) that would read the BLOCKS of the file searching each for the string taking into account possible block crossing. Once found replace the string with the SAME length alternate and the write only that BLOCK. Continuing for the known number of occurrences or until end of file. This would require as few as number-of-occurances writes and at most twice that (if every occurrence was split between 2 blocks). This would require NO additional space!

Answered By: DGerman
Categories: Answers Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.