Skip to content
Advertisement

Find and replace words using sed command not working

I have a a text file which is tab separated, the first column holds the word to be found and the second column holds the word to replace the found word. This text file contains English and Arabic pairs. Once the word is found and replaced it should not be changed again.

For example:

adam    a +dam
a   b
ال  ال+

So for a given text file:

adam played with a ball ال

I expect:

a +dam played with b ball ال+

However, I get:

b +dbm plbyed with b bbll ال+

I am using the following sed command to find and replace:

sed -e 's/^/s%/' -e 's/t/%/' -e 's/$/%g/' tab_sep_file.txt | sed -f - original_file.txt >replaced.txt

How can I fix this issue

Advertisement

Answer

The basic problem to your approach is that you don’t want to replace matched text in a prior substitution with a later one – you don’t want to change the a‘s in a +dam to b‘s. This makes sed a pretty poor choice – you can make a regular expression that matches all of the things you want to replace fairly easily, but picking which replacement to use is an issue.

A way using GNU awk:

gawk -F't' '
     FNR == NR { subs[$1] = $2; next } # populate the array of substitutions
     ENDFILE {
             if (FILENAME == ARGV[1]) {
                # Build a regular expression of things to substitute
                subre = "\<("
                first=0
                for (s in subs)
                    subre = sprintf("%s%s%s", subre, first++ ? "|" : "", s)
                subre = sprintf("%s)\>", subre)
             }
     }
     {
        # Do the substitution
        nwords = patsplit($0, words, subre, between)
        printf "%s", between[0]
        for (n = 1; n <= nwords; n++)
            printf "%s%s", subs[words[n]], between[n]
        printf "n"
     }
' tab_sep_file.txt original_file.txt

which outputs

a +dam played with b ball

First it reads the TSV file and builds an array of words to be replaced and text to replace it with (subs). Then after reading that file, it builds a regular expression to match all possible words to be found – <(a|adam)> in this case. The < and > match only at the beginning and end, respectively, of words, so the a in ball won’t match.

Then for the second file with the text you want to process, it uses patsplit() to split each line into an array of matched parts (words) and the bits between matches (between), and iterates over the length of the array, printing out the replacement text for each match. That way it avoids re-matching text that’s already been replaced.


And a perl version that uses a similar approach (Taking advantage of perl‘s ability to evaluate the replacement text in a s/// substitution):

perl -e '
     use strict;
     use warnings;
     # Set file/standard stream char encodings from locale
     use open ":locale"; 
     # Or for explicit UTF-8 text
     # use open ":encoding(UTF-8)", ":std";
     my %subs;
     open my $words, "<", shift or die $!;
     while (<$words>) {
        chomp;
        my ($word, $rep) = split "t" ,$_, 2;
        $subs{$word} = $rep;
     }
     my $subre = "\b(?:" . join("|", map { quotemeta } keys %subs) . ")\b";
     while (<<>>) {
       print s/$subre/$subs{$&}/egr;
     }
' tab_sep_file.txt original_file.txt

(This one will escape regular expression metacharacters in the words to replace, making it more robust)

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement