Skip to content
Advertisement

sed/awk – remove a multiline block if it contains multiple patterns

In an xml file a multiline block is identified by <start></start>. I need to find and delete these multiline blocks if they contain a set of pattern in any order (pattern1, pattern2, etc).

For example, in the following:

<xml>
    ...
    <start>
        <x>pattern2</x>
        <y>pattern1<y>
    </start>
    <start>
        <x>pattern2</x>
        <y>string1<y>
    </start>
    <start>
        <y>string2<y>
        <x>pattern1</x>
    </start>
   <start>
        <x>pattern1</x>
        <y>pattern2<y>
    </start>
    <start>
        <x>string3</x>
        <y>string4<y>
    </start>
    ...
 </xml>

if searching for pattern1 only, blocks 1, 3, 4 should be deleted

<xml>
    ...
    <start>
        <x>pattern2</x>
        <y>string1<y>
    </start>
    <start>
        <x>string3</x>
        <y>string4<y>
    </start>
    ...
 </xml>

if searching for pattern2 only, blocks 1, 2, 4 should be deleted

<xml>
    ...
    <start>
        <y>string2<y>
        <x>pattern1</x>
    </start>
    <start>
        <x>string3</x>
        <y>string4<y>
    </start>
    ...
 </xml>

if searching for (pattern1 and pattern 2), blocks 1, 4 should be deleted

<xml>
    ...
    <start>
        <x>pattern2</x>
        <y>string1<y>
    </start>
    <start>
        <y>string2<y>
        <x>pattern1</x>
    </start>
    <start>
        <x>string3</x>
        <y>string4<y>
    </start>
    ...
 </xml>

I managed to identify blocks using

sed -n "s/<start>/,/</start>/p" file

How can I identify those matching unordered multiple pattern?

Thanks for your help

Advertisement

Answer

$ awk '$0~"<start>"{f=1;p=0;a=""} 
                  f{a=a RS $0} 
                 !f{print} 
      /pattern1/&&f{p=1} 
      $0~"</start>"{if(!p) print a;f=0}' file

<xml>
    ...

    <start>
        <x>pattern2</x>
        <y>string1<y>
    </start>

    <start>
        <x>string3</x>
        <y>string4<y>
    </start>
    ...
 </xml>

you can generalize to multiple patterns as well

$ awk '$0~"<start>"{f=1;p=0;a=""} 
                  f{a=a RS $0} 
                 !f{print} 
      /pattern1/&&f{p++} 
      /pattern2/&&f{p++} 
      $0~"</start>"{if(p!=2) print a;f=0}' file
<xml>
    ...

    <start>
        <x>pattern2</x>
        <y>string1<y>
    </start>

    <start>
        <y>string2<y>
        <x>pattern1</x>
    </start>

    <start>
        <x>string3</x>
        <y>string4<y>
    </start>
    ...
 </xml>

to eliminate the extra blank lines change f{a=a RS $0} to f{a=a?a RS $0:$0}

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement