In an xml file a multiline block is identified by <start></start>.
I need to find and delete these multiline blocks if they contain a set of pattern in any order (pattern1, pattern2, etc).
For example, in the following:
<xml>
...
<start>
<x>pattern2</x>
<y>pattern1<y>
</start>
<start>
<x>pattern2</x>
<y>string1<y>
</start>
<start>
<y>string2<y>
<x>pattern1</x>
</start>
<start>
<x>pattern1</x>
<y>pattern2<y>
</start>
<start>
<x>string3</x>
<y>string4<y>
</start>
...
</xml>
if searching for pattern1 only, blocks 1, 3, 4 should be deleted
<xml>
...
<start>
<x>pattern2</x>
<y>string1<y>
</start>
<start>
<x>string3</x>
<y>string4<y>
</start>
...
</xml>
if searching for pattern2 only, blocks 1, 2, 4 should be deleted
<xml>
...
<start>
<y>string2<y>
<x>pattern1</x>
</start>
<start>
<x>string3</x>
<y>string4<y>
</start>
...
</xml>
if searching for (pattern1 and pattern 2), blocks 1, 4 should be deleted
<xml>
...
<start>
<x>pattern2</x>
<y>string1<y>
</start>
<start>
<y>string2<y>
<x>pattern1</x>
</start>
<start>
<x>string3</x>
<y>string4<y>
</start>
...
</xml>
I managed to identify blocks using
sed -n "s/<start>/,/</start>/p" file
How can I identify those matching unordered multiple pattern?
Thanks for your help
Advertisement
Answer
$ awk '$0~"<start>"{f=1;p=0;a=""}
f{a=a RS $0}
!f{print}
/pattern1/&&f{p=1}
$0~"</start>"{if(!p) print a;f=0}' file
<xml>
...
<start>
<x>pattern2</x>
<y>string1<y>
</start>
<start>
<x>string3</x>
<y>string4<y>
</start>
...
</xml>
you can generalize to multiple patterns as well
$ awk '$0~"<start>"{f=1;p=0;a=""}
f{a=a RS $0}
!f{print}
/pattern1/&&f{p++}
/pattern2/&&f{p++}
$0~"</start>"{if(p!=2) print a;f=0}' file
<xml>
...
<start>
<x>pattern2</x>
<y>string1<y>
</start>
<start>
<y>string2<y>
<x>pattern1</x>
</start>
<start>
<x>string3</x>
<y>string4<y>
</start>
...
</xml>
to eliminate the extra blank lines change f{a=a RS $0} to f{a=a?a RS $0:$0}