In an xml file a multiline block is identified by <start></start>
.
I need to find and delete these multiline blocks if they contain a set of pattern in any order (pattern1, pattern2, etc).
For example, in the following:
<xml> ... <start> <x>pattern2</x> <y>pattern1<y> </start> <start> <x>pattern2</x> <y>string1<y> </start> <start> <y>string2<y> <x>pattern1</x> </start> <start> <x>pattern1</x> <y>pattern2<y> </start> <start> <x>string3</x> <y>string4<y> </start> ... </xml>
if searching for pattern1 only, blocks 1, 3, 4 should be deleted
<xml> ... <start> <x>pattern2</x> <y>string1<y> </start> <start> <x>string3</x> <y>string4<y> </start> ... </xml>
if searching for pattern2 only, blocks 1, 2, 4 should be deleted
<xml> ... <start> <y>string2<y> <x>pattern1</x> </start> <start> <x>string3</x> <y>string4<y> </start> ... </xml>
if searching for (pattern1 and pattern 2), blocks 1, 4 should be deleted
<xml> ... <start> <x>pattern2</x> <y>string1<y> </start> <start> <y>string2<y> <x>pattern1</x> </start> <start> <x>string3</x> <y>string4<y> </start> ... </xml>
I managed to identify blocks using
sed -n "s/<start>/,/</start>/p" file
How can I identify those matching unordered multiple pattern?
Thanks for your help
Advertisement
Answer
$ awk '$0~"<start>"{f=1;p=0;a=""} f{a=a RS $0} !f{print} /pattern1/&&f{p=1} $0~"</start>"{if(!p) print a;f=0}' file <xml> ... <start> <x>pattern2</x> <y>string1<y> </start> <start> <x>string3</x> <y>string4<y> </start> ... </xml>
you can generalize to multiple patterns as well
$ awk '$0~"<start>"{f=1;p=0;a=""} f{a=a RS $0} !f{print} /pattern1/&&f{p++} /pattern2/&&f{p++} $0~"</start>"{if(p!=2) print a;f=0}' file <xml> ... <start> <x>pattern2</x> <y>string1<y> </start> <start> <y>string2<y> <x>pattern1</x> </start> <start> <x>string3</x> <y>string4<y> </start> ... </xml>
to eliminate the extra blank lines change f{a=a RS $0}
to f{a=a?a RS $0:$0}