Hi I’ve been having trouble with the command egrep. Here is my question: Lets just say I was running in a for loop on these words:
99aa88bb99aa88bb 9a9a 11bb11bb 11bb11dd 12aa12aa 33aa33bb33aa33bb
I only want to print the word if it has two identical numbers and two identical letters and the word repeats itself. for example in this case the only words that should print are:
99aa88bb99aa88bb 11bb11bb 33aa33bb33aa33bb
because each word has at least one or more set of two identical numbers and the two identical letters and then it repeats itself
Here is another example, I am going over these words in a loop:
aa99aa99 00aa00bb00aa00bb 44aa44aac 2222aaaa2222aaaa 11cc11cc11cc11cc
The only words that should print are
00aa00bb00aa00bb 11cc11cc11cc11cc
because of whats mentioned above. I am really struggling on how to do this My current command that isnt working is:
egrep "^((([0-9])3([a-z])4)(([0-9])6([a-z])7))1*$" tmp
The reason why its not working because it prints for me words like:
11bb11dd
which are not allowed.
any help would be highly appreciated.
Advertisement
Answer
Try this:
$ cat ip.txt 99aa88bb99aa88bb 9a9a 11bb11bb 11bb11dd 12aa12aa 33aa33bb33aa33bb aa99aa99 00aa00bb00aa00bb 44aa44aac 2222aaaa2222aaaa 11cc11cc11cc11cc $ grep -owE '(([0-9])2([a-z])3([0-9])4([a-z])5)1+|(([0-9])7([a-z])8)6' ip.txt 99aa88bb99aa88bb 11bb11bb 33aa33bb33aa33bb 00aa00bb00aa00bb 11cc11cc11cc11cc
This has two cases
1) (([0-9])2([a-z])3([0-9])4([a-z])5)1+
which has 8 character construct repeating at least once – that is the key, using 1*
will falsely match 11bb11dd
2) (([0-9])7([a-z])8)6
this has 4 character construct repeating exactly once
If you have them on separate lines, this would do
grep -xE '(([0-9])2([a-z])3([0-9])4([a-z])5)1+|(([0-9])7([a-z])8)6'
If 11bb11bb11bb
has to be matched as well, use 6+
Or, use this very clever suggestion by Nahuel Fouilleul
$ grep -owE '((([0-9])3([a-z])4)+)1+' ip.txt 99aa88bb99aa88bb 11bb11bb 33aa33bb33aa33bb 00aa00bb00aa00bb 11cc11cc11cc11cc
(([0-9])3([a-z])4)+
forms the base, 4/8/12/16/etc characters which consists of repeating digit followed by repeating alphabet- that is then captured in outer group and then repeated at least once
Note that if input is large, and you have PCRE -P
option, use that instead of -E
as backreferences would be much faster, at least in case of GNU grep