Skip to content
Advertisement

egrep two identical numbers and two identical letters n times

Hi I’ve been having trouble with the command egrep. Here is my question: Lets just say I was running in a for loop on these words:

99aa88bb99aa88bb 9a9a 11bb11bb 11bb11dd 12aa12aa
33aa33bb33aa33bb 

I only want to print the word if it has two identical numbers and two identical letters and the word repeats itself. for example in this case the only words that should print are:

99aa88bb99aa88bb
11bb11bb
33aa33bb33aa33bb

because each word has at least one or more set of two identical numbers and the two identical letters and then it repeats itself

Here is another example, I am going over these words in a loop:

 aa99aa99 00aa00bb00aa00bb 44aa44aac
 2222aaaa2222aaaa 11cc11cc11cc11cc 

The only words that should print are

 00aa00bb00aa00bb
 11cc11cc11cc11cc

because of whats mentioned above. I am really struggling on how to do this My current command that isnt working is:

egrep "^((([0-9])3([a-z])4)(([0-9])6([a-z])7))1*$" tmp

The reason why its not working because it prints for me words like:

11bb11dd

which are not allowed.

any help would be highly appreciated.

Advertisement

Answer

Try this:

$ cat ip.txt
99aa88bb99aa88bb 9a9a 11bb11bb 11bb11dd 12aa12aa
33aa33bb33aa33bb 
 aa99aa99 00aa00bb00aa00bb 44aa44aac
 2222aaaa2222aaaa 11cc11cc11cc11cc 

$ grep -owE '(([0-9])2([a-z])3([0-9])4([a-z])5)1+|(([0-9])7([a-z])8)6' ip.txt
99aa88bb99aa88bb
11bb11bb
33aa33bb33aa33bb
00aa00bb00aa00bb
11cc11cc11cc11cc

This has two cases

1) (([0-9])2([a-z])3([0-9])4([a-z])5)1+ which has 8 character construct repeating at least once – that is the key, using 1* will falsely match 11bb11dd

2) (([0-9])7([a-z])8)6 this has 4 character construct repeating exactly once


If you have them on separate lines, this would do

grep -xE '(([0-9])2([a-z])3([0-9])4([a-z])5)1+|(([0-9])7([a-z])8)6'


If 11bb11bb11bb has to be matched as well, use 6+


Or, use this very clever suggestion by Nahuel Fouilleul

$ grep -owE '((([0-9])3([a-z])4)+)1+' ip.txt
99aa88bb99aa88bb
11bb11bb
33aa33bb33aa33bb
00aa00bb00aa00bb
11cc11cc11cc11cc
  • (([0-9])3([a-z])4)+ forms the base, 4/8/12/16/etc characters which consists of repeating digit followed by repeating alphabet
  • that is then captured in outer group and then repeated at least once


Note that if input is large, and you have PCRE -P option, use that instead of -E as backreferences would be much faster, at least in case of GNU grep

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement