I’m having some problems with the {}
operator. In the following examples, I’m trying to find the rows with 1, 2, and 2 or more occurrences of the word mint
, but I get a response only if I search for 1
occurrence of mint
, even though there are more than 1
per row.
The input I am processing is a listing like this obtainded with the ls -l
command:
-rw-r--r-- 1 mint mint 26 Dec 20 21:11 example.txt -rw-r--r-- 1 mint mint 26 Dec 20 21:11 another.example -rw-r--r-- 1 mint mint 19 Dec 20 15:11 something.else -rw-r--r-- 1 mint mint 1 Dec 20 01:23 filemint -rw-r--r-- 1 mint mint 26 Dec 20 21:11 mint
With ls -l | grep -E 'mint{1}'
I find all the rows above, and I expected to find nothing (should be all the rows with 1 occurrence of mint
).
With ls -l | grep -E 'mint{2}'
I find nothing, and I expected to find the first 3 rows above (should be all the rows with 2 occurrences of mint
).
With ls -l | grep -E 'mint{2,}'
I expected to find all the rows above, and again I found nothing (should be all the rows with at least 2 occurrences of mint
).
Am I missing something on how {}
works?
Advertisement
Answer
Firstly, a “quantifier” in a regular expression refers to the “token” immediately before it, which by default is a single character. So mint{2}
is looking for the character t
twice – it is equivalent to m{1}i{1}n{1}t{2}
, or mintt
.
To search for a sequence of characters a number of times, you need to group that sequence, using parentheses. So (mint){2}
would search for the sequence mint
twice in a row, as in mintmint
.
Secondly, in your input, there are additional characters in between the occurrences of mint
; the regular expression needs to specify that those are allowed.
The simplest way to do that is using the pattern .*
, which means “anything, zero or more times”. That gives you (mint.*){2}
which will match “mint
followed by anything, twice”.
Finally, given the input “mint mint”, the pattern (mint.*){1}
will match – it doesn’t care that some of the “extra” characters also spell “mint”, it just knows that the required parts are there. In fact, {1}
is always redundant, and (mint.*){1}
matches exactly the same things that just mint
matches. In general, regular expressions are good at asserting what is there, and not at asserting what is not there.
Some regular expression flavours have “lookahead assertions” which can process negative assertions like “not followed by mint
“, but grep -E
does not. What it does have is a switch, -v
, which inverts the whole command – it shows all lines except the ones matched by the regular expression. A simple approach to say “no more than 1 instance of mint
” is therefore to run grep twice – once normally, and once with -v
:
# At least once, but not twice -> exactly once ls -l | grep -E 'mint' | grep -v -E '(mint.*){2}' # At least twice, but not three times -> exactly twice ls -l | grep -E '(mint.*){2}' | grep -v -E '(mint.*){3}'