Skip to content
Advertisement

Regular expression operator {} in linux bash

I’m having some problems with the {} operator. In the following examples, I’m trying to find the rows with 1, 2, and 2 or more occurrences of the word mint, but I get a response only if I search for 1 occurrence of mint, even though there are more than 1 per row.

The input I am processing is a listing like this obtainded with the ls -l command:

-rw-r--r--  1 mint mint   26 Dec 20 21:11 example.txt
-rw-r--r--  1 mint mint   26 Dec 20 21:11 another.example
-rw-r--r--  1 mint mint   19 Dec 20 15:11 something.else
-rw-r--r--  1 mint mint    1 Dec 20 01:23 filemint
-rw-r--r--  1 mint mint   26 Dec 20 21:11 mint

With ls -l | grep -E 'mint{1}' I find all the rows above, and I expected to find nothing (should be all the rows with 1 occurrence of mint).

With ls -l | grep -E 'mint{2}' I find nothing, and I expected to find the first 3 rows above (should be all the rows with 2 occurrences of mint).

With ls -l | grep -E 'mint{2,}' I expected to find all the rows above, and again I found nothing (should be all the rows with at least 2 occurrences of mint).

Am I missing something on how {} works?

Advertisement

Answer

Firstly, a “quantifier” in a regular expression refers to the “token” immediately before it, which by default is a single character. So mint{2} is looking for the character t twice – it is equivalent to m{1}i{1}n{1}t{2}, or mintt.

To search for a sequence of characters a number of times, you need to group that sequence, using parentheses. So (mint){2} would search for the sequence mint twice in a row, as in mintmint.

Secondly, in your input, there are additional characters in between the occurrences of mint; the regular expression needs to specify that those are allowed.

The simplest way to do that is using the pattern .*, which means “anything, zero or more times”. That gives you (mint.*){2} which will match “mint followed by anything, twice”.

Finally, given the input “mint mint”, the pattern (mint.*){1} will match – it doesn’t care that some of the “extra” characters also spell “mint”, it just knows that the required parts are there. In fact, {1} is always redundant, and (mint.*){1} matches exactly the same things that just mint matches. In general, regular expressions are good at asserting what is there, and not at asserting what is not there.

Some regular expression flavours have “lookahead assertions” which can process negative assertions like “not followed by mint“, but grep -E does not. What it does have is a switch, -v, which inverts the whole command – it shows all lines except the ones matched by the regular expression. A simple approach to say “no more than 1 instance of mint” is therefore to run grep twice – once normally, and once with -v:

# At least once, but not twice -> exactly once
ls -l | grep -E 'mint' | grep -v -E '(mint.*){2}'

# At least twice, but not three times -> exactly twice
ls -l | grep -E '(mint.*){2}' | grep -v -E '(mint.*){3}'
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement