What is the fastest way to extract a substring of interest from input such as the following?
MsgTrace(65/26)noop:user=xxx=INBOX:cmd=534 ImapFetchComplete(56/39)user=xxxxxxxxxx
Desired output (i.e., the :
-terminated string following the string MsgTrace(65/26)
in this example):
noop
I tried the following, but without success:
egrep -i "[a-zA-Z]+(.*)[a-z]+:"
Advertisement
Answer
grep
by default returns the entire line when a match is found on a given input line.
While option -o
restricts the output to only that part of the line that the regex matched, that is still not enough in this case, because you want a substring of that match.
However, since you’re on Linux, you can use GNU grep
‘s -P
option (for support of PCREs, Perl-compatible regular expression), which allows extracting a submatch by way of features such as K
(drop everything matched so far) and (?=...)
(a look-ahead assertion that does not contribute to the match):
$ grep -Po "[a-zA-Z](.*)K[a-z]+(?=:)" <<'EOF' MsgTrace(65/26)noop:user=xxx=INBOX:cmd=534 ImapFetchComplete(56/39)user=xxxxxxxxxx EOF noop # output
Optional background information:
Ed Morton points out (in a since-deleted comment) that GNU grep
‘s man
page still calls the -P
option “highly experimental” that may “warn of unimplemented features”, but the option has been around for years, and in practice I have yet to see a warning or a performance problem – YMMV.
In the case at hand, the above command even outperforms sed
and awk
solutions – see NeronLeVelu’s helpful performance comparison.
The interesting article Ed points to discusses a potential performance problem that can surface with regex engines such as used by grep -P
(via the PCRE library), Perl itself, and many other widely used (and mature) regex engines, such as in Python, Ruby, and PHP:
- In short: the recursive backtracking algorithm employed by these engines can result in severe performance degradation with “pathological” regexes that string together long sequences of subexpressions with variable-length quantifiers, such as (a longer version of)
a?a?a?a?aaaa
to matchaaaa
. - The article argues that backtracking is only truly required when a regex contains backreferences, and that a different, much faster algorithm should be employed in their absence.