Skip to content
Advertisement

Fastest way to extract pattern

What is the fastest way to extract a substring of interest from input such as the following?

MsgTrace(65/26)noop:user=xxx=INBOX:cmd=534
ImapFetchComplete(56/39)user=xxxxxxxxxx

Desired output (i.e., the :-terminated string following the string MsgTrace(65/26) in this example):

noop

I tried the following, but without success:

egrep -i "[a-zA-Z]+(.*)[a-z]+:"

Advertisement

Answer

grep by default returns the entire line when a match is found on a given input line.

While option -o restricts the output to only that part of the line that the regex matched, that is still not enough in this case, because you want a substring of that match.

However, since you’re on Linux, you can use GNU grep‘s -P option (for support of PCREs, Perl-compatible regular expression), which allows extracting a submatch by way of features such as K (drop everything matched so far) and (?=...) (a look-ahead assertion that does not contribute to the match):

$ grep -Po  "[a-zA-Z](.*)K[a-z]+(?=:)" <<'EOF'
MsgTrace(65/26)noop:user=xxx=INBOX:cmd=534
ImapFetchComplete(56/39)user=xxxxxxxxxx
EOF
noop  # output

Optional background information:

Ed Morton points out (in a since-deleted comment) that GNU grep‘s man page still calls the -P option “highly experimental” that may “warn of unimplemented features”, but the option has been around for years, and in practice I have yet to see a warning or a performance problem – YMMV.

In the case at hand, the above command even outperforms sed and awk solutions – see NeronLeVelu’s helpful performance comparison.

The interesting article Ed points to discusses a potential performance problem that can surface with regex engines such as used by grep -P (via the PCRE library), Perl itself, and many other widely used (and mature) regex engines, such as in Python, Ruby, and PHP:

  • In short: the recursive backtracking algorithm employed by these engines can result in severe performance degradation with “pathological” regexes that string together long sequences of subexpressions with variable-length quantifiers, such as (a longer version of) a?a?a?a?aaaa to match aaaa.
  • The article argues that backtracking is only truly required when a regex contains backreferences, and that a different, much faster algorithm should be employed in their absence.
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement