Skip to content
Advertisement

I am trying to use awk to extract a portion of each line in my file

I have a large file of user agent strings, and I want to extract one particular section of each request.

For input:

207.46.13.9 - - [22/Jan/2019:08:02:29 +0330] "GET /product/23474/%D9%84%DB%8C%D8%B2%D8%B1-%D8%A8%D8%AF%D9%86-%D8%AE%D8%A7%D9%86%DA%AF%DB%8C-%D8%B1%D9%85%DB%8C%D9%86%DA%AF%D8%AA%D9%88%D9%86-%D9%85%D8%AF%D9%84-Remington-Laser-Hair-Removal-IPL6250 HTTP/1.1" 200 41766 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"

I am trying to get output:

23474

from after /product/ in the sample above.

I’m trying to use Awk, but I can’t figure out how to get the regex expression that’s required for this. I’m sure it’s simpler than I think, but I’m quite new to this!

The pattern is the following:

RANDOMSTUFF/GET /product/XXXXX/MORERANDOMSTUFF

and I’m trying to grab XXXXX. I don’t think I can use just the ‘/’ since there will be other slashes in the line.

I’ve tried

awk 'BEGIN{FS="[GET \/product\/]"}{print $2}'

to try and use GET /product as a field separator, and then grab the next item. But I’ve realized this won’t work (even if I got the regex expression right, which I didn’t), since there might not be whitespace after the product ID I want to grab.

Advertisement

Answer

The square brackets you tried to put around the FS are incorrect here, but the problem after you fix that is that you then simply have two fields, as you are overriding the splitting on whitespace which Awk normally does.

Because the (horrible) date format always has exactly two slashes, I think you can actually do

awk -F / '/product/ { print $5 }' filename

Even though it divides the earlier part of the line into quite weird parts, the things after GET or PUT will always be $4, $5, etc.

If you wanted to keep your original idea, maybe try

awk 'BEGIN {FS="GET /product/"}
  NF==2{
    # second field is now everything after /product/ -- split on slash
    split($2, f, "/")
    print f[1] }' file

… or very simply, brutally remove everything except the text you want;

awk '//product// { sub(".*/product/", ""), sub("/.*", ""); print }' file

which might be better expressed as a simple sed script;

sed -n 's%.*GET /product/([^/]*)/.*%1%p' file
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement