I have a large file of user agent strings, and I want to extract one particular section of each request.
For input:
207.46.13.9 - - [22/Jan/2019:08:02:29 +0330] "GET /product/23474/%D9%84%DB%8C%D8%B2%D8%B1-%D8%A8%D8%AF%D9%86-%D8%AE%D8%A7%D9%86%DA%AF%DB%8C-%D8%B1%D9%85%DB%8C%D9%86%DA%AF%D8%AA%D9%88%D9%86-%D9%85%D8%AF%D9%84-Remington-Laser-Hair-Removal-IPL6250 HTTP/1.1" 200 41766 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
I am trying to get output:
23474
from after /product/
in the sample above.
I’m trying to use Awk, but I can’t figure out how to get the regex expression that’s required for this. I’m sure it’s simpler than I think, but I’m quite new to this!
The pattern is the following:
RANDOMSTUFF/GET /product/XXXXX/MORERANDOMSTUFF
and I’m trying to grab XXXXX
. I don’t think I can use just the ‘/’ since there will be other slashes in the line.
I’ve tried
awk 'BEGIN{FS="[GET \/product\/]"}{print $2}'
to try and use GET /product
as a field separator, and then grab the next item. But I’ve realized this won’t work (even if I got the regex expression right, which I didn’t), since there might not be whitespace after the product ID I want to grab.
Advertisement
Answer
The square brackets you tried to put around the FS
are incorrect here, but the problem after you fix that is that you then simply have two fields, as you are overriding the splitting on whitespace which Awk normally does.
Because the (horrible) date format always has exactly two slashes, I think you can actually do
awk -F / '/product/ { print $5 }' filename
Even though it divides the earlier part of the line into quite weird parts, the things after GET
or PUT
will always be $4
, $5
, etc.
If you wanted to keep your original idea, maybe try
awk 'BEGIN {FS="GET /product/"} NF==2{ # second field is now everything after /product/ -- split on slash split($2, f, "/") print f[1] }' file
… or very simply, brutally remove everything except the text you want;
awk '//product// { sub(".*/product/", ""), sub("/.*", ""); print }' file
which might be better expressed as a simple sed
script;
sed -n 's%.*GET /product/([^/]*)/.*%1%p' file