I have multiple HTTP headers in one giant file, separated with one empty line.
Host
Connection
Accept
From
User-Agent
Accept-Encoding
Host
Connection
Accept
From
User-Agent
Accept-Encoding
X-Forwarded-For
cookie
Cache-Control
referer
x-fb-sim-hni
Host
Accept
user-agent
x-fb-net-sid
x-fb-net-hni
X-Purpose
accept-encoding
x-fb-http-engine
Connection
User-Agent
Host
Connection
Accept-Encoding
I have approximately 10,000,000 of headers separated with an empty line. If I want to discover trends, like header order, I want to do aggregate headers to a one-liner (how I can aggregate lines ending with an empty line and do that separately for all headers?):
Host,Connection,Accept,From,User-Agent,Accept-Encoding
and follow with: uniq -c|sort -nk1,
so I could receive:
197897 Host,Connection,Accept,From,User-Agent,Accept-Encoding
8732233 User-Agent,Host,Connection,Accept-Encoding
What would be the best approach and most effective one to parse that massive file and get that data?
Thanks for hints.
Advertisement
Answer
Using your your 1.5milGETs.txt
file (and converting the triple nnn
to nn
to separate blocks) you can use ruby
in paragraph mode:
$ ruby -F'n' -lane 'BEGIN{h=Hash.new(0); $/=""
def commafy(n)
n.to_s.reverse.gsub(/...(?=.)/,"\&,").reverse
end
}
h[$F.join(",")]+=1
# p $_
END{ printf "Total blocks: %sn", commafy(h.values.sum)
h2=h.sort_by {|k,v| -v}
h2[0..10].map {|k,v| printf "%10s %sn", commafy(v), k}
}' 1.5milGETs.txt
That prints the total number of blocks, sorts them large->small, prints the top 10.
Prints:
Total blocks: 1,262,522
71,639 Host,Accept,User-Agent,Pragma,Connection
70,975 Host,ros-SecurityFlags,ros-SessionTicket,ros-Challenge,ros-HeadersHmac,Scs-Ticket,If-Modified-Since,User-Agent
40,781 Host,Accept,User-Agent,Pragma,nnCoection,Connection,X-Forwarded-For
35,485 Accept,ros-SecurityFlags,ros-SessionTicket,ros-Challenge,ros-HeadersHmac,Scs-Ticket,If-Modified-Since,User-Agent,Accept-Language,UA-CPU,Accept-Encoding,Host,Connection
34,005 User-Agent,Host,Connection,Accept-Encoding
30,668 Host,User-Agent,Accept-Encoding,Connection
25,547 Host,Accept,Accept-Language,Connection,Accept-Encoding,User-Agent
22,581 Host,User-Agent,Accept,Accept-Encoding
19,311 Host,Connection,Accept,From,User-Agent,Accept-Encoding
14,694 Host,Connection,User-Agent,Accept,Referer,Accept-Encoding,Accept-Language,Cookie
12,290 Host,User-Agent,Accept-Encoding
That takes about 8 seconds on a 6 year old Mac.
Awk will be 3x faster and entirely appropriate for this job.
Ruby will give you more output options and easier analysis of the data. You can create an interactive HTML documents; output JSON, quoted csv, xml trivially; interact with a database; invert keys and values in a statement; filter; etc.