I have multiple HTTP headers in one giant file, separated with one empty line.
Host Connection Accept From User-Agent Accept-Encoding Host Connection Accept From User-Agent Accept-Encoding X-Forwarded-For cookie Cache-Control referer x-fb-sim-hni Host Accept user-agent x-fb-net-sid x-fb-net-hni X-Purpose accept-encoding x-fb-http-engine Connection User-Agent Host Connection Accept-Encoding
I have approximately 10,000,000 of headers separated with an empty line. If I want to discover trends, like header order, I want to do aggregate headers to a one-liner (how I can aggregate lines ending with an empty line and do that separately for all headers?):
and follow with: uniq -c|sort -nk1,
so I could receive:
197897 Host,Connection,Accept,From,User-Agent,Accept-Encoding 8732233 User-Agent,Host,Connection,Accept-Encoding
What would be the best approach and most effective one to parse that massive file and get that data?
Thanks for hints.
Using your your 1.5milGETs.txt
file (and converting the triple nnn
to nn
to separate blocks) you can use ruby
in paragraph mode:
$ ruby -F'n' -lane 'BEGIN{h=Hash.new(0); $/="" def commafy(n) n.to_s.reverse.gsub(/...(?=.)/,"\&,").reverse end } h[$F.join(",")]+=1 # p $_ END{ printf "Total blocks: %sn", commafy(h.values.sum) h2=h.sort_by {|k,v| -v} h2[0..10].map {|k,v| printf "%10s %sn", commafy(v), k} }' 1.5milGETs.txt
That prints the total number of blocks, sorts them large->small, prints the top 10.
Total blocks: 1,262,522 71,639 Host,Accept,User-Agent,Pragma,Connection 70,975 Host,ros-SecurityFlags,ros-SessionTicket,ros-Challenge,ros-HeadersHmac,Scs-Ticket,If-Modified-Since,User-Agent 40,781 Host,Accept,User-Agent,Pragma,nnCoection,Connection,X-Forwarded-For 35,485 Accept,ros-SecurityFlags,ros-SessionTicket,ros-Challenge,ros-HeadersHmac,Scs-Ticket,If-Modified-Since,User-Agent,Accept-Language,UA-CPU,Accept-Encoding,Host,Connection 34,005 User-Agent,Host,Connection,Accept-Encoding 30,668 Host,User-Agent,Accept-Encoding,Connection 25,547 Host,Accept,Accept-Language,Connection,Accept-Encoding,User-Agent 22,581 Host,User-Agent,Accept,Accept-Encoding 19,311 Host,Connection,Accept,From,User-Agent,Accept-Encoding 14,694 Host,Connection,User-Agent,Accept,Referer,Accept-Encoding,Accept-Language,Cookie 12,290 Host,User-Agent,Accept-Encoding
That takes about 8 seconds on a 6 year old Mac.
Awk will be 3x faster and entirely appropriate for this job.
Ruby will give you more output options and easier analysis of the data. You can create an interactive HTML documents; output JSON, quoted csv, xml trivially; interact with a database; invert keys and values in a statement; filter; etc.