Running statistics on multiple lines in bash

Question

I have multiple HTTP headers in one giant file, separated with one empty line. I have approximately 10,000,000 of headers separated with an empty line. If I want to discover trends, like header order, I want to do aggregate headers to a one-liner (how I can aggregate lines ending with an empty line and do tha…

Accepted Answer

Using your your 1.5milGETs.txt file (and converting the triple nnn to nn to separate blocks) you can use ruby in paragraph mode:$ ruby -F'n' -lane 'BEGIN{h=Hash.new(0); $/=""                                       def commafy(n)                                             n.to_s.reverse.gsub(/...(?=.)/,"\&,").reverse                                         end                                         }                                         h[$F.join(",")]+=1                                        # p $_                              END{ printf "Total blocks: %sn", commafy(h.values.sum)                                      h2=h.sort_by {|k,v| -v}                                      h2[0..10].map {|k,v| printf "%10s %sn", commafy(v), k}                                      }' 1.5milGETs.txt That prints the total number of blocks, sorts them large->small, prints the top 10.Prints:Total blocks: 1,262,522    71,639 Host,Accept,User-Agent,Pragma,Connection    70,975 Host,ros-SecurityFlags,ros-SessionTicket,ros-Challenge,ros-HeadersHmac,Scs-Ticket,If-Modified-Since,User-Agent    40,781 Host,Accept,User-Agent,Pragma,nnCoection,Connection,X-Forwarded-For    35,485 Accept,ros-SecurityFlags,ros-SessionTicket,ros-Challenge,ros-HeadersHmac,Scs-Ticket,If-Modified-Since,User-Agent,Accept-Language,UA-CPU,Accept-Encoding,Host,Connection    34,005 User-Agent,Host,Connection,Accept-Encoding    30,668 Host,User-Agent,Accept-Encoding,Connection    25,547 Host,Accept,Accept-Language,Connection,Accept-Encoding,User-Agent    22,581 Host,User-Agent,Accept,Accept-Encoding    19,311 Host,Connection,Accept,From,User-Agent,Accept-Encoding    14,694 Host,Connection,User-Agent,Accept,Referer,Accept-Encoding,Accept-Language,Cookie    12,290 Host,User-Agent,Accept-EncodingThat takes about 8 seconds on a 6 year old Mac.Awk will be 3x faster and entirely appropriate for this job.Ruby will give you more output options and easier analysis of the data. You can create an interactive HTML documents; output JSON, quoted csv, xml trivially; interact with a database; invert keys and values in a statement; filter; etc.

Advertisement

Answer