Skip to content
Advertisement

Running statistics on multiple lines in bash

I have multiple HTTP headers in one giant file, separated with one empty line.

Host
Connection
Accept
From
User-Agent
Accept-Encoding

Host
Connection
Accept
From
User-Agent
Accept-Encoding
X-Forwarded-For

cookie
Cache-Control
referer
x-fb-sim-hni
Host
Accept
user-agent
x-fb-net-sid
x-fb-net-hni
X-Purpose
accept-encoding
x-fb-http-engine
Connection

User-Agent
Host
Connection
Accept-Encoding

I have approximately 10,000,000 of headers separated with an empty line. If I want to discover trends, like header order, I want to do aggregate headers to a one-liner (how I can aggregate lines ending with an empty line and do that separately for all headers?):

Host,Connection,Accept,From,User-Agent,Accept-Encoding

and follow with: uniq -c|sort -nk1, so I could receive:

197897 Host,Connection,Accept,From,User-Agent,Accept-Encoding
8732233 User-Agent,Host,Connection,Accept-Encoding

What would be the best approach and most effective one to parse that massive file and get that data?

Thanks for hints.

Advertisement

Answer

Using your your 1.5milGETs.txt file (and converting the triple nnn to nn to separate blocks) you can use ruby in paragraph mode:

$ ruby -F'n' -lane 'BEGIN{h=Hash.new(0); $/=""
                                       def commafy(n)
                                             n.to_s.reverse.gsub(/...(?=.)/,"\&,").reverse
                                         end
                                         }
                                         h[$F.join(",")]+=1
                                        # p $_
                              END{ printf "Total blocks: %sn", commafy(h.values.sum)
                                      h2=h.sort_by {|k,v| -v}
                                      h2[0..10].map {|k,v| printf "%10s %sn", commafy(v), k}
                                      }' 1.5milGETs.txt 

That prints the total number of blocks, sorts them large->small, prints the top 10.

Prints:

Total blocks: 1,262,522
    71,639 Host,Accept,User-Agent,Pragma,Connection
    70,975 Host,ros-SecurityFlags,ros-SessionTicket,ros-Challenge,ros-HeadersHmac,Scs-Ticket,If-Modified-Since,User-Agent
    40,781 Host,Accept,User-Agent,Pragma,nnCoection,Connection,X-Forwarded-For
    35,485 Accept,ros-SecurityFlags,ros-SessionTicket,ros-Challenge,ros-HeadersHmac,Scs-Ticket,If-Modified-Since,User-Agent,Accept-Language,UA-CPU,Accept-Encoding,Host,Connection
    34,005 User-Agent,Host,Connection,Accept-Encoding
    30,668 Host,User-Agent,Accept-Encoding,Connection
    25,547 Host,Accept,Accept-Language,Connection,Accept-Encoding,User-Agent
    22,581 Host,User-Agent,Accept,Accept-Encoding
    19,311 Host,Connection,Accept,From,User-Agent,Accept-Encoding
    14,694 Host,Connection,User-Agent,Accept,Referer,Accept-Encoding,Accept-Language,Cookie
    12,290 Host,User-Agent,Accept-Encoding

That takes about 8 seconds on a 6 year old Mac.

Awk will be 3x faster and entirely appropriate for this job.

Ruby will give you more output options and easier analysis of the data. You can create an interactive HTML documents; output JSON, quoted csv, xml trivially; interact with a database; invert keys and values in a statement; filter; etc.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement