Skip to content
Advertisement

Running statistics on multiple lines in bash

I have multiple HTTP headers in one giant file, separated with one empty line.

JavaScript

I have approximately 10,000,000 of headers separated with an empty line. If I want to discover trends, like header order, I want to do aggregate headers to a one-liner (how I can aggregate lines ending with an empty line and do that separately for all headers?):

Host,Connection,Accept,From,User-Agent,Accept-Encoding

and follow with: uniq -c|sort -nk1, so I could receive:

JavaScript

What would be the best approach and most effective one to parse that massive file and get that data?

Thanks for hints.

Advertisement

Answer

Using your your 1.5milGETs.txt file (and converting the triple nnn to nn to separate blocks) you can use ruby in paragraph mode:

JavaScript

That prints the total number of blocks, sorts them large->small, prints the top 10.

Prints:

JavaScript

That takes about 8 seconds on a 6 year old Mac.

Awk will be 3x faster and entirely appropriate for this job.

Ruby will give you more output options and easier analysis of the data. You can create an interactive HTML documents; output JSON, quoted csv, xml trivially; interact with a database; invert keys and values in a statement; filter; etc.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement