Skip to content
Advertisement

Using sed/awk to extract multiple strings from each line

I have a file that contains 30million lines(so big file)

On each line I have this kind of data:

"title": "some title" (SOME RANDOM DATA) "rank": "1,292,064"

I need to extract both the title value and the rank value so:

some title:1,292,064

Little help? 🙂 I have tried my little heart out and nothing, can only extract one piece of data from each line

Advertisement

Answer

Except in the case there could be escaped quotes between the quotes, and other tricky stuff like that, I would try this sed command to filter your big file:

sed 's/^"[^"]*": "([^"]*)".*"(.*)"$/1:2/'

Basically, you look for two subgroups 1 and 2 containing the fields you want, and you print these separated by a :.

In case the string title appears litterally, the regex passed as argument to sed is less ugly:

sed 's/^"title": "([^"]*)".*"(.*)"$/1:2/'

Even safer, for avoiding side effects from the random data:

sed 's/^"title": "([^"]*)".*"rank": "(.*)"$/1:2/'
Advertisement