Skip to content
Advertisement

Split/Slice large JSON using jq

Would like to SLICE a huge json file ~20GB into smaller chunk of data based on array size (10000/50000 etc)..

Input:

{"recDt":"2021-01-05",
 "country":"US",
 "name":"ABC",
 "number":"9828",
 "add": [
     {"evnCd":"O","rngNum":"1","state":"TX","city":"ANDERSON","postal":"77830"},
     {"evnCd":"O","rngNum":"2","state":"TX","city":"ANDERSON","postal":"77832"},
     {"evnCd":"O","rngNum":"3","state":"TX","city":"ANDERSON","postal":"77831"},
     {"evnCd":"O","rngNum":"4","state":"TX","city":"ANDERSON","postal":"77834"}
 ]
}

Currently running in a loop to get the desire output by incrementing x/y value, but performance is very slow and takes very 8-20 seconds for a iteration depends on size of the file to complete the split process. Currently using 1.6 version, is there any alternates for getting below result

Expected Output: for Slice of 2 objects in array

{"recDt":"2021-01-05","country":"US","name":"ABC","number":"9828","add":[{"rngNum":"1","state":"TX","city":"ANDERSON","postal":"77830"},{"rngNum":"2","state":"TX","city":"ANDERSON","postal":"77832"}]}
{"recDt":"2021-01-05","country":"US","name":"ABC","number":"9828","add":[{"rngNum":"3","state":"TX","city":"ANDERSON","postal":"77831"},{"rngNum":"4","state":"TX","city":"ANDERSON","postal":"77834"}]}

Tried with

cat $inFile | jq -cn --stream 'fromstream(1|truncate_stream(inputs))' | jq --arg x $x --arg y $y -c '{recDt: .recDt, country: .country, name: .name, number: .number, add: .add[$x|tonumber:$y|tonumber]}' >> $outFile

cat $inFile | jq --arg x $x --arg y $y -c '{recDt: .recDt, country: .country, name: .name, number: .number, add: .add[$x|tonumber:$y|tonumber]}' >> $outFile  

Please share if there are any alternate available..

Advertisement

Answer

In this response, which calls jq just once, I’m going to assume your computer has enough memory to read the entire JSON. I’ll also assume you want to create separate files for each slice, and that you want the JSON to be pretty-printed in each file.

Assuming a chunk size of 2, and that the output files are to be named using the template part-N.json, you could write:

< input.json jq -r --argjson size 2 '
  del(.add) as $object
  | (.add|_nwise($size) | ("t", $object + {add:.} ))
' | awk '
      /^t/ {fn++; next}
      { print >> "part-" fn ".json"}'

The trick being used here is that valid JSON cannot contain a tab character.

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement