Would like to SLICE a huge json file ~20GB into smaller chunk of data based on array size (10000/50000 etc)..
Input:
{"recDt":"2021-01-05", "country":"US", "name":"ABC", "number":"9828", "add": [ {"evnCd":"O","rngNum":"1","state":"TX","city":"ANDERSON","postal":"77830"}, {"evnCd":"O","rngNum":"2","state":"TX","city":"ANDERSON","postal":"77832"}, {"evnCd":"O","rngNum":"3","state":"TX","city":"ANDERSON","postal":"77831"}, {"evnCd":"O","rngNum":"4","state":"TX","city":"ANDERSON","postal":"77834"} ] }
Currently running in a loop to get the desire output by incrementing x/y value, but performance is very slow and takes very 8-20 seconds for a iteration depends on size of the file to complete the split process. Currently using 1.6 version, is there any alternates for getting below result
Expected Output: for Slice of 2 objects in array
{"recDt":"2021-01-05","country":"US","name":"ABC","number":"9828","add":[{"rngNum":"1","state":"TX","city":"ANDERSON","postal":"77830"},{"rngNum":"2","state":"TX","city":"ANDERSON","postal":"77832"}]} {"recDt":"2021-01-05","country":"US","name":"ABC","number":"9828","add":[{"rngNum":"3","state":"TX","city":"ANDERSON","postal":"77831"},{"rngNum":"4","state":"TX","city":"ANDERSON","postal":"77834"}]}
Tried with
cat $inFile | jq -cn --stream 'fromstream(1|truncate_stream(inputs))' | jq --arg x $x --arg y $y -c '{recDt: .recDt, country: .country, name: .name, number: .number, add: .add[$x|tonumber:$y|tonumber]}' >> $outFile cat $inFile | jq --arg x $x --arg y $y -c '{recDt: .recDt, country: .country, name: .name, number: .number, add: .add[$x|tonumber:$y|tonumber]}' >> $outFile
Please share if there are any alternate available..
Advertisement
Answer
In this response, which calls jq just once, I’m going to assume your computer has enough memory to read the entire JSON. I’ll also assume you want to create separate files for each slice, and that you want the JSON to be pretty-printed in each file.
Assuming a chunk size of 2, and that the output files are to be named using the template part-N.json, you could write:
< input.json jq -r --argjson size 2 ' del(.add) as $object | (.add|_nwise($size) | ("t", $object + {add:.} )) ' | awk ' /^t/ {fn++; next} { print >> "part-" fn ".json"}'
The trick being used here is that valid JSON cannot contain a tab character.