I am looking for a way to use grep on a linux server to find duplicate json records, is it possible to have a grep to search for duplicate id’s in the example below ?
so the grep would return: 01
{ "book": [ { "id": "01", "language": "Java", "edition": "third", "author": "Herbert Schildt" }, { "id": "02", "language": "Java", "edition": "third", "author": "Herbert Schildt" }, { "id": "03", "language": "Java", "edition": "third", "author": "Herbert Schildt" }, { "id": "01", "language": "Java", "edition": "third", "author": "Herbert Schildt" }, { "id": "04", "language": "C++", "edition": "second", "author": "E.Balagurusamy" } ] }
Advertisement
Answer
OK, discarding any whitespace from the JSON strings I can offer this if awk
is acceptable – hutch
being the formatted chunk of JSON above in a file.
I use tr
to remove any whitespace, use ,
as a field separator in awk; iterate over the one long lines elements with a for-loop, do some pattern-matching in awk to isolate ID fields and increment an array for each matched ID. At the end of processing I iterate over the array and print ID’s that have more than one match.
Here your data:
$ cat hutch { "book": [ { "id": "01", "language": "Java", "edition": "third", "author": "Herbert Schildt" }, { "id": "02", "language": "Java", "edition": "third", "author": "Herbert Schildt" }, { "id": "03", "language": "Java", "edition": "third", "author": "Herbert Schildt" }, { "id": "01", "language": "Java", "edition": "third", "author": "Herbert Schildt" }, { "id": "04", "language": "C++", "edition": "second", "author": "E.Balagurusamy" } ] }
And here the finding of dupes:
$ tr -d '[:space:]' <hutch | awk -F, '{for(i=1;i<=NF;i++){if($i~/"id":/){a[gensub(/^.*"id":"([0-9]+)"$/, "\1","1",$i)]++}}}END{for(i in a){if(a[i]>1){print i}}}' 01