Grouping common elements using awk

Question

The following table illustrates a brief snapshot of the data that I wish to manipulate. I am looking for an awk script that will group similar elements into one group. For eg. if you look at the table below: Numbers (1,2,3,4,6) should all belong to one group. So row1 row2 row4 row8 will be group &#8220;1&#822…

Accepted Answer

This problem seemed almost same as one of my problems and i had used one column in your example to solve my problem 🙂 So&#8230;[[bash_prompt$]]$ cat log ; echo "########"; > cat test.sh ;echo "########";  awk -f test.sh logheading1        heading2         numberlist     groupname1           text             1,2,3name2           text             2name3           text             9name4           text             1,4name5           text             5,7name6           text             7name7           text             8name8           text             6,2########/^name/{  i=0; j=0;  split($3,a,",");  for(var in a) {    for(var1 in q) {      split(q[var1],r,",");      for(var2 in r) {        if(r[var2] == a[var]) {          i=1;          j=((var1+1));        }      }    }  }    if(i == 0) {      q[length(q)] = $3;      j=length(q);    }  print $1 "tt" $2 "   tt" $3 "tt" j;}########name1           text            1,2,3           1name2           text            2               1name3           text            9               2name4           text            1,4             1name5           text            5,7             3name6           text            7               3name7           text            8               4name8           text            6,2             1[[bash_prompt$]]$Update: split splits the first argument by the delimiter passed in third argument and puts it into an array pointed by the second argument. Here main array is q, which holds the group members of a group, it&#8217;s basically an array of arrays where the index of an element is the group id, and the element is collection all the members of the group. so q[0]="1,2,3" indicates 0th group is containing members 1,2 and 3. Now in awk, first one line is read which starts with name (/^name/). Then the 3rd field (1,2,3) is broken down into an array a. Now for each element in an array a, we go for each group stored into q (for(var1 in q)) , then inside each group, we split them into another temporary array r (split(q[var1],r,",")), i.e. &#8220;1,2,3&#8221; is split into an array r. Now each element in r is compared to the element in a. if a match found, the group&#8217;s index is the index of that row  (array index starts from 0, group&#8217;s from 1, so ((var1+1)) used. Now if not found, just add this as a new group in q and the last index + 1, i.e. length of the array is the index for the rowUpdate:/^name/{  j=0;  split($3,a,",");  for(var in a) {    if(q[a[var]] != 0) {      j=q[a[var]]; i=1;      break;    }  }  j = (j == 0) ? ++k : j;  for(var1 in a) {    if(q[a[var1]] == 0) {      q[a[var1]] = j;    }  }  print $1 "tt" $2 "   tt" $3 "tt" j;}Update: base is awk has associative array and each  element is accessed by a string key. Earlier approach was to store each group in an array where key is the index of the group. So when we were reading a column, we will read each group, split the group in individual element, compare each of the element with each element of the column. But instead of storing a group, if we store the elements in an array where key is the element themselves and value at key is the index of the group to which the element belongs. So when we read a column, we split the column in individual element (split($3,a,",");) then check element in array if there is a group index with the element as key in if(q[a[var]] != 0)( in awk, if the element is not there, by default an element with value 0 is initialized there, so the check q[a[var]] != 0 ). If any element is found, we take the element&#8217;s group index as the index of the column and break. else j will remain 0. if j remains 0, ++k gives the latest group index. Now we found the group index for the column elements. Need to carry that index to those elements which are not a part of any other group( there will be cases where multiple elements in same column belongs to different group, here we are taking the first come, first serve approach, but do not over write the group index of others already belonging to another group). So for each element in column (for(var1 in a)) , if it does not belong to a group  (if(q[a[var1]] == 0)) , give it a group index q[a[var1]] = j;. So here all accesses are linear because we are accessing using elements directly a key. Thus no breaking up a group again and again for every element and hence a shorter time. My first approach was based on one of my own problem ( i mentioned in first line ) which was more complex processing but shorter data set. But this one required a simpler straight forward logic.

Advertisement

Answer