Skip to content
Advertisement

Grouping common elements using awk

The following table illustrates a brief snapshot of the data that I wish to manipulate. I am looking for an awk script that will group similar elements into one group. For eg. if you look at the table below:

  1. Numbers (1,2,3,4,6) should all belong to one group. So row1 row2 row4 row8 will be group “1”
  2. Number 9 is unique and does not have any common elements. So it will reside alone in a separate group say group 2
  3. Similarly numbers 5,7 will reside in one group say group 3 and so on…

The file:

heading1        heading2         numberlist     group
name1           text             1,2,3          1
name2           text             2              1
name3           text             9              2
name4           text             1,4            1
name5           text             5,7            3
name6           text             7              3
name7           text             8              4
name8           text             6,2            1

I was searching for queries similar to mine and found this link. Grouping lists by common elements. But the solution is in C++ and not awk, which is my primary requirement.

Incidentally I also found this awk solution that is somewhat related to my query but it was devoid of handling of comma separated values. awk script grouping with array

Numberlist i.e. $3 is my only consideration for grouping.

Advertisement

Answer

This problem seemed almost same as one of my problems and i had used one column in your example to solve my problem 🙂 So…

[[bash_prompt$]]$ cat log ; echo "########"; 
> cat test.sh ;echo "########";  awk -f test.sh log
heading1        heading2         numberlist     group
name1           text             1,2,3
name2           text             2
name3           text             9
name4           text             1,4
name5           text             5,7
name6           text             7
name7           text             8
name8           text             6,2
########
/^name/{
  i=0; j=0;
  split($3,a,",");
  for(var in a) {
    for(var1 in q) {
      split(q[var1],r,",");
      for(var2 in r) {
        if(r[var2] == a[var]) {
          i=1;
          j=((var1+1));
        }
      }
    }
  }
    if(i == 0) {
      q[length(q)] = $3;
      j=length(q);
    }
  print $1 "tt" $2 "   tt" $3 "tt" j;
}
########
name1           text            1,2,3           1
name2           text            2               1
name3           text            9               2
name4           text            1,4             1
name5           text            5,7             3
name6           text            7               3
name7           text            8               4
name8           text            6,2             1
[[bash_prompt$]]$

Update:

split splits the first argument by the delimiter passed in third argument and puts it into an array pointed by the second argument. Here main array is q, which holds the group members of a group, it’s basically an array of arrays where the index of an element is the group id, and the element is collection all the members of the group. so q[0]="1,2,3" indicates 0th group is containing members 1,2 and 3. Now in awk, first one line is read which starts with name (/^name/). Then the 3rd field (1,2,3) is broken down into an array a. Now for each element in an array a, we go for each group stored into q (for(var1 in q)) , then inside each group, we split them into another temporary array r (split(q[var1],r,",")), i.e. “1,2,3” is split into an array r. Now each element in r is compared to the element in a. if a match found, the group’s index is the index of that row (array index starts from 0, group’s from 1, so ((var1+1)) used. Now if not found, just add this as a new group in q and the last index + 1, i.e. length of the array is the index for the row

Update:

/^name/{
  j=0;
  split($3,a,",");
  for(var in a) {
    if(q[a[var]] != 0) {
      j=q[a[var]]; i=1;
      break;
    }
  }
  j = (j == 0) ? ++k : j;
  for(var1 in a) {
    if(q[a[var1]] == 0) {
      q[a[var1]] = j;
    }
  }
  print $1 "tt" $2 "   tt" $3 "tt" j;
}

Update:

base is awk has associative array and each element is accessed by a string key. Earlier approach was to store each group in an array where key is the index of the group. So when we were reading a column, we will read each group, split the group in individual element, compare each of the element with each element of the column. But instead of storing a group, if we store the elements in an array where key is the element themselves and value at key is the index of the group to which the element belongs. So when we read a column, we split the column in individual element (split($3,a,",");) then check element in array if there is a group index with the element as key in if(q[a[var]] != 0)( in awk, if the element is not there, by default an element with value 0 is initialized there, so the check q[a[var]] != 0 ). If any element is found, we take the element’s group index as the index of the column and break. else j will remain 0. if j remains 0, ++k gives the latest group index. Now we found the group index for the column elements. Need to carry that index to those elements which are not a part of any other group( there will be cases where multiple elements in same column belongs to different group, here we are taking the first come, first serve approach, but do not over write the group index of others already belonging to another group). So for each element in column (for(var1 in a)) , if it does not belong to a group (if(q[a[var1]] == 0)) , give it a group index q[a[var1]] = j;. So here all accesses are linear because we are accessing using elements directly a key. Thus no breaking up a group again and again for every element and hence a shorter time. My first approach was based on one of my own problem ( i mentioned in first line ) which was more complex processing but shorter data set. But this one required a simpler straight forward logic.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement