Group By and Sum from .txt file in Linux

Following a suggestion from @tripleee, I’m posting another question for a coding issue I’m having trying to be more specific with sources and expected results.

My source .txt file is quite over populated with lines and using AWK, I:

extract only rows identified by a specific code.
parse the content of the line in order to get only certain values.
remove leading zeros from substring.

After these first steps, my saved output is like this (I can change the order of the columns as I wish).

1646         | 01         |       1602 |          4
1646         | 01         |       1604 |         19
1646         | 01         |       1605 |         35
1646         | 01         |       1606 |          7
1646         | 01         |       1607 |         37
1646         | 01         |       1609 |          9
1646         | 01         |       1610 |          3
1646         | 01         |       1611 |          2
1646         | 01         |       1612 |         90
1646         | 01         |       1613 |        107
1646         | 01         |       1614 |         12
1646         | 01         |       1615 |         12
1646         | 01         |       1616 |          4
1646         | 01         |       1617 |          1
1646         | 01         |       1618 |         39
3625         | 01         |       2311 |         14
3625         | 01         |       2312 |          9
3625         | 01         |       2313 |         11
3625         | 01         |       2314 |          3
3625         | 01         |       2315 |          1
3625         | 01         |       2316 |          1
3625         | 01         |       2317 |         28
3625         | 01         |       2318 |          9
3625         | 01         |       2320 |         12
3625         | 01         |       2321 |          5
3625         | 01         |       2322 |         29
3625         | 01         |       2324 |          3
1646         | 06         |       1600 |         12
1646         | 06         |       1603 |         19
1646         | 06         |       1608 |         11
3625         | 06         |       2319 |          6
3625         | 06         |       2323 |         16
3547         | AF         |       1525 |          1
3547         | AF         |       1526 |          1

What I would like to achieve is a group by on the second column, summing the fourth column, with a line for each value in the first column.

In other words, the expected output should be:

DEP         SUM
01
    1646    381
    3625    125
06  
    1646    42
    3625    22
AF  
    3547    2

Or with a better presentation:

DEP        | PDV        |            |        SUM
01         |            |            |          
           | 1646       |            |        381
           | 3625       |            |        125
06         |            |            |          
           | 1646       |            |         42
           | 3625       |            |         22
AF         |            |            |          
           | 3547       |            |          2

So far I’ve tried this:

awk '{sum[$2]+=$4}
END{for (date in sum) print date, sum[date], freq[date]}' Test_Awk_2.txt

But results in not what I expected:

AF 2
01 506
06 64

Answer

You’re only grouping your sum by one column, but you want to group it by 2 columns, so you need to use both of them as the key in the array.

You’re also printing a freq array, but you never assigned it. It also doesn’t exist in your desired output.

awk '{sum[$2" "$1]+=$4}
     END{for (date in sum) print date, sum[date]}' Test_Awk_2.txt | sort -n -k1,2

The sort command will put all the same values of $2 in consecutive rows.

Advertisement

Answer