Skip to content
Advertisement

Group By and Sum from .txt file in Linux

Following a suggestion from @tripleee, I’m posting another question for a coding issue I’m having trying to be more specific with sources and expected results.

My source .txt file is quite over populated with lines and using AWK, I:

  • extract only rows identified by a specific code.
  • parse the content of the line in order to get only certain values.
  • remove leading zeros from substring.

After these first steps, my saved output is like this (I can change the order of the columns as I wish).

1646         | 01         |       1602 |          4
1646         | 01         |       1604 |         19
1646         | 01         |       1605 |         35
1646         | 01         |       1606 |          7
1646         | 01         |       1607 |         37
1646         | 01         |       1609 |          9
1646         | 01         |       1610 |          3
1646         | 01         |       1611 |          2
1646         | 01         |       1612 |         90
1646         | 01         |       1613 |        107
1646         | 01         |       1614 |         12
1646         | 01         |       1615 |         12
1646         | 01         |       1616 |          4
1646         | 01         |       1617 |          1
1646         | 01         |       1618 |         39
3625         | 01         |       2311 |         14
3625         | 01         |       2312 |          9
3625         | 01         |       2313 |         11
3625         | 01         |       2314 |          3
3625         | 01         |       2315 |          1
3625         | 01         |       2316 |          1
3625         | 01         |       2317 |         28
3625         | 01         |       2318 |          9
3625         | 01         |       2320 |         12
3625         | 01         |       2321 |          5
3625         | 01         |       2322 |         29
3625         | 01         |       2324 |          3
1646         | 06         |       1600 |         12
1646         | 06         |       1603 |         19
1646         | 06         |       1608 |         11
3625         | 06         |       2319 |          6
3625         | 06         |       2323 |         16
3547         | AF         |       1525 |          1
3547         | AF         |       1526 |          1

What I would like to achieve is a group by on the second column, summing the fourth column, with a line for each value in the first column.

In other words, the expected output should be:

DEP         SUM
01
    1646    381
    3625    125
06  
    1646    42
    3625    22
AF  
    3547    2

Or with a better presentation:

DEP        | PDV        |            |        SUM
01         |            |            |          
           | 1646       |            |        381
           | 3625       |            |        125
06         |            |            |          
           | 1646       |            |         42
           | 3625       |            |         22
AF         |            |            |          
           | 3547       |            |          2

So far I’ve tried this:

awk '{sum[$2]+=$4}
END{for (date in sum) print date, sum[date], freq[date]}' Test_Awk_2.txt

But results in not what I expected:

AF 2
01 506
06 64

Advertisement

Answer

You’re only grouping your sum by one column, but you want to group it by 2 columns, so you need to use both of them as the key in the array.

You’re also printing a freq array, but you never assigned it. It also doesn’t exist in your desired output.

awk '{sum[$2" "$1]+=$4}
     END{for (date in sum) print date, sum[date]}' Test_Awk_2.txt | sort -n -k1,2

The sort command will put all the same values of $2 in consecutive rows.

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement