Skip to content
Advertisement

How to delete prefix, suffix in a string matching a pattern and split on a character using sed?

I have the following string, which is the output of a cassandra query in bash

col1|col2|col3+++++++++++A|1|a B|2|b C|3|c D|4|d  (3 rows)

I want to split this string so as to remove the string in the beginning till the last + symbol and then remove the tail end, which is (XYZ rows).

So, the string becomes A|1|a B|2|b C|3|c D|4|d. Now, I want to split this string into multiple arrays that look like this

A 1 a 
B 2 b 
C 3 c 
D 4 d

so that I can iterate over each row using a for loop to do some processing. The number of rows can vary.

How can I do this using sed or grep?

I tried this for the first pass but it didn’t work:

echo $string | sed 's/([0-9])rows//' | sed 's/[^+]//'

NOTE: the column strings can have multiple spaces in them ex: output of CQL query when written to file is

 topic   | partition | offset
---------+-----------+--------
 topic_2 |        31 |      4
 topic_2 |        30 |      4
 topic_2 |        29 |      4
 topic_2 |        28 |      4
 topic_2 |        27 |      4
 topic_2 |        26 |      4
 topic_2 |        25 |      4
 topic_2 |        24 |      4
 topic_2 |        23 |      4
 topic_2 |        22 |      4
 topic_2 |        21 |      4
 topic_2 |        20 |      4
 topic_2 |        19 |      4
 topic_2 |        18 |      4
 topic_2 |        17 |      4
 topic_2 |        16 |      4
 topic_2 |        15 |      4
 topic_2 |        14 |      4
 topic_2 |        13 |      4
 topic_2 |        12 |      4
 topic_2 |        11 |      4
 topic_2 |        10 |      4
 topic_2 |         9 |      4
 topic_2 |         8 |      4
 topic_2 |         7 |      4
 topic_2 |         6 |      4
 topic_2 |         5 |      4
 topic_2 |         4 |      4
 topic_2 |         3 |      4
 topic_2 |         2 |      4
 topic_2 |         1 |      4
 topic_2 |         0 |      4

(32 rows)

Advertisement

Answer

$ sed 's/[^+]*[+]*(.*[^ ]) *(.*)$/1/;y/ |/n /' <<< 'col1|col2|col3+++++++++++A|1|a B|2|b C|3|c D|4|d  (3 rows)'
A 1 a
B 2 b
C 3 c
D 4 d

The substitution does the following (hat tip to potong for pointing out how to get rid of one more substitution):

s/
    [^+]*      # Match non-plusses
    [+]*       # Followed by plusses
    (         # Capture the next group
        .*     # Any characters (greedily)
        [^ ]   # that end with a non-space
    )         # End of capture group
     *         # Spaces
    (.*)       # Followed by whatever in parentheses
$/1/          # Replace all that by the capture group

resulting in this intermediate stage:

$ sed 's/[^+]*[+]*(.*[^ ]) *(.*)$/1/' <<< 'col1|col2|col3+++++++++++A|1|a B|2|b C|3|c D|4|d  (3 rows)'
A|1|a B|2|b C|3|c D|4|d

The transformation (y///) turns all spaces into newlines and pipes into spaces.

Spaces other than the ones separating rows

If there are spaces within column and we assume that each entry has the format

[spaces]entry[spaces]

i.e., exactly two sets of spaces per entry, we have to replace the transformation y/// with another substitution,

s/([^ |])( +[^ |])/1n2/g

This looks for spaces following not a space or pipe and followed by not a space or pipe, and inserts a newline before those spaces. Result:

$ var='col1 | col2 | col3 +++++++++++ A | 1 | a B | 2 | b C | 3 | c D | 4 | d (3 rows)'
$ sed 's/[^+]*[+]*(.*[^ ]) *(.*)$/1/;s/([^ |])( +[^ |])/1n2/g' <<< "$var"
 A | 1 | a
 B | 2 | b
 C | 3 | c
 D | 4 | d
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement