Editor’s note: I’ve clarified the problem definition, because I think the problem is an interesting one, and this question deserves to be reopened.
I’ve got a text file containing key-value lines in the following format – note that the #
lines below are only there to show repeating blocks and are NOT part of the input:
Country:United Kingdom Language:English Capital city:London # Country:France Language:French Capital city:Paris # Country:Germany Language:German Capital city:Berlin # Country:Italy Language:Italian Capital city:Rome # Country:Russia Language:Russian Capital city:Moscow
Using shell commands and utilities, how can I transform such a file to CSV format, so it will look like this?
Country,Language,Capital city United Kingdom,English,London France,French,Paris Germany,German,Berlin Italy,Italian,Rome Russia,Russian,Moscow
In other words:
- Make the key names the column names of the CSV header row.
- Make the values from each block a data row each.
[OP’s original] Edit: My idea would be to separate the entries e.g. Country:France would become Country France, and then grep/sed the heading. However I have no idea how to move the headings from a single column to several separate ones.
Advertisement
Answer
A simple solution with cut
, paste
, and head
(assumes input file file
, outputs to file out.csv
):
#!/usr/bin/env bash { cut -d':' -f1 file | head -n 3 | paste -d, - - -; cut -d':' -f2- file | paste -d, - - -; } >out.csv
cut -d':' -f1 file | head -n 3
creates the header line:cut -d':' -f1 file
extracts the first:
-based field from each input line, andhead -n 3
stops after 3 lines, given that the headers repeat every 3 lines.paste -d, - - -
takes 3 input lines from stdin (one for each-
) and combines them to a single, comma-separated output line (-d,
)
cut -d':' -f2- file | paste -d, - - -
creates the data lines:cut -d':' -f2- file
extracts everything after the:
from each input line.As above,
paste
then combines 3 values to a single, comma-separated output line.
agc points out in a comment that the column count (3
) and the paste
operands (- - -
) are hard-coded above.
The following solution parameterizes the column count (set it via n=...
):
{ n=3; pasteOperands=$(printf '%.s- ' $(seq $n)) cut -d':' -f1 file | head -n $n | paste -d, $pasteOperands; cut -d':' -f2- file | paste -d, $pasteOperands; } >out.csv
printf '%.s- ' $(seq $n)
is a trick that produces a list of as many space-separated-
chars. as there are columns ($n
).
While the previous solution is now parameterized, it still assumes that the column count is known in advance; the following solution dynamically determines the column count (requires Bash 4+ due to use of readarray
, but could be made to work with Bash 3.x):
# Determine the unique list of column headers and # read them into a Bash array. readarray -t columnHeaders < <(awk -F: 'seen[$1]++ { exit } { print $1 }' file) # Output the header line. (IFS=','; echo "${columnHeaders[*]}") >out.csv # Append the data lines. cut -d':' -f2- file | paste -d, $(printf '%.s- ' $(seq ${#columnHeaders[@]})) >>out.csv
awk -F: 'seen[$1]++ { exit } { print $1 }
outputs each input line’s column name (the 1st:
-separated field), remembers the column names in associative arrayseen
, and stops at the first column name that is seen for the second time.readarray -t columnHeaders
readsawk
‘s output line by line into arraycolumnHeaders
(IFS=','; echo "${columnHeaders[*]}") >out.csv
prints the array elements using a space as the separator (specified via$IFS
); note the use of a subshell ((...)
) so as to localize the effect of modifying$IFS
, which would otherwise have global effects.The
cut ...
pipeline uses the same approach as before, with the operands forpaste
being created based on the count of the elements of arraycolumnHeaders
(${#columnHeaders[@]}
).
To wrap the above up in a function that outputs to stdout and also works with Bash 3.x:
toCsv() { local file=$1 columnHeaders # Determine the unique list of column headers and # read them into a Bash array. IFS=$'n' read -d '' -ra columnHeaders < <(awk -F: 'seen[$1]++ { exit } { print $1 }' "$file") # Output the header line. (IFS=','; echo "${columnHeaders[*]}") # Append the data lines. cut -d':' -f2- "$file" | paste -d, $(printf '%.s- ' $(seq ${#columnHeaders[@]})) } # Sample invocation toCsv file > out.csv