Editor’s note: I’ve clarified the problem definition, because I think the problem is an interesting one, and this question deserves to be reopened.
I’ve got a text file containing key-value lines in the following format – note that the # lines below are only there to show repeating blocks and are NOT part of the input:
Country:United Kingdom Language:English Capital city:London # Country:France Language:French Capital city:Paris # Country:Germany Language:German Capital city:Berlin # Country:Italy Language:Italian Capital city:Rome # Country:Russia Language:Russian Capital city:Moscow
Using shell commands and utilities, how can I transform such a file to CSV format, so it will look like this?
Country,Language,Capital city United Kingdom,English,London France,French,Paris Germany,German,Berlin Italy,Italian,Rome Russia,Russian,Moscow
In other words:
- Make the key names the column names of the CSV header row.
- Make the values from each block a data row each.
[OP’s original] Edit: My idea would be to separate the entries e.g. Country:France would become Country France, and then grep/sed the heading. However I have no idea how to move the headings from a single column to several separate ones.
Advertisement
Answer
A simple solution with cut, paste, and head (assumes input file file, outputs to file out.csv):
#!/usr/bin/env bash
{ cut -d':' -f1 file | head -n 3 | paste -d, - - -;
cut -d':' -f2- file | paste -d, - - -; } >out.csv
cut -d':' -f1 file | head -n 3creates the header line:cut -d':' -f1 fileextracts the first:-based field from each input line, andhead -n 3stops after 3 lines, given that the headers repeat every 3 lines.paste -d, - - -takes 3 input lines from stdin (one for each-) and combines them to a single, comma-separated output line (-d,)
cut -d':' -f2- file | paste -d, - - -creates the data lines:cut -d':' -f2- fileextracts everything after the:from each input line.As above,
pastethen combines 3 values to a single, comma-separated output line.
agc points out in a comment that the column count (3) and the paste operands (- - -) are hard-coded above.
The following solution parameterizes the column count (set it via n=...):
{ n=3; pasteOperands=$(printf '%.s- ' $(seq $n))
cut -d':' -f1 file | head -n $n | paste -d, $pasteOperands;
cut -d':' -f2- file | paste -d, $pasteOperands; } >out.csv
printf '%.s- ' $(seq $n)is a trick that produces a list of as many space-separated-chars. as there are columns ($n).
While the previous solution is now parameterized, it still assumes that the column count is known in advance; the following solution dynamically determines the column count (requires Bash 4+ due to use of readarray, but could be made to work with Bash 3.x):
# Determine the unique list of column headers and
# read them into a Bash array.
readarray -t columnHeaders < <(awk -F: 'seen[$1]++ { exit } { print $1 }' file)
# Output the header line.
(IFS=','; echo "${columnHeaders[*]}") >out.csv
# Append the data lines.
cut -d':' -f2- file | paste -d, $(printf '%.s- ' $(seq ${#columnHeaders[@]})) >>out.csv
awk -F: 'seen[$1]++ { exit } { print $1 }outputs each input line’s column name (the 1st:-separated field), remembers the column names in associative arrayseen, and stops at the first column name that is seen for the second time.readarray -t columnHeadersreadsawk‘s output line by line into arraycolumnHeaders(IFS=','; echo "${columnHeaders[*]}") >out.csvprints the array elements using a space as the separator (specified via$IFS); note the use of a subshell ((...)) so as to localize the effect of modifying$IFS, which would otherwise have global effects.The
cut ...pipeline uses the same approach as before, with the operands forpastebeing created based on the count of the elements of arraycolumnHeaders(${#columnHeaders[@]}).
To wrap the above up in a function that outputs to stdout and also works with Bash 3.x:
toCsv() {
local file=$1 columnHeaders
# Determine the unique list of column headers and
# read them into a Bash array.
IFS=$'n' read -d '' -ra columnHeaders < <(awk -F: 'seen[$1]++ { exit } { print $1 }' "$file")
# Output the header line.
(IFS=','; echo "${columnHeaders[*]}")
# Append the data lines.
cut -d':' -f2- "$file" | paste -d, $(printf '%.s- ' $(seq ${#columnHeaders[@]}))
}
# Sample invocation
toCsv file > out.csv