Skip to content
Advertisement

Bash: transform key-value lines to CSV format [closed]

Editor’s note: I’ve clarified the problem definition, because I think the problem is an interesting one, and this question deserves to be reopened.

I’ve got a text file containing key-value lines in the following format – note that the # lines below are only there to show repeating blocks and are NOT part of the input:

Country:United Kingdom
Language:English
Capital city:London
#
Country:France
Language:French
Capital city:Paris
#
Country:Germany
Language:German
Capital city:Berlin
#
Country:Italy
Language:Italian
Capital city:Rome
#
Country:Russia
Language:Russian
Capital city:Moscow

Using shell commands and utilities, how can I transform such a file to CSV format, so it will look like this?

Country,Language,Capital city
United Kingdom,English,London
France,French,Paris
Germany,German,Berlin
Italy,Italian,Rome
Russia,Russian,Moscow

In other words:

  • Make the key names the column names of the CSV header row.
  • Make the values from each block a data row each.

[OP’s original] Edit: My idea would be to separate the entries e.g. Country:France would become Country France, and then grep/sed the heading. However I have no idea how to move the headings from a single column to several separate ones.

Advertisement

Answer

A simple solution with cut, paste, and head (assumes input file file, outputs to file out.csv):

#!/usr/bin/env bash

{ cut -d':' -f1 file | head -n 3 | paste -d, - - -;
  cut -d':' -f2- file | paste -d, - - -; } >out.csv
  • cut -d':' -f1 file | head -n 3 creates the header line:

    • cut -d':' -f1 file extracts the first :-based field from each input line, and head -n 3 stops after 3 lines, given that the headers repeat every 3 lines.

    • paste -d, - - - takes 3 input lines from stdin (one for each -) and combines them to a single, comma-separated output line (-d,)

  • cut -d':' -f2- file | paste -d, - - - creates the data lines:

    • cut -d':' -f2- file extracts everything after the : from each input line.

    • As above, paste then combines 3 values to a single, comma-separated output line.


agc points out in a comment that the column count (3) and the paste operands (- - -) are hard-coded above.

The following solution parameterizes the column count (set it via n=...):

{ n=3; pasteOperands=$(printf '%.s- ' $(seq $n)) 
  cut -d':' -f1 file | head -n $n | paste -d, $pasteOperands;
  cut -d':' -f2- file | paste -d, $pasteOperands; } >out.csv
  • printf '%.s- ' $(seq $n) is a trick that produces a list of as many space-separated - chars. as there are columns ($n).

While the previous solution is now parameterized, it still assumes that the column count is known in advance; the following solution dynamically determines the column count (requires Bash 4+ due to use of readarray, but could be made to work with Bash 3.x):

# Determine the unique list of column headers and
# read them into a Bash array.
readarray -t columnHeaders < <(awk -F: 'seen[$1]++ { exit } { print $1 }' file)

# Output the header line.
(IFS=','; echo "${columnHeaders[*]}") >out.csv

# Append the data lines.
cut -d':' -f2- file | paste -d, $(printf '%.s- ' $(seq ${#columnHeaders[@]})) >>out.csv
  • awk -F: 'seen[$1]++ { exit } { print $1 } outputs each input line’s column name (the 1st :-separated field), remembers the column names in associative array seen, and stops at the first column name that is seen for the second time.

  • readarray -t columnHeaders reads awk‘s output line by line into array columnHeaders

  • (IFS=','; echo "${columnHeaders[*]}") >out.csv prints the array elements using a space as the separator (specified via $IFS); note the use of a subshell ((...)) so as to localize the effect of modifying $IFS, which would otherwise have global effects.

  • The cut ... pipeline uses the same approach as before, with the operands for paste being created based on the count of the elements of array columnHeaders (${#columnHeaders[@]}).


To wrap the above up in a function that outputs to stdout and also works with Bash 3.x:

toCsv() {

  local file=$1 columnHeaders

  # Determine the unique list of column headers and
  # read them into a Bash array.
  IFS=$'n' read -d '' -ra columnHeaders < <(awk -F: 'seen[$1]++ { exit } { print $1 }' "$file")

  # Output the header line.
  (IFS=','; echo "${columnHeaders[*]}")

  # Append the data lines.
  cut -d':' -f2- "$file" | paste -d, $(printf '%.s- ' $(seq ${#columnHeaders[@]}))
}

# Sample invocation
toCsv file > out.csv
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement