Interleave lines sorted by a column

Question

(Similar to How to interleave lines from two text files but for a single input. Also similar to Sort lines by group and column but interleaving or randomizing versus sorting.) I have a set of systems and tasks in two columns, SYSTEM,TASK: I want to distribute the tasks to each system in a balanced way. The ideal case where each

Accepted Answer

We start with by answering the underlying theoretical problem. The problem is not as simple as it seems. Feel free to implement a script based on this answer.The blocks formatted as quotes are not quotes. I just wanted to highlight them to improve navigation in this rather long answer.Theoretical Problem  Given a finite set of letters L with frequencies f : L→ℕ0, find a sequence of letters such that every letter ℓ appears exactly f(ℓ) times and adjacent elements of the sequence are always different.ExampleL = {a,b,c} with f(a)=4, f(b)=2, f(c)=1ababaca, acababa, and abacaba are all valid solutions.aaaabbc is invalid – Some adjacent elements are equal, for instance aa or bb. ababac is invalid – The letter a appears 3 times, but its frequency is f(a)=4  cababac is invalid – The letter c appears 2 times, but its frequency is f(c)=1  Solution  The following approach produces a valid sequence if and only if there exists a solution.      Sort the letters by their frequencies.  For ease of notation we assume, without loss of generality, that f(a) ≥ f(b) ≥ f(c) ≥ &#8230; ≥ 0.  Note: There exists a solution if and only if f(a) ≤ 1 + ∑ℓ≠a f(ℓ).  Write down a sequence s of f(a) many a.  Add the remaining letters into a FIFO working list, that is:      (Don&#8217;t add any a)  First add f(b) many b   Then f(c) many c   and so on    Iterate from left to right over the sequence s and insert after each element a letter from the working list. Repeat this step until the working list is empty.  ExampleL = {a,b,c,d} with f(a)=5, f(b)=5, f(c)=4, f(d)=2The letters are already sorted by their frequencies.s = aaaaaworkinglist = bbbbbccccdd. The leftmost entry is the first one.We iterate from left to right. The places where we insert letters from the working list are marked with an _ underscore.s = a_a_a_a_a_       workinglist = bbbbbccccdds = aba_a_a_a_       workinglist = bbbbccccdds = ababa_a_a_       workinglist = bbbccccdd&#8230;s = ababababab       workinglist = ccccdd⚠️ We reached the end of sequence s. We repeat step 4.s = a_b_a_b_a_b_a_b_a_b_       workinglist = ccccdds = acb_a_b_a_b_a_b_a_b_       workinglist = cccdd&#8230;s = acbcacb_a_b_a_b_a_b_       workinglist = cdds = acbcacbca_b_a_b_a_b_       workinglist = dds = acbcacbcadb_a_b_a_b_       workinglist = ds = acbcacbcadbda_b_a_b_       workinglist =⚠️ The working list is empty. We stop.The final sequence is acbcacbcadbdabab.Implementation In BashHere is a bash implementation of the proposed approach that works with your input format. Instead of using a working list each line is labeled with a binary floating point number specifying the position of that line in the final sequence. Then the lines are sorted by their labels. That way we don&#8217;t have to use explicit loops. Intermediate results are stored in variables. No files are created.#! /bin/bashinputFile="$1" # replace $1 by your input file or call "./thisScript yourFile"inputBySys="$(sort "$inputFile")"sysFreqBySys="$(cut -d, -f1 <<< "$inputBySys" | uniq -c | sed 's/^ *//;s/ /,/')"inputBySysFreq="$(join -t, -1 2 -2 1 <(echo "$sysFreqBySys") <(echo "$inputBySys") | sort -t, -k2,2nr -k1,1)"maxFreq="$(head -n1 <<< "$inputBySysFreq" | cut -d, -f2)"lineCount="$(wc -l <<< "$inputBySysFreq")"increment="$(awk '{l=log($1/$2)/log(2); l=int(l)-(int(l)>l); print 2^l}' <<< "$maxFreq $lineCount")"seq="$({ echo obase=2; seq 0 "$increment" "$maxFreq" | head -n-1; } | bc |        awk -F. '{sub(/0*$/,"",$2); print 0+$1 "," $2 "," length($2)}' |        sort -snt, -k3,3 -k2,2 | head -n "$lineCount")"paste -d, <(echo "$seq") <(echo "$inputBySysFreq") | sort -nt, -k1,1 -k2,2 | cut -d, -f4,6This solution could fail for very long input files due to the limited precision of floating point numbers in seq and awk.

Advertisement

Answer

Theoretical Problem

Example

Solution

Example

Implementation In Bash