(Similar to How to interleave lines from two text files but for a single input. Also similar to Sort lines by group and column but interleaving or randomizing versus sorting.)
I have a set of systems and tasks in two columns, SYSTEM,TASK
:
alpha,90198500 alpha,93082105 alpha,30184438 beta,21700055 beta,33452909 beta,40850198 beta,82645731 gamma,64910850
I want to distribute the tasks to each system in a balanced way. The ideal case where each system has the same number of tasks would be round-robin, one alpha
then one beta
then one gamma
and repeat until finished.
- I get the whole list of tasks + systems at once, so I don’t need to keep any state
- The list of systems is not static, on the order of
N=100
- The total number of tasks is variable, on the order of
N=500
- The number of tasks for each system is not guaranteed to be equal
- Hard / absolute interleaving isn’t required, as long as there aren’t two of the same system twice in a row
- The same task may show up more than once, but not for the same system
- Input format / delimiter can be changed
I can solve this well enough with some fancy scripting to split the data into multiple files (grep ^alpha, input > alpha.txt
etc) and then recombine them with paste
or similar, but I’d like to use a single command or set of pipes to run it without intermediate files or a proper scripting language. Just using sort -R
gets me 95% of the way there, but I end up with 2 tasks for the same system in a row almost every time, and sometimes 3 or more depending on the initial distribution.
edit:
To clarify, any output should not have the same system
on two lines in a row. All system,task
pairs must be preserved, you can’t move a task from one system to another – that’d make this really easy!
One of several possible sample outputs:
beta,40850198 alpha,90198500 beta,82645731 alpha,93082105 gamma,64910850 beta,21700055 alpha,30184438 beta,33452909
Advertisement
Answer
We start with by answering the underlying theoretical problem. The problem is not as simple as it seems. Feel free to implement a script based on this answer.
The blocks formatted as quotes are not quotes. I just wanted to highlight them to improve navigation in this rather long answer.
Theoretical Problem
Given a finite set of letters L with frequencies f : L→ℕ0, find a sequence of letters such that every letter ℓ appears exactly f(ℓ) times and adjacent elements of the sequence are always different.
Example
L = {a,b,c} with f(a)=4, f(b)=2, f(c)=1
- ababaca, acababa, and abacaba are all valid solutions.
- aaaabbc is invalid – Some adjacent elements are equal, for instance aa or bb.
- ababac is invalid – The letter a appears 3 times, but its frequency is f(a)=4
- cababac is invalid – The letter c appears 2 times, but its frequency is f(c)=1
Solution
The following approach produces a valid sequence if and only if there exists a solution.
- Sort the letters by their frequencies.
For ease of notation we assume, without loss of generality, that f(a) ≥ f(b) ≥ f(c) ≥ … ≥ 0.
Note: There exists a solution if and only if f(a) ≤ 1 + ∑ℓ≠a f(ℓ).- Write down a sequence s of f(a) many a.
- Add the remaining letters into a FIFO working list, that is:
- (Don’t add any a)
- First add f(b) many b
- Then f(c) many c
- and so on
- Iterate from left to right over the sequence s and insert after each element a letter from the working list. Repeat this step until the working list is empty.
Example
L = {a,b,c,d} with f(a)=5, f(b)=5, f(c)=4, f(d)=2
- The letters are already sorted by their frequencies.
- s = aaaaa
- workinglist = bbbbbccccdd. The leftmost entry is the first one.
- We iterate from left to right. The places where we insert letters from the working list are marked with an _ underscore.
- s = a_a_a_a_a_ workinglist = bbbbbccccdd
s = aba_a_a_a_ workinglist = bbbbccccdd
s = ababa_a_a_ workinglist = bbbccccdd
…
s = ababababab workinglist = ccccdd
⚠️ We reached the end of sequence s. We repeat step 4. - s = a_b_a_b_a_b_a_b_a_b_ workinglist = ccccdd
s = acb_a_b_a_b_a_b_a_b_ workinglist = cccdd
…
s = acbcacb_a_b_a_b_a_b_ workinglist = cdd
s = acbcacbca_b_a_b_a_b_ workinglist = dd
s = acbcacbcadb_a_b_a_b_ workinglist = d
s = acbcacbcadbda_b_a_b_ workinglist =
⚠️ The working list is empty. We stop.
- s = a_a_a_a_a_ workinglist = bbbbbccccdd
- The final sequence is acbcacbcadbdabab.
Implementation In Bash
Here is a bash
implementation of the proposed approach that works with your input format. Instead of using a working list each line is labeled with a binary floating point number specifying the position of that line in the final sequence. Then the lines are sorted by their labels. That way we don’t have to use explicit loops. Intermediate results are stored in variables. No files are created.
#! /bin/bash inputFile="$1" # replace $1 by your input file or call "./thisScript yourFile" inputBySys="$(sort "$inputFile")" sysFreqBySys="$(cut -d, -f1 <<< "$inputBySys" | uniq -c | sed 's/^ *//;s/ /,/')" inputBySysFreq="$(join -t, -1 2 -2 1 <(echo "$sysFreqBySys") <(echo "$inputBySys") | sort -t, -k2,2nr -k1,1)" maxFreq="$(head -n1 <<< "$inputBySysFreq" | cut -d, -f2)" lineCount="$(wc -l <<< "$inputBySysFreq")" increment="$(awk '{l=log($1/$2)/log(2); l=int(l)-(int(l)>l); print 2^l}' <<< "$maxFreq $lineCount")" seq="$({ echo obase=2; seq 0 "$increment" "$maxFreq" | head -n-1; } | bc | awk -F. '{sub(/0*$/,"",$2); print 0+$1 "," $2 "," length($2)}' | sort -snt, -k3,3 -k2,2 | head -n "$lineCount")" paste -d, <(echo "$seq") <(echo "$inputBySysFreq") | sort -nt, -k1,1 -k2,2 | cut -d, -f4,6
This solution could fail for very long input files due to the limited precision of floating point numbers in seq
and awk
.