Skip to content
Advertisement

Interleave lines sorted by a column

(Similar to How to interleave lines from two text files but for a single input. Also similar to Sort lines by group and column but interleaving or randomizing versus sorting.)

I have a set of systems and tasks in two columns, SYSTEM,TASK:

alpha,90198500
alpha,93082105
alpha,30184438
beta,21700055
beta,33452909
beta,40850198
beta,82645731
gamma,64910850

I want to distribute the tasks to each system in a balanced way. The ideal case where each system has the same number of tasks would be round-robin, one alpha then one beta then one gamma and repeat until finished.

  • I get the whole list of tasks + systems at once, so I don’t need to keep any state
  • The list of systems is not static, on the order of N=100
  • The total number of tasks is variable, on the order of N=500
  • The number of tasks for each system is not guaranteed to be equal
  • Hard / absolute interleaving isn’t required, as long as there aren’t two of the same system twice in a row
  • The same task may show up more than once, but not for the same system
  • Input format / delimiter can be changed

I can solve this well enough with some fancy scripting to split the data into multiple files (grep ^alpha, input > alpha.txt etc) and then recombine them with paste or similar, but I’d like to use a single command or set of pipes to run it without intermediate files or a proper scripting language. Just using sort -R gets me 95% of the way there, but I end up with 2 tasks for the same system in a row almost every time, and sometimes 3 or more depending on the initial distribution.

edit: To clarify, any output should not have the same system on two lines in a row. All system,task pairs must be preserved, you can’t move a task from one system to another – that’d make this really easy!

One of several possible sample outputs:

beta,40850198
alpha,90198500
beta,82645731
alpha,93082105
gamma,64910850
beta,21700055
alpha,30184438
beta,33452909

Advertisement

Answer

We start with by answering the underlying theoretical problem. The problem is not as simple as it seems. Feel free to implement a script based on this answer.

The blocks formatted as quotes are not quotes. I just wanted to highlight them to improve navigation in this rather long answer.

Theoretical Problem

Given a finite set of letters L with frequencies f : L→ℕ0, find a sequence of letters such that every letter ℓ appears exactly f(ℓ) times and adjacent elements of the sequence are always different.

Example

L = {a,b,c} with f(a)=4, f(b)=2, f(c)=1

  • ababaca, acababa, and abacaba are all valid solutions.
  • aaaabbc is invalid – Some adjacent elements are equal, for instance aa or bb.
  • ababac is invalid – The letter a appears 3 times, but its frequency is f(a)=4
  • cababac is invalid – The letter c appears 2 times, but its frequency is f(c)=1

Solution

The following approach produces a valid sequence if and only if there exists a solution.

  1. Sort the letters by their frequencies.
    For ease of notation we assume, without loss of generality, that f(a) ≥ f(b) ≥ f(c) ≥ … ≥ 0.
    Note: There exists a solution if and only if f(a) ≤ 1 + ∑ℓ≠a f(ℓ).
  2. Write down a sequence s of f(a) many a.
  3. Add the remaining letters into a FIFO working list, that is:
    • (Don’t add any a)
    • First add f(b) many b
    • Then f(c) many c
    • and so on
  4. Iterate from left to right over the sequence s and insert after each element a letter from the working list. Repeat this step until the working list is empty.

Example

L = {a,b,c,d} with f(a)=5, f(b)=5, f(c)=4, f(d)=2

  1. The letters are already sorted by their frequencies.
  2. s = aaaaa
  3. workinglist = bbbbbccccdd. The leftmost entry is the first one.
  4. We iterate from left to right. The places where we insert letters from the working list are marked with an _ underscore.
    • s = a_a_a_a_a_       workinglist = bbbbbccccdd
      s = aba_a_a_a_       workinglist = bbbbccccdd
      s = ababa_a_a_       workinglist = bbbccccdd

      s = ababababab       workinglist = ccccdd
      ⚠️ We reached the end of sequence s. We repeat step 4.
    • s = a_b_a_b_a_b_a_b_a_b_       workinglist = ccccdd
      s = acb_a_b_a_b_a_b_a_b_       workinglist = cccdd

      s = acbcacb_a_b_a_b_a_b_       workinglist = cdd
      s = acbcacbca_b_a_b_a_b_       workinglist = dd
      s = acbcacbcadb_a_b_a_b_       workinglist = d
      s = acbcacbcadbda_b_a_b_       workinglist =
      ⚠️ The working list is empty. We stop.
  5. The final sequence is acbcacbcadbdabab.

Implementation In Bash

Here is a bash implementation of the proposed approach that works with your input format. Instead of using a working list each line is labeled with a binary floating point number specifying the position of that line in the final sequence. Then the lines are sorted by their labels. That way we don’t have to use explicit loops. Intermediate results are stored in variables. No files are created.

#! /bin/bash
inputFile="$1" # replace $1 by your input file or call "./thisScript yourFile"

inputBySys="$(sort "$inputFile")"
sysFreqBySys="$(cut -d, -f1 <<< "$inputBySys" | uniq -c | sed 's/^ *//;s/ /,/')"
inputBySysFreq="$(join -t, -1 2 -2 1 <(echo "$sysFreqBySys") <(echo "$inputBySys") | sort -t, -k2,2nr -k1,1)"

maxFreq="$(head -n1 <<< "$inputBySysFreq" | cut -d, -f2)"
lineCount="$(wc -l <<< "$inputBySysFreq")"
increment="$(awk '{l=log($1/$2)/log(2); l=int(l)-(int(l)>l); print 2^l}' <<< "$maxFreq $lineCount")"

seq="$({ echo obase=2; seq 0 "$increment" "$maxFreq" | head -n-1; } | bc |
        awk -F. '{sub(/0*$/,"",$2); print 0+$1 "," $2 "," length($2)}' |
        sort -snt, -k3,3 -k2,2 | head -n "$lineCount")"

paste -d, <(echo "$seq") <(echo "$inputBySysFreq") | sort -nt, -k1,1 -k2,2 | cut -d, -f4,6

This solution could fail for very long input files due to the limited precision of floating point numbers in seq and awk.

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement