Skip to content
Advertisement

How to compare two files containing many long strings then extract lines with at least n consecutive identical chars?

I have 2 large files each containing long strings separated by newlines in different formats. I need to find similarities and differences between them. The Problem is that the formats of the two files differ.

File a:

JavaScript

File b:

JavaScript

So now I want to extract the whole line containing NjA5MDAxNdaeag0NjE5NTIx.XUwXRQ.gat8MzuGfkj2pWs7z8z-LBFXQaE from File a to a new file and also delete this line in File a.

I have tried achieving this with meld and got to the point that it will at least show me the similarities only. Say File a has 3000 lines and File b has 120 lines, now I want to find the the lines with at least n consecutive identical chars and remove these from File a.

I found this and accordingly tried to use diff like this:

JavaScript

This didn’t do anything I got no output whatsoever so I guess it exited with 0 and didn’t find anything.

How can I make this work? I have Linux and Windows available.

Advertisement

Answer

Given the format of the files, the most efficient implementation would be something like this:

  1. Load all b strings into a [hashtable] or [HashSet[string]]
  2. Filter the contents of a by:
    • Extracting the substring from each line with String.Split(':') or similar
    • Check whether it exists in the set from step 1
JavaScript
Advertisement