I have 2 large files each containing long strings separated by newlines in different formats. I need to find similarities and differences between them. The Problem is that the formats of the two files differ.
File a:
9217:NjA5MDAxNdaeag0NjE5NTIx.XUwXRQ.gat8MzuGfkj2pWs7z8z-LBFXQaE:dasda97sda9sdadfghgg789hfg87ghf8fgh87
File b:
NjA5MDAxNdaeag0NjE5NTIx.XUwXRQ.gat8MzuGfkj2pWs7z8z-LBFXQaE
So now I want to extract the whole line containing NjA5MDAxNdaeag0NjE5NTIx.XUwXRQ.gat8MzuGfkj2pWs7z8z-LBFXQaE
from File a to a new file and also delete this line in File a.
I have tried achieving this with meld and got to the point that it will at least show me the similarities only. Say File a has 3000 lines and File b has 120 lines, now I want to find the the lines with at least n consecutive identical chars and remove these from File a.
I found this and accordingly tried to use diff like this:
diff --unchanged-line-format='%L' --old-line-format='' --new-line-format='' a.txt b.txt
This didn’t do anything I got no output whatsoever so I guess it exited with 0 and didn’t find anything.
How can I make this work? I have Linux and Windows available.
Advertisement
Answer
Given the format of the files, the most efficient implementation would be something like this:
- Load all
b
strings into a[hashtable]
or[HashSet[string]]
- Filter the contents of
a
by:- Extracting the substring from each line with
String.Split(':')
or similar - Check whether it exists in the set from step 1
- Extracting the substring from each line with
$FilterStrings = [System.Collections.Generic.HashSet[string]]::new( [string[]]@( Get-Content .pathtob ) ) Get-Content .pathtoa |Where-Object { # Split the line into the prefix, middle, and suffix; # Discard the prefix and suffix $null,$searchString,$null = $_.Split(":", 3) if($FilterStrings.Contains($searchString)){ # we found a match, write it to the new file $searchString |Add-Content .pathtomatchedStrings.txt # make sure it isn't passed through $false } else { # substring wasn't found to be in `b`, let's pass it through $true } } |Set-Content .pathtofilteredStrings.txt