Skip to content
Advertisement

linux extract portion of the string that can be second most common pattern

I have several strings(or filenames in a directory) and i need to group them by second most common pattern, then i will iterate over them by each group and process them. in the example below i need 2 from ACCEPT and 2 from BASIC_REGIS, bascially from string beginning to one character after hyphen (-) and it could be any character and not just digit. The first most common pattern are ACCEPT and BASIC_REGIS. I am looking for second most common pattern using grep -Po (Perl and only-matching). AWK solution is working

INPUT

ACCEPT-zABC-0123
ACCEPT-zBAC-0231
ACCEPT-1ABC-0120
ACCEPT-1CBA-0321

BASIC_REGIS-2ABC-9043
BASIC_REGIS-2CBA-8132
BASIC_REGIS-PCCA-6532
BASIC_REGIS-PBBC-3023

OUTPUT

ACCEPT-z
ACCEPT-1

BASIC_REGIS-2
BASIC_REGIS-P

echo "ACCEPT-0ABC-0123"|grep -Po "K^A.*-"

Result : ACCEPT-0ABC-

but I need : ACCEPT-0

However awk solution is working

echo "ACCEPT-1ABC-0120"|awk '$0 ~ /^A/{print substr($0,1,index($0,"-")+1)}'

ACCEPT-1

Advertisement

Answer

You don’t need -P (PCRE) for that, just a plain, old BRE:

$ grep -o '^[^-]*-.' file | sort -u
ACCEPT-0
ACCEPT-1
BASIC_REGIS-2
BASIC_REGIS-9

Or using GNU awk alone:

$ awk 'match($0,/^[^-]*-./,a) && !seen[a[0]]++{print a[0]}' file
ACCEPT-0
ACCEPT-1
BASIC_REGIS-2
BASIC_REGIS-9

or any awk:

$ awk '!match($0,/^[^-]*-./){next} {$0=substr($0,1,RLENGTH)} !seen[$0]++' file
ACCEPT-0
ACCEPT-1
BASIC_REGIS-2
BASIC_REGIS-9
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement