I am new to linux and new to scripting. I am working in a linux environment using bash. I need to do the following things: 1. read a txt file line by line 2. delete the first line 3. remove the middle part of each line after the first 4. copy the changes to a new txt file
Each line after the first has three sections, the first always ends in .pdf and the third always begins with R0 but the middle section has no consistency.
Example of 2 lines in the file:
R01234567_High Transcript_01234567.pdf High School Transcript R01234567 R01891023_Application_01891023127.pdf Application R01891023
Here is what I have so far. I’m just reading the file, printing it to screen and copying it to another file.
#! /bin/bash cd /usr/local/bin; #echo "list of files:"; #ls; for index in *.txt; do echo "file: ${index}"; echo "reading..." exec<${index} value=0 while read line do #value='expr ${value} +1'; echo ${line}; done echo "read done for ${index}"; cp ${index} /usr/local/bin/test2; echo "file ${index} moved to test2"; done
So my question is, how can I delete the middle bit of each line, after .pdf but before the R0…?
Advertisement
Answer
Updated answer assuming tab delim
Since there is a tab delimiter, then this is a cinch for awk. Borrowing from my originally deleted answer and @geek1011 deleted answer:
awk -F"t" '{print $1, $NF}' infile.txt
Here awk
splits each record in your file by tab, then prints the first field $1
and the last field $NF
where NF
is the built in awk
variable for the record’s Number of Fields; by prepending a dollar sign, it says “The value of the last field in the record”.
Original answer assuming space delimiter
Leaving this here in case someone has space delimited nonsense like I originally assumed.
You can use awk
instead of using bash to read through the file:
awk 'NR>1{for(i=1; $i!~/pdf/; ++i) firstRec=firstRec" "$i} NR>1{print firstRec,$i,$NF}' yourfile.txt
awk
reads files line by line and processes each record it comes across. Fields are delimited automatically by white space. The first field is $1
, the second is $2
and so on. awk
has built in variables; here we use NF
which is the Number of Fields contained in the record, and NR
which is the record number currently being processed.
This script does the following:
- If the record number is greater than 1 (not the header) then
- Loop through each field (separated by white space here) until we find a field that has “pdf” in it (
$i!~/pdf/
). Store everything we find up until that field in a variable calledfirstRec
separated by a space (firstRec=firstRec" "$i
). - print out the
firstRec
, then print out whatever field we stopped iterating on (the one that contains “pdf”) which is$i
, and finally print out the last field in the record, which is$NF
(print firstRec,$i,$NF
)
You can direct this to another file:
awk 'NR>1{for(i=1; $i!~/pdf/; ++i) firstRec=firstRec" "$i} NR>1{print firstRec,$i,$NF}' yourfile.txt > outfile.txt
sed
may be a cleaner way of going here since, if your pdf
file has more than one space separating characters, then you will lose the multiple spaces.