Skip to content
Advertisement

Bash: Read in file, edit line, output to new file

I am new to linux and new to scripting. I am working in a linux environment using bash. I need to do the following things: 1. read a txt file line by line 2. delete the first line 3. remove the middle part of each line after the first 4. copy the changes to a new txt file

Each line after the first has three sections, the first always ends in .pdf and the third always begins with R0 but the middle section has no consistency.

Example of 2 lines in the file:

R01234567_High Transcript_01234567.pdf  High School Transcript  R01234567
R01891023_Application_01891023127.pdf   Application R01891023

Here is what I have so far. I’m just reading the file, printing it to screen and copying it to another file.

#! /bin/bash
cd /usr/local/bin;
#echo "list of files:";
#ls;
for index in *.txt;
do echo "file: ${index}";
echo "reading..."
exec<${index}
value=0
while read line
do
   #value='expr ${value} +1';
   echo ${line};
done
echo "read done for ${index}";
cp ${index} /usr/local/bin/test2;
echo "file ${index} moved to test2"; 
done 

So my question is, how can I delete the middle bit of each line, after .pdf but before the R0…?

Advertisement

Answer

Updated answer assuming tab delim

Since there is a tab delimiter, then this is a cinch for awk. Borrowing from my originally deleted answer and @geek1011 deleted answer:

awk -F"t" '{print $1, $NF}' infile.txt

Here awk splits each record in your file by tab, then prints the first field $1 and the last field $NF where NF is the built in awk variable for the record’s Number of Fields; by prepending a dollar sign, it says “The value of the last field in the record”.


Original answer assuming space delimiter

Leaving this here in case someone has space delimited nonsense like I originally assumed.

You can use awk instead of using bash to read through the file:

awk 'NR>1{for(i=1; $i!~/pdf/; ++i) firstRec=firstRec" "$i} NR>1{print firstRec,$i,$NF}' yourfile.txt

awk reads files line by line and processes each record it comes across. Fields are delimited automatically by white space. The first field is $1, the second is $2 and so on. awk has built in variables; here we use NF which is the Number of Fields contained in the record, and NR which is the record number currently being processed.

This script does the following:

  1. If the record number is greater than 1 (not the header) then
  2. Loop through each field (separated by white space here) until we find a field that has “pdf” in it ($i!~/pdf/). Store everything we find up until that field in a variable called firstRec separated by a space (firstRec=firstRec" "$i).
  3. print out the firstRec, then print out whatever field we stopped iterating on (the one that contains “pdf”) which is $i, and finally print out the last field in the record, which is $NF (print firstRec,$i,$NF)

You can direct this to another file:

awk 'NR>1{for(i=1; $i!~/pdf/; ++i) firstRec=firstRec" "$i} NR>1{print firstRec,$i,$NF}' yourfile.txt > outfile.txt

sed may be a cleaner way of going here since, if your pdf file has more than one space separating characters, then you will lose the multiple spaces.

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement