i have a file ‘movie.html’ :
<html> <head><title>Index of /Data/Movies/Hollywood/2016_2017/</title></head> <body bgcolor="white"> <h1>Index of /Data/Movies/Hollywood/2016_2017/</h1><hr><pre><a href="../">../</a> <a href="1%20Buck%20%282017%29/">1 Buck (2017)/</a> 25-Nov-2019 10:25 - <a href="1%20Mile%20to%20You%20%282017%29/">1 Mile to You (2017)/</a> 25-Nov-2019 10:26 - <a href="1%20Night%20%282016%29/">1 Night (2016)/</a> 25-Nov-2019 10:27 - </pre><hr></body> </html>
I want to get multiple word with pipe delimited like this:
title | link 1 Buck (2017) | 1%20Buck%20%282017%29/ 1 Mile to You (2017) | 1%20Mile%20to%20You%20%282017%29/ 1 Night (2016) | 1%20Night%20%282016%29/
I tried this code:
awk -F'[><]' 'BEGIN{ print "title","link" } /%29/ {print $3,$2}' movie.html > output.txt
but the output isn’t as my expectation please help me, i am still a beginner
Advertisement
Answer
Parsing html with regex is not advised for several reasons (see https://stackoverflow.com/a/1732454/12957340), but here is one potential solution:
awk -F'[<>/"]' 'BEGIN{ print "title | link" }; /(.*)/ {print $6 " | " $3}' movie.html