get multiple words after a specific word of HTML using linux/unix scripting

i have a file ‘movie.html’ :

<html>
<head><title>Index of /Data/Movies/Hollywood/2016_2017/</title></head>
<body bgcolor="white">
<h1>Index of /Data/Movies/Hollywood/2016_2017/</h1><hr><pre><a href="../">../</a>
<a href="1%20Buck%20%282017%29/">1 Buck (2017)/</a>                                     25-Nov-2019 10:25       -
<a href="1%20Mile%20to%20You%20%282017%29/">1 Mile to You (2017)/</a>                              25-Nov-2019 10:26       -
<a href="1%20Night%20%282016%29/">1 Night (2016)/</a>                                    25-Nov-2019 10:27       -
</pre><hr></body>
</html>

JavaScript
​x
 
<html><head><title>Index of /Data/Movies/Hollywood/2016_2017/</title></head><body bgcolor="white"><h1>Index of /Data/Movies/Hollywood/2016_2017/</h1><hr><pre><a href="../">../</a><a href="1%20Buck%20%282017%29/">1 Buck (2017)/</a>                                     25-Nov-2019 10:25       -<a href="1%20Mile%20to%20You%20%282017%29/">1 Mile to You (2017)/</a>                              25-Nov-2019 10:26       -<a href="1%20Night%20%282016%29/">1 Night (2016)/</a>                                    25-Nov-2019 10:27       -</pre><hr></body></html>​

I want to get multiple word with pipe delimited like this:

title | link
1 Buck (2017) | 1%20Buck%20%282017%29/
1 Mile to You (2017) | 1%20Mile%20to%20You%20%282017%29/
1 Night (2016) | 1%20Night%20%282016%29/

JavaScript
 
title | link1 Buck (2017) | 1%20Buck%20%282017%29/1 Mile to You (2017) | 1%20Mile%20to%20You%20%282017%29/1 Night (2016) | 1%20Night%20%282016%29/​

I tried this code:

awk -F'[><]' 'BEGIN{ print "title","link" } /%29/ {print $3,$2}' movie.html > output.txt

JavaScript
 
awk -F'[><]' 'BEGIN{ print "title","link" } /%29/ {print $3,$2}' movie.html > output.txt​

but the output isn’t as my expectation please help me, i am still a beginner

Answer

Parsing html with regex is not advised for several reasons (see https://stackoverflow.com/a/1732454/12957340), but here is one potential solution:

awk -F'[<>/"]' 'BEGIN{ print "title | link" }; /(.*)/ {print $6 " | " $3}' movie.html

JavaScript
 
awk -F'[<>/"]' 'BEGIN{ print "title | link" }; /(.*)/ {print $6 " | " $3}' movie.html​

Advertisement

Answer