Skip to content
Advertisement

get multiple words after a specific word of HTML using linux/unix scripting

i have a file ‘movie.html’ :

<html>
<head><title>Index of /Data/Movies/Hollywood/2016_2017/</title></head>
<body bgcolor="white">
<h1>Index of /Data/Movies/Hollywood/2016_2017/</h1><hr><pre><a href="../">../</a>
<a href="1%20Buck%20%282017%29/">1 Buck (2017)/</a>                                     25-Nov-2019 10:25       -
<a href="1%20Mile%20to%20You%20%282017%29/">1 Mile to You (2017)/</a>                              25-Nov-2019 10:26       -
<a href="1%20Night%20%282016%29/">1 Night (2016)/</a>                                    25-Nov-2019 10:27       -
</pre><hr></body>
</html>

I want to get multiple word with pipe delimited like this:

title | link
1 Buck (2017) | 1%20Buck%20%282017%29/
1 Mile to You (2017) | 1%20Mile%20to%20You%20%282017%29/
1 Night (2016) | 1%20Night%20%282016%29/

I tried this code:

awk -F'[><]' 'BEGIN{ print "title","link" } /%29/ {print $3,$2}' movie.html > output.txt

but the output isn’t as my expectation please help me, i am still a beginner

Advertisement

Answer

Parsing html with regex is not advised for several reasons (see https://stackoverflow.com/a/1732454/12957340), but here is one potential solution:

awk -F'[<>/"]' 'BEGIN{ print "title | link" }; /(.*)/ {print $6 " | " $3}' movie.html
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement