I extract all the links from a specific webpage using lynx.
lynx -dump http://www.example.com/videos | awk '/http/"{print $2}"' >> links.txt
It give the following output:
http://www.example.com/home/ http://www.example.com/contact/ http://www.example.com/videos/ .. .. .. .. http://www.example.com/video/1001/The-title-of-video http://www.example.com/video/1002/The-title-of-video http://www.example.com/video/1003/The-title-of-video http://www.example.com/video/1004/The-title-of-video ..so on
I want to do following things.
- Output only those links which contains /video/
- remove the title in the end of the link http://www.example.com/video/1001/
The-title-of-videoshould output only http://www.example.com/video/1001/
Advertisement
Answer
Use grep
to filter the output, sed
to remove the title:
lynx -dump http://www.example.com/videos | grep /video/ | sed 's=/[^/]*$=='