I have RSS feed, like this:
<?xml version="1.0" encoding="UTF-8"?> <feed xmlns="http://www.w3.org/2005/Atom"> <title>my feed</title> <link rel="self" href="http://myhomesite.com/articles/feed/"/> <updated>2019-11-04T12:45:00Z</updated> <id>http://myhomesite.com/articles/feed/?dt=2019-11-04T12:45:00Z</id> <entry> <id>id0</id> <link rel="alternate" type="text/html" href="https://yandex.ru/link123"/> <author> <name/> </author> <published>2019-11-04T12:45:00Z</published> <updated>2019-11-04T12:45:00Z</updated> <title type="html"><![CDATA[foo bar foo bar]]></title> <content type="html"><![CDATA[]]></content> </entry> <entry> <id>id2</id> <link rel="alternate" type="text/html" href="https://myhomesite.com"/> <author> <name/> </author> <published>2019-11-04T09:45:00Z</published> <updated>2019-11-04T09:45:00Z</updated> <title type="html"><![CDATA[foo bar foo bar]]></title> <content type="html"><![CDATA[]]></content> </entry> ....
I want remove all nodes (/feed/entry
) where link href != http://myhomesite.com
.
How do I remove XML node where value start at specified symbols using Bash?
Advertisement
Answer
Bash features by themselves are not very well suited parsing XML.
This renowned Bash FAQ states the following:
Do not attempt [to extract data from an XML file] with sed, awk, grep, and so on (it leads to undesired results).
Consider utilizing an XML specific command line tool, such as XMLStarlet. See download info here if you don’t already have XML Starlet installed.
Solution:
Using XML Starlet you can run the following command to output the desired results to your terminal:
xml ed -N x="http://www.w3.org/2005/Atom" -d '//x:entry[not(child::x:link[@href="https://myhomesite.com"])]' /path/to/file.rss
Note: The /path/to/file.rss
part at the end of the command shown above should be substituted with the real pathname to the actual .rss
file.
Explanation:
The parts of the aforementioned command breakdown as follows:
xml
– invoke the XML Starlet command.ed
– Edit/Update the XML document.-N x="http://www.w3.org/2005/Atom"
– The-N
option binds the namespace, i.e.http://www.w3.org/2005/Atom
, to a prefix that we’ve arbitrarily namedx
.-d
– delete node(s) that are matched.'//x:entry[not(child::x:link[@href="https://myhomesite.com"])]'
The xpath expression used to find/match the appropriate nodes as specified in your question.all nodes (/feed/entry) where link href !=
http://myhomesite.com
.As you can see, in the XPath expression we prepend the
x
prefix to the element node names, i.e.x:entry
andx:link
to ensure we address the elements in the correct namespace./path/to/file.rss
– A pathname to the source.rss
file.
Saving the resultant XML (RSS)
To save the resultant XML you can either:
Add the
--inplace
option to the aforementioned command – this will overwrite the original.rss
with the desired result. For instance:xml ed --inplace -N x="http://www.w3.org/2005/Atom" -d '//x:entry[not(child::x:link[@href="https://myhomesite.com"])]' /path/to/file.rss
Or, utilize the redirection operator (
>
) and specify a pathname to the the location at which to save the output. For instance the following compound command will save the results to a new file:xml ed -N x="http://www.w3.org/2005/Atom" -d '//x:entry[not(child::x:link[@href="https://myhomesite.com"])]' /path/to/file.rss > /path/to/results.rss
Note: The
/path/to/results.rss
at the end of the aforementioned compound command should be substituted with a real pathname to where you want to save the new file.
XPath with local-name()
:
Given that your example source XML (RSS) does not include any QNames it’s also possible to utilize XPath’s local-name()
function. This will negate the need to bind the namespace using XMLStarlet’s -N
option. For example:
xml ed -d '//*[local-name() = "entry" and not(child::*[local-name() = "link"][@href="https://myhomesite.com"])]' /path/to/file.rss
IMPORTANT: You may need to substitute the leading xml
part in all the example commands shown in this post with xmlstarlet
instead. For example:
xmlstarlet ed -N x="http://www.w3.org/2005/Atom" -d '//x:entry[not(child::x:link[@href="https://myhomesite.com"])]' /path/to/file.rss. ^^^^^^^^^^
Edit:
Given your example XML it’s also possible to utilize a simplified syntax for the default namespace, which is to use _:
instead x:
. By using an underscore (_
) you don’t need to utilize the -N
option to bind the namespace to a prefix. Refer to the section titled 1.3. A More Convenient Solution in the XMLStarlet documentation for further information regarding this feature.
For instance:
xml ed -d '//_:entry[not(child::_:link[@href="https://myhomesite.com"])]' /path/to/file.rss
To further understand using XMLStarlet when your source XML uses namespaces I suggest also reading Namespaces and default namespace in the documentation.
Edit 2:
The author of the OP subsequently wrote the following in the comments:
One question more. Condition
[not(child::_:link[@href="myhomesite.com"])]
is strict. I wanna be something like start withmyhomesite.com
but URI not important i.e.myhomesite.com**anything**
. It’s possible? [sic]something like this..
xmlstarlet ed -N x="http://www.w3.org/2005/Atom" -d '//x:entry[not(child::x:link[matches(@href, '^https://myhomesite.com/' )]/@href)]' feed.rs
Consider utilizing Xpath’s starts-with()
Function with any one of the previously given examples. For example:
Using the
-N
option andstarts-with()
:xml ed -N x="http://www.w3.org/2005/Atom" -d '//x:entry[not(child::x:link[starts-with(@href, "https://myhomesite.com")])]' file.rss
Using the
local-name()
andstarts-with()
:xml ed -d '//*[local-name() = "entry" and not(child::*[local-name() = "link"][starts-with(@href, "https://myhomesite.com")])]' file.rss
Using the simplified syntax for the default namespace, i.e. an underscore, and
starts-with()
:xml ed -d '//_:entry[not(child::_:link[starts-with(@href, "https://myhomesite.com")])]' file.rss