I want to remove the newline before the </script>
in my HTML file with a Linux command (sed
, awk
…).
Sample input:
<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>JavaScript Ders 2</title> <script type="text/javascript" src="script1.js" language="javascript"> </script> <script type="text/javascript" src="script2.js" language="javascript"> </script> <script> // script kodumuz buraya yazılacak </script> </head> <body> <script type="text/javascript" src="script3.js" language="javascript"> </script> </body> </html>
Sample output:
<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>JavaScript Ders 2</title> <script type="text/javascript" src="script1.js" language="javascript"> </script> <script type="text/javascript" src="script2.js" language="javascript"> </script> <script> // script kodumuz buraya yazılacak</script> </head> <body> <script type="text/javascript" src="script3.js" language="javascript"> </script> </body> </html>
I tried different syntax, but none of them could do.
Advertisement
Answer
First of all, as mentioned in the comments Don’t parse XML with Regex! Never do it, never think about it. Make it a habit not to think about it! Sometimes it might look to be a simple task that can be performed with sed or awk or any other regex parser, but no …
What you can do, on the other hand—if you really want to use sed or awk — processes the file first with xmlstarlet and convert it into a PYX format.
The PYX format is a line-oriented representation of XML documents that is derived from the SGML ESIS format. (see ESIS – ISO 8879 Element Structure Information Set spec, ISO/IEC JTC1/SC18/WG8 N931 (ESIS))
So what you realy want to do is something like :
$ xmlstarlet pyx <file.html> | do_your_magic_here | xmlstarlet depyx > file.new.html
In your case this would be something like:
$ xmlstarlet pyx file.html | awk 'c~/^- *\n *$/&&/^)script$/{c=$0;next}{print c; c=$0}END{print c}' | xmlstarlet depyx
This will output
<html> <head> <meta content="text/html; charset=utf-8" http-equiv="Content-Type"></meta> <title>JavaScript Ders 2</title> <script language="javascript" src="script1.js" type="text/javascript"></script> <script language="javascript" src="script2.js" type="text/javascript"></script> <script> // script kodumuz buraya yazılacak </script> </head> <body> <script language="javascript" src="script3.js" type="text/javascript"></script> </body> </html>