We are migrating web servers, and it would be nice to have an automated way to check some of the basic site structure to see if the rendered pages are the same on the new server as the old server. I was just wondering if anyone knew of anything to assist in this task?
Advertisement
Answer
Get the formatted output of both sites (here we use w3m, but lynx can also work):
w3m -dump http://google.com 2>/dev/null > /tmp/1.html w3m -dump http://google.de 2>/dev/null > /tmp/2.html
Then use wdiff, it can give you a percentage of how similar the two texts are.
wdiff -nis /tmp/1.html /tmp/2.html
It can be also easier to see the differences using colordiff.
wdiff -nis /tmp/1.html /tmp/2.html | colordiff
Excerpt of output:
Web Images Vidéos Maps [-Actualités-] Livres {+Traduction+} Gmail plus » [-iGoogle |-] Paramètres | Connexion Google [hp1] [hp2] [hp3] [-Français-] {+Deutschland+} [ ] Recherche avancéeOutils [Recherche Google][J'ai de la chance] linguistiques /tmp/1.html: 43 words 39 90% common 3 6% deleted 1 2% changed /tmp/2.html: 49 words 39 79% common 9 18% inserted 1 2% changed
(he actually put google.com into french… funny)
The common % values are how similar both texts are. Plus you can easily see the differences by word (instead of by line which can be a clutter).