How to download a copy of a website in linux?
I have tried using wget --recursive --level=inf https://example.com
however it also downloaded links from different domains.
Also is there a way to download a copy of the website where the javascript has run and resulted in output on the page. For example if downloading a weather website, there might be javascript which looks up the current temperature in the database and then renders the output. How to capture the temperature/final output?
Advertisement
Answer
Phantom.js?
http://phantomjs.org/quick-start.html
I think this will do what you like!
The best thing to do is install from here:
Basically you run it by creating javascript scripts and passing as a command line arg, e.g.
phantomjs.exe someScript.js
There are loads of examples, you can render a website as an image, for example you can do:
phantomjs.exe github.js
Where github.js looks like
var page = require('webpage').create(); page.open('http://github.com/', function() { page.render('github.png'); phantom.exit(); });
This demo is at http://phantomjs.org/screen-capture.html
You can also show the webpage content as text.
For example, let’s take a simple webpage, demo_page.html:
<html> <head> <script> function setParagraphText() { document.getElementById("1").innerHTML = "42 is the answer."; } </script> </head> <body onload="setParagraphText();"> <p id="1">Static content</p> <body> </html>
And then create a test script, test.js:
var page = require('webpage').create(); page.open("demo_page.html", function(status) { console.log("Status: " + status); if(status === "success") { console.log('Page text' + page.plainText); console.log('All done'); } phantom.exit(); });
Then in the console write:
> phantomjs.exe test.js Status: success Page text: 42 is the answer. All done
You can also inspect the page DOM and even update it:
var page = require('webpage').create(); page.open("demo_page.html", function(status) { console.log("Status: " + status); if(status === "success") { page.evaluate(function(){ document.getElementById("1").innerHTML = "I updated the value myself"; }); console.log('Page text: ' + page.plainText); console.log('All done'); } phantom.exit(); });