Downloading or Cloning a Full Website in OS X and Linux with wget can make it fully static and you can deliver it from any CDN like Rackspace Cloud Files. This method for Downloading or Cloning a Full Website in OS X and Linux needs wget and for needs Mac OS X 10.8.x you need to fix the command line first. We will show this guide using OS X 10.8.x with iTerm2 and OhMyZSH as shell, not default Bash. Poor Windows users can use wget too, however its better to upgrade to Ubuntu or OpenSUSE or Debian or CentOS. It is not abnormal to get virus or malware. To prevent copying, some can keep a Windows Malware in some folders in our servers and robot it out. Downloading or Cloning a Full Website in OS X and Linux with wget is fully legal unless you bring a Ddos (What is Ddos?). That is only possible if you are an UNIX expert and has few hundreds of servers.
Downloading or Cloning a Full Website in OS X and Linux with wget : Purposes
But what good will be served by Downloading or Cloning a Full Website in OS X and Linux with wget ? Here are the causes :
- You want a backup in HTML output format for PHP MySQL based web softwares like WordPress. It will work fine as working copy in case you get hacked for the time being.
- You want to make your website static because you use it less or post less. Using Cache Plugins in WordPress wastes the compute cycles.
- Also, there is no good CDN plugin for any PHP MySQL platform unlike Ruby or Python based platforms.
- There is basically no meaning of keeping old posts of 5 years dynamic, you will hardly need to execute few PHP loops like for comments or sidebar (recent posts), that can be added in batch. It will significantly decrease the burden on MySQL.
- Speed can not compared when delivered from a CDN like Rackspace Cloud Files. You have to follow the way we described before for serving HTML website from CDN. You will need .htaccess redirection for showing proper URL. You can use WP super Cache and look at the .htaccess rules, simply modify them. However, Google’s non asynchronous js Codes including AdSense might not load. This happens for a complex reason, know clearly that it is the worst Ad delivery server of Google which is responsible. In that case for your purpose serve from FTP, normally. However you can recursively change the urls of static components.
- You want to get inspired (that is basically copying) from some one’s website design. Keep in mind, you must have enough grasp on HTML, CSS, Photoshop to avoid DMCA. A cheat way is to give the HTML site to some WordPress theme designer and ask to convert it to WordPress theme. It is reverse engineering. It is widely practiced than you can think of. But never publish any text, Google will blacklist you. Also do not use your real IP to do these ways. It is a fully different niche, needs time, expertise and unless you are a Guru, never try them.
Downloading or Cloning a Full Website in OS X and Linux with wget
To keep things organized, open the user named folder from Finder and create a sub folder :
Open iTerm2, Change Directory to that named folder (copy is the name in our example) :
Downloading or Cloning a Full Website in OS X and Linux with wget : The Commands
If the website you want to rip / clone / copy is http://example.com/ then, run this command :
wget --mirror -w 2 http://example.com/
A subfolder named example.com will be created inside the folder copy. It will take a huge time to copy the whole site. Manually check the URLs in HTML files if the domain name is different than where you will use. An important fact is, cURL will not work instead of wget.
Variation of the commands :
wget -r http://example.com/
-r is for recursive.
wget --mirror -w 2 -p --html-extension --convert-links -P folder-name http://example.com
Folder name is the path (P) you have kept the folder. Other variations you will find here :
From practical points, there are useless though. If robots.txt blocks the copying, you have to force by creating a .wgetrc file in root directory / home directory (depends on how you have setup your ZSH or Bash) and write a line inside it robots = off . You can mimic browser by adding few extra things in the command. However we do not recommend it as it opens the cache and cookies of your browser. It is better to use a temporary server and configure it if you want to hide your IP. Never run for a medium to bigger professional blog website because they will understand some IP is doing the wrong. All uses 24×7 fully monitored managed server and within few minutes you will be tricked to download few GB of files. Never do it with Google’s webpages.