Concept One These days in this crazy world it makes sense to archive locally any critical online data. That way when the Internet is not working (for whatever reason), you still have access to your important infos and data. For those who are listening and interested in being prepared, here is the quickest, easiest way that I have been able to find to archive complete offline copies of websites to your Mac (or iOS). It makes use of a free app, and requires about 60 seconds to get started. From there, backing up web pages can be done in the background, while you work on other projects. Multitaksing baby.
Here is the full tutorial, don’t blink..
Download the free SiteSucker app from Rick’s Apps. Then install the app and configure settings as desired. Enter a URL in the box and click the “Download” button to make it go. It’s that easy to get started. Then once you see how it works, there are some important things that you should keep in mind..
For anyone new to all of this..
Just because anyone can rip an entire website with almost no effort, doesn’t mean they should. Consecutively and rapidly downloading thousands of files can put strain on the website server. So if you go crazy and start sucking down hundreds of thousands of pages, n-levels deep, and just let the app download crazy in the background for hours and days on end, you probably will end up dealing with very angry people, a blocked IP address, and possibly worse.
Instead of risky behavior, play it cool and take advantage of the recommended settings explained later in the article. Also, it is important to download only what you need, and be responsible while downloading. Don’t just start going around hammering on people’s servers. Be smart, be kind.
Understand that this particular program will literally download the entire Internet if you let it. Well maybe not that extreme, but it will download waay more than any typical hard drive can hold. So you want to be smart with how you configure the app settings. If ever in doubt, just roll with the default settings and don’t go crazy is a good place to start.
Then once you’re more familiar with how it works, you can go in and tweak the settings however you like. To give you an idea of what to look for, check out the following recommended settings for serious download adventures.
- Suppress Login Dialog — enabled
- Ignore Robot Exclusions — disabled
- Always Download HTML and CSS — enabled
- Ask for Destination — disabled
- File Replacement — With Newer
- File Modification — Localize
- Path Constraint — Host
- Destination Folder —
- Log Errors — enabled
- Max Number of Levels — varies, recommended <= 6
- Max Number of Files — normally leave disabled, but may be wise to set an upper limit, some large-ish number
- Identity — recommended to rotate frequently
- Exclude Patterns — recommended! (see example below)
- All other settings I usually leave at the default values
The smarter you configure your settings, the quicker you will be able to download exactly what you want, without wasting time and resources downloading stuff that you don’t need.
The more you end up using the SiteSucker app, the more you’ll want to optimize settings to achieve the best results. As mentioned above, the goal is precise, efficient downloads. In other words, grab only what you need and do it without trashing the site’s server. To help achieve this, here are some key settings for getting lean and mean with all your downloadz.
There are two places where you can access settings in this app. So first visit the App Preferences and look for the setting, “Connections for new documents”. The default value, 6, works great in most cases. But you can dial it down for “quieter” downloads, which as explained previously are advised whenever possible. Or if you want to just gulp down some site very quickly, you can increase the value of this setting to as high as you are willing to go (or risk).
In addition to the App Preferences are the App Settings, which are located from within the floating SiteSucker app window (UI). There you will find a button that looks like a gear icon. Click on it to access the following settings.
- Max Number of Levels — experiment with lower numbers for targeted downloads
- Max Number of Files — if you need to leave the app running while you are away, and are downloading an unknown number of pages, use this setting to keep things sane
- Exclude Patterns — this is the BEST way to help you download only what is needed (see example below)
For the Excluded Patterns, here is a set of patterns that I used recently, to give you an example of how it works. Click image to view full size.
Here are the patterns all listed together (they are entered separately per-line in the app settings).
.*/de/.* .*/es/.* .*/fr/.* .*/ja/.* .*/pt_BR/.* .*/ro/.* .*/tr/.* .*/zh/.*
The trick to understanding these exclusion patterns (regular expressions, or “regex”) is knowing that
.* means wildcard match any character. So when you write something like
.*/directory/.* you are matching any URL that includes the string,
/directory/. So with this example, all of the following URLs would be excluded and NOT downloaded:
..and so forth. Knowing this, and a little trial and error, it is possible to go into any website, and download only what you need in whatever language you need it in. The trick of course is getting a “map” or basic idea of the site’s layout. In order to exclude specific directories and files, you need to know their location, or URI on the server.
Take-home message: for best results it can take some trial and error for each site you want to download. My strategy is to spend a few minutes beforehand, just surfing the target site to get an idea of its directory structure. Then I tweak exclusions and settings as needed to exact the most concise, efficient, and machine-friendly download possible.
Until next time..
Hopefully this is useful for you. Keeping archived copies of online content is super important these days. And yes there are other ways to download entire websites and directories, just get out there on your favorite search engine and surf around. All sorts of possibilities for Mac, Windows, Linux, or whatever operating system you may be using.