<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xml:base="http://www.townx.org" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
 <title>townx - Using wget to mirror a website - Comments</title>
 <link>http://www.townx.org/blog/elliot/using-wget-mirror-website</link>
 <description>Comments for &quot;Using wget to mirror a website&quot;</description>
 <language>en</language>
<item>
 <title>Hello</title>
 <link>http://www.townx.org/blog/elliot/using-wget-mirror-website#comment-41706</link>
 <description>&lt;p&gt;Great stuff of your stuff, man. I’ve read your stuff before and you’re too awesome. I enjoy what you’ve got here, love what you’re saying and exactly how you say it. You create it entertaining but you just seem to ensure that is stays smart. I can’t wait to see more of your stuff. This is really a fantastic blog. &lt;a href=&quot;http://www.wpfdental.com/&quot;&gt;san antonio dentists&lt;/a&gt;&lt;/p&gt;</description>
 <pubDate>Sat, 24 Dec 2011 10:52:07 -0600</pubDate>
 <dc:creator>san antonio dentists</dc:creator>
 <guid isPermaLink="false">comment 41706 at http://www.townx.org</guid>
</item>
<item>
 <title>Well,</title>
 <link>http://www.townx.org/blog/elliot/using-wget-mirror-website#comment-41654</link>
 <description>&lt;p&gt;in that case, +1 because I was also trying to mirror a site [admittedly one that I do not own, but also part of a rather large &quot;cloud&quot; site so I don&#039;t feel too bad mirroring a small 80-page site ignoring robots.txt, or at least so I tell myself] &amp;amp; couldn&#039;t seem to get out of wget &quot;playing nice&quot; mode.&lt;/p&gt;</description>
 <pubDate>Fri, 25 Nov 2011 22:10:08 -0600</pubDate>
 <dc:creator>PaxSkeptica</dc:creator>
 <guid isPermaLink="false">comment 41654 at http://www.townx.org</guid>
</item>
<item>
 <title>I put a short (and probably</title>
 <link>http://www.townx.org/blog/elliot/using-wget-mirror-website#comment-40680</link>
 <description>&lt;p&gt;I put a short (and probably not very helpful) suggestion in reply to your other comment. Always happy to get a genuine comment from someone who found my blog useful!&lt;/p&gt;</description>
 <pubDate>Sat, 29 May 2010 06:49:24 -0500</pubDate>
 <dc:creator>elliot</dc:creator>
 <guid isPermaLink="false">comment 40680 at http://www.townx.org</guid>
</item>
<item>
 <title>Nifty use of wget</title>
 <link>http://www.townx.org/blog/elliot/using-wget-mirror-website#comment-40617</link>
 <description>&lt;p&gt;Nifty use of wget,  Seems so simple and useful I, not only bookmarked it but also cut &amp;amp; pasted your article in my personal linux help document.. Dont want to risk a page not found some time off in the distant future. ;)&lt;br /&gt;
Consequentially I stumbled onto your blog searching for an easy way to implement spam filtering.  The how to was very helpful, but left me with one question when training my inbox (see my comment in that blog entry, if you have any insight)&lt;br /&gt;
Thanks, John &lt;/p&gt;</description>
 <pubDate>Wed, 19 May 2010 09:26:51 -0500</pubDate>
 <dc:creator>John</dc:creator>
 <guid isPermaLink="false">comment 40617 at http://www.townx.org</guid>
</item>
<item>
 <title>Using wget to mirror a website</title>
 <link>http://www.townx.org/blog/elliot/using-wget-mirror-website</link>
 <description>&lt;p&gt;Occasionally you need to mirror a website (or a directory inside one). If you&#039;ve only got &lt;span class=&quot;caps&quot;&gt;HTTP &lt;/span&gt;access, there are tools like &lt;a href=&quot;http://www.httrack.com/&quot;&gt;httrack&lt;/a&gt; which are pretty good (albeit pretty ugly) at doing this. However, as far as I can tell, you can&#039;t use httrack on a password-protected website.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://curl.haxx.se/&quot;&gt;curl&lt;/a&gt; can probably do this too, and supports authentication, but it wasn&#039;t obvious.&lt;/p&gt;

&lt;p&gt;So I ended up using &lt;a href=&quot;http://www.gnu.org/software/wget/&quot;&gt;wget&lt;/a&gt;, as it supports mirroring and credentials. But the issue here is that wget plays nice and respects robots.txt; which can actually prevent you mirroring a site you own. And nothing in the man page explains how to ignore robots.txt.&lt;/p&gt;

&lt;p&gt;Eventually, I came up with this incantation, which works for me (access to password-protected site, full mirror, ignoring robots.txt):&lt;/p&gt;



&lt;pre&gt;
wget -e robots=off --wait 1 -x --user=xxx --password=xxx -m -k &lt;a href=&quot;http://domain.to.mirror/&quot; title=&quot;http://domain.to.mirror/&quot;&gt;http://domain.to.mirror/&lt;/a&gt;
&lt;/pre&gt;



&lt;p&gt;where:&lt;/p&gt;


&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;-e robots=off&lt;/strong&gt; obviously disables robots&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;--wait 1&lt;/strong&gt; forces a pause between gets (so the site doesn&#039;t get hammered)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;--user&lt;/strong&gt; and &lt;strong&gt;--password&lt;/strong&gt;: self-evident&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;-x&lt;/strong&gt; creates a local directory structure which &quot;mirrors&quot; (see what I did there?) the directory structure on the site you&#039;re mirroring&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;-m&lt;/strong&gt; turns on mirror mode: &quot;turns on recursion and time-stamping, sets infinite recursion depth and keeps &lt;span class=&quot;caps&quot;&gt;FTP &lt;/span&gt;directory listings&quot; (from the man page)&lt;/li&gt;
&lt;li&gt;&lt;string&gt;-k&lt;/strong&gt; converts links after download so that &lt;span class=&quot;caps&quot;&gt;URL&lt;/span&gt;s in the mirrored files reference local files&lt;/li&gt;
&lt;/ul&gt;



&lt;p&gt;Don&#039;t use it carelessly on someone else&#039;s website, as they might get angry...&lt;/p&gt;</description>
 <comments>http://www.townx.org/blog/elliot/using-wget-mirror-website#comments</comments>
 <category domain="http://www.townx.org/tech">tech</category>
 <category domain="http://www.townx.org/howtos">howtos</category>
 <pubDate>Wed, 12 May 2010 04:29:46 -0500</pubDate>
 <dc:creator>elliot</dc:creator>
 <guid isPermaLink="false">793 at http://www.townx.org</guid>
</item>
</channel>
</rss>

