So, today I was trying to download an entire C programming tutorial from a website, it was splitted in several different html files and I wanted to have it all so I could read it while offline.

My first thought was about using wget to automatically get it all with the following parameters:

#wget -r -np http://www.xxx.com/docs/stuff/yeah/

The output:

–2009-04-08 02:34:59–  http://www.xxx.com/docs/stuff/yeah/
Resolving http://www.xxx.com… 75.126.69.23
Connecting to http://www.xxx.com|75.126.69.23|:80… connected.
HTTP request sent, awaiting response… 403 Forbidden
2009-04-08 02:35:00 ERROR 403: Forbidden.

So, you might ask yourself  “but then how the hell my browser got the html files without any error?”.

The webserver use a kind of security configuration where they will refuse any “user agent” which is not related to a browser. For example, when you use wget to download the html, the webserver will answer you with “ERROR403: Forbidden”, because it is not a valid browser, it is Wget. I just don’t know yet how it works at the server side, hopefully I will be writing more about it in the next posts.

So now we can try the following parameters in order to break it:

#wget -U firefox -r -np http://www.xxx.com/docs/stuff/yeah/

Where “-U” means “user agent”

It works flawlessly! 😀

Please share with us your experiences about it.

References:

http://www.checkupdown.com/status/E403.html

http://www.gnu.org/software/wget/manual/

Advertisements