Current location

Home > Computing > Linux Notes > Generate a list of a site's URLs using wget

Generate a list of a site's URLs using wget

You can use wget to generate a list of the URLs on a website.

Spider example.com, writing URLs to urls.txt, filtering out common media files (css, js, etc..):

wget --spider -r http://www.example.com 2>&1 | grep '^--' | awk '{ print $3 }' | grep -v '\.\(css\|js\|png\|gif\|jpg\|JPG\)$' > urls.txt

Note that this gives a list that duplicates URLs.

If you mirror instead of spider you seem to get a more comprehensive list without duplicates:

wget -m http://www.example.com 2>&1 | grep '^--' | awk '{ print $3 }' | grep -v '\.\(css\|js\|png\|gif\|jpg\|JPG\)$' > urls.txt

This will download all pages of the site into a directory with the same name as the domain.

Last modified: 19/07/2014 Tags: wget

Other pages possibly of interest:

This website is a personal resource. Nothing here is guaranteed correct or complete, so use at your own risk and try not to delete the Internet. -Stephan

Skip to

Generate a list of a site's URLs using wget

Related Pages

Site Menu

Search

Site Info

Accessibility