wget to get all the files in a directory only returns index.html

I’m new to using bash, and I have been trying to wget all the files from a website to the server I have been working on. However all I’m getting back is an index.html file. I let it run for 15 minutes and the index.html file was still downloading so I killed it. Might my files be download after the index.html file?

Here is the code I have been trying:

$ wget --no-parent -R index.html -A "Sample" -nd --random-wait 
   -r -p -e robots=off -U Mozilla --no-check-certificate 
   http://somewebsite.com/hasSamples/Sample0

I’m trying to download all the files in a subdirectory that starts with Sample. I have searched quite a bit on the internet to find a resolution, and at this point I’m stumped. I probably just haven’t found the right combinations of options, but any help would be much appreciated. Here is my understanding of the code:

  • --no-parent means don’t search parent directories
  • -R index.html means reject downloading the index.html file, I also tried “index.html*”, but it still downloaded it anyway
  • -A "Sample" kind of acts like a Sample* would in bash
  • -nd means download the files and not any of the directories
  • --random-wait to make sure you don’t get blacklisted from a site
  • -r recursively downloads
  • -p not sure really
  • -e robots=off ignores robot.txt files
  • U Mozilla makes the user look like its Mozilla I think
  • The --no-check-certificate is just necessary for the website.
Asked By: njBernstein

||

What happens when you connect to the page with your browser?

If you browse the pages without any problem, then there might be a couple of things you’re missing.

The page might be checking your browser, and Mozilla is not the correct answer, pick one of the browser strings from here (the whole string, not just Mozilla) and try to see if it works.

If it doesn’t then you might need cookies, get the cookies connecting to the main page with wget and store them. Then use wget with those cookies and try to download the pages.

Try --mirror option if you want to mirror it.

If nothing works, then I would study the connection and the pages. The live http headers add on for Firefox is a pretty cool tool. You can see the whole communication between your browser and the web page. Try to mimmic that behaviour with wget to obtain what you’re looking for.

Answered By: YoMismo

-A “Sample” kind of acts like a Sample* would in bash

Not by my reading of man wget:

  • -A acclist –accept acclist
  • -R rejlist –reject rejlist

Specify comma-separated lists of file name suffixes or patterns to accept or reject. Note that if
any of the wildcard characters, *, ?, [ or ], appear in an element of acclist or rejlist, it will
be treated as a pattern, rather than a suffix.

So your usage (no wildcards) is equivalent to the bash glob *.Sample.

Wget works by scanning links, which is probably why it is trying to download an index.html (you haven’t said what the content of that is, if any, just that it took a long time) — it has to have somewhere to start. To explain further: an URL is not a file path. You cannot scan a web server as if it were a directory hierarchy, saying, “give me all the files in directory foobar“. Iffoobar corresponds to a real directory (it certainly doesn’t have to, because it’s part of an URL, not a file path), a web server may be configured to provide an autogenerated index.html listing the files, providing the illusion that you can browse the filesystem. But that’s not part of the HTTP protocol, it’s just a convention used by default with servers like apache. So what wget does is scan, e.g., index.html for <a href= and <img src=, etc., then it follows those links and does the same thing, recursively. That’s what wget’s “recursive” behaviour refers to — it recursively scans links because (to reiterate), it does not have access to any filesystem on the server, and the server does not have to provide it with ANY information regarding such.

If you have an actual .html web page that you can load and click through to all the things you want, start with that address, and use just -r -np -k -p.

Answered By: goldilocks
Categories: Answers Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.