Download HTTPS website available only through username and password with wget?

I’m trying to recursively download website which is normally available only when you login.

I have valid username and password, but the problem is that I need to login through web interface, so using --user=user and --password=password doesn’t help.

wget downloads only one webpage with text:
Sorry this page is not available, maybe you’ve forgotten to login?

Is it possible to download?

I can’t use –user, –password even at the login page because there is no FTP/HTTP file retrieval login as mentioned in man wget:

--user=user
--password=password
    Specify the username user and password password for both FTP and
    HTTP file retrieval.

Classic graphical login is there.

If I try to do this: wget --save-cookies coookies --keep-session-cookies --post-data='j_username=usr&j_password=pwd' 'https://idp2.civ.cvut.cz/idp/Authn/UserPassword'. Using POST method to login and trying to save cookies, the coookies file is empty and the saved page is some error page.

The URL is https://idp2.civ.cvut.cz/idp/Authn/UserPassword. Actually, when I want to log in, it redirects me to this page and when I successfully log in, it redirects me back to the page where I was before or some page where I wanted to be after logging in (example: https://progtest.fit.cvut.cz/.

Asked By: MichalH

||

The session information is probably saved in a cookie to allow you to navigate to other pages after you have logged in.

If this is the case, you could do this in two steps :

  1. Use wget‘s --save-cookies mycookies.txt and --keep-session-cookies options on the login page of the website along with your --username and --password options
  2. Use wget‘s --load-cookies mycookies.txt option on the subsequent pages you are trying to retrieve.

EDIT

If the --password and --username option doesn’t work, you must find out the info sent to the server by the login page and mimic it :

  • For a GET request, you can add the GET parameters directly in the address wget must fetch (make sure you properly quote the &, = and other special characters). The url would probably look something like https://the_url?user=foo&pass=bar.
  • For a POST request you can use wget‘s --post-data=the_needed_info option to use the post method on the needed login info.

EDIT 2

It seems that you indeed need the POST method with the j_username and j_password set. Try --post-data='j_username=yourusername&j_password=yourpassword option to wget.

EDIT 3

With the page of origin, I was able to understand a little more of what is happening. That being said, I cannot make sure that it works because, well, I don’t have (nor do I want) valid credentials.

That being said, here is what’s happening :

  1. The page https://progtest.fit.cvut.cz/ sets a PHPSESSID cookie and present you with login options.
  2. Clicking the login button sends a request to https://progtest.fit.cvut.cz/shibboleth-fit.php which takes the PHPSESSID cookie (not sure if it uses it) and redirects you to the SSO engine with a specially crafted url just for you which looks like this : https://idp2.civ.cvut.cz/idp/profile/SAML2/Redirect/SSO?SAMLRequest=SOME_VERY_LONG_AND_UNIQUE_ID
  3. The SSO response sets a new cookie named _idp_authn_lc_key and redirects you to the page https://idp2.civ.cvut.cz:443/idp/AuthnEngine which redirects you again to https://idp2.civ.cvut.cz:443/idp/Authn/UserPassword (the real login page)
  4. You enter your credentials and send the post data j_username and j_password along with the cookie from the SSO response
  5. ???

The first four steps can be done with wget like this :

origin='https://progtest.fit.cvut.cz/'

# Get the PHPSESSID cookie
wget --save-cookies phpsid.cki --keep-session-cookies "$origin"

# Get the _idp_authn_lc_key cookie
wget --load-cookies phpsid.cki  --save-cookies sso.cki --keep-session-cookies --header="Referer: $origin" 'https://progtest.fit.cvut.cz/shibboleth-fit.php'

# Send your credentials
wget --load-cookies sso.cki --save-cookies auth.cki --keep-session-cookies --post-data='j_username=usr&j_password=pwd' 'https://idp2.civ.cvut.cz/idp/Authn/UserPassword'

Note that wget follows redirection all by himself, which helps us quite a bit in this case.

Answered By: user43791

Why are you playing around with wget? Better use some headless browser to automate this task.

What is a headless browser, you ask?

A headless browser is a web browser without a graphical user interface.
They provide automated control of a web page in an environment similar to popular web browsers, but are executed via a command line interface or using network communication.

Two popular headless browsers are phantomjs (javascript) and Ghost.py (python).

Solution using phantomjs

First you will need to install phantomjs. On Ubuntu based systems, you can install it using the package manager or you could build it from source from their home page.

sudo apt-get install phantomjs

After this you write javascript script and run it using phantomjs:

phantomjs script.js

That’s it.

Now, to learn how to implement it for your case, head over to its quickstart guide. As an example, to login to facebook automatically, and take a snapshot, one could use the gist provided here:

// This code login's to your facebook account and takes snap shot of it.
var page = require('webpage').create();
var fillLoginInfo = function(){
var frm = document.getElementById("login_form");
frm.elements["email"].value = 'your fb email/username';
frm.elements["pass"].value = 'password';
frm.submit();
}
page.onLoadFinished = function(){
if(page.title == "Welcome to Facebook - Log In, Sign Up or Learn More"){
page.evaluate(fillLoginInfo);
return;
}
else
page.render('./screens/some.png');
console.log("completed");
phantom.exit();
}
page.open('https://www.facebook.com/');

Look around the documentation to implement it for your specific case. If you face some troubles for your https website due to ssl errors, run your script like this:

phantomjs --ssl-protocol=any script.js

Solution using Ghost.py

To install Ghost.py, you will need pip:

sudo apt-get install python-pip   #On a Debian based system
sudo pip install Ghost.py

Now you have installed Ghost.py. Now, to use it inside a python script, just follow the documentation given in its home page. I’ve tried using Ghost.py on an https website but it somehow didn’t work for me. Do try it and see if it works.

UPDATE : GUI based solution

You can also use tools like Selenium to automate the login process and retrieve the information. It is pretty easy to use. You will just need to install a plugin for your browser from here. And then you can record your process and replay it later on.

Answered By: shivams

Try using 'curl'

curl --data "j_username=value1&j_password=value2" https://idp2.civ.cvut.cz/idp/Authn/UserPassword

You may need to look at the response type and set the 'content-type' header to match; i.e: XML, json etc

Answered By: jas-

In addion to cookies, use user agent as firefox, chrome etc. Because most servers hate dwonload managers.

Alternatively use firefox extension scrap book

  • easy to use
  • GUI
Answered By: totti

The way I would do it is: First I would use live HTTP Headers pluggin for firefox to analyze the communication. Referers and all that stuff may be needed. Once I have that information I would mimic that with wget saving cookies and loading when needed.

Answered By: YoMismo

For a more modern tool to perform these kind of operations, checkout HTTPie:

https://httpie.org/

Answered By: Adam Erickson

I was looking for the same thing and finally wrote a tool based on user43791‘s answer.

Therefore, what you are looking for is maybe something like that:
https://github.com/atar-axis/TUM-tools/blob/master/moodle_grabber/moodle_grabber.sh

It should be very easy to adapt this to any other shibboleth site and it is based only on wget!

Answered By: flood
Categories: Answers Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.