Sunday, November 23, 2008

Using WGET for screen scraping HTML pages.

(THIS POST IS STILL IN PROGRESS)

Ever want to get some content from a web page on a schedule? That's a pretty simple programming task, but sometimes the content is behind a password protected page. What do you do then?

Wouldn't it be nice to be able to do the above from a simple batch file, and then run the batch file using the Windows scheduler? This quick tutorial will show you how.

The following procedure will work on WINDOWS or UNIX (at least in concept). But, I'll be writing it from the point of a Windows user.

STEP 0: Plan
Figure out what you want to automate. In this blog, I'm going to show you how to login to GMAIL using your account info and get the first email in the list. This is not particularly useful (there are easier ways to get your email), but it shows the process.

STEP 1: Install WGET and PERL.
Install CYGWIN and select the WGET component (from the "Web" category) during the installation. Another good component we'll be using is the PERL language (from the "Interpreters" category), which is good at simple text processing of the WGET output.

Both of these tools are command line tools, which means this entire example will be runnable from a batch file.

The PERL module is optional.

STEP 2: Install HTTPFOX plugin.
Install FIREFOX and the HTTPFOX plugin. This plugin only works on Firefox. HTTPFOX lets you examine the streams of data going between firefox and the website -- kinda like a network analyzer, but all handled by the plugin instead of a network driver.

STEP 3: Record as session with HTTPFOX
For this step, it is a really good idea to make sure you only have one FIREFOX window open, and only one tab open. Close all the rest.

Start HTTPFOX (it's a small green icon in bottom right corner of your browser) and then navigate to the site: http://gmail.com (I assume you have an account there).

When you start HTTPFOX, it creates a window at the bottom of the screen. There is a START button for starting recording, and a STOP button for when you are done.

So start recording. Just as a not, HTTPFOX isn't recording what you type. It records all the HTTP traffic between your browser and the webserver.

Then, in the GMAIL web page, type in your username and password. Press SIGN IN.

You'll see a bunch of activity in the HTTPFOX window. This is each request being sent to the browser, for each HTML file, image file, javascript file, css file, or media files in the web page. Plus there can be additional traffic for AJAX HTTP requests.

Each request has a header, cookie, query string, post data, and content that you can see as you click on each one.

After the login is complete and you see your mail. Press the STOP button in HTTP fox.

STEP 4: Analyze the HTTPFOX data.

This is really the toughest step. You have to figure out what HTTP requests are important for the problem you are trying to solve. In our case, we're only interested in the HTML content (no images, scripts, etc)

In this case we're looking for the request where the password was sent to the server, and the main page returned. This is often a POST request, which sends data to browser, and returns a response. The post in this case, send your username and password over HTTPS. It is probably the second item in the list:

00:00:13.430 0.274 1428 533 POST 302 Redirect to: https://www.google.com/accounts/CheckCookie?continue=http%3A%2F%2Fmail.google.com%2Fmail%2F%3F&hl=en&service=mail&ltmpl=default&chtml=LoginDoneHtml https://www.google.com/accounts/ServiceLoginAuth?service=mail

If you click on that request, and then look through the tabs, you'll see a "POST Data" column.

There is a radio below that allows you to choose "RAW" format, which is really useful:

ltmpl=default&ltmplcache=2&continue=http%3A%2F%2Fmail.google.com%2Fmail%2F%3F&service=mail&rm=false&ltmpl=default&hl=en&ltmpl=default&Email=YOURMAIL%40gmail.com&Passwd=PASSWORD&rmShown=1&signIn=Sign+in&asts=

The username and password are highlighted above. Since Google is using HTTPS for the login, no one on the internet between your browser and google can read it, though.

You'll want to keep that RAW POST DATA handy because we are going to copy it later.

There's another interesting tab, the cookie tab, which shows you all the cookies that were sent and returned with this request. That's important to know about, although the details aren't important. Generally any site that sends cookies, probably needs them to work properly, so we'll need to simulate the cookies when we automate with WGET.

STEP 5:
Learn about WGET and it's command line params. WGET does what your browser does, but in a command line. WGET only does one HTTP request at a time, so you may need to run it multiple times.

Here is a sample command line that will get a web page:

wget --keep-session-cookies --load-cookies=cookies.txt --save-cookies=cookies.txt -q --no-check-certificate -O - "https://www.google.com/accounts/ServiceLoginAuth?service=mail" --post-data="ltmpl=default&ltmplcache=2&continue=http%3A%2F%2Fmail.google.com%2Fmail%2F%3F&service=mail&rm=false&ltmpl=default&hl=en&ltmpl=default&Email=YOURMAIL%40gmail.com&Passwd=PASSWORD&rmShown=1&signIn=Sign+in&asts="

A couple important notes:
1) The first three command parameters make sure all cookies are saved to a local file cookies.txt. This will be the repository for all cookies each request. Cookies are SAVED and WRITTEN to the file. Since the server sends the cookies to the browser (or WGET in this case), then if you run WGET a second time, you want the cookies to be sent back with that request. In other words, maintain a session.

2) The "-O -" option sends the received html to STDOUT. You can also specify a filename there, if you want to keep the HTML.

3) The URL is self explanatory. I copied this URL directly from the POST request in HTTPFOX.

4) The --post-data is copied from HTTPFOX, too. It's the RAW POST DATA that was discussed in the previous step. Pretty nifty eh? [NOTE: I modified the text above to hide my real password]

STEP 6: Write a batch file.
Simulate the interacnption with the website using WGET.

STEP 7: Parse the output from WGET.
If you look at the response, you won't see much. Google is actually throwing us a bit of a curve here. The google webserver just returns a web page that redirects you (via javascript location.replace) to another non-https page:


Some HTML...

location.replace("http://mail.google.com/mail/?auth\x3dDQAAAHwAAADmne2aBCW1j
C8CLoxi-cm6bOgBHKMF-wyh8A_RyKDLjHrbEPCnA53NjBOxPYYk-y6QXQJKeVGyzD8_g9JIDbTbZISyP
H74ub7JP3FiNVGxPrXKGj2V4o-iYOQ9o2RPdYCSOJyil8VApveTpUpBDZqRNtiuR1A-oLTLEnh4gikaW
A\x26gausr\x3ddejasurf%40gmail.com")

More HTML...

This is where PERL will shine: The easiest thing to do is to parse the HTML and get the new url, then WGET that url. The following perls script does it all:

TBD:

Step 8: Make a final batch file to run it all.
You can just run the above perl script like this from Windows Scheduler (or Unix cron job):

perl myscript.pl


2 comments:

Kris Bergstrom said...

Thank you for this great tutorial! I had a lot of success following your instructions, until I realized my bank asks "security questions" every few logins. Life would certainly be easier if there were API's for everything. :)

Anyway, thank you again. The post is very informative and was just what I was looking for!

Jorge Monasterio said...

Thanks, Kris:

>> Security questions every few logins

You can probably, if you have some programming ability, use PERL or a similar language to examine the HTML that comes back and look for the "Security questions" in the HTML. Then, you only have a few options for answers, so send those back. Hard to test, but once you get it working, you'll be back to automating your bank access.