Talk:Website fingerprinting

Hm.. interesting idea. Try this:

Which sites to store?
The Alexa.com top 1 million + selected categories in DMOZ.org like banks.

Which pages to store?
The main page and the login pages.

How to find the login pages?
Check the link text of each  link off the main page for a set of keywords, eg. login, signin, sign-in. Index each page that matches up to a limit of 500.

If no link text if found, then check each link's destination page for login forms. They can be identified by input tags with the password setting and other keywords. If this doesn't work then try training a bayesian text classifier.

How to make a fingerprint?

 * 1) a hash of the HTML tags, eg. html,head,title,/title,/head
 * 2) a hash of the HTML except for URLs, eg. remove text in <a href="", style declarations, etc.

How do I know if the fingerprint method is good?
Collect an archive of mirrored phishing websites and test.

How do I implement it?
Collect the fingerprints and identify login pages using WhatWeb (http://www.morningstarsecurity.com/research/whatweb). I wrote WhatWeb BTW and you'll need to write a couple of custom plugins, maybe I will help you.... maybe not.

Make a web browser plugin.
It sends the URL + fingerprint to the server.

Make a server
it receives URLs+page fingerprints and responds with:
 * 1) url found, fingerprint matches. all good in da hood
 * 2) url found, fingerprint doesn't match. maybe they redesigned their website, maybe it's a MITM attack, better check the actual URL for verification from the central server.
 * 3) url not found, fingerprint not found. whatever... don't lose your trust in small businesses
 * 4) url not found, fingerprint found. maybe it's a phishing site.
 * 5) url not found, fingerprint found for >1 trusted websites. probably a false positive for a CMS default login page.

Meh.. that's about it. I wouldn't bother the user unless you get condition 2 or 4. You should pop something up and let them choose what to do, eg. redirect to the trusted site with the same fingerprint.

What's the greatest challenge involved in this?
Who's gonna pay for the bandwidth? It could work for a single corp, antivirus vendor or google.

Who wrote this up?
Andrew Horton / urbanadventurer