Protecting Whois Sites from Datamining

Nexus · Oct 11, 2004

Ok. So I've just finished developing a fairly bulletproof scheme for protecting whois output from data miners, for sites that display whois. The output data is essentially encrypted into gibberish, that requires a special function to decrypt it (this function is included in a javascript include).

Both the manner of encryption and the function that decrypts it, mutate in tandem, so that there is no standard means of "decrypting" the data, as the methodology mutates once each time someone views the output. To the casual observer, the results show as nicely formatted HTML, but looking at the source code, only shows a pile of gibberish being run through a function.

For instance, one mutation might make:
"DNforum.com - The Place to Talk Domains"

Look like this:
"KuK1@MS=ItPsK=Lg{tSm.0&JGvLrH1LdLKn4>Pu[LMnuH=rdLunQ<2P5GuLsWtPe{L<1oqzz"

Likewise, Whois output like this:
[quote]Registrant:
MONKEYMOJOmedia, Inc. (MONKEYMO-DOM)
23445 122nd Ave SW
Bellevue, MA 56566
US

Domain Name: MONKEYMOJORAMA.COM

Administrative Contact:
Smedley, Bruce (1481420I) [email protected]
23445 122nd Ave SW
Bellevue, MA 56566
US
(888) 555-5864 fax: (888) 555-0898[/quote]Would look like this in the HTML source (without carriage returns):

Code:

<script>document.write(o('U0c5eGMybHFhRDVrYWpvTkNsZEZSRlZQZTFkRlZFV
jNiMjV6UGl4QlUyUThXa0VvVjBWRVZVOTdWMFV0VGtWWEtRMEt
Nak0wTkRWQk1USXlaRzVCSUd4dlFVbE5EUXA5YjNaMmIyeHJie
XhCVnlCQk5UWTFOallOQ2t0SkRRb05DazVsZHo1elpFRkVQbmR
2T2tGWFJVUlZUM3RYUlZSRlNDQlhJRm9sUlZjTkNnMEtJRzUzY
zJSemFXcG9QbXB6Ykc5QkpXVmthajQ4YWpvTkNrbDNiMjUyYjF
zc1FYMW9henh2UVNneE5EZ3hOREl3VXlsQlFHaHJQRzlpZDJWa
2RXOWJhWEp6Wkc5YVBHVjNEUW95TXpRME5VRXhNakprYmtFZ2J
HOUJTVTBOQ24xdmRuWnZiR3R2TEVGWElFRTFOalUyTmcwS1Mwa
05DaWc0T0RncFFUVTFOUzAxT0RZMFFYQStKanBCS0RnNE9DbEJ
OVFUxTFRBNE9UZz0='));</script>

Given the mutating nature of the function to decrypt, and the output itself, my bet is that dataminers wouldn't really want to bother trying to mine from a whois site like this. But, I guess the question is (if no other exceptions come up for anyone), would miners use "manual" methods of stealing data, such that implementing a function that replaces e-mail addresses with images would be required (the kind of method that Whois.sc uses)? I'm trying to avoid using that method if possible, as it enforces certain server requirements I'd rather avoid. I'm creating a commercial whois solution and I'd like to have as few requirements as possible.

Outside of that, I'm considering "flood" controls, etc. The easiest "protection" is definitely requiring users to "register" to use the tool... but that's not for most casual users, as it raises privacy concerns (unless its in a forum environment like this one, where this type of protection is already part of the site). Also, I do not necessarily want to RESTRICT people from creating outside links to whois data, and thereby FORCE them to ONLY make queries from my homepage (or not use the tool at all).

So, again... the question is... is client-side data encryption more than likely enough to prevent "mining", or is it likely miners will manually copy & paste results from the page, or use some automated tool to do so? Is "image: replacement the most sure means (esp. used alongside the method above)? The more I think about, it may well be much to easy to create a tool to sneak around my method (using the PC clipboard).

~ Nexus

theparrot · Oct 11, 2004

why would a whois data miner use a web interface to being with?

also, there are javascript modules available so a robot could be made to mine the data just the same.

Nexus · Oct 11, 2004

theparrot said:
also, there are javascript modules available so a robot could be made to mine the data just the same.

I get the impression they do this simply to abuse someone else's connection. That they'd maybe collect a number of "whois" sites into their software, and use them all as remote zombies, without the concern of having their own IP's (or IP blocks) being affected by banning limitations directly from the whois servers. They get to puppet around these other sites until they reach the amount of data they want.

theparrot said:
also, there are javascript modules available so a robot could be made to mine the data just the same.

Doh. Yeah... that was kind of the type of thing I was hoping/not-hoping to hear. If the robot can parse javascript, that kind of throws a monkey-wrench in the whole plan. I guess sufficiently blurry/tweaked images are the best solution for at least protecting e-mail addresses...

Ran into this site:
http://www.scss.com.au/family/andrew/webdesign/emaillinks/
This explanation hits the nail on the head about my concerns. Thanks Parrot for the note about js parsing being a possibility. Something to think over a bit.

How This Technique Could Be Defeated
The only way I believe this technique could be defeated would be by harvesting software implementing a complete Javascript and DOM interpreter. By running every script on every page, then scanning through the resulting document objects, such a system could easily find anchor tags and the decoded href attributes.

This isn't necessarily difficult as open source browsers (such as Mozilla and Konqueror) provide a ready-to-go engine. All the spammers would need to do is create a modified version of the browser that can spider automatically.

What might stop this technique from being practical is that all the extra processing would significantly slow down harvesting.

~ Nexus

Scott.Mc · Oct 14, 2004

Just on the email thing, i did an experiment recently it was a crawler to see how fast it would take to get solid site urls but i also ran a test at the same time on those pages for emails and in 2 days i got over 33,000 urls and over 10,000 emails.

The emails also has a section masking='yes' or masking='no'
Only around 400 attempted to mask the email which was either using
x at x.com
[at]
(at)
Replace x with x

And quite a number of other things, i made my script to phrase these and attempt for the email.

Now i have finished my test everything is cleared but its quite amazing how many emails a bot can pickup in a day.

And all i did was start at 1 site, what would they pickup if they crawl each site fully i only crawled the index.