| Re: TFCentral Renewal Project Old thread but it bears discussion.
I was surprised at how effective two little questions on your old form was at flagging potential bots, even though that wasn't the original intent.
"What is your time zone?" With a pull down list and the default answer is somewhere in eastern Siberia. Real humans don't usually leave the default and almost no one living in eastern Siberia has Internet access.
"What is your favorite thing to transform into?" (or something like that). That was a required question and the correct answer isn't a specific thing, it's that it isn't something like your username.
If you asked someone what country they were from, you could couple that with the timezone answer, but the danger is somebody could be from Belgium (for example) but be part of a UN peace keeping force located in some other part of the world. So there's no guarantee the "where are you from" and the timezone would match. If they don't, you could flag the account for personal scrutiny before enabling it. Still, if those kinds of questions were even a little bit common, a bot-master might bother to program in the correct answers each run. So even bots might get the answers to match.
I think one key is to ask a few required questions that use a text box rather than a pull down menu and that have no real right or wrong answer. The answers given will give a strong clue as to if the prospective member is human or a bot. When a link spammer programs their bot they won't take the time to plug in a bunch of different answers each run for our special questions. And maybe that's another trick. Label the form elements with something non-obvious. Here's how it might work:
What username would you like? [__________]
... <other stuff> ...
What is the thing you'd most like to transform into? [_________]
If you could be any food object, what would it be? [_________]
What is your favorite time of the day, week or year? [_________]
Why is it your favorite? [___________________]
Now a human will sit and think of answers, even if they are flip and silly. A link spammer programming a bot may not take the time to set up their bot to do something clever with these answers or to come up with a unique answer for each of these questions. This is how a bot might answer:
What username would you like? [_BananaStan_]
... <other stuff> ...
What is the thing you'd most like to transform into? [_BananaStan_]
If you could be any food object, what would it be? [_BananaStan_]
What is your favorite time of the day, week or year? [_BananaStan_]
Why is it your favorite? [_BananaStan________]
Probably the best, most telling question of all is one you need to ask of the web server, and it needs to record the answer: "How many seconds did this user take from page download until they punched the 'submit' button?" Some answers can only be so if the potential member is a bot--for example, "1.8 seconds". Especially if you add a few free-form questions that are required and that most humans need to think about for a bit.
Another system question that should be recorded is: "What page did this user come from to get to the membership form?" The correct answer is one of a few pages on this site that link to the form, but a very incorrect answer is "unknown". "Unknown" almost always means the user came from a bookmark. I doubt a bot in production mode will traverse the website before it finds the form and fills it in. Bots work from a list of target pages... a list of bookmarks.
You could also ask the system what kind of OS and browser the user is using, but bots lie so I doubt it'd be of much use.
Another possibility is to figure out which kinds of common plugins the user has installed--plugins no bot would ever have. This is probably beyond the scope of the project.
There are several different ways the system could ask the potential member what their IP address is. When you compare the answers you can discover: they have a direct connection to the Net; they are behind a caching proxy; they are behind an anonomizing proxy; they are behind a router doing NAT; they are behind a firewall doing script filtering; and other answers. Such techniques are beyond the scope of this project. It'd be a lot of extra work, but one can look through even an anonomizing proxy to get the true IP address. With this information a black-list would work. Even the fact a site visitor is behind an anonomizing proxy is enough that I'd disallow the account--human or not.
I can think of at least another half-dozen tricks to ferret out the differences between humans and bots. The captcha is supposed to do this, but the first key is to remember what the captcha is supposed to be: a Turing test. Once you free yourself from "captcha" and start thinking "Turing test" you can come up with all kinds of ideas--most of them invisible and running in the background--to discover if you're dealing with a human or a machine.
It probably wouldn't hurt to throw in a captcha too, but because many humans get it wrong if it's hard, I'd not put too much faith in the answer. However, it might be invaluable in convincing the bot-master who is manually scouting out your site, that there is a captcha and therefore they will forget "Turing test" and think "captcha". Later when their bot keeps failing at generating new accounts they may waste a lot of time fiddling with getting their bot to read the captcha. The second key is that the answer is never black or white. I'd assign a weight to each test and then assign a score to each answer. Multiply them out and add them up. Let's say a total score of 0 is almost certainly a bot (maybe 99.999% chance), and 100 is almost certainly a human. If human, they'll probably get a high score, like 80 to 100. If bot they'll probably get a score of 0 to 20. If someone gets a really great score I'd activate their account immediately.
However, if they get a lower score then take them to a message page that says something like, "Due to the high number of bot generated accounts all applications will be reviewed by one of our admins before we activate your new account." If the score was below 20 toss the information and don't bother the admin. If between 20 and 80 then bother the admin.
Well, TF Central is a nice laboratory for this sort of experimentation, but I don't have access to the system and it'd probably be a bad idea to go hacking around in the vBulletin code base because it'd take the site up and down for weeks until the code was fully written and debugged. Not good.
On one of my personal sites I have a little "contact me" form which has been discovered by an e-mail spammer who's been using a bot. I'm not quite sure what the value is of hitting one website to e-mail spam one person, but they do. As the e-mails are clearly marked as having come from my form, and my contact form destroys any HTML tags, I've been kind of lazy about fixing the problem. I think that little form can become my laboratory.
Scotty |