02-17-2006, 10:43 PM
i have an idea for a captcha system that i'd like to run by other people before i begin working on it, just to see if there's something i've missed.
my idea is based off the fact that most captchas are either pictures or sound files. the problem with this that they're not very friendly to people with either mobile devices or special needs (like blind people with screen readers). i came to the conclusion that in order for a captcha to be accessible, but also useful, it would need to be text based, but not easily parseable by a bot. what i came up with follows:
pick a large amount of public domain text. (eg Shakespeare, the Bible, US Constitution, etc.)
randomly select no more than 5 lines from your text
select a random piece of data from within the text (eg: first word, first letter of last word, etc)
create random phrasing asking for identification of the random data. eg:
"Please identify the first word in last sentence",
"What is the last word in the first sentennce?",
"Could you tell me the third word in the third sentence, please?"
it seems to me that such a task would be sufficiently difficult to keep out most bots, but is at the same time accessible to people. does anyone see any problems here, or reasons why i shouldn't go ahead with this?
02-17-2006, 10:52 PM
In my humble opinion, this is a wonderful idea - just keep it simple :thumbsup:
02-18-2006, 12:54 AM
Well as long there are no benefits of writing a bot for that it should work. The bot would probably even use some of the captcha code as captcha would also have to know what it is asking for.
ralph l mayo
02-18-2006, 01:26 AM
Sounds like a lot of work for the db to pick random lines from all that text, and it's not really a hard AI problem unless the directions are hard to parse (ie., in an image like a regular captcha). How about picking small easily identifiable icons of everyday things and asking questions about an array of them in a dynamic GD image? Ie, what item is in the 1st row of the 2nd column, which items are there the most of, what do these items have in common, whatever. It'll be harder AI and probably more technically manageable.
Without distortion or a large bank of pictures it's still pretty easily breakable, but you're going to have to balance technical complexity, user annoyance, accessibility, and attack resistance in appropriate proportions.
02-18-2006, 08:39 PM
marek: i thought about the possibility of bots using the captcha code too. i think that can be mitigated by making the instructions sufficiently random, and sufficiently complex
ralph: reverting to media files, either pictures or anything else, defeats the entire purpose. media files don't work as well for people with alternate browsers. that was the entire reason for doing it text based.
also, why would pulling a random entry from a DB be hard? i don't understand what you mean by that.
ralph l mayo
02-18-2006, 09:01 PM
Text based protections are inherently defeatable by regex and will only keep out the least inclined attacker. At this point I think users that can't either see an image or listen to a short sound clip are negligible. They can buy a new treo or install fluxbox or something. :) You can't design for 100 percent of the population and if you could the last thing you would implement would be a captcha. While I'm sure all 17 lynx users appreciate your position, captchas limit access *by definition*, and if they're not annoying or discouraging to normal users they aren't to attackers either.
Random access from a large set of data is expensive because of the way MySQL (and possibly other RDBs) handle it internally, which essentially results in a test for every column. To get around it you can select count and do your own random key, which I guess really isn't too bad.
02-18-2006, 09:43 PM
marek: i thought about the possibility of bots using the captcha code too. i think that can be mitigated by making the instructions sufficiently random, and sufficiently complex.
But the CAPTCHA mechanism must be able tu generate that random ting and know to which word it points to. This will become quite limited as the amount of possibilities will be the amount of different CAPTCHA generated questions not the text length it uses.
02-19-2006, 12:03 AM
I like your reasons for doing this, but at the end of the day the more accessible it is to users with varying dissabilities, the more open it is to parsing by bots... semi-defeating the purpose in the process.
The reason for this is that assistive technologies are merely user-directed bots. If their bot can make sense of it, so can a bot built with malicious intentions.
The only way around the above problem is by (as you say) "making the instructions sufficiently random, and sufficiently complex".
Now we're not only blocking users on the basis of their physical ability, but also on the basis of their cognitive ability. ouch :(
In my opinion, all CAPTCHAs and usability are mutually exclusive to a degree, regardless of the CAPTHCA type. This is because no matter where you move the tipping point of "block users vs. block bots", it will always exist. Healthy balances can be found for most situations, but I believe the site's/app's user base needs to be thoroughly researched prior to implementing something like this. For that reason, I have difficulty supporting the idea of any 'one size fits all' solution :)
I like your solution because it moves the usability tipping point further in favour of users, but in doing so, also makes the CAPTCHA much less effective at blocking machines.
As Ralph stated earlier, you got text, we got regex. Everything you can output, we can deconstruct ;)
02-19-2006, 02:09 AM
I dunno, I like the idea , I dont like captcha , I am fully sighted (and not entirely dense;)) and I still get captcha's wrong sometimes if the image fonts are weird enough (or the destinction between lower and upper case is fuzzy).
If you were echoing a random quote and then asking for a random $x'th word of that quote for verification and you were using an image for the question and the quote ... I think that would be hard enough to put off most bots ?
02-19-2006, 03:27 AM
well, i'm pretty sure you guys are over-estimating what can be done with Regexes. there's plenty of things that are hard to do with pattern matching. the real problem is in keeping the instructions from being repetitive, like marek says: if the bank of questions you build from is too limited, then a bot could brute force for question samples, and then build a DB of questions to work off of. the bulk of the code in this project would be randomly generating instructions that are simple enough that you could follow them as they were read aloud to you, but that are too difficult for a machine to follow. i think that this is difficult, but possible.
i'd still like to give it a try at any rater, though for all i know, it could be beyond my coding ability; i've never implemented a decision tree before. i'll update here as the project progresses