Quick points:
Eam It is easy for a user to make a password 20x better. 20 26, which means you only have to add 1 random lowercase letter to a password to do that. Or, 20 2^5 which means you need less than 5 extra bits of entropy, which is going to be only 3-4 letters (worstcase) in a passphrase.
Mike What Eam said - never use a custom hash function. Use an off the shelf one. (Look up Schneier’s law. This applies to hash functions as much as crypto algorithms. Anything custom is almost certainly going to be much worse than a published, peer-reviewed and well-attacked algorithm.)
Jeff, I don’t know of an existing tool to do this, but…
Take a good size corpus of English text and generate a Markov model of it.
As you create the model, if you come across a word that matches a name in your Cracker’s Name Database (e.g. “John” or “Smith”), or just a word starting with a capital letter that is not at the beginning of a sentence, tag it somehow in the model with “name”. Do the same for numbers - tag them so you can easily tell they’re a number. Dates would be good too. If you can grab “January 15 2005” as a single token and mark it as a date, brilliant! Ditto times.
Create indices for your Markov model for previous word and order of probability. You want to be able to look up by previous word in O(1), and by order of probability in O(1).
That can be done beforehand.
Then, to generate passphrases, go through all one-word passphrases, starting with the word most likely to appear at the start of a sentence, and finishing when you get to the word least likely to appear at the start of a sentence (but still non-zero possibility) Then two-word passphrases; Start with the most common word, and it’s most common successor, etc…
When you hit a “name” token, first try that name, but then try all the other names in your Cracker’s Name Database. Ditto numbers - try the number first, and then numbers “close to” that number, and/or round numbers near that number. Ditto dates.
Add leet. Try the, say, 10 most common leet substitutions on words. Say for an average 4 letter word, 2 of those could be leetable, giving you 4 possible combinations. Try the, say, 16 most likely leet substitutions for each word? Then ALL-UPPER, CaMeL1, cAmEl2 and Initial Caps.
You could also try replacing_spaces_with_underscores and removingspacesaltogether.
16 leet substitions + all those gives me an extra 1024 combinations per word to try, but meh. There aren’t that many words.
Let’s say we’ve got a vocabulary of 10,000 words. How many of them are valid (p 0.001?) at a given point? Um - I’m going to guess at 1,000 here. I’ve no idea how accurate that might be.
So, that’s:
1,000,000 1-word phrases (negligible)
1,000,000,000 2-word phrases (5 seconds)
1,000,000,000,000 3-word phrases. (5000 seconds ~ 1.4 hours)
1,000,000,000,000,000 4-word phrases (5,000,000 seconds ~ 57 days)
Hmmm…and cracking your 6-letter passphrase will take a million times longer than that.
OK, that’s starting to take a long time. Maybe moving on to longer passphrases before exhausting all shorter ones might help. Or maybe there are less than 1000 valid words at any given point. But that’s the idea. But it will certainly get your passphrase before a brute-force alphanumeric would. 63^28 (~10^50) tests by my calculation, using upper, lower, numbers and space, compared to 10^24 tests for mine.