The Problem With URLs

URLs are simple things. Or so you'd think. Let's say you wanted to detect an URL in a block of text and convert it into a bona fide hyperlink. No problem, right?


This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2008/10/the-problem-with-urls.html

/* As some people have already said, the find URLs problem is trivial and can be solved very easily. Notice that this solution uses no whitelists (Jeff’s new favorite buzzword), so it can handle pretty much anything you throw at it: Unicode, ftp://, whatever. The only trouble spots are Did you mean http://example.com? and URLs containing brackets (but see http://en.wikipedia.org/wiki/Template:Bracketed). */

#include stdio.h
#include stdlib.h
#include string.h
#include ctype.h

void extractURL(const char *text, int start, int length)
{
#define RETURN(a,b) do { start = a; length = b; return; } while (0)
const char *t = text;
find_next_colon:
const char colon = strstr(t, ://);
if (colon == NULL)
RETURN(-1,-1);
/
Get the preceding protocol ID; e.g., http or ed2k. */
const char s = colon;
while (s text isalnum(s[-1])) --s;
if (s == colon) {
/
We have $@#://http://example.com. Keep going. /
t = colon+3;
goto find_next_colon;
}
start = s-text;
s = colon+3;
/
URLs end with whitespace, , or brackety things. Unbalanced
* parentheses also end the URL; consider (at http://example.com)
* as opposed to (http://example.com, for example). */
int parens = 0;
while (!isspace(*s) !strchr([]{}, *s)) {
if (*s == ‘(’) ++parens;
if (s == ‘)’) --parens;
if (parens 0) break;
++s;
}
/
Consider http://en.wikipedia.org/wiki/Bang!. I’ve
* decided arbitrarily that ! and ? may end a URL, but
* we must correctly handle I like http://example.com. /
if (strchr(.,:;, s[-1])) --s;
if (s == colon+3) {
/
Reject http:// with nothing following it. /
t = colon+3;
goto find_next_colon;
}
/
Accept the rest. */
RETURN(start, (s-text) - start);
#undef RETURN
}

/* For testing. */
int main()
{
char buffer[1000];
while (fgets(buffer, sizeof buffer, stdin) != NULL) {
const char *text = buffer;
int start = 0;
int len = 0;
while (1) {
extractURL(text, start, len);
if (start == -1) break;
printf(%d %d: %.*s\n, start, len, len, text+start);
text += start+len;
}
}
return 0;
}

What Barry Kelly said - you’re never going to get it to be Completely Right with a regexp (especially, as others have said, with a non-Roman or even non-low-ASCII alphabet), and the regexp will rapidly, as you try, become completely write-only gibberish.

Use a real parser, in the form of an actual grammar.

(The 99% close enough solution linked by a previous commenter of course fails utterly on the non-Roman/low-ASCII case by pretending that a-zA-Z0-9 is sufficient to recognise a name.

If one is willing to live with assuming that users will never want to link anywhere that uses an umlaut, an accent, or a non-Roman character, one is certainly free to… but that’s probably a bad idea.)

(Is the Anonymous Coward with the C code deliberately writing obfuscated code for some reason? Or is it long-term C exposure that makes people think that’s good style?

But at least that has the advantage of being an actual parser - if an ugly one - rather than trying to fit everything into a regular expression.)

You could do something like this: http://website.com

This URI is in perfect angle brackets and yet, the parser recognizes the closing angle bracket as part of the URI.

Perfect example for another broken parser. :smiley:

Yup, I am doing a little project for myself. Looking at duplicate sites, http://www.google.com http://google.com google.com ftp://google.com etc. Regex, and indexof, substring, all needed to check the site.

Cheers, Sarkie.

URL extracting indeed can be extremely troublesome.
Concerning your point 1: In times of IDNs the whitelist-character-approach ist at least problematical.

Are you really advocating avoiding the use of perfectly valid characters in URLs, just because they make a URL difficult to identify in code?

Many websites use regular expressions to validate email addresses, and these too will often fail to correctly identify perfectly valid email addresses. Would you recommend the victims of these coding failures just change their email address?

You tried to solve a problem using regular expressions…and then you had two problems.

Sorry - couldn’t resist.

But why the dilemma of telling people to escape their parentheses? Square brackets aren’t legitimate characters in URLs from what you’ve stated, yes? So…

My website [http://www.example.com] is awesome.

…should work just fine.

I suppose this post should demonstrate whether your regex works as expected :slight_smile:

Can’t forget https, ftp and file URLs.

I display my latest Twitter entry on my homepage and decided to use the following to parse the text:

preg_replace(\b(https?|ftp|file)://[-A-Za-z0-9+@#/%?=~_|!:,.;]*[-A-Za-z0-9+@#/%=~_|]\b, ‘a href=\0\0/a’, substr($item[‘title’], 10));

I wrote this only yesterday and completely forgot about parens.

There’s a few comments suggesting that you validate the URL by requesting it and checking if you get a 404. There’s a few reasons against this:

  1. Many dynamic sites written by newer coders won’t give you a 404 if you request a bad page. e.g. http://example.com/page?id=4 , tack a bracket on the end, and you likely get a bad ID. The page would tell you, but you won’t get a 404.

  2. It could open both you and a poor target up to a DOS attack. Imagine someone submitting a post with 1,000,000 references to http://example.com .

Umm, you are actually missing all URL’s out there that contain non ascii characters in their domain names, which are perfectly valid: http://en.wikipedia.org/wiki/Internationalized_domain_name

On a site-note, comments posted here don’t respect RFC2396.
Example: http://www.example.com – the trailing angle-bracket gets included in the URL.

while I usually agree with your points, this one leaves me baffled: by definition, automating semantics extraction from text without using context aware parser is not possible, so auto linking will always be far from perfectly working as intended by the user

the point of the problem there is: as intended
users are not required how to format a perfect href tag, nor it’s desirable to allow rendering html through custom text, but user should know how to play by the rules. if they want an autolink, they better know that they couldn’t use spaces because are treated as linking boundary and that spaces should be escaped using %20s.

as a solution, I’d prefer a method to have a live or batched preview to allow user to test their link before posting. enabling links during writing permits user to see how the boundary system works, and to avoid mistakes

A better heuristic for extracting URLs would be to use a stronger pattern formalism than regular expressions, such as a context-free grammar. Since humans generally produce the format for URLs, you could expect that URLs are highly unlikely to include unbalanced parens. Regular grammars can’t express this constraint, but a context-free grammar can.

What’s up with Domains with Umlaut?
They’re by now perfectly legal and work in all modern browsers, as fs111 correctly stated. And they are alreade in usage here in germany.

Example (does not exist actually, but could and is valid):
http://www.mllrr.de/

Since conscientious users use the preview feature, the url detection can be minimal and we can propose a specific syntax for exceptional case.

Well maybe we need a preview here also :slight_smile:

Great, how does that work with the international characters allowed in domain names recently?

Actually, this problem can be solved with a single regular expression, although it’s not an easy one. I have split the regex over several lines for clarity:

(?=()
\bhttp://[-A-Za-z0-9+@#/%?=~()|!:,.;]*[-A-Za-z0-9+@#/%=~()|]
(?=))
|
(?=(?wrap[=~|#]))
\bhttp://[-A-Za-z0-9+@#/%?=~
()|!:,.;][-A-Za-z0-9+@#/%=~()|]
(?=\kwrap)
|
\bhttp://[-A-Za-z0-9+@#/%?=~
()|!:,.;]
[-A-Za-z0-9+@#/%=~_()|]

This will match any URL that is surrounded by parentheses, but also by any of the following characters: ‘=’,’~’,’|’,’_’,’#’.
Of course, it will fail in some very borderline cases, but I think it matches 99.9% of URLs entered by users.

I think this is one of those situations where as you’ve stated
you can’t get a solution to fit all cases. Therefore you have
to take a pragmatic approach. The most pragmatic I think is
to not allow () in urls, and in the .1% that have (), people can easily
cut and paste the URL rather than clicking:

Here is the simple python snippet I use to auto link URLs:

r=((?:ftp|https?)://[^ \t\n\r()’]+)
comment=re.sub(r,r’a rel=nofollow href=\1\1/a’,comment)

This is why I allways write urls on a separate line, not only because it’s more likely that any automatic link creator will detect it but allso because it’s easier to select and copy. At least I allways have a space between the url and any punctuation

Visit my website at http://www.example.com, it’s awesome!
is hard to select…

Visit my website at http://www.example.com , it’s awesome!
is better but typographically wrong…

I prefer:
Visit my website, it’s awesome:
http://www.example.com