The Problem With URLs

Nice, I’ll have to update my JavaScript solution to include parens:

http://knol.google.com/k/adam-eivy/javascript-html-format-links-in-text/2a9qcf9a3ig0u/14#

I would like to suggest everyone to avoid also legal comma (,).

It is not uncommon to find a href=http://www.example1.com/test.htmlhttp://www.example1.com/test.html/a, a href=http://www.example2.com/test2.htmlhttp://www.example2.com/test2.html/a,

I recently tackled this problem myself, and came to a similar regular expression (minor differences, and I think that Jeff’s is better) plus some additional parsing to handle edge-cases and prevent the regular expression from becoming a complete mess. I think that every single person in the comments assumed that this problem is related solely to message boards. It’s not. There are plenty of reasons that you might want to linkify text. You might be writing a web-based e-mail client. You might be writing a client for a chat or IM protocol. You might be trying to turn flat text files into something slightly more presentable on the web. In none of these cases can you reasonably expect the text to contain easy-to-parse URIs with bbcode-style tags or spaces surrounding the link text.

In my case, it was a personal project relating to a web interface for searching and viewing IRC logs. Lots of links get posted in IRC. The IDN problem is a non-issue, but other problems of parsing certainly are not. And I’m neither capable of enforcing URI standards, or would I want to if I could.

As others have pointed out, all we can do is get good enough. I had to accept that my algorithm was going to make mistakes, and move on. But what struck me most while reading through the comments were the people who either a) assumed that the problem is simple (discounting those edge cases that are becoming more and more common on the web) and b) stuck in their own little world where the problem can be solved by waving a big stick at your users. On a coding forum, I was quite surprised at the number of assumptions that people made about the types of situations in which this becomes useful.

RFC2396 has a Recommendations for Delimiting URI in Context section that talks about how URIs should be encased. Not everyone follows that part though.

Dude, that’s so easy, just have you server try to access the url with and without the final paren, and see which one actually works :wink:

Would it be a good approach to have the autolinker request to the potential URL and see if it comes back with a 404?

Then you could spot bad URLs and ask the poster to fix them.

Damn jwickers types faster.

jwickers, Graham Stewart - please take a moment to consider the malicious uses for this idea.

Here’s just two ideas for exploiting a ‘validating autolinker’:

  1. Create a DoS condition on the host, or a third party site by passing in hundreds or thousands of URLs which need to get tested. Potentially executing expensive (in resource terms) requests.

  2. Access and modify protected resources which are only accessable from ‘inside’ the firewall (Management sites, router configuration settings, and many other things)

Always assume that whatever input recieved is a deliberate attempt to exploit or subvert your application. Certainly validate that the input is legal, but you should not automatically request unknown third party resources without significant constraints around it.

As for solving the issue Jeff is talking about - wouldn’t a backref test in the regex be the easiest solution?

How about spaces in URLs?

http://www.google.com/codesearch?q=jeff atwood

Draconian to require the user to put a %20 instead?

To make matters worse, this doesn’t correctly parse the following:
See my site (at http://example.com)

okay, so having spent a bit of time trying to wrangle over testing a backref… I’ll admit it’s not so easy. I’m sure there’s a way to test that a backref contains something - but maybe I’m getting my XSLT and Regexs mixed up.

Useful regex resources for those not wanting to reinvent the wheel:

RegExLib web site - Very useful library of common regular expressions
http://regexlib.com/

Mastering Regular Expressions in case you really want to understand how regular expressions work
http://oreilly.com/catalog/9780596528126/index.html

Adium has a nice library for detecting hyperlinks: http://cloggedtubes.com/development/the_aihyperlinks_framework_or_how_adium_finds_links

What I would like to see is a regular expression that will avoid any links that have already been enclosed in a tags.

That is, linkify this link: http://www.google.com

But do not re-linkify this link: a href=http://www.google.com/http://www.google.com/a

first example does not work with first code example. There is a comma!

http://www.example.com,

IMHO the best approach would be to force your users to enter their text like this:

My website http://www.example.com is awesome.

URLs are hard, let’s go shopping :slight_smile:

I think that we are oversolving the problem.

First, Jeff, you have gone a little to far in suggesting that people change the URLs they enter because the poor little computer can’t autolink correctly.

Second, the whole of the URL text is present even if not correctly autolinked. A savy user will simply copy/paste the link. An unsavy user shouldn’t be on the internet anyway. So make a good effort, and then call it a day. You will catch 90% of everything.

-df5

I agree with Dave Schenk; people aren’t so stupid that they can’t use simplified markup.

Or you could actually ping the url (assuming you only do this check once).

As for URL construction, I still like the way the PHP site does it.

Great post Jeff.

I can’t say that I’ve ever thought about detecting parentheses in urls at all, much less the implications of parentheses surrounding a url. I am enlightened once again.