The Problem With URLs

pfctdayelise · October 30, 2008, 12:00am

And unfortunately some sites’ URLs even end in periods, which you called end-of-hyperlink characters.

e.g. http://en.wikisource.org/wiki/1911_Encyclop%C3%A6dia_Britannica/Aga_Khan_I.

yes the . is part of the URL.

The parentheses on Wikipedia pages are particularly annoying. I paste URLs into identi.ca and have trained myself to put %29 at the end because I know otherwise a URL with parentheses won’t work.

Josh_Bush · October 30, 2008, 12:00am

Can you use the regex balancing group technique to avoid matching a ending parenthesis when one is detected on the front before the http?

http://blog.stevenlevithan.com/archives/balancing-groups

EduardoD · October 30, 2008, 12:00am

there isn’t a problem URLs, the real problem is to use regex for url matching!

EmperorX · October 30, 2008, 12:00am

By the way, you should be using s.Length - 2 to strip both the first and last parentheses. Using s.Substring(1, s.Length - 1) will have the same effect as s.Substring(1), since the remaining length after removing the first character is s.Length - 1.

JonathanS · October 30, 2008, 12:00am

Shouldn’t you be stripping the leading parenthesis and only removing the closing one if the leading one is missing. For example, it would seem, (http://example.com/ Example Site) would capture the leading parenthesis and would never get stripped since it doesn’t have a closing parenthesis.

numerodix · October 30, 2008, 12:00am

It’s not just parens, it can be any characters surrounding the url. The first example shows a url followed by a comma. A comma is legal in urls, so is it in or out? There’s no way to write a regex to correctly delimit a url in all cases, you have to know the grammar of the data. And in human communications the grammar is informal, a matter of convention in a particular group.

Anj · October 30, 2008, 12:00am

Rather than checking that first/last char are parentheses, I’d suggest removing any closing paren unless there’s an unbalanced matching open paren in the URL itself. (I’m going on the assumption it’s unlikely a programmer-type will construct a URL that intentionally has unbalanced parentheses.)

The trim off first/last strategy won’t correctly deal with

Hey, try this (my friend's site at <a href="http://google.com)">http://google.com)</a>

This alternative strategy would handle that, as well as the following:

Here’s a link (http://google.com)

Here’s an ugly link: http://google.com/file(stuff)

Here’s an ugly link (http://google.com/file(stuff))

Here’s another one (with a comment http://google.com/file(stuff))

Ben · October 30, 2008, 12:00am

The issue of international characters and other such things could probably be circumvented by using a well tested and long used existing regular expression to this problem.

http://search.cpan.org/~abigail/Regexp-Common-2.122/lib/Regexp/Common/URI/http.pm

This came up pretty quickly. The author of that is a pretty smart guy.

You’d potentially have to still wrap this regex inside another to apply your same approach with the parenthesis. This is trivial.

Oh and yes, that’s perl, but extracting the actual regex in use from that thing shouldn’t be too difficult and most languages out there use PCRE or something very, very, very close to it.

Best tool for the job and all that.

Lucas · October 30, 2008, 12:00am

Just nitpicking, but shouldn’t that code return return s.Substring(1, s.Length - 2) if the idea is to remove both the opening and closing parens?

Ben · October 30, 2008, 12:00am

Oh, in recognizing you might not be familiar with how perl imports libraries, the regex linked earlier looks to be this:

my $http_uri = (?k:(?k:http)://(?k:$host)(?::(?k:$port))? .
(?k:/(?k:(?k:$path_segments)(?:?)?))?);

(. is a concatenate operator) with the $ variables defined here
http://search.cpan.org/src/ABIGAIL/Regexp-Common-2.122/lib/Regexp/Common/URI/RFC2396.pm

As you can see, getting these regex right is harder than it would appear at first blush.

MichaelT · October 30, 2008, 12:00am

Why not save some back-end processing time and just give the users a WYSIWYG editor?

You get easy-to-parse (X)HTML, the user clicks buttons.

Jim_Cooper · October 30, 2008, 12:00am

I’ve noticed that you have a certain tendency to see too many problems as nails that you can hit with your regex hammer

The trouble is that regexes (provably) can only deal with very limited grammars.

As someone else pointed out, you’re never going to get this perfect, as you are ultimately dealing with a human language, which no parsers yet written deal with perfectly. And what are you going to do if the URL is just in an example and not supposed to be a real one (in a code sample, for example)?

If this is just for markup purposes, just specify the format. People will learn that quicker than you can write code to parse English, or whatever.

Rus · October 30, 2008, 12:00am

have you posted this at 2:30 in the morning?

you might need to look at this: {http://crazy-videoz.com/cool-stories/suggestions-for-sleeping-at-work/)}

DW12 · October 30, 2008, 12:00am

Why not just check for cases where it might be possible or likely the parse just got confused, and simply prompt the user before form submit? I know it’s an additional step, but it’s actually not that huge of an obstruction.

AnonymousC21 · October 30, 2008, 12:00am

Jeff’s on a roll lately.

ProfessorT · October 30, 2008, 12:00am

This is why I prefer vb code for this purpose. Who was ever hurt by a little ?

Kris · October 30, 2008, 12:00am

I just force people to use [URL][/URL] if they want to include a URL. Then I don’t have to worry about all these special cases… unless of course someone uses [URL] in their URL, but that’s their own fault for having an absurd URL.

Chris_C · October 30, 2008, 12:00am

Heh. Linkification in Firefox failed to handle most of the problematic URLs in the post and comments. The trailing paren in the first wikipedia URL doesn’t get linkified, nor do the %28s. The one with the umlauts was somehow split into two URLs at the first u-umlaut. Guess it is harder than it looks.

Kyle · October 30, 2008, 12:00am

I guess this blog really has become focused entirely on web development. Sucks for me since I don’t do web dev and couldn’t care less about auto-linking URLs. When StackOverflow was started, CodingHorror jumped the shark

HB132 · October 30, 2008, 12:00am

What it really comes down to is that parsing text for anything is one of the biggest pains in the ass when it comes to programming. Quite simply you never know what’s coming.