Mixing Oil and Water: Authorship in a Wiki World

When you visit Wikipedia's entry on asphalt, you get some reasonably reliable information about asphalt. What you don't get, however, is any indication of who the author is. That's because the author is irrelevant. Wikipedia is a community effort, the result of tiny slices of effort contributed by millions of people around the world. The focus is on the value of the aggregated information, not who the individual authors are.

This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2009/02/mixing-oil-and-water-authorship-in-a-wiki-world.html

I’ve always hated the idea of anonymous authorship. Wiki is a cabal and no one has to be held accountable or responsible.

I still don’t particularly like people changing my posts - It only takes a 1% edit to make an answer completely wrong.

Interesting analysis on authorship stats, but it’s important to remember that Wikipedia encourages detailed references. Authorship really is unimportant for content that sites non-wiki references. Any content that does not should be taken with a grain of wiki salt, knowing that the vast majority of it is correct.

Agreed. Wikipedia tends to tout whatever is popular as fact. Sometimes it is… oftentimes it is not.

Isn’t the problem of computing a very good approximation of the minimal set of differences on text, source code, and other data, already quite elegantly solved by the *nix-like diff tool? (and by rsync, which also uses a similar algorithm.)

I simply sum the total size of all line contributions (insertions or deletions) from any given author in a revision, with a small bonus multiplier of 2x for the original author. We report the highest percentage of authorship in the final revision.

Yikes. So if I make a spelling correction in each line, I’m 100% the author?

And I’d hate to see my name (even with a percentage) next to something I didn’t 100% write.

I wonder if the authorship percentage could be calculated by the weighted change in stemmed words in the article for each revision…

Reminds me of how one might ‘fingerprint’ audio streams…
the order isn’t nearly as important as the frequency analysis…
(I always thought fingerprinting took into account how the song progressed, but it doesn’t seem to)

I really wish I could punch people that post first :slight_smile:

Nice feature, but I have to wonder about one thing: why is it inconsistent?

I search the term Alan Kay for examples.

Example 1: http://stackoverflow.com/questions/432922/significant-new-inventions-in-computing-since-1980
community wiki
5 revisions, 4 users
Alan Kay 76%

Example 2: http://stackoverflow.com/questions/58640/great-programming-quotes
community wiki
9 revisions, 7 users
epatel (82%)

Example 3: http://stackoverflow.com/questions/359877/are-there-famous-developers-using-stackoverflow
community wiki
9 revisions, 5 users

How come the first example has no parenthesis, the second example has them, and the third example doesn’t even have the user/percentage?

I think this is the first time that I disagree with you.

Please don’t call SO a wiki. It’s a forum with community editing features.

wiki does not imply community editing.

‘From Wikipedia, the free encyclopedia’ on every wiki page … get it … encyclopedia. The ‘community editing’ set it apart making it an encyclopedia by the people for the people.

SO is not an encyclopedia. It’s a bunch of opinionated programmers. Don’t get me wrong, it’s fine with what it is, but it is NOT a wiki and will never be.

I for one thought there was going to be q/a section and a wiki ‘reference’ section to that SO became the wiki for programmers. I was sorely disappointed. Allowing people to edit other peoples posts is just that, allowing them to edit other peoples posts.

Nothing on a wiki is personal, and that’s the way it should be. As much as I like Alan Kay, I don’t care if he said something interesting, or it was some kid in India, or a WoW addict.

SO is a game. Write some stuff, get rewarded, show off, etc.

Looks good except for when it shows up in your search results:


It is cut off.

We at swarmforce are attempting to solve this problem with swarm ai. Our first product was debates, and it was tough, but we particalized data, handled revisions and corrections, edits, etc, and assigned each person a contribution percentage (and performance index we call karma) all using swarm ai. Our article product is in development and should be out soon (we also have a twitter product tackling tweet noise, called swatter). There are a bunch of companies popping up all trying to solve the same problem - too much noise on the net with not enough quality and authorship.

pumpitup, wiki doesn’t mean encyclopaedia. It means something nearer to website with simple low-overhead collaborative editing. The fact that some people say wiki when they mean Wikipedia doesn’t change that.

I don’t know whether Stack Overflow is in fact a wiki; I’ve been there maybe twice ever. But if it isn’t, the reason isn’t because it’s not an encyclopaedia.

The problem with wikipedia is that many of the most active registered users (those with the most edits, not content) believe they own wikipedia. When a new person adds valuable content, these registered users come in and delete or modify what was written as if to take credit. Then the contributor has to fight to include valuable information and the registered users falsely say that the contributor is trying to claim ownership of the article.

It’s a frustrating exercise and why there are so many contributors that never return.

How unexpected. A genuinely interesting contribution.

For those who might be interested in really efficient differencing algorithms, the strategy used by rsync is actually very interesting, and understandable with only a general background in hashing functions and some exposure to developing algorithms.

Here’s the thesis written by Andrew Tridgell (the guy who put together rsync in the first place):


A CS PhD thesis that a regular coder can read and understand!

I’ve used this idea of a rolling checksum in some of my own apps (differential backup, for example), and it’s remarkable how well they work. Rsync uses large block sizes because of network latency, but you can get a very tight diff by using small block sizes if you have local access to the files.

The things that differs from SO and a normal wiki is that the status of ownership has a completely different meaning.

In a place like Wikipedia you want to create listings of ideas and define them. Thus, each repetitive edits and changes to the original post are refinements on the original idea. However I would be very surprised to find that there is a strong correlation between the author of the idea and the original author of the article. Thus the original poster is just the first in a long string of refinements (at least one hope so) that should converge on the most correct definition.

In SO however, the original poster asked a question, thus he has a vested strong interest in what will be the answers provided. Also, edits will be mostly to correct errors, or rephrase the question so it is better understood but must remain in the spirit of the original, otherwise it is a different question. Thus the original author, however badly worded his question was, should always be present as the author of the question. Not so much as a token of ownership… but as a token of interest. Then, if you wish, you could create a metric as to the largest contributor to the question.

Thus I changing the signature at the bottom when one does an edit (whether be typos or complete rephrase) will hide the original person who asked the question. If I can make a parallel for a classroom, where a student would ask a question, of course the teacher will address the whole class in answering this question as he/she knows full well that if one student asked it, 10 others are just burning to ask it as well. However, even if another student added to the original question, a good teacher will always return to the first one who asked and ensure that the question was answered to his satisfaction. I find that not doing so is a disrespect to the student who dared ask it.

The same goes to SO, although questions benefit the whole community, one must never loose sight who asked the question in the first place, after all, of all interested people in the answer, he surely is the one who really want the answer the most.

The answers however is a different game, they are more like wikis in some regard as the goal here is to provide the best possible answer. Thus it should be encouraged to modify an answer rather than creating a new one thus creating the convergence effect of a wiki. Ownership tokens here are not as important and thus, the metric could simply be the person whose contribution was the largest according to some metric. Maybe have different metrics to measure different aspects of contributions, however the original person who answered is, in my opinion, more like the original poster of a wiki article, just the one who submitted a good draft to work on.

anyways… my 2 cent on the subject

It doesn’t make sense if you edit your own post