Concluding the Great MP3 Bitrate Experiment

I still rip all my CDs to FLACs since it gives me the freedom to transcode albums in the future without worrying about artifacts from lossy compression.

Jeff: You’re drawing way too many conclusions from a test with only a single sample.

Milan: “Enhancement” is a misnomer. It does not make the audio better, it only makes it different.

I really, REALLY don’t hope you’ll try using this experiment to justify ripping your entire CD collection to a lossy format.

First, there’s the obvious issue of choosing a pretty bad track, and a pretty bad method of doing the test. Even if you want to make the tests “realistic” by including crappy headphones and the like, you at least need a decent variation in in music.

Second, there’s the obvious issue of “what happens if I need another lossy playback format for my portable player sometime in the future?” - transcoding between lossy formats suck.

Yup, I’m fully aware that for the majority of my music, I might not be able to hear the difference between 192kbps MP3 and FLAC on a portable music player while jogging near a crowded road, but listening to certain tracks at home I just might.

And when storage is as cheap as it is, why would you rip your collection to a lossy format? Especially when you’re doing it for archival purposes, with the intent of disposing of the physical media?

The problem with these blind tests is that accurately picking the difference between songs is a very difficult task, but not necessarily because people can’t hear the difference.

Unless people have practice in this type of test, they start thinking too much and the results become biased for all types of unimportant reasons. That’s why you have such an outlier in the first sample.

For those of you who don’t know much about wine - think of those times when you’ve been wine tasting. You know that there is difference between the wines, but can you actually discern which is the ‘best’ when you’re at the cellar door sipping on 10 different choices? How often have you bought something, only to find when you get home that it wasn’t what you expected? Whilst the best wine will almost certainly be much more enjoyable when it comes time to drink it, actually picking the best wine is best left to the pros.

I’m pretty surprised I got everything right. The 320kbps and the uncompressed ones might have been a lucky guess. They were nearly indistinguishable, and I’d hate to resort to unquantifiable quack terms to describe them.

I wouldn’t say I have dog ears, though.

I’ll wager that spotting bad compression is an aquired skill, much like spotting bugs and code smells. As a musician and producer, I can reliably hear MP3 compression at 192kbs and below, given decent headphones and source material. It’s mostly about knowing what to look for. Percussion with alot of random high frequency content is usually the easiest giveaway. I did a blind test on myself ten years ago or so, and found I was unable to discern between 192kbps and the original CD.

Another possible bias is the fact that a study at Stanford shows that more students each year actually prefer the MP3 compressed sound.

I don’t take issue with your results. I do take issue with your interpretation. Using your results to state that “people can’t hear a difference at bitrates above 128kbps” and “nobody can hear the difference between a 320kbps CBR audio file and the CD” while accurate for the average person (based on your statistics) cannot be said to be true 100% of the time.
As you discovered in your experiment, there are always outliers.

The people who make a living off of audio and audio quality; (and I don’t mean DJ’s, I mean Audio Engineers, Mastering Engineers and Audio technicians), I guarantee that any of these people worth their salt will be able to accurately assess each sample.

You have conducted a good experiment and collected good data, on the AVERAGE person. it is important to make this clear when reporting your results. Because if your sample included more the above mentioned professionals, your results would look very different.

For the love of god, please stop with the:

“I know this one dude that can tell the difference between flac and 320kbps, so your test is totally wrong”

Yes, there will always be audiophiles, there will always be experts. I’m sure there is someone who can tell the difference between a Maryland squirrel and a Pennsylvania squirrel, but to everyone else, they just look like squirrels. The outliers don’t matter, and experts are outliers.

I had them correctly rated. On board laptop sound card and Sennheiser 595 headphones. Semi-audiophile.

Played them all full length, and entered the grades while listening.
Did some additional testing using random, came at the same grades .

My primary tell was the ease of listening next to the (lack of) dynamic range . I often listen to streaming radio, and sometimes it takes 15 minutes to figure that the quality stinks. I think this is a subconscious thing, it irks me to hear very low quality music. So I wonder the effect of lossy compression of non-dogs’ mind…

Looks like I was right on the mark other than thinking Gouda was the worst.

There are at least two “industry standards” AFAIK: ATRAC at 292kbps (Sony), AAC at 256kbps (Apple iTunes Plus). The bitrates are chosen carefully by audio engineers given the efficiency of the codecs they used. MP3 is inferior to both codecs especially AAC, so it is not surprising people prefer 320kbps “to be safe”. All these references are designed to be perceptually indistinguishable human beings’ “psychoacoustics”. So if it is easy to tell the difference, the codecs at such bitrates are simply a failure in audio research and engineering. And I don’t think this is true!

Of course the choice of sample makes a big difference too. The reference is supposed to be transparent to even the trickiest music we would ever hear most of the time. The sample we have here has no quality and we don’t even know how the original should sound like (its highly processed and distorted anyway). I can tell Feta (128kbps) is worst from the drums at around 0:15. The pre-echo noise is a signature of bad MP3. I can also tell Limburger (160kbps) sounds different to the rest. Given I don’t believe I can tell the difference of a 320kbps to the original in such a bad sample, logically Limburger (160kbps) should be the 2nd worst. The other three, I think it is completely reasonable for me not telling the difference!

The result from the poll is not surprising at all. I am one of the average who can tell Limburger and Feta are different. But I don’t rate Limburger best simply by logic!

The thing is that while its not out of the question that someone might tell the difference a large quantity of the posters here claim they can. These are the same people who would have taken the test. The fact that the results came out like they did mean that a lot of people are simply fooling themselves.

Interesting to me (as an amateur): all of the compressions induced clipping, but the CBR samples had substantially less clipping than the VBR samples. However, since the first reliable clipping is about 0:56, I doubt it was significant.

We have 3TB hard drives nowadays. The question isn’t why compress with FLAC, it’s “Why not?”

I use FLAC for all my rips on the PC, but then transcode to Ogg when I copy to my mobile. Banshee and Rhythmbox both make this entirely transparent. I just needed to set a setting however many years ago. Surely everything does this.

I studied Audiology and I have read a few studies on these topics: One thing that could explain the abnormally good score of the 160 VBR is that this is the sound that people are used to. Thus, they tend to rate familiar sounds higher than objectively better ones.

On a different note, I wrote my thesis on a somewhat related theme. There, I discovered that different kinds of music indeed performed differently when evaluated for sound quality. Overall trends were constant over genres, but some music showed them more pronounced than other.

May I suggest a few quibbles with your data and analysis, Jeff? I’ll explain here, but you can follow along with my updated spreadsheet if you like.

My first observation is that fully 58% of your respondents scored one sample a 1, one sample a 2, etc., all the way through 5. In other words, they most likely misread the instructions and ranked the samples by quality rather than rating them independently. I don’t think you have 3511 sets of ratings; you have 2045 sets of ranking data and 1466 sets of rating data.

Both data sets are useful and informative, but you can’t freely intermix the numbers and analyze them conjointly. The data types are not commensurate, and they require different statistical techniques for analysis.

As it happens, though, removing the ranked data and looking only at the 1466 sets of presumed true ratings doesn’t really change the patterns you mentioned above. A t-Test matrix on the revised data shows that Feta (128kbps) is clearly distinct from all other samples. And no pairing from the pool of Cheddar (320kbps), Gouda (raw), and Brie (192kbps) shows a statistically significant difference.

But… Limburger (160kbps) is in fact rated higher than all other samples, and each of those pairings has a p value far smaller than 0.05. The largest p value is 0.00003. That is very strong and consistent statistical evidence, and you can’t just wave it away because it’s “clearly insane” (i.e., it doesn’t agree with your preconceptions).

I agree with you that it’s highly unlikely that 160kpbs MP3s actually sound better than their higher- and lower-bit-rate counterparts. My theory is that you’ve demonstrated that the order of presentation of the samples influences the responses. In other words, respondents may tend to interpret the first-heard sample as a reference baseline.

As near as I can tell, you didn’t randomize the order of presentation in your original post (or at least, the samples keep coming up in the same order for me…). I bet that if you reran the test with Gouda (raw) listed first, the results would directly (though probably erroneously) contradict your original thesis.

It would be interesting to take a look at the ranked data as well to see what it has to say. This is getting beyond my level of statistical knowledge, but I suspect that something like the following treatment would be appropriate: 1) Restrict the data set to 1-5 rankings. 2) Drop the Feta column (since we all agree that Feta is distinguishable; we want to see if Limburger can be distinguished from the others). 3) Recode the rankings to the range 1-4; in other words, for each person’s rankings, assign the lowest value the “1”, the next-highest value the “2”, etc. 4) Prepare a summary table of cheese vs. 1-4 with a count of the number of the appropriate responses in each cell. 5) Prepare a reference table similar to #4, but with 1466/4 in each cell (even distribution). 6) Use a chi-squared test to test whether the distribution observed in #4 is distinguishable from the reference distribution in #5.

As Barbie says, survey design is hard…

Some people can discern between bitrates if they originally listened to the song in high quality then listen to it in a lower quality format. My father does that often, where I play a song he knows on my computer in some mp3 192 kbps and he tells me it sounds a little off. These are not the kind of differences you can notice on mediocre audio equipment, which is something these kind of tests cannot account for. Obviously someone with crappy headphones won’t be able to tell the difference. It would be like watching a 1080p video on a 640x480 screen and saying there’s no difference between 1080p and 720p.

Woohoo! I got all five in the right order. Do I get a cookie or something of equivalent value?

I could definitely tell the raw CD from the <320kbs samples (although the drop in quality was admittedly slight in all but the 128kbps, which sounded really nasty) but discerning the difference between uncompressed and the 320kbps was much harder.

Ultimately this test has reaffirmed my personal beliefs that 1) FLAC isn’t worth it purely on terms of sound quality and 2) 320kbps CBR or V0 VBR are still superior to 192kbs if you have good quality headphones. The music just sounds richer and more detailed, I can’t explain it beyond that.

Tested on a Samsung netbook with Sennheiser HD 25-1 IIs (no amplifier).

There is some research that suggests that people now prefer the “sizzling sounds” of mp3s to RAW CD quality. Which could be the reason why 160 kbps does so well.

You’d think 128 kbps would do even better, but it’s probably SO bad that it’s easy to detect.

http://radar.oreilly.com/2009/03/the-sizzling-sound-of-music.html

Congratulations on completely confirming that untrained ears can tell the difference between them!

There’s an interesting bit of psychology that states that the first thing we hear is treated as the baseline. And you didn’t randomize the order on page load.

So for most of us, the “Limburger” file (160 kpbs) was treated as the baseline, anomalies and all. Then in all the other files, the changes in audio quality were treated as anomalies - and they got a lower rating.

All your test proves is that most of us are not familiar with the exact notes of a single arbitrary song.

Must say- routers and mp3s seem like commodity things these days that don’t warrant so much attention. Is it just nostalgia?