The problem with your article is that you imply that kernel mode is faster than user space because it has direct-access to hardware. This is wrong.
Sometimes PCI hardware can directly-access a memory region mapped to user process memory. Sometimes a user process can write directly to a memory region that is mapped to a PCI device’s onboard memory (i.e. video memory). In such cases the kernel is bypassed and is only used to setup the mappings or issue commands to start an I/O transfer.
For instance under UNIX, if your user process bypasses the kernel cache to write out a file(called Direct I/O under UNIX/Linux) using the write() system call, the kernel will start an I/O transfer to the filesystem such that the write data will be directly read from your process’s memory. The user space “hit” in performance is the latency of calling a kernel system call which involves a context switch.
Some people get all worked up on context switches because they are heavy, etc. But under Linux, the context switch time is less than ONE microsecond.
BTW, the TUX article you reference is really old. Nowadays no one uses tux because the Linux kernel added sendfile() syscall which takes a file from user-space and directly maps it to a socket resulting in zero-copy sending.