When many CPUs are locking a rw semaphore for read concurrently, cache
line bouncing occurs. When a CPU acquires rw semaphore for read, the
CPU writes to the cache line holding the semaphore. Consequently, the
cache line is being moved between CPUs and this slows down semaphore
acquisition.
This patch introduces new percpu rw semaphores. They are functionally
identical to existing rw semaphores, but locking the percpu rw semaphore
for read is faster and locking for write is slower.
The percpu rw semaphore is implemented as a percpu array of rw
semaphores, each semaphore for one CPU. When some thread needs to lock
the semaphore for read, only semaphore on the current CPU is locked for
read. When some thread needs to lock the semaphore for write, semaphores
for all CPUs are locked for write. This avoids cache line bouncing.
Note that the thread that is locking percpu rw semaphore may be
rescheduled, it doesn't cause bug, but cache line bouncing occurs in
this case.