具体大家可以查看:http://en.wikipedia.org/wiki/False_sharing
我们可以用一张图来说明问题:
总结一下就是cpu能从本地缓存中取数据就不会从内存中取,而内存中的数据和缓存中的数据一般都是按行读取的,也就是所谓的缓存行,一般为64个字节,当我们操作数据的时候,假如刚好多个变量在同一个缓存行的时候,多线程同时操作就会让之前的缓存行失效,导致程序效率降低,我们用一个程序来说明问题:
public final class FalseSharing implements Runnable { public static int NUM_THREADS = 4; // change public final static long ITERATIONS = 500L * 1000L * 1000L; private final int arrayIndex; private static VolatileLong[] longs; public FalseSharing(final int arrayIndex) { this.arrayIndex = arrayIndex; } public static void main(final String[] args) throws Exception { Thread.sleep(10000); System.out.println("starting...."); if (args.length == 1) { NUM_THREADS = Integer.parseInt(args[0]); } longs = new VolatileLong[NUM_THREADS]; for (int i = 0; i < longs.length; i++) { longs[i] = new VolatileLong(); } final long start = System.nanoTime(); runTest(); System.out.println("duration = " + (System.nanoTime() - start)); } private static void runTest() throws InterruptedException { Thread[] threads = new Thread[NUM_THREADS]; for (int i = 0; i < threads.length; i++) { threads[i] = new Thread(new FalseSharing(i)); } for (Thread t : threads) { t.start(); } for (Thread t : threads) { t.join(); } } public void run() { long i = ITERATIONS + 1; while (0 != --i) { longs[arrayIndex].value = i; } } public final static class VolatileLong { public volatile long value = 0L; public long p1, p2, p3, p4, p5, p6; } }
假如我们把:
public long p1, p2, p3, p4, p5, p6;
这样去除的话,性能就会降低十倍。
这也就是为什么jetty在实现BlockingArrayQueue的时候,会加上以下代码:
private long _space0; private long _space1; private long _space2; private long _space3; private long _space4; private long _space5; private long _space6; private long _space7;
在c的程序中,例如nginx也有类似的实现:
typedef union { erts_smp_rwmtx_t rwmtx; byte cache_line_align_[ERTS_ALC_CACHE_LINE_ALIGN_SIZE(sizeof(erts_smp_rwmtx_t))]; }erts_meta_main_tab_lock_t; erts_meta_main_tab_lock_t main_tab_lock[16]
最后大家有兴趣可以看下这篇文章:
https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads/