I have posted this messgage yesterday in the Beowulf mailing list, and did not get any responses, as i have* tried different Centos kernels to see if the behaviour changes or not, and it did not change much, I am posting it here, I hope no one minds, and thanks in advance for any pointers or clues?
I have an issue with a new cluster setup where the nodes are RHEL5.1(with the latest 5.2 kernel), when i try to write NFS data, the nodes scale linearly until they reach the 10th node, that is the bandwidth , and throughput seen from the NFS sever on the other side of the nodes shows a liner increment from around 100+Mbyte/sec up to 1Gbyte/sec, however when we add another extra node to the equation the bandwidth/throughput becomes erratic/inconsistent, and drops to around 500-700Mbyte/sec. however if i try the same setup with RHEL4U6 i do not get the same behaviour it sustains the bandwidth at 1Gbyte/sec. the setup is like this 48 nodes sharing 48 port access switch that is up linked* using 10g link to a CISCO 6509 switch which is linked to a Clustered NFS File system that consist of eight heads where each head linked using a 10G link to the 6509. the above was a write test, so i thought may be the tcp congestion kicked in, or sliding windows problem, however when i do a read test it gets worse, the scalability now is reduced to 5 nodes that is one node is able to read around 100 MBps, two will read double, and so on until you add the fifth node where the bandwidth drops from around 500+MBps to around 300, and again from RHEL4 the behaviour is different.
CentOS mailing list