It often happens that a job is submitted to the scheduler requesting less resources (in terms of memory, slots, time, etc.) than the job will be actually using. In these circumstances a job my overload one or more computational nodes where it is running.
Node overload in turns affect the cluster overall performance and can lead to premature job termination and node down time.
In the interest of preventing degraded performance, jobs that use more memory (or computational time) than what had been requested are automatically terminated by the scheduler. Cores over-subscription, however, is dealt differently. You will generally receive a notification (with or without job termination) that will inform you of the actual load caused by your job and the number of slots that you have requested for such job.
Over-subscription generally means that the load generated by your job on the compute node(s) is greater than the number of slots you have requested. In this case you will need to terminate your job and resubmit it requesting a number of slot equal to the value of the load caused by your job rounded up to the closest integer. This is because each unit of load roughly corresponds to one scheduler slot, so that if your load is 6.9 you should have requested a matching number of 7 slots.
In the example below we show how to change the resources requested for a multi-threaded job that had erroneously been submitted requesting only 1 slot (while the job uses 8 threads which will result in a load value between 7 and 8) .
Original request:
#$ -l h_data=16G,h_rt=HHH:MM:SS
notice that since no parallel environment, and no exclusive flag, were specified you are tacitly requesting
only 1 slot.
To appropriately match the load generated by the 8-threaded job you will need to change your request to:
#$ -l h_data=2G,h_rt=7:30:00
#$ -pe shared 8
The parallel environment request (-pe shared 8) allocates 8 slots (which on Hoffman2 really mean 8 cores) from one computing node (this means that all 8 cores are on the same physical server). Please notice that since h_data (or h_mem) is a value per slot, instead of the original requested memory value of 16GB (for one slot) we are now requesting to 2GB (as 2GB x 8slots=16GB total memory). If you have pending jobs that had been erroneously submitted requesting only one slot you can alter your pending request using the following command:
qalter -l h_data=2G,h_rt=HHH:MM:SS -pe shared 8 JOB_ID_NO
where HHH=number of hours; MM=number of minutes, SS=number of seconds.