Problem with ISIS simulations on Caparmor
Tina, Last friday, I launched 3 simulations plans of 1500 jobs each (1multi-job of 1500 sub-jobs each). The first one went to the end, but the next two were either killed in the middle or never started. Do you know why those jobs were killed or did not started ? And is there a way so that we can do such things without any problems. Best regards Jean Couteau
Jean Couteau wrote:
Tina,
Last friday, I launched 3 simulations plans of 1500 jobs each (1multi-job of 1500 sub-jobs each). The first one went to the end, but the next two were either killed in the middle or never started. In fact none went through, one got 1445/1500 jobs done, one got 0/1500, one got 0/1575 :(.
Do you know why those jobs were killed or did not started ? And is there a way so that we can do such things without any problems.
Best regards Jean Couteau _______________________________________________ Isis-fish-devel mailing list Isis-fish-devel@list.isis-fish.org http://list.isis-fish.org/cgi-bin/mailman/listinfo/isis-fish-devel
hi jean, you need to check what was 'pbs job id' that is the number you get when you do qsub xxx and with this number you (or I)can type tracejob -n days xxx (days is number of days you submited before, like today is 4th, you submitted on 28 thus it is 7) to see when the job died because of what. if your software does not keep pbs job id, let me know, i will try to look at logs. tina Jean Couteau a écrit :
Tina,
Last friday, I launched 3 simulations plans of 1500 jobs each (1multi-job of 1500 sub-jobs each). The first one went to the end, but the next two were either killed in the middle or never started.
Do you know why those jobs were killed or did not started ? And is there a way so that we can do such things without any problems.
Best regards Jean Couteau
-- =================================================== Tina Odaka RIC - IDM - IFREMER Tel: +33 (0)2 98 22 41 85 Fax: +33 (0)2 98 22 45 46 email: Tina.Odaka@ifremer.fr http://www.ifremer.fr/pcim ==================================================
Tina ODAKA wrote:
hi jean, you need to check what was 'pbs job id' that is the number you get when you do qsub xxx and with this number you (or I)can type tracejob -n days xxx (days is number of days you submited before, like today is 4th, you submitted on 28 thus it is 7)
to see when the job died because of what. Ok, so I got that :
poussin@service4:~> tracejob -n 7 99338[].service4 Job: 99338[].service4 11/28/2009 12:36:28 S Job Modified at request of Scheduler@service4.ice.ifremer.fr 11/28/2009 12:36:28 A user=poussin group=emh jobname=simulation-as_S queue=sequentiel ctime=1259336077 qtime=1259336077 etime=1259336077 start=0 array_indices=0-1574 Resource_List.mem=3gb Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.place=pack Resource_List.select=1:mem=3gb:ncpus=1 Resource_List.walltime=96:00:00 11/28/2009 12:37:07 L Considering job to run 11/28/2009 12:37:07 L Queue sequentiel per-user job limit reached 11/28/2009 12:37:13 S delete job request received 11/28/2009 12:37:13 S Job to be deleted at request of root@service4.ice.ifremer.fr 11/28/2009 12:37:13 A requestor=root@service4.ice.ifremer.fr 11/28/2009 12:37:20 S delete job request received 11/28/2009 12:37:20 S Job to be deleted at request of root@service4.ice.ifremer.fr 11/28/2009 12:37:20 A requestor=root@service4.ice.ifremer.fr 11/28/2009 12:37:21 S dequeuing from sequentiel, state 7 11/28/2009 12:37:21 A user=poussin group=emh jobname=simulation-as_S queue=sequentiel ctime=1259336077 qtime=1259336077 etime=1259336077 start=0 array_indices=0-1574 Resource_List.mem=3gb Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.place=pack Resource_List.select=1:mem=3gb:ncpus=1 Resource_List.walltime=96:00:00 session=0 end=1259411841 Exit_status=0 poussin@service4:~> tracejob -n 7 99198[].service4 Job: 99198[].service4 11/28/2009 12:36:17 L Considering job to run 11/28/2009 12:36:17 L Queue sequentiel per-user job limit reached 11/28/2009 12:36:24 S delete job request received 11/28/2009 12:36:24 S Job to be deleted at request of root@service4.ice.ifremer.fr 11/28/2009 12:36:24 A requestor=root@service4.ice.ifremer.fr 11/28/2009 12:37:13 S delete job request received 11/28/2009 12:37:13 S Job to be deleted at request of root@service4.ice.ifremer.fr 11/28/2009 12:37:13 S dequeuing from sequentiel, state 7 11/28/2009 12:37:13 A requestor=root@service4.ice.ifremer.fr 11/28/2009 12:37:13 A user=poussin group=emh jobname=simulation-as_r queue=sequentiel ctime=1259331670 qtime=1259331671 etime=1259331671 start=0 array_indices=0-1499 Resource_List.mem=3gb Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.place=pack Resource_List.select=1:mem=3gb:ncpus=1 Resource_List.walltime=96:00:00 session=0 end=1259411833 Exit_status=0 11/28/2009 12:37:20 S delete job request received 11/28/2009 12:37:20 S Unknown Job Id poussin@service4:~> tracejob -n 7 99346[].service4 Job: 99346[].service4 11/28/2009 12:37:07 L Considering job to run 11/28/2009 12:37:07 L Queue sequentiel per-user job limit reached 11/28/2009 12:37:13 S delete job request received 11/28/2009 12:37:13 S Job to be deleted at request of root@service4.ice.ifremer.fr 11/28/2009 12:37:13 S dequeuing from sequentiel, state 1 11/28/2009 12:37:13 A requestor=root@service4.ice.ifremer.fr 11/28/2009 12:37:20 S delete job request received 11/28/2009 12:37:20 S Unknown Job Id
participants (2)
-
Jean Couteau -
Tina ODAKA