Fwd: Re: Fwd: Re: Fwd: Re: HelpDesk Nv3 / n°=191007 : Re: Accès à plus de coeurs sur caparmor ?
Bonjour, Il apparait qu'il y a eu une mauvaise compréhension entre l'assistance de l'IFREMER et nous concernant les queues qui peuvent être utilisées par ISIS sur caparmor : - On ne peut toujours pas utiliser les queues "parallel" avec ISIS. - Il existe bien une queue spéciale pour ISIS "isisfish" qui permet de faire tourner ses simus sur plus de 16 coeurs à la fois, MAIS chaque simu sur cette queue doit durer moins de dix minutes. - Donc la seule solution pour le moment pour les grosses simus est d'utiliser la queue par défaut "sequentiel" qui a 16 coeurs. Tina a jeté un coup d'oeil à mes simus sur cette queue et trouve qu'on n'utilise pas toute la puissance du processeur. En particulier, elle trouve des "futex wait" et demande si c'est normal. Voici son message : -------- Message original -------- Sujet: Re: Fwd: Re: [Isis-fish-users] Fwd: Re: HelpDesk Nv3 / n°=191007 : Re: Accès à plus de coeurs sur caparmor ? Date : Tue, 17 Dec 2013 11:08:40 +0100 De : Tina ODAKA <Tina.Odaka@ifremer.fr> Organisation : IFREMER Pour : Loic GASCHE <Loic.Gasche@ifremer.fr> hi, i checked your jobid "0" you have 56 min of walltime where as CPU was only used for 43 minutes; i.e. it only use about 75% of cpu power. I tried to to strace on your java process, and i do not know what it is waiting, but i saw a lot of "futex wait" can you ask code-lutin if it is normal?? what i did is checking your java process's system call. lgasche 29123 29100 78 09:08 ? 00:45:56 /home3/caparmor/poussin/jdk64/bin/java -Djava.library.path=jri64 -DR.type=jni -Xmx3000M -jar isis-fish-4.2.1.2.jar --option launch.ui false --option perform.vcsupdate false --option perform.migration false --option perform.cron false --simulateRemotellyWithPreScript as_7DV10_toutesRegles_SansCantonnements_ParamsPrincipaux_2013-12-17-10-08_1 /home1/caparmor/lgasche/isis-tmp/simulation-as_7DV10_toutesRegles_SansCantonnements_ParamsPrincipaux_2013-12-17-10-08-preparation.zip /home1/caparmor/lgasche/isis-tmp/simulation-as_7DV10_toutesRegles_SansCantonnements_ParamsPrincipaux_2013-12-17-10-08_1-result.zip /home1/caparmor/lgasche/isis-tmp/simulation-as_7DV10_toutesRegles_SansCantonnements_ParamsPrincipaux_2013-12-17-10-08_1-prescript.bsh r5i0n0:~ # strace -p 29123 -f [pid 29249] sched_yield( <unfinished ...> [pid 29243] futex(0x4012ed28, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29236] sched_yield( <unfinished ...> [pid 29228] <... sched_yield resumed> ) = 0 [pid 29249] <... sched_yield resumed> ) = 0 [pid 29243] <... futex resumed> ) = 0 [pid 29240] <... sched_yield resumed> ) = 0 [pid 29236] <... sched_yield resumed> ) = 0 [pid 29233] sched_yield( <unfinished ...> [pid 29228] sched_yield( <unfinished ...> [pid 29222] sched_yield( <unfinished ...> [pid 29249] sched_yield( <unfinished ...> [pid 29243] futex(0x2afb140008b4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x2afb140008b0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...> [pid 29249] <... sched_yield resumed> ) = 0 [pid 29245] <... futex resumed> ) = 0 [pid 29243] <... futex resumed> ) = 1 [pid 29245] futex(0x40130a28, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 29240] sched_yield( <unfinished ...> [pid 29236] sched_yield( <unfinished ...> [pid 29240] <... sched_yield resumed> ) = 0 [pid 29233] <... sched_yield resumed> ) = 0 [pid 29228] <... sched_yield resumed> ) = 0 [pid 29222] <... sched_yield resumed> ) = 0 [pid 29249] sched_yield( <unfinished ...> [pid 29243] futex(0x40130a28, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29240] sched_yield( <unfinished ...> [pid 29249] <... sched_yield resumed> ) = 0 [pid 29236] <... sched_yield resumed> ) = 0 [pid 29228] sched_yield( <unfinished ...> [pid 29222] sched_yield( <unfinished ...> [pid 29249] sched_yield( <unfinished ...> [pid 29245] <... futex resumed> ) = 0 [pid 29249] <... sched_yield resumed> ) = 0 [pid 29245] futex(0x40130a28, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29243] <... futex resumed> ) = 1 [pid 29245] <... futex resumed> ) = 0 [pid 29240] <... sched_yield resumed> ) = 0 [pid 29236] sched_yield( <unfinished ...> [pid 29233] sched_yield( <unfinished ...> [pid 29228] <... sched_yield resumed> ) = 0 [pid 29222] <... sched_yield resumed> ) = 0 [pid 29249] futex(0x40132d24, FUTEX_WAIT_PRIVATE, 50941, NULL <unfinished ...> [pid 29243] futex(0x40130fb4, FUTEX_WAIT_PRIVATE, 50909, NULL <unfinished ...> [pid 29245] futex(0x2afb140008b4, FUTEX_WAIT_PRIVATE, 50671, NULL <unfinished ...> [pid 29240] futex(0x4012f244, FUTEX_WAIT_PRIVATE, 50727, NULL <unfinished ...> [pid 29236] <... sched_yield resumed> ) = 0 [pid 29233] <... sched_yield resumed> ) = 0 [pid 29228] futex(0x401299f4, FUTEX_WAIT_PRIVATE, 51249, NULL <unfinished ...> [pid 29233] futex(0x4012b764, FUTEX_WAIT_PRIVATE, 50907, NULL <unfinished ...> [pid 29222] futex(0x40132d24, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x40132d20, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...> [pid 29249] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) [pid 29222] <... futex resumed> ) = 0 [pid 29249] futex(0x40132728, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29222] futex(0x40127c84, FUTEX_WAIT_PRIVATE, 50875, NULL <unfinished ...> [pid 29249] <... futex resumed> ) = 0 [pid 29249] futex(0x40130fb4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x40130fb0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 [pid 29243] <... futex resumed> ) = 0 [pid 29249] futex(0x4012ed28, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29243] futex(0x4012ed28, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 29249] <... futex resumed> ) = 0 [pid 29243] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) [pid 29249] futex(0x40132d24, FUTEX_WAIT_PRIVATE, 50943, NULL <unfinished ...> [pid 29243] futex(0x4012ed28, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 29243] futex(0x4012f244, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4012f240, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...> [pid 29240] <... futex resumed> ) = 0 [pid 29236] futex(0x4012b7a4, FUTEX_WAIT_PRIVATE, 51125, NULL <unfinished ...> [pid 29243] <... futex resumed> ) = 1 [pid 29240] futex(0x4012cf28, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 29243] futex(0x4012cf28, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 29240] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) [pid 29243] futex(0x40130fb4, FUTEX_WAIT_PRIVATE, 50911, NULL <unfinished ...> [pid 29240] futex(0x4012cf28, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 29240] futex(0x2afb140008b4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x2afb140008b0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...> [pid 29245] <... futex resumed> ) = 0 [pid 29240] <... futex resumed> ) = 1 [pid 29245] futex(0x40130a28, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 29240] futex(0x40130a28, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29245] <... futex resumed> ) = 0 [pid 29240] <... futex resumed> ) = 1 [pid 29245] futex(0x40130a28, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 29245] futex(0x401299f4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x401299f0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 [pid 29228] <... futex resumed> ) = 0 [pid 29245] futex(0x2afb140008b4, FUTEX_WAIT_PRIVATE, 50673, NULL <unfinished ...> [pid 29228] futex(0x40127728, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29240] futex(0x4012f244, FUTEX_WAIT_PRIVATE, 50729, NULL <unfinished ...> [pid 29228] <... futex resumed> ) = 0 [pid 29228] futex(0x4012b764, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4012b760, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...> [pid 29233] <... futex resumed> ) = 0 [pid 29228] <... futex resumed> ) = 1 [pid 29233] futex(0x40129428, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 29228] futex(0x40129428, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29233] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) [pid 29228] <... futex resumed> ) = 0 [pid 29233] futex(0x40129428, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29228] futex(0x401299f4, FUTEX_WAIT_PRIVATE, 51251, NULL <unfinished ...> [pid 29233] <... futex resumed> ) = 0 [pid 29233] futex(0x40127c84, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x40127c80, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 [pid 29222] <... futex resumed> ) = 0 [pid 29233] futex(0x4012b764, FUTEX_WAIT_PRIVATE, 50909, NULL <unfinished ...> [pid 29222] futex(0x40125928, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 29222] futex(0x40132d24, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x40132d20, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...> [pid 29249] <... futex resumed> ) = 0 [pid 29222] <... futex resumed> ) = 1 [pid 29249] futex(0x40132728, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 29222] futex(0x40132728, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29249] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) [pid 29222] <... futex resumed> ) = 0 [pid 29249] futex(0x40132728, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29222] futex(0x40127c84, FUTEX_WAIT_PRIVATE, 50877, NULL <unfinished ...> [pid 29249] <... futex resumed> ) = 0 [pid 29249] futex(0x40130fb4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x40130fb0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 [pid 29243] <... futex resumed> ) = 0 [pid 29249] futex(0x4019f884, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4019f880, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...> [pid 29243] futex(0x4012ed28, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29256] <... futex resumed> ) = 0 [pid 29249] <... futex resumed> ) = 1 [pid 29256] futex(0x4019f128, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 29249] futex(0x4019f128, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29256] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) [pid 29249] <... futex resumed> ) = 0 [pid 29256] futex(0x4019f128, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29249] futex(0x40132d24, FUTEX_WAIT_PRIVATE, 50945, NULL <unfinished ...> [pid 29256] <... futex resumed> ) = 0 [pid 29243] <... futex resumed> ) = 0 [pid 29243] futex(0x4012b7a4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4012b7a0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 [pid 29236] <... futex resumed> ) = 0 [pid 29243] futex(0x40130fb4, FUTEX_WAIT_PRIVATE, 50913, NULL <unfinished ...> [pid 29236] futex(0x4012b228, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 29236] futex(0x4012f244, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4012f240, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...> [pid 29240] <... futex resumed> ) = 0 [pid 29236] <... futex resumed> ) = 1 [pid 29240] futex(0x4012cf28, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 29236] futex(0x4012cf28, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29240] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) [pid 29236] <... futex resumed> ) = 0 [pid 29240] futex(0x4012cf28, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29236] futex(0x4012b7a4, FUTEX_WAIT_PRIVATE, 51127, NULL <unfinished ...> [pid 29240] <... futex resumed> ) = 0 [pid 29240] futex(0x2afb140008b4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x2afb140008b0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...> [pid 29245] <... futex resumed> ) = 0 [pid 29240] <... futex resumed> ) = 1 [pid 29245] futex(0x40130a28, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 29240] futex(0x40130a28, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29245] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) [pid 29240] <... futex resumed> ) = 0 [pid 29245] futex(0x40130a28, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29240] futex(0x4012f244, FUTEX_WAIT_PRIVATE, 50731, NULL <unfinished ...> [pid 29245] <... futex resumed> ) = 0 [pid 29245] futex(0x401299f4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x401299f0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 [pid 29228] <... futex resumed> ) = 0 [pid 29245] futex(0x2afb140008b4, FUTEX_WAIT_PRIVATE, 50675, NULL <unfinished ...> [pid 29228] futex(0x40127728, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 29228] futex(0x4012b764, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4012b760, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 [pid 29233] <... futex resumed> ) = 0 [pid 29228] futex(0x401299f4, FUTEX_WAIT_PRIVATE, 51253, NULL <unfinished ...> [pid 29233] futex(0x40129428, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 29233] futex(0x40127c84, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x40127c80, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 [pid 29222] <... futex resumed> ) = 0 [pid 29233] futex(0x4012b764, FUTEX_WAIT_PRIVATE, 50911, NULL <unfinished ...> [pid 29222] futex(0x40125928, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 29222] futex(0x40127c84, FUTEX_WAIT_PRIVATE, 50879, NULL <unfinished ...> [pid 29296] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 29296] futex(0x2afb1402ee28, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 29296] futex(0x4179ee94, 0x189 /* FUTEX_??? */, 1, {1387274027, 162380000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 29296] futex(0x2afb1402ee28, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 29296] futex(0x4179ee94, 0x189 /* FUTEX_??? */, 1, {1387274027, 212506000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 29296] futex(0x2afb1402ee28, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 29296] futex(0x4179ee94, 0x189 /* FUTEX_??? */, 1, {1387274027, 262659000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 29296] futex(0x2afb1402ee28, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 29296] futex(0x4179ee94, 0x189 /* FUTEX_??? */, 1, {1387274027, 312820000}, ffffffff <unfinished ...> [pid 29256] mprotect(0x2afb0bbe0000, 4096, PROT_READ) = 0 [pid 29256] futex(0x401ec634, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x401ec630, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 [pid 29411] <... futex resumed> ) = 0 [pid 29411] futex(0x41b25828, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29256] futex(0x4019f884, 0x189 /* FUTEX_??? */, 25587, {1387274028, 305424000}, ffffffff <unfinished ...> [pid 29411] <... futex resumed> ) = 0 [pid 29411] futex(0x401ae004, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x401ae000, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 [pid 29262] <... futex resumed> ) = 0 [pid 29411] futex(0x2afb2357b984, 0x189 /* FUTEX_??? */, 1, {1387274027, 405532000}, ffffffff <unfinished ...> [pid 29262] futex(0x401a5e28, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 29262] futex(0x401ae004, FUTEX_WAIT_PRIVATE, 77, NULL <unfinished ...> [pid 29296] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 29296] futex(0x2afb1402ee28, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 29296] futex(0x4179ee94, 0x189 /* FUTEX_??? */, 1, {1387274027, 362969000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 29296] futex(0x2afb1402ee28, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 29296] futex(0x4179ee94, 0x189 /* FUTEX_??? */, 1, {1387274027, 413120000}, ffffffff <unfinished ...> [pid 29411] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 29411] futex(0x4229a828, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 29411] futex(0x4019f884, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4019f880, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 [pid 29411] futex(0x401ec634, FUTEX_WAIT_PRIVATE, 8799, NULL <unfinished ...> [pid 29256] <... futex resumed> ) = 0 [pid 29256] futex(0x4019f128, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 29256] mprotect(0x2afb0bbe1000, 4096, PROT_READ) = 0 [pid 29256] mprotect(0x2afb0bbe1000, 4096, PROT_READ|PROT_WRITE) = 0 [pid 29256] mprotect(0x2afb0bbe0000, 4096, PROT_NONE) = 0 [pid 29256] futex(0x40132d24, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x40132d20, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 [pid 29249] <... futex resumed> ) = 0 [pid 29256] futex(0x4019f884, FUTEX_WAIT_PRIVATE, 25589, NULL <unfinished ...> [pid 29249] futex(0x40132728, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 29249] futex(0x40130fb4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x40130fb0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 [pid 29243] <... futex resumed> ) = 0 [pid 29243] futex(0x4012ed28, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 29243] futex(0x4012b7a4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4012b7a0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 [pid 29236] <... futex resumed> ) = 0 [pid 29243] futex(0x4012b228, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29236] futex(0x4012b228, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 29243] <... futex resumed> ) = 0 [pid 29236] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) [pid 29249] sched_yield( <unfinished ...> [pid 29243] sched_yield( <unfinished ...> [pid 29249] <... sched_yield resumed> ) = 0 [pid 29236] futex(0x4012b228, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 29236] futex(0x4012f244, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4012f240, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...> [pid 29240] <... futex resumed> ) = 0 [pid 29236] <... futex resumed> ) = 1 [pid 29240] futex(0x4012cf28, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 29236] futex(0x4012cf28, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29240] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) [pid 29236] <... futex resumed> ) = 0 [pid 29249] sched_yield() = 0 [pid 29240] futex(0x4012cf28, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29236] sched_yield( <unfinished ...> [pid 29240] <... futex resumed> ) = 0 [pid 29236] <... sched_yield resumed> ) = 0 [pid 29249] sched_yield( <unfinished ...> [pid 29240] futex(0x2afb140008b4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x2afb140008b0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...> [pid 29249] <... sched_yield resumed> ) = 0 [pid 29236] sched_yield( <unfinished ...> [pid 29245] <... futex resumed> ) = 0 [pid 29240] <... futex resumed> ) = 1 [pid 29236] <... sched_yield resumed> ) = 0 [pid 29249] sched_yield( <unfinished ...> [pid 29245] futex(0x40130a28, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 29240] futex(0x40130a28, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29236] sched_yield( <unfinished ...> [pid 29245] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) [pid 29240] <... futex resumed> ) = 0 [pid 29236] <... sched_yield resumed> ) = 0 [pid 29249] <... sched_yield resumed> ) = 0 [pid 29245] futex(0x40130a28, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29240] sched_yield( <unfinished ...> [pid 29236] sched_yield( <unfinished ...> [pid 29245] <... futex resumed> ) = 0 [pid 29240] <... sched_yield resumed> ) = 0 [pid 29236] <... sched_yield resumed> ) = 0 [pid 29249] sched_yield( <unfinished ...> [pid 29245] futex(0x401299f4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x401299f0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...> [pid 29249] <... sched_yield resumed> ) = 0 [pid 29240] sched_yield( <unfinished ...> [pid 29236] sched_yield( <unfinished ...> [pid 29245] <... futex resumed> ) = 1 [pid 29240] <... sched_yield resumed> ) = 0 [pid 29236] <... sched_yield resumed> ) = 0 [pid 29228] <... futex resumed> ) = 0 [pid 29228] futex(0x40127728, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29249] sched_yield() = 0 [pid 29228] <... futex resumed> ) = 0 [pid 29249] sched_yield( <unfinished ...> [pid 29245] sched_yield( <unfinished ...> [pid 29240] sched_yield( <unfinished ...> [pid 29228] futex(0x4012b764, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4012b760, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...> [pid 29249] <... sched_yield resumed> ) = 0 [pid 29233] <... futex resumed> ) = 0 [pid 29228] <... futex resumed> ) = 1 [pid 29233] futex(0x40129428, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 29228] futex(0x40129428, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29233] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) [pid 29228] <... futex resumed> ) = 0 [pid 29245] <... sched_yield resumed> ) = 0 [pid 29243] <... sched_yield resumed> ) = 0 [pid 29240] <... sched_yield resumed> ) = 0 [pid 29236] sched_yield( <unfinished ...> [pid 29233] futex(0x40129428, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29249] sched_yield( <unfinished ...> [pid 29245] sched_yield( <unfinished ...> [pid 29233] <... futex resumed> ) = 0 [pid 29228] sched_yield( <unfinished ...> [pid 29233] futex(0x40127c84, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x40127c80, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...> [pid 29228] <... sched_yield resumed> ) = 0 [pid 29233] <... futex resumed> ) = 1 [pid 29222] <... futex resumed> ) = 0 [pid 29249] <... sched_yield resumed> ) = 0 [pid 29245] <... sched_yield resumed> ) = 0 [pid 29240] sched_yield( <unfinished ...> [pid 29236] <... sched_yield resumed> ) = 0 [pid 29249] sched_yield( <unfinished ...> [pid 29243] sched_yield( <unfinished ...> [pid 29233] futex(0x40125928, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29228] sched_yield( <unfinished ...> [pid 29233] <... futex resumed> ) = 0 [pid 29228] <... sched_yield resumed> ) = 0 [pid 29222] futex(0x40125928, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 29249] <... sched_yield resumed> ) = 0 [pid 29249] sched_yield( <unfinished ...> [pid 29245] sched_yield( <unfinished ...> [pid 29243] <... sched_yield resumed> ) = 0 [pid 29245] <... sched_yield resumed> ) = 0 [pid 29240] <... sched_yield resumed> ) = 0 [pid 29236] sched_yield( <unfinished ...> [pid 29233] sched_yield( <unfinished ...> [pid 29222] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) [pid 29228] sched_yield( <unfinished ...> [pid 29233] <... sched_yield resumed> ) = 0 [pid 29228] <... sched_yield resumed> ) = 0 [pid 29222] futex(0x40125928, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29249] <... sched_yield resumed> ) = 0 [pid 29249] sched_yield( <unfinished ...> [pid 29243] sched_yield( <unfinished ...> [pid 29233] sched_yield( <unfinished ...> [pid 29245] sched_yield( <unfinished ...> [pid 29222] <... futex resumed> ) = 0 [pid 29249] <... sched_yield resumed> ) = 0 [pid 29249] futex(0x40127c84, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x40127c80, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...> [pid 29245] <... sched_yield resumed> ) = 0 [pid 29243] <... sched_yield resumed> ) = 0 [pid 29245] futex(0x2afb140008b4, FUTEX_WAIT_PRIVATE, 50677, NULL <unfinished ...> [pid 29243] futex(0x40130fb4, FUTEX_WAIT_PRIVATE, 50915, NULL <unfinished ...> [pid 29240] futex(0x4012f244, FUTEX_WAIT_PRIVATE, 50733, NULL <unfinished ...> [pid 29236] <... sched_yield resumed> ) = 0 [pid 29233] <... sched_yield resumed> ) = 0 [pid 29236] futex(0x4012b7a4, FUTEX_WAIT_PRIVATE, 51129, NULL <unfinished ...> [pid 29233] futex(0x4012b764, FUTEX_WAIT_PRIVATE, 50913, NULL <unfinished ...> [pid 29228] sched_yield( <unfinished ...> [pid 29222] futex(0x40127c84, FUTEX_WAIT_PRIVATE, 50881, NULL <unfinished ...> [pid 29228] <... sched_yield resumed> ) = 0 [pid 29222] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) [pid 29228] futex(0x401299f4, FUTEX_WAIT_PRIVATE, 51255, NULL <unfinished ...> [pid 29222] futex(0x40127c80, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 29249] <... futex resumed> ) = 1 [pid 29222] <... futex resumed> ) = 0 [pid 29249] futex(0x40132d24, FUTEX_WAIT_PRIVATE, 50947, NULL <unfinished ...> [pid 29222] futex(0x40127c80, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 29222] futex(0x40125928, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 29222] futex(0x4012b764, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4012b760, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...> [pid 29233] <... futex resumed> ) = 0 [pid 29222] <... futex resumed> ) = 1 [pid 29233] futex(0x40129428, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29222] futex(0x4019f884, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4019f880, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...> [pid 29233] <... futex resumed> ) = 0 [pid 29222] <... futex resumed> ) = 1 [pid 29256] <... futex resumed> ) = 0 [pid 29233] futex(0x4012b7a4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4012b7a0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...> [pid 29222] futex(0x40127c84, FUTEX_WAIT_PRIVATE, 50883, NULL <unfinished ...> [pid 29256] futex(0x4019f128, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29236] <... futex resumed> ) = 0 [pid 29233] <... futex resumed> ) = 1 [pid 29236] futex(0x4012b228, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 29256] <... futex resumed> ) = 0 [pid 29233] futex(0x4012b228, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29296] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 29236] <... futex resumed> ) = 0 [pid 29233] <... futex resumed> ) = 1 [pid 29296] futex(0x2afb1402ee28, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29236] futex(0x4012b228, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29233] futex(0x4012b764, FUTEX_WAIT_PRIVATE, 50915, NULL <unfinished ...> [pid 29236] <... futex resumed> ) = 0 [pid 29236] futex(0x40130fb4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x40130fb0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...> [pid 29243] <... futex resumed> ) = 0 [pid 29236] <... futex resumed> ) = 1 [pid 29243] futex(0x4012ed28, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 29236] futex(0x4012ed28, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29243] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) [pid 29236] <... futex resumed> ) = 0 [pid 29243] futex(0x4012ed28, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29236] futex(0x4012b7a4, FUTEX_WAIT_PRIVATE, 51131, NULL <unfinished ...> [pid 29243] <... futex resumed> ) = 0 [pid 29243] futex(0x2afb140008b4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x2afb140008b0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...> [pid 29245] <... futex resumed> ) = 0 [pid 29243] <... futex resumed> ) = 1 [pid 29245] futex(0x40130a28, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 29243] futex(0x40130a28, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29245] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) [pid 29243] <... futex resumed> ) = 0 [pid 29245] futex(0x40130a28, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29243] futex(0x40130fb4, FUTEX_WAIT_PRIVATE, 50917, NULL <unfinished ...> [pid 29245] <... futex resumed> ) = 0 [pid 29245] futex(0x4012f244, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4012f240, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 [pid 29240] <... futex resumed> ) = 0 [pid 29245] futex(0x4012cf28, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29240] futex(0x4012cf28, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 29245] <... futex resumed> ) = 0 [pid 29240] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) [pid 29245] futex(0x2afb140008b4, FUTEX_WAIT_PRIVATE, 50679, NULL <unfinished ...> [pid 29240] futex(0x4012cf28, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 29240] futex(0x401299f4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x401299f0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 [pid 29228] <... futex resumed> ) = 0 [pid 29240] futex(0x4012f244, FUTEX_WAIT_PRIVATE, 50735, NULL <unfinished ...> [pid 29228] futex(0x40127728, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 29228] futex(0x40132d24, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x40132d20, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...> [pid 29249] <... futex resumed> ) = 0 [pid 29228] <... futex resumed> ) = 1 [pid 29249] futex(0x40132728, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 29228] futex(0x40132728, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29249] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) [pid 29228] <... futex resumed> ) = 0 [pid 29249] futex(0x40132728, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 29228] futex(0x401299f4, FUTEX_WAIT_PRIVATE, 51257, NULL <unfinished ...> [pid 29249] <... futex resumed> ) = 0 [pid 29249] futex(0x40132d24, FUTEX_WAIT_PRIVATE, 50949, NULL <unfinished ...> [pid 29296] <... futex resumed> ) = 0 [pid 29296] futex(0x4179ee94, 0x189 /* FUTEX_??? */, 1, {1387274027, 481023000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 29296] futex(0x2afb1402ee28, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 29296] futex(0x4179ee94, 0x189 /* FUTEX_??? */, 1, {1387274027, 531188000}, ffffffff^C <unfinished ...> Process 29123 detached Process 29203 detached Process 29222 detached Process 29228 detached Process 29233 detached Process 29236 detached Process 29240 detached Process 29243 detached Process 29245 detached Process 29249 detached Process 29256 detached Process 29262 detached Process 29263 detached Process 29281 detached Process 29285 detached Process 29289 detached Process 29293 detached Process 29296 detached Process 29305 detached Process 29307 detached Process 29312 detached Process 29313 detached Process 29315 detached Process 29326 detached Process 29328 detached Process 29329 detached Process 29341 detached Process 29343 detached Process 29356 detached Process 29409 detached Process 29411 detached r5i0n0:~ # Le 17/12/2013 10:11, Loic GASCHE a écrit :
I have submitted it already with sequentiel.
Le 17/12/2013 10:10, Tina ODAKA a écrit :
anyway, try to submit it with sequentiel, and mail me when you submit? i try to check how the job is running to see if anything can be speed up tina
Le 17/12/2013 09:55, Tina ODAKA a écrit :
If one simulation need 5 hours to run, you need to use the default queue, which is sequentiel (you do not need to put -q sequentiel, as it is a defalut queue)
tina
Le 17/12/2013 09:49, Loic GASCHE a écrit :
Hi Tina,
Thanks a lot.
Does "each job can run max 10 minutes" mean that one simulation can take no longer than 10 minutes to run, or it will be killed ?
We use a pretty complex model this time and run it for twelve years, so one simulation needs about 5 hours to run...
Loïc
Le 16/12/2013 21:34, todaka a écrit :
hi loic,
yes, there was a misunderstanding. so, i put you in the users who can use 'isisfish' instead of '-q parallel32' plz use '-q isisfish'
you can then run more than 16 job, but each job can run max 10 minutes.
Tina
Le 2013-12-16 16:55, Loic GASCHE a écrit :
re-bonjour tina,
Voici mon dernier échange avec les gens de code lutin concernant l'utilisation des queues "parallel" avec ISIS.
Apparemment ils avaient réussi à faire tourner des simus en parallel8 et avaient tenté en parallel32.
Ils savaient qu'ISIS ne fonctionnait pas sur ces queues mais avaient supposé que Denis nous orientait vers les queues parallel car des changements de caparmor nous permettaient d'y accéder... mais en fait non.
Donc tout ceci est en fait un gros malentendu entre Denis, moi et les lutins.
Loïc
-------- Message original -------- Sujet: Re: [Isis-fish-users] Fwd: Re: HelpDesk Nv3 / n°=191007 : Re: Accès à plus de coeurs sur caparmor ? Date : Thu, 21 Nov 2013 09:46:01 +0100 De : Eric Chatellier <chatellier@codelutin.com> Répondre à : isis-fish-users@list.isis-fish.org Organisation : Codelutin Pour : isis-fish-users@list.isis-fish.org
Le 21/11/2013 08:44, Loic GASCHE a écrit : > > > J'ai recu toute la nuit des messages d'erreur de caparmor : > > Hello lgasche, your job 5931603.service0, jobname simulation-sim_ > using 33 > cores have > performance ratio as 0.00. Your real time (wall time) is 03:39 > where as > your CPU time is 00:00. This job blocks 33 cores, thus your cpu > time > should > get closer to 33 * your real time (wall time). If you can > improve the > performance of your job, your calculation runs faster (and you can > make > economy of computational resource). > Please check your code, and see if you do not do unnecessary io > access or bad > usage of MPI or OpenMP, or running non optimised paralleljob. > This is an automatic e-mail from caparmor. > > Apparemment il n'est pas content car le job 5931603 ne tourne pas. > > Ce qui est marrant c'est qu'il dit que ce job tourne sur 33 > coeurs... C'est moi qui ai lancé deux fois le même jobs sur deux files différentes. > > Est-ce que je mets fin à ce job ? Oui. > > Le jour ou j'ai besoin de plus de 8 coeurs il me suffit donc de > taper -q > parallel nbCoeurs pour utiliser une des queues jusqu'à 256 ? Non, c'est "parallel8", ou "parallel256" (sans espace) Il n'y a que 5 ou six file spécifiquement disponibles. > Dans son mail Denis indique que les queue ont un temps limité. Par > exemple 18 > heures pour la 256 coeurs. Qu'est-ce que cela signifie ? Que se > passe-t-il si > mon AS n'a pas fini de tourner après 18 heures ? C'est une contraintes de caparmor et plus spécifiquement de la politique d'allocation des ressources sur les super calculateurs. Ils veulent bien que tu prennes plus de coeurs, mais à condition que tu les "monopolise" moins longtemps.
C'est à toi de voir suivant ta région. Si tu pense que l'AS prend plus de 18 heures, il faut que tu utilises moins de coeurs. Sinon caparmor tuera les jobs trop longs.
-- =================================================== Tina Odaka RIC - IDM - IMN - IFREMER Pôle de Calcul Intensif pour la Mer (PCIM) Tel: +33 (0)2 98 22 41 85 Fax: +33 (0)2 98 22 45 46 email: Tina.Odaka@ifremer.fr http://www.ifremer.fr/pcim ==================================================
Le 17/12/2013 11:19, Loic GASCHE a écrit :
Bonjour,
Il apparait qu'il y a eu une mauvaise compréhension entre l'assistance de l'IFREMER et nous concernant les queues qui peuvent être utilisées par ISIS sur caparmor :
- On ne peut toujours pas utiliser les queues "parallel" avec ISIS.
- Il existe bien une queue spéciale pour ISIS "isisfish" qui permet de faire tourner ses simus sur plus de 16 coeurs à la fois, MAIS chaque simu sur cette queue doit durer moins de dix minutes.
- Donc la seule solution pour le moment pour les grosses simus est d'utiliser la queue par défaut "sequentiel" qui a 16 coeurs.
Tina a jeté un coup d'oeil à mes simus sur cette queue et trouve qu'on n'utilise pas toute la puissance du processeur. En particulier, elle trouve des "futex wait" et demande si c'est normal. Ca doit être normal car isis "perd" un peu beaucoup de temps à lire/écrire des choses sur le disque dur et ca descend la moyenne de temps d'utilisation du CPU par rapport au temps global de la simulation.
-- Éric Chatellier - Code Lutin Tel: 02.40.50.29.28 - http://www.codelutin.com
participants (2)
-
Eric Chatellier -
Loic GASCHE