From 0ee690fcb712718ca7ad179ec1a29a9803c80ed6 Mon Sep 17 00:00:00 2001 From: Andrew Geissler Date: Mon, 17 Sep 2018 10:36:08 -0500 Subject: Increase StartLimitIntervalSec to 240s The DefaultTimeoutStartSec is 90s. If a service is hitting this timeout repeatedly then the StartLimitIntervalSec needs to be set in a way to handle this worse case scenario so that the service which is timing out does not continuously get restarted. This means it needs to be set to: StartLimitBurst*DefaultTimeoutStartSec + StartLimitBurst* (30s) which currently would be 2x90 + 2x30 Ref: systemd-system.conf Tested: Verified that if 90s timeout is hit in service that it is no longer restarted after 2 attempts. Resolves openbmc/openbmc#3379 (From meta-phosphor rev: ee52526c80eaca65a581c01bcf703861ec1a80b6) Change-Id: I8ff4febeb46a746dd3e5e625c5bdc3735963799b Signed-off-by: Andrew Geissler Signed-off-by: Brad Bishop --- .../phosphor-systemd-policy/service-restart-policy.conf | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) (limited to 'meta-phosphor/recipes-phosphor') diff --git a/meta-phosphor/recipes-phosphor/systemd-policy/phosphor-systemd-policy/service-restart-policy.conf b/meta-phosphor/recipes-phosphor/systemd-policy/phosphor-systemd-policy/service-restart-policy.conf index 54516c2d47..17c9e6beae 100644 --- a/meta-phosphor/recipes-phosphor/systemd-policy/phosphor-systemd-policy/service-restart-policy.conf +++ b/meta-phosphor/recipes-phosphor/systemd-policy/phosphor-systemd-policy/service-restart-policy.conf @@ -13,19 +13,23 @@ # restarting once does the job or restarting all 5 times does not help # and we just end up hitting the 5 limit anyway. # -# - Change the StartLimitIntervalSec to 30s +# - Change the StartLimitIntervalSec to 240s # The BMC CPU performance is already challenged. When a service is # failing and a core dump is being generated and collected into a dump, # it's even more challenged. Recent failures have shown situations where # the service does not fail again until 15-20 seconds after the initial # failure which means the default of 10s for this results in the service -# being restarted indefinitely. Change this to 30s to only allow a service -# to be restarted StartLimitBurst times within a 30s interval before -# being put in a permanent fail state. +# being restarted indefinitely. +# Another issue that has cropped up recently is that the DefaultTimeoutStartSec +# is 90s. If a service is hitting this timeout repeatedly then there +# is a similar issue as noted above. Because of this, the StartLimitIntervalSec +# needs to be StartLimitBurst*DefaultTimeoutStartSec + +# StartLimitBurst* worst case processing time (30s) +# which currently would be 2x90 + 2x30 # # See systemd-system.conf(5) for details on the conf files [Manager] DefaultRestartSec=1s DefaultStartLimitBurst=2 -DefaultStartLimitIntervalSec=30s +DefaultStartLimitIntervalSec=240s -- cgit v1.2.3