Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

occasional spinlocks assertion fails on event_post from ISR with zephyr 4.1 #87060

Open
tguenth opened this issue Mar 13, 2025 · 1 comment
Open
Assignees
Labels
area: Kernel bug The issue is a bug, or the PR is fixing a bug

Comments

@tguenth
Copy link
Contributor

tguenth commented Mar 13, 2025

spinlock assert fails on event_post from ISR

My application occasionally gets stuck posting k_events from DMA ISR handler (similar to #30074 (comment)).
It can take some seconds to up to hours to obtain that issue.

setup + signal/event flow

description

  • custom board with stm32-g474xx
  • zephyr version 4.1
  • adc connected via spi

signal/event flow
The adc is configured to sample data on its own at a given data rate at some kHz and notify the cpu via DR on each data available. The CPU then reads 3 bytes into a double-buffer. This is repeated until one half of the buffer is filled. Then a interrupt is triggered by the DMA. The CPU and the ADC do not share a common clock source, each of them has a distinct quartz oscillator attached. So scheduling events and the ISR from the DMA do have a varying alignment. I think this might be important for the observed behavior.

Reading the data is completely offloaded into HW of the SOC to have the buffer filled without any CPU time.
The chain is realized as such:

  • ADC data ready is connected to an EXTI pin
  • DMAMUX is triggered by EXTI
  • DMAMUX triggers 3 transfers on SPI
  • for each data ready on the SPI incoming data are transferred into a buffer via another DMA channel
  • on DMA half complete or transfer complete an interrupt is triggered
  • in the ISR handler i need to inform the application by posting an k_event

Posting an event from ISR handler does seem to yield a recursive spinlock fighting for timeout_lock in zephyrbase/kernel/timeout.c.

To get a more reasonable stack trace i have added the following to my proj.conf

CONFIG_ASSERT=y
CONFIG_NO_OPTIMIZATIONS=y
**stack trace** with locations of simultaneous write access on `timeout_lock`
  arch_system_halt@0x08011b6c (PROJECTROOT/zephyrbase/kernel/fatal.c:30)
  k_sys_fatal_error_handler@0x08011b82 (PROJECTROOT/zephyrbase/kernel/fatal.c:44)
  z_fatal_error@0x08006b08 (PROJECTROOT/zephyrbase/kernel/fatal.c:119)
  z_arm_fatal_error@0x0800b46c (PROJECTROOT/zephyrbase/arch/arm/core/fatal.c:86)
  z_arm_fault@0x08000f88 (PROJECTROOT/zephyrbase/arch/arm/core/cortex_m/fault.c:1080)
  z_arm_usage_fault@0x08000fe0 (PROJECTROOT/zephyrbase/arch/arm/core/cortex_m/fault_s.S:102)
  <signal handler called>@0xfffffff1 (Unknown Source:0)
  assert_post_action@0x0800b3d8 (PROJECTROOT/zephyrbase/lib/os/assert.c:44)
  z_spinlock_validate_pre@0x0800a0cc (PROJECTROOT/zephyrbase/include/zephyr/spinlock.h:136)
  k_spin_lock@0x0800a0cc (PROJECTROOT/zephyrbase/include/zephyr/spinlock.h:193)
> z_abort_timeout@0x0800a0cc (PROJECTROOT/zephyrbase/kernel/timeout.c:152)
  event_walk_op@0x08012b46 (PROJECTROOT/zephyrbase/kernel/events.c:125)
  z_sched_waitq_walk@0x080097be (PROJECTROOT/zephyrbase/kernel/sched.c:1490)
  k_event_post_internal@0x0800a6bc (PROJECTROOT/zephyrbase/kernel/events.c:160)
  z_impl_k_event_post@0x08012b64 (PROJECTROOT/zephyrbase/kernel/events.c:184)
  k_event_post@0x08010d6c (PROJECTROOT/build/zephyr/include/generated/zephyr/syscalls/kernel.h:810)
  msp_irq_handler@0x0800594c (PROJECTROOT/drivers/sensor/limatica_probe/src/lprobe_core.c:444)
  dma_stm32_irq_handler@0x08002b54 (PROJECTROOT/zephyrbase/drivers/dma/dma_stm32.c:130)
  dma_stm32_irq_0_0@0x0800c29c (PROJECTROOT/zephyrbase/drivers/dma/dma_stm32.c:766)
  _isr_wrapper@0x08001420 (PROJECTROOT/zephyrbase/arch/arm/core/cortex_m/isr_wrapper.c:77)
  <signal handler called>@0xfffffff1 (Unknown Source:0)
  first@0x08009d42 (PROJECTROOT/zephyrbase/kernel/timeout.c:48)
> sys_clock_announce@0x0800a262 (PROJECTROOT/zephyrbase/kernel/timeout.c:239)
  sys_clock_isr@0x080044e0 (PROJECTROOT/zephyrbase/drivers/timer/cortex_m_systick.c:218)
  <signal handler called>@0xfffffffd (Unknown Source:0)
  __ISB@0x0800b67a (PROJECTROOT/modules/hal/cmsis/CMSIS/Core/Include/cmsis_gcc.h:261)
  arch_cpu_idle@0x0800b67a (PROJECTROOT/zephyrbase/arch/arm/core/cortex_m/cpu_idle.c:102)
  k_cpu_idle@0x08011d1a (PROJECTROOT/zephyrbase/include/zephyr/kernel.h:6192)
  idle@0x08007086 (PROJECTROOT/zephyrbase/kernel/idle.c:75)
  z_thread_entry@0x080007f4 (PROJECTROOT/zephyrbase/lib/os/thread_entry.c:48)
  ??@0x00000000 (Unknown Source:0)

insights

Two different locations in the stack trace are using the same spin lock from timeout.c.

From systick 0x200028ac <timeout_lock> is used at

sys_clock_announce@0x0800a262 (PROJECTROOT/zephyrbase/kernel/timeout.c:239)

From ISR 0x200028ac <timeout_lock> is used at

z_abort_timeout@0x0800a0cc (PROJECTROOT/zephyrbase/kernel/timeout.c:152)

I assume the varying point in time of DMA ISR relative to systick due to different clock sources can lead to this issue. When i repeat that experiment, it can take some seconds to up to hours to have that issue occurring.

question / assumption

For me this seems to be the underlying problem from that issue #30074 (comment)

  • similar setup (external ADC, SPI + DMA)
  • they closed the ticket by updating zepyhr; for me not possible - already using latest
  • i think the underlying issue might be the same in both cases
  • unfortunately there was never a stack trace posted

Any suggestions on how to create a reproducer here?

For the zephyr scheduler is it safe to do do k_event_post from an ISR which can be triggered at any time relative to systick?

@tguenth tguenth added the bug The issue is a bug, or the PR is fixing a bug label Mar 13, 2025
@peter-mitsis
Copy link
Collaborator

If I am understanding the information presented correctly, it would appear that the DMA triggered interrupt is occurring during sys_clock_announce(). Furthermore based on the stack trace, it would appeaer that the DMA ISR is getting serviced while sys_clock_announce() already held the timeout_lock spinlock.

This certainly seems unusual. I am not an ARM expert, but my expectation would be that for something like this to occur the priority of the serviced DMA interrupt would not just have to be higher than that of the timer interrupt, but likely unmaskable so that it could be serviced even while interrupts were locked by the timer ISR. If this is indeed true in that it matches your setup, then that DMA interrupt's ISR must not call Zephyr kernel routines.

@henrikbrixandersen henrikbrixandersen changed the title occasional spinlocks assertion fails on event_post from ISR with zepyhr 4.1 occasional spinlocks assertion fails on event_post from ISR with zephyr 4.1 Mar 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Kernel bug The issue is a bug, or the PR is fixing a bug
Projects
None yet
Development

No branches or pull requests

4 participants