Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sample.smf.hsm_psicc2 is flaky on SMP platforms #86954

Open
fabiobaltieri opened this issue Mar 11, 2025 · 11 comments
Open

sample.smf.hsm_psicc2 is flaky on SMP platforms #86954

fabiobaltieri opened this issue Mar 11, 2025 · 11 comments
Assignees
Labels
area: Logging area: SMP Symmetric multiprocessing area: State Machine Framework State Machine Framework bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug

Comments

@fabiobaltieri
Copy link
Member

Describe the bug
Hi, I noticed few qemu_x86_64/atom tests being a bit flaky recently, we see them failing on the main CI and then after a retry they pass, this is a sample run:

bad: https://github.com/zephyrproject-rtos/zephyr/actions/runs/13785932905/attempts/1
retry: https://github.com/zephyrproject-rtos/zephyr/actions/runs/13785932905/attempts/2

See:

INFO    - 1181/1181 qemu_x86_64/atom          sample.smf.hsm_psicc2                              FAILED Timeout (qemu 119.885s <zephyr>)

Is this something you Intel folks could look into?

To Reproduce
No idea, west twister -p qemu_x86_64/atom -s sample.smf.hsm_psicc2 seems to work locally.

Expected behavior
Pass

Impact
Annoyance in CI, if it's failing for a good reason maybe we can exclude the platform.

@fabiobaltieri fabiobaltieri added bug The issue is a bug, or the PR is fixing a bug platform: Intel Intel Corporation labels Mar 11, 2025
@fabiobaltieri fabiobaltieri added the priority: low Low impact/importance bug label Mar 11, 2025
@aescolar
Copy link
Member

aescolar commented Mar 12, 2025

The same(?) issue on
twister -p qemu_x86_64/atom -T tests/lib/c_lib/thrd -s libraries.libc.c11_threads.picolibc.notls
(The test tends to hung and timeout quite often both in CI and locally)
https://github.com/zephyrproject-rtos/zephyr/actions/runs/13807339089/job/38620827891#step:11:2121
I'd say this is worse than a low. We cannot have a platform tests failing this often. It may be an indication of something being rotten.

The issue is also present in 4.1.0, and in 4.0.0

@aescolar aescolar added priority: medium Medium impact/importance bug and removed priority: low Low impact/importance bug labels Mar 12, 2025
@kwd-doodling
Copy link
Collaborator

It has a high reproduce rate on my WSL environment and I debugged a while today.
It seems that the case runs well but somehow its logs cannot be printed out.
With CONFIG_DEBUG=y and CONFIG_NO_OPTIMIZATIONS=y set, the FW sticks during early booting.
On my ISH Simics environment, it shows that Shell log backend causes assertion inside k_spin_lock().

Image

@kwd-doodling
Copy link
Collaborator

kwd-doodling commented Mar 14, 2025

Ignore the assertion, it's caused by stack overflow after set CONFIG_NO_OPTIMIZATION=y.
latest finding is that the logs are somehow filtered away by logging API LOG_INF(). Need more time to track the logging filter settings.

@nashif
Copy link
Member

nashif commented Mar 14, 2025

when the sample fails, which is easy to reproduce under load, I get:

ESCcSeaBIOS (version rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org)
Booting from ROM...


uart:~$ *** Booting Zephyr OS build v4.1.0-544-g4bb5ffd7867d ***
uart:~$ State Machine Framework Demo
uart:~$ See PSiCC2 Fig 2.11 for the statechart
uart:~$ https://www.state-machine.com/psicc2

and not the expected logging:

    regex:
      - ".*<inf> hsm_psicc2_thread: initial_entry.*"
      - ".*<inf> hsm_psicc2_thread: s_entry.*"
      - ".*<inf> hsm_psicc2_thread: s2_entry.*"
      - ".*<inf> hsm_psicc2_thread: s21_entry.*"
      - ".*<inf> hsm_psicc2_thread: s211_entry.*"
      - 
      - 

I suspect this is SMP related issue, it is a known issue that SMP based qemus fail under CI heavy load...

@nashif
Copy link
Member

nashif commented Mar 14, 2025

#87099 does not fix the issue for me. still getting the same output above and no logging output.

@nashif nashif added area: SMP Symmetric multiprocessing area: Logging labels Mar 14, 2025
@nashif
Copy link
Member

nashif commented Mar 14, 2025

btw, to reproduce easily, I use stress --cpu 22 and launch twister on this sample in another terminal.

@nashif
Copy link
Member

nashif commented Mar 14, 2025

@peter-mitsis one of thos flaky CI SMP issues

@nashif nashif changed the title qemu_x86_64/atom:sample.smf.hsm_psicc2 is flaky sample.smf.hsm_psicc2 is flaky on SMP platforms Mar 14, 2025
@nashif
Copy link
Member

nashif commented Mar 14, 2025

also fails on qemu_cortex_a53/qemu_cortex_a53/smp samples/subsys/smf/hsm_psicc2/sample.smf.hsm_psicc2 FAILED Timeout (qemu 59.176s <zephyr>)

@nashif
Copy link
Member

nashif commented Mar 14, 2025

my suggestion is not to run this on SMP qemus systems in CI

@kwd-doodling
Copy link
Collaborator

kwd-doodling commented Mar 14, 2025

@nashif two patches needed. #87097 makes the reproduce rate lower to 1/20 on my WSL, and with #87099 I run the case over 100 times.
I tried your stress and qemu_cortex_a53/qemu_cortex_a53/smp and yes there's still chance to fail. But this seems to be another failure case I've never seen in my previous debugging days. This time even printk cannot print out.
I think the two things I captured are real problems and suggest to go on merging. I will take a look on the new issue next week.

@kwd-doodling
Copy link
Collaborator

@nashif two patches needed. #87097 makes the reproduce rate lower to 1/20 on my WSL, and with #87099 I run the case over 100 times. I tried your stress and qemu_cortex_a53/qemu_cortex_a53/smp and yes there's still chance to fail. But this seems to be another failure case I've never seen in my previous debugging days. This time even printk cannot print out. I think the two things I captured are real problems and suggest to go on merging. I will take a look on the new issue next week.

and more, from memory dump, the case has finished running, it seems to be still a logging lost issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Logging area: SMP Symmetric multiprocessing area: State Machine Framework State Machine Framework bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Projects
None yet
Development

No branches or pull requests

4 participants