Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in tb_target_set_jmp_target_arm #2048

Open
futhewo opened this issue Oct 31, 2024 · 12 comments
Open

Segfault in tb_target_set_jmp_target_arm #2048

futhewo opened this issue Oct 31, 2024 · 12 comments

Comments

@futhewo
Copy link

futhewo commented Oct 31, 2024

In a very long emulation, I have a segfault (dereferencing null pointer) in qemu/tcg/aarch64/tcg-target.inc.c:tb_target_set_jmp_target.

In qemu/accel/tcg/cpu-exec.c, the function tb_set_jmp_target calls the previous function tb_target_set_jmp_target with parameters tc_ptr = 0 and tc_ptr + offset = 0.

Then tb_target_set_jmp_target calls atomic_set((uint64_t*)jmp_addr_, pair) with jmp_addr = tc_ptr + offset = 0, which segfaults.

I patched it (dirty) by adding the following code in tb_set_jmp_target:
if (TCG_TARGET_HAS_direct_jump && tb->tc.ptr) {
[…]
}

I do not know what this patch may break, but it solves the crash.
I am on commit 6ae0c97.

Feel free to ask me anything.

@wtdcode
Copy link
Member

wtdcode commented Oct 31, 2024

Thanks for the report and investigation. Your version seems pretty old, how about current dev branch?

@1144822034
Copy link

I have also encountered the same problem, and this issue has been very helpful to me

@futhewo
Copy link
Author

futhewo commented Nov 11, 2024

Unfortunately, I cannot try on a newer unicorn version, sorry.
I am using unicorn as part of unicornafl++ and it uses the version I pointed out.

@wtdcode
Copy link
Member

wtdcode commented Nov 11, 2024

unicornafl++

What is that? Are you mentioning unicornafl?

@futhewo
Copy link
Author

futhewo commented Nov 11, 2024

Yes, the one that is part of afl++, that you may find here:
https://github.com/AFLplusplus/unicornafl

@wtdcode
Copy link
Member

wtdcode commented Nov 11, 2024

Yes, the one that is part of afl++, that you may find here: https://github.com/AFLplusplus/unicornafl

Okay, unicornafl will bump to 2.1.2 once I fixed it.

@futhewo
Copy link
Author

futhewo commented Nov 11, 2024

That would be great! Thanks.

@sadeli413
Copy link

Tagging #1923 for visibility.
My temporary workaround is to destroy and reconstruct the emulator once every few million times.

@wtdcode
Copy link
Member

wtdcode commented Jan 4, 2025

I assume 2c688ba shall solve this. Could you have a try @futhewo ?

@samuel-beckett
Copy link

Hi,

Experienced same segfault on aarch64 target. After investigation the
null tc.ptr comes from the last_tb in tb_find.

Analysis

The TCG highwater has been lowered.

void tcg_prologue_init(TCGContext *s)
{
...
    s->code_gen_highwater = (char *)s->code_gen_buffer + (total_size - TCG_HIGHWATER) - s->uc->qemu_real_host_page_size;
}

Thus tcg_gen_code called from tb_find will fail tb_alloc because
of a lower highwater. However the last_tb is still set within
tb_find.

static inline TranslationBlock *tb_find(CPUState *cpu,
                                        TranslationBlock *last_tb,
                                        int tb_exit, uint32_t cf_mask)
{
...
        tb = tb_gen_code(cpu, pc, cs_base, flags, cf_mask);
}

Back from tb_gen_code in tb_find, the call to
tb_add_jump/tb_set_jmp_target will trigger the fault:

static inline TranslationBlock *tb_find(CPUState *cpu,
                                        TranslationBlock *last_tb,
                                        int tb_exit, uint32_t cf_mask)
{
...
    if (last_tb) {
        tb_add_jump(last_tb, tb_exit, tb);
	    	tb_set_jmp_target(tb, n, (uintptr_t)tb_next->tc.ptr);
			SEGV
    }
}

While my baseline is d568885d64c89db5b9a722f0c1bef05aa92f84ca.
I'm using Sai Cao patch:

git show d01904d3f33bd58c54271a7e28c402dbaddda847
commit d01904d3f33bd58c54271a7e28c402dbaddda847
Author: Sai Cao <[email protected]>
Date:   Thu Dec 19 11:29:20 2024 +0800

    Fix bug that tcg_region_alloc tbs were not flush
    this cause uc_ctl_remove_cache remove a invalid tbs.

diff --git a/qemu/tcg/tcg.c b/qemu/tcg/tcg.c
index dcacec7c..2b643b49 100644
--- a/qemu/tcg/tcg.c
+++ b/qemu/tcg/tcg.c
@@ -808,6 +808,7 @@ TranslationBlock *tcg_tb_alloc(TCGContext *s)
         if (tcg_region_alloc(s)) {
             return NULL;
         }
+        tb_flush(s->uc->cpu);
         goto retry;
     }
     s->code_gen_ptr = next;

The tb_flush done during the failed alloc, because of a lower
highwater, is reponsible for invalidating the last_tb in tb_find
whithout letting it know about.

Candidate fix

We could eventually check last_tb.tc.ptr before calling
set_jmp_target. But it is safer to detect a tb_flush earlier and
reset last_tb such as:

diff --git a/qemu/accel/tcg/cpu-exec.c b/qemu/accel/tcg/cpu-exec.c
index fc47eecb..bf9a8c3b 100644
--- a/qemu/accel/tcg/cpu-exec.c
+++ b/qemu/accel/tcg/cpu-exec.c
@@ -259,11 +259,20 @@ static inline TranslationBlock *tb_find(CPUState *cpu,
     tb = tb_lookup__cpu_state(cpu, &pc, &cs_base, &flags, cf_mask);
     if (tb == NULL||tb->cflags&CF_NOCACHE) {
         mmap_lock();
+
+        unsigned tb_flush_count = uc->tcg_ctx->tb_ctx.tb_flush_count;
+
         tb = tb_gen_code(cpu, pc, cs_base, flags, cf_mask);
         mmap_unlock();
         /* We add the TB in the virtual pc hash table for the fast lookup */
         cpu->tb_jmp_cache[tb_jmp_cache_hash_func(cpu->uc, pc)] = tb;
 
+        if (uc->tcg_ctx->tb_ctx.tb_flush_count != tb_flush_count) {
+            uc->last_tb = last_tb = NULL;
+        }
+
         if (uc->last_tb) {
             UC_TB_COPY(&cur_tb, tb);
             UC_TB_COPY(&prev_tb, uc->last_tb);

@wtdcode
Copy link
Member

wtdcode commented Mar 4, 2025

Hi,

Experienced same segfault on aarch64 target. After investigation the null tc.ptr comes from the last_tb in tb_find.

Analysis

The TCG highwater has been lowered.

void tcg_prologue_init(TCGContext *s)
{
...
    s->code_gen_highwater = (char *)s->code_gen_buffer + (total_size - TCG_HIGHWATER) - s->uc->qemu_real_host_page_size;
}

Thus tcg_gen_code called from tb_find will fail tb_alloc because of a lower highwater. However the last_tb is still set within tb_find.

static inline TranslationBlock *tb_find(CPUState *cpu,
                                        TranslationBlock *last_tb,
                                        int tb_exit, uint32_t cf_mask)
{
...
        tb = tb_gen_code(cpu, pc, cs_base, flags, cf_mask);
}

Back from tb_gen_code in tb_find, the call to tb_add_jump/tb_set_jmp_target will trigger the fault:

static inline TranslationBlock *tb_find(CPUState *cpu,
                                        TranslationBlock *last_tb,
                                        int tb_exit, uint32_t cf_mask)
{
...
    if (last_tb) {
        tb_add_jump(last_tb, tb_exit, tb);
	    	tb_set_jmp_target(tb, n, (uintptr_t)tb_next->tc.ptr);
			SEGV
    }
}

While my baseline is d568885d64c89db5b9a722f0c1bef05aa92f84ca. I'm using Sai Cao patch:

git show d01904d3f33bd58c54271a7e28c402dbaddda847
commit d01904d3f33bd58c54271a7e28c402dbaddda847
Author: Sai Cao <[email protected]>
Date:   Thu Dec 19 11:29:20 2024 +0800

    Fix bug that tcg_region_alloc tbs were not flush
    this cause uc_ctl_remove_cache remove a invalid tbs.

diff --git a/qemu/tcg/tcg.c b/qemu/tcg/tcg.c
index dcacec7c..2b643b49 100644
--- a/qemu/tcg/tcg.c
+++ b/qemu/tcg/tcg.c
@@ -808,6 +808,7 @@ TranslationBlock *tcg_tb_alloc(TCGContext *s)
         if (tcg_region_alloc(s)) {
             return NULL;
         }
+        tb_flush(s->uc->cpu);
         goto retry;
     }
     s->code_gen_ptr = next;

The tb_flush done during the failed alloc, because of a lower highwater, is reponsible for invalidating the last_tb in tb_find whithout letting it know about.

Candidate fix

We could eventually check last_tb.tc.ptr before calling set_jmp_target. But it is safer to detect a tb_flush earlier and reset last_tb such as:

diff --git a/qemu/accel/tcg/cpu-exec.c b/qemu/accel/tcg/cpu-exec.c
index fc47eecb..bf9a8c3b 100644
--- a/qemu/accel/tcg/cpu-exec.c
+++ b/qemu/accel/tcg/cpu-exec.c
@@ -259,11 +259,20 @@ static inline TranslationBlock *tb_find(CPUState *cpu,
     tb = tb_lookup__cpu_state(cpu, &pc, &cs_base, &flags, cf_mask);
     if (tb == NULL||tb->cflags&CF_NOCACHE) {
         mmap_lock();
+
+        unsigned tb_flush_count = uc->tcg_ctx->tb_ctx.tb_flush_count;
+
         tb = tb_gen_code(cpu, pc, cs_base, flags, cf_mask);
         mmap_unlock();
         /* We add the TB in the virtual pc hash table for the fast lookup */
         cpu->tb_jmp_cache[tb_jmp_cache_hash_func(cpu->uc, pc)] = tb;
 
+        if (uc->tcg_ctx->tb_ctx.tb_flush_count != tb_flush_count) {
+            uc->last_tb = last_tb = NULL;
+        }
+
         if (uc->last_tb) {
             UC_TB_COPY(&cur_tb, tb);
             UC_TB_COPY(&prev_tb, uc->last_tb);

Could you try with commit 2128e01efc81404fecfcdfe1c7bb282ebd3e87d3? This includes a relevant fix to it.

This ensures we have correct tb_flush.

@wtdcode
Copy link
Member

wtdcode commented Mar 4, 2025

Also I would appreciate it if you could also have a reproduction, even though it might be complex, not minimal. @samuel-beckett

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants