|
| 1 | +## Setup |
| 2 | + |
| 3 | +First, we create a root qdisc of the DRR type. |
| 4 | +Any classful qdisc can be used, but this choice affects the sizes of the objects involved in the exploitation. |
| 5 | + |
| 6 | +Next, we create two child classes: 0x10001 and 0x10002. |
| 7 | + |
| 8 | +We also add two filters attached to the root qdisc, bound to the same 0x10001 class, with handles 1 and 2 and a bitmask of 0xf. |
| 9 | +Matching in fw filter works by masking the packet's mark value with the filter's mask and then comparing it to the filter handle, so in this simple setup packets with mark value 2 will be matched by filter with handle 2. |
| 10 | + |
| 11 | +Finally, we add an iptables rule to set the fw mark on all OUTPUT packets to 2 - this will allow us to trigger a match on a filter and execute the enqueue() function. |
| 12 | + |
| 13 | +## Triggering use-after-free |
| 14 | + |
| 15 | +To trigger the vulnerability [scenario 2](vulnerability.md#scenario-2) is used. |
| 16 | + |
| 17 | +We change the bound class of the first filter (handle 1) to class 0x10002. |
| 18 | +This causes field filter_cnt of class 0x10001 to become 0, allowing us to delete it. |
| 19 | + |
| 20 | +## Getting RIP control |
| 21 | + |
| 22 | +Deleting a DRR class frees two objects: |
| 23 | +- struct drr_class, freed directly by kfree() in drr_destroy_class |
| 24 | +- struct qdisc scheduled to be freed through RCU in __qdisc_destroy() |
| 25 | + |
| 26 | + |
| 27 | +### LTS/COS exploit |
| 28 | + |
| 29 | +The simplest way to get code execution is to target ->enqueue() function pointer of the qdisc struct: |
| 30 | +``` |
| 31 | +struct qdisc { |
| 32 | + int (*enqueue)(struct sk_buff *, struct qdisc *, struct sk_buff * *); /* 0 0x8 */ |
| 33 | + struct sk_buff * (*dequeue)(struct qdisc *); /* 0x8 0x8 */ |
| 34 | +... |
| 35 | +``` |
| 36 | + |
| 37 | +->enqueue() is called from dev_qdisc_enqueue() after a packet is classified to a given class by fw_classify(). |
| 38 | + |
| 39 | +Struct qdisc is allocated from kmalloc-512, so we need a heap allocation primitive that will allocate from that cache and gives us control of the first 8 bytes of the object, which excludes some popular options like keys and xattrs. |
| 40 | + |
| 41 | +In this exploit we used a netlink allocation primitive - if a message is sent on the netlink socket (other kinds can be used as well), kernel allocates a buffer for the data to be sent in __alloc_skb() using a kmalloc_reserve() wrapper which for the sizes we are interested in uses a regular general-use kmalloc cache. |
| 42 | +User data is copied at the very beginning of this buffer, with skb metadata (skb_shared_info) added at the end. |
| 43 | + |
| 44 | +One last thing to remember is that we have to sleep for a bit to give RCU time to perform the callback that will free the qdisc object. |
| 45 | + |
| 46 | +After contents of the qdisc object are under our control we just have to send any packet to get RIP control. |
| 47 | + |
| 48 | +In this exploit we rely on the external KASLR leak, so the kernel text address is known beforehand. |
| 49 | + |
| 50 | + |
| 51 | +### Mitigation exploit |
| 52 | + |
| 53 | +When I was writing the original mitigation exploit, I incorrectly assumed that targeting the qdisc object wouldn't work on the mitigation instance because of the static/dynamic kmalloc cache split. |
| 54 | +Looking back at this now, I see that the original exploit for LTS would work perfectly fine on mitigation (after adjusting offsets etc) because qdisc is allocated from dynamic cache due to its size being calculated at runtime in qdisc_alloc(). |
| 55 | + |
| 56 | +I'll document this unnecessarily complicated approach here anyways in case someone finds it interesting. |
| 57 | + |
| 58 | +The mitigation exploit targets the other object freed when the class is deleted - struct drr_class. |
| 59 | +This object has a qdisc pointer at offset 0x60. If we managed to overwrite it with a pointer to a fake qdisc object, we could gain code execution in a similar way to the LTS exploit. |
| 60 | + |
| 61 | +There are two issues to overcome with this approach: |
| 62 | +- drr_class is a fixed-size allocation, so with CONFIG_KMALLOC_SPLIT_VARSIZE enabled we have to find an object allocated from kmalloc-128 that has fixed size and has user-controlled data at offset 0x60 |
| 63 | +- We have to have a place with a known address to store the fake qdisc object |
| 64 | + |
| 65 | +First issue is solved by the struct clusterip_config which has a following field at the end: |
| 66 | +``` |
| 67 | +struct clusterip_config { |
| 68 | +... |
| 69 | + char ifname[16]; /* 0x60 0x10 */ |
| 70 | + |
| 71 | +} |
| 72 | +
|
| 73 | +``` |
| 74 | + |
| 75 | +This object is allocated when creating an iptables rule with the CLUSTERIP target. |
| 76 | +The network interface provided in the ifname field has to exist, but this not a problem because we can rename the loopback interface to any name we want. |
| 77 | +The only restriction is that we can't use null or whitespace bytes. |
| 78 | + |
| 79 | +The issue with placing the payload at a known address was solved by exploiting a separate infoleak vulnerability in the xfrm subsystem: |
| 80 | + |
| 81 | +``` |
| 82 | +author Herbert Xu <[email protected]> 2023-02-09 09:09:52 +0800 |
| 83 | +committer Steffen Klassert <[email protected]> 2023-02-13 13:38:58 +0100 |
| 84 | +commit 8222d5910dae08213b6d9d4bc9a7f8502855e624 |
| 85 | +
|
| 86 | +xfrm: Zero padding when dumping algos and encap |
| 87 | +``` |
| 88 | + |
| 89 | +https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8222d5910dae08213b6d9d4bc9a7f8502855e624 |
| 90 | + |
| 91 | +This vulnerability allows us to leak 64 bytes of uninitialized data from the xfrm_algo object allocated from the kmalloc-96 cache. |
| 92 | +We'll use it to leak data of a simple_xattr object, which looks like this: |
| 93 | +``` |
| 94 | +struct simple_xattr { |
| 95 | + struct list_head list; /* 0 0x10 */ |
| 96 | + char * name; /* 0x10 0x8 */ |
| 97 | + size_t size; /* 0x18 0x8 */ |
| 98 | + char value[]; /* 0x20 0 */ |
| 99 | +
|
| 100 | +}; |
| 101 | +``` |
| 102 | + |
| 103 | +Here's a procedure to get our data at a known address: |
| 104 | + |
| 105 | +1. Create xattr x1 with data length of 96, but name length of 256 bytes so that simple_xattr is allocated from kmalloc-96, but name from kmalloc-256. |
| 106 | +2. Remove x1 and allocate xfrm_algo in place of its simple_xattr. |
| 107 | +3. Read x1->name pointer using infoleak. |
| 108 | +4. Create a new xattr x2 with data length placing it into kmalloc-256. This will be allocated in place of the x1's name buffer that was freed when x1 was removed. We now have controlled data at the address we leaked + 0x20 (simple_xattr header takes 0x20 bytes). |
| 109 | + |
| 110 | +Next step is to rename the loopback interface to the name matching pointer to our fake qdisc object stored in x2. |
| 111 | +After this is done, we trigger the use-after-free in the same way as with LTS exploit and create an iptables CLUSTERIP rule with input interface set to the fake qdisc's address. |
| 112 | +This overwrites struct drr_class with struct clusterip_config, setting qdisc pointer of the former to the address of our fake object. |
| 113 | + |
| 114 | +Finally, we send a network packet to be matched by fw_classify() leading to the ->enqueue function pointer of our fake qdisc being called. |
| 115 | + |
| 116 | +## Pivot to ROP |
| 117 | + |
| 118 | +When ->enqueue() is called registers are as follows: |
| 119 | +- RDI - pointer to the skb |
| 120 | +- RSI - pointer to the qdisc |
| 121 | +- RAX - copy of the RSI |
| 122 | + |
| 123 | +RDI is not that useful to us, but RSI and RAX point directly to the data under our control.. |
| 124 | + |
| 125 | +Stack pivot has three stages using different gadgets. |
| 126 | + |
| 127 | +#### Gadget 1 |
| 128 | + |
| 129 | +``` |
| 130 | +lea rdi, [rax + 0x20] |
| 131 | +mov rax, qword ptr [rax + 0x30] |
| 132 | +jmp __x86_indirect_thunk_rax |
| 133 | +``` |
| 134 | + |
| 135 | +This adds 0x20 to our controlled data pointer, stores it into RDI and jumps to the next gadget. |
| 136 | +Adding 0x20 is very helpful, because we can't use the very beginning of the buffer as the start of the ROP - it contains the address of our first gadget. |
| 137 | + |
| 138 | +#### Gadget 2 |
| 139 | + |
| 140 | +``` |
| 141 | +push rdi |
| 142 | +jmp qword [rsi+0x0F] |
| 143 | +``` |
| 144 | + |
| 145 | +This pushes location of our ROP chain to the stack and jumps to the next gadget. |
| 146 | + |
| 147 | +#### Gadget 3 |
| 148 | + |
| 149 | +``` |
| 150 | +pop rsp |
| 151 | +ret |
| 152 | +``` |
| 153 | + |
| 154 | +Finally, we pop the previously pushed ROP location into RSP, completing the pivot |
| 155 | + |
| 156 | +Gadgets above are from the LTS exploit, COS/mitigation versions work exactly the same, differences are only in registers used and offsets. |
| 157 | + |
| 158 | +## Second pivot |
| 159 | + |
| 160 | +At this point we have full ROP, but there is not much space - most of our buffer is taken by skb_shared_info struct at the end. |
| 161 | +To have enough space to execute all privilege escalation code we have to pivot again. |
| 162 | +This is quite simple - we choose an unused read/write area in the kernel and use copy_user_generic_string() to copy the second stage ROP from userspace to that area. |
| 163 | +Then we use `pop rsp ; ret` gadget to pivot there. |
| 164 | + |
| 165 | +## Privilege escalation |
| 166 | + |
| 167 | +To escalate our process's privileges we execute following functions from ROP chain: |
| 168 | + |
| 169 | +- commit_creds(init_cred) |
| 170 | +- switch_task_namespaces(find_task_by_vpid(1), init_nsproxy) |
| 171 | + |
| 172 | +Then we set up registers for return to the userspace and jump to the `swapgs ; sysret` gadget in return_via_sysret. |
| 173 | + |
| 174 | +After getting back to userspace we call setns() on namespaces of pid 1 to complete escape to the initial namespace. |
0 commit comments