The story of exploiting kmalloc() overflows v.0.2.1 - qobaiashi I. Intro """""""" Recently many kernel level heap (kmalloc) overflows have been discovered which were rated "unclear" with regard to exploitation. This paper aims at explaining the kernels heap management with security and exploiting heap overflows in kernel space in mind. II. The Slab allocator """""""""""""""""""""" The slab alocator serves the need for small, dynamicand and thus fragmented memory portions in kernel space. Requested pages from the Buddy system (0x1000 bytes) are further splitted into caches, holding slabs, holding objects where objects are the smallest unit of memory - pointed to by a returned kmalloc() pointer. To increase performance the slab allocator saves more objects of the same size in so called slabs where a just free'd object can - w/o touching the buddy system - be given back to a new instance of a driver module which repeatedly allocates objects of the same size for example. II.I Caches """"""""""" +-------+--->+-------+----->+-------+ | CACHE | | CACHE | | CACHE | | |<---| |<-----| | +---+---+ +---+---+ +---+---+ | V +--------------------+ | SLAB | SLAB | SLAB | |------|------|------| | O | B | J | |------|------|------| | E | C | T | |------|------|------| | S | | | +--------------------+ Fig. I. """"""" A list of all active caches is at qobaiashi@cocoon:~> cat /proc/slabinfo <---- slabinfo - version: 1.1 (SMP) kmem_cache 80 80 244 5 5 1 : 252 126 fib6_nodes 113 113 32 1 1 1 : 252 126 ip6_dst_cache 20 20 192 1 1 1 : 252 126 ndisc_cache 30 30 128 1 1 1 : 252 126 hpsb_packet 0 0 100 0 0 1 : 252 126 ip_fib_hash 113 113 32 1 1 1 : 252 126 clip_arp_cache 0 0 128 0 0 1 : 252 126 ip_mrt_cache 0 0 96 0 0 1 : 252 126 tcp_tw_bucket 30 30 128 1 1 1 : 252 126 tcp_bind_bucket 113 113 32 1 1 1 : 252 126 tcp_open_request 40 40 96 1 1 1 : 252 126 ip_dst_cache 20 20 192 1 1 1 : 252 126 arp_cache 30 30 128 1 1 1 : 252 126 blkdev_requests 1560 1560 96 39 39 1 : 252 126 nfs_write_data 0 0 384 0 0 1 : 124 62 nfs_read_data 0 0 352 0 0 1 : 124 62 nfs_page 0 0 96 0 0 1 : 252 126 ext2_xattr 0 0 44 0 0 1 : 252 126 kioctx 0 0 128 0 0 1 : 252 126 kiocb 0 0 96 0 0 1 : 252 126 eventpoll pwq 0 0 36 0 0 1 : 252 126 eventpoll epi 0 0 96 0 0 1 : 252 126 dnotify_cache 338 338 20 2 2 1 : 252 126 file_lock_cache 40 40 96 1 1 1 : 252 126 async poll table 0 0 144 0 0 1 : 252 126 fasync_cache 126 202 16 1 1 1 : 252 126 uid_cache 113 113 32 1 1 1 : 252 126 skbuff_head_cache 80 80 192 4 4 1 : 252 126 sock 216 216 1344 72 72 1 : 60 30 sigqueue 28 28 136 1 1 1 : 252 126 kiobuf 0 0 64 0 0 1 : 252 126 cdev_cache 531 531 64 9 9 1 : 252 126 bdev_cache 59 59 64 1 1 1 : 252 126 mnt_cache 59 59 64 1 1 1 : 252 126 inode_cache 20808 20808 512 2601 2601 1 : 124 62 dentry_cache 23010 23010 128 767 767 1 : 252 126 dquot 0 0 128 0 0 1 : 252 126 filp 1110 1110 128 37 37 1 : 252 126 names_cache 8 8 4096 8 8 1 : 60 30 buffer_head 32310 32310 128 1077 1077 1 : 252 126 mm_struct 48 48 160 2 2 1 : 252 126 vm_area_struct 1904 2408 68 43 43 1 : 252 126 fs_cache 59 59 64 1 1 1 : 252 126 files_cache 54 54 416 6 6 1 : 124 62 signal_act 51 51 1312 17 17 1 : 60 30 pae_pgd 113 113 32 1 1 1 : 252 126 size-131072(DMA) 0 0 131072 0 0 32 : 0 0 size-131072 0 0 131072 0 0 32 : 0 0 size-65536(DMA) 0 0 65536 0 0 16 : 0 0 size-65536 20 20 65536 20 20 16 : 0 0 size-32768(DMA) 0 0 32768 0 0 8 : 0 0 size-32768 3 3 32768 3 3 8 : 0 0 size-16384(DMA) 0 0 16384 0 0 4 : 0 0 size-16384 0 5 16384 0 5 4 : 0 0 size-8192(DMA) 0 0 8192 0 0 2 : 0 0 size-8192 5 10 8192 5 10 2 : 0 0 size-4096(DMA) 0 0 4096 0 0 1 : 60 30 size-4096 40 40 4096 40 40 1 : 60 30 size-2048(DMA) 0 0 2048 0 0 1 : 60 30 size-2048 20 20 2048 10 10 1 : 60 30 size-1024(DMA) 0 0 1024 0 0 1 : 124 62 size-1024 92 92 1024 23 23 1 : 124 62 size-512(DMA) 0 0 512 0 0 1 : 124 62 size-512 104 104 512 13 13 1 : 124 62 size-256(DMA) 0 0 256 0 0 1 : 252 126 size-256 75 75 256 5 5 1 : 252 126 size-128(DMA) 0 0 128 0 0 1 : 252 126 size-128 900 900 128 30 30 1 : 252 126 size-64(DMA) 0 0 64 0 0 1 : 252 126 size-64 3835 3835 64 65 65 1 : 252 126 size-32(DMA) 0 0 32 0 0 1 : 252 126 size-32 904 904 32 8 8 1 : 252 126 . This list tells you the cache name, number of active objects, total object count, sizeof the managed objects, number of objects inside a slab, pages per slab. Here you can see that the kernel allocates special caches for filesystem drivers, network stuff etc. and general purpose caches (size-*) suitable for DMA and for ordinary memory access. All caches are doubly linked which makes traversing them easier in order to find a suitable object for allocation. Every cache holds three linked lists of slabs for free, partially free and one for full slabs. Additionally every cache has an array for each CPU pointing to free objects in the slabs, managed in the LIFO way (just kfree'd objects should asap be given away again). +-------+ | CACHE | |-------| +-------+ +----------------+ | |------------------>| CPU_0 |--->| Arry w/ ptrs | | | | CPU_N | | to unused objs | | free |-->[ SLAB HEAD ] +-------+ | in slabs | | | +----------------+ |partial|---------------------->+--------+---->+--------+---->+--------+---->+- | |<----------------------| SLAB |<----| SLAB |<----| SLAB |<----| | full |--[ SLAB HEAD ] | HEAD | | HEAD | | HEAD | | | | +--------+ +--------+ +--------+ + +-------+ | | | | | | | | obj | | obj | | obj | | . . . . . . . . . . . . . . Fig. II """"""": /* * kmem_cache_t * * manages a cache. */ #define CACHE_NAMELEN 20 /* max name length for a slab cache */ struct kmem_cache_s { /* 1) each alloc & free */ /* full, partial first, then free */ struct list_head slabs_full; struct list_head slabs_partial; struct list_head slabs_free; unsigned int objsize; unsigned int flags; /* constant flags */ unsigned int num; /* # of objs per slab */ spinlock_t spinlock; #ifdef CONFIG_SMP unsigned int batchcount; #endif /* 2) slab additions /removals */ /* order of pgs per slab (2^n) */ unsigned int gfporder; /* force GFP flags, e.g. GFP_DMA */ unsigned int gfpflags; size_t colour; /* cache colouring range */ unsigned int colour_off; /* colour offset */ unsigned int colour_next; /* cache colouring */ kmem_cache_t *slabp_cache; unsigned int growing; unsigned int dflags; /* dynamic flags */ /* constructor func */ void (*ctor)(void *, kmem_cache_t *, unsigned long); /* de-constructor func */ /* de-constructor func */ void (*dtor)(void *, kmem_cache_t *, unsigned long); unsigned long failures; /* 3) cache creation/removal */ char name[CACHE_NAMELEN]; struct list_head next; #ifdef CONFIG_SMP /* 4) per-cpu data */ cpucache_t *cpudata[NR_CPUS]; #endif #if STATS unsigned long num_active; unsigned long num_allocations; unsigned long high_mark; unsigned long grown; unsigned long reaped; unsigned long errors; #ifdef CONFIG_SMP atomic_t allochit; atomic_t allocmiss; atomic_t freehit; atomic_t freemiss; #endif #endif }; II.II Slabs """"""""""" As seen in Fig.I slabs hold objects of the same size. The head of each slab looks as follows: /* : struct list_head { struct list_head *next, *prev; }; typedef struct list_head list_t; */ : typedef struct slab_s { struct list_head list; unsigned long colouroff; void *s_mem; unsigned int inuse; kmem_bufctl_t free; //typedef unsigned int kmem_bufctl_t; } slab_t; Each header is located PAGE_SIZE aligned at the beginning of a (on-slab) slab. Every object in the slab is sizeof(void *) aligned to increase access spead. After this header follows an array containing an int value for every object. These values however are only important for currently free objects and are used as an index to the next free object in the slab. A value called BUFCTL_END (slab.c: #define BUFCTL_END 0xffffFFFF) marks the end of this array. "couloroff" describes "offsetting the slab_t structure into the slab area to maximize cache alignment." (slab.c) It is also possible that the slab header is stored "off-slab" at an independent object. Due to the *s_mem member of the slab_t struct it is unimportant where the slab head is stored because it holds a pointer to the beginning of the objects of a slab. The decision for on or off-slab is made in kmem_cache_create: ---8<--- /* Determine if the slab management is 'on' or 'off' slab. */ if (size >= (PAGE_SIZE>>3)) // if (size-requested >= 512) /* * Size is large, assume best to place the slab management obj * off-slab (should allow better packing of objs). */ flags |= CFLGS_OFF_SLAB; //a special flag was set ---8<--- // If the requested object size is >= 512 bytes. BUT: ---8<--- /* * If the slab has been placed off-slab, and we have enough space then * move it on-slab. This is at the expense of any extra colouring. */ if (flags & CFLGS_OFF_SLAB && left_over >= slab_size) { flags &= ~CFLGS_OFF_SLAB; left_over -= slab_size; } ---8<--- If the header fits into the allocated slab space then store it on-slab!! Exploitation note: """""""""""""""""" as we are writing towards upper memory addresses it is NOT possible to overwrite the current (previously written) slab header: <--0x00 .-overflowing obj 0xff -> [SLAB HEADER][COLOUR][obj1 ][ojb 2][ctrl_][aaaaa][aaaaa][aaaaa][aaaaa] but other - after "us" allocated objects inside of our slab. Here we must keep in mind that the kernel tries to balance the load between all possible slabs, so all "partial" slabs should have the same count of inuse objects..... the problem is that we can only overflow *random* other objects that have been allocated by other kernel routines after our bug-trigger. randomness or increasing reliability: """"""""""""""""""""""""""""""""""""" by carefully consuming objects of a certain cache you can - with the above stuff in mind - quite good control what object comes next to our overflowing object and thus quite reliable exploit the use of this other allocated object... ..got it.. use your brains! it's time to find kernel functions you can triger from userspace that allocate objects of the cache in which your overflow is in. !and erm successful exploitation is actually possible! have fun! -q