KPTI implementation mechanism and performance and overhead

1 KPTI Overview
KPTI (Kernel PageTable Isolation) full name kernel page table isolation. KPTI was modified from the KAISER patch. Previously, the process address space was divided into a kernel address space and a user address space. The kernel address space is mapped to the entire physical address space, and the user address space can only be mapped to the specified physical address space. The kernel address space and the user address space share a page global directory table (PGD represents the entire address space of the process), and the meltdown vulnerability takes advantage of this. The attacker obtains the kernel data by fetching the microinstruction during the time window of illegal access to the kernel address and the CPU processing exception. In order to completely prevent the user program from acquiring kernel data, the kernel address space and the user address space can be used to use two sets of page tables (that is, two PGDs are used).
Figure 1 modified process address space
2 questions
Of course things are not that simple, there are two problems:
Problem 1: In the X86 architecture, in the context of the context switch (note the gap), part of the memory needs to be valid for both kernel space and user space, which means that the kernel will start working before switching CR3.
Question 2: When modifying CR3, the CPU will wash the TLB, which will bring great performance problems.
3 KPTI implementation mechanism
In the KAISER paper, the following solutions are proposed for these two problems.
3.1 Shadow Address Space (Shadow Address Spaces)
Each process in KPTI has two address spaces. The first address space can only be accessed in kernel mode. It can create a mapping to the kernel and the user (although the user space is protected by SMAP and SMEP, you can query the Intel manual). The second address space is called the shadow address space and only contains user space. However, due to context switching, a partial kernel address must be included in the shadow address space to establish a mapping to the interrupt entry and exit.
When the interrupt occurs in the user mode, it involves switching the CR3 register and switching from the shadow address space to the user address space. The requirement to interrupt the upper half is as fast as possible, so switching CR3 is also required to be as fast as possible. In order to achieve this, KAISER continuously places the PGD of the kernel space and the PGD of the user space in an 8 KB memory space. This space must be 8K aligned, which converts the CR3 switching operation to set or clear the 13th bit (low to high) of the CR3 value, increasing the speed of CR3 switching.
Schematic diagram of PGD distribution of user space and kernel space
3.2 Minimum mapping of kernel space
As mentioned above, in the process of switching from the shadow address space to the kernel address space, in order for the kernel to start working before the CR3 switch, the shadow address space must contain part of the kernel address space.
As shown in the figure below, the shadow is the kernel data and code that needs to be mapped during the kernel state. Figure a is the address space of a regular OS process. Figure b and Figure c are the process address spaces after the page table is isolated. The difference between the two is related to whether the SMAP and SMEP mechanisms are used.
So how do you determine which shadow address space should map to those kernel data? Since the interrupt may occur in the user mode, it should contain the interrupt vector table (IDT), the interrupt stack, and the interrupt vector. In addition to the kernel stack, GDT and TSS should also be mapped to the shadow address space.
4 performance and overhead (performance and overhead)
4.1 TLB
As mentioned in the intel manual, the high order of the linear address is called the page number, and the lower bit is called the page offset (page offset, if the page size is 4K, it is the lower 12 bits). The upper bits of the physical address are called page frames.
TLB is used to speed up the conversion from linear address to physical address, and is essentially a cache. The TLB uses the page number to get the base address of the page corresponding to the linear address. Each item in the TLB contains the following:
Page number corresponds to the physical address of the page
Page access rights (R/W, U/S)
Page attribute (dirty flag, memory type)
Figure 4-1 TLB-based memory access process
A processor may contain different types of TLBs, such as TLBs dedicated to fetching and TLBs for data access.
When switching CR3, the CPU implicitly flushes the TLB. The TLB's miss penalty can reach 10 â€“ 100 clock cycles. Some pages in memory (such as shared libraries) are shared by all processes. These pages are indicated by the global bit (G) of the page table entry. Shared pages are not involved in the implicit flushing of TLB.
There are two ways to prevent data leakage, the first one needs to flush the entire TLB, and the second is to disable the global bits of the page table entry.
The use of PCID can alleviate the performance problems caused by flushing TLB.
4.2 Process-Context Identifiers (PCID)
The PCID full name process context identifier, the PCIDE bit of the CR4 register indicates whether the PCID function of the CPU is enabled. PCIDE=1 means that the PCID is enabled. When enabled, the lower 12 bits of CR3 (page base register) are used to store the PCID. Each process has a PCID. When PCID is not enabled, the lower 12 bits of CR3 are all 0s (000H).
The Intel manual provides a very detailed explanation of the behavior of TLB failure. When using the mov instruction to modify CR3, the TLB will be invalidated (mov to CR3). The specific behavior is as follows:
If CR4.PCIDE = 0 (meaning PCID is not enabled), the CPU will invalidate all TLB entries associated with PCID 000H, except for the global page.
If CR4.PCIDE = 1 (PCID is enabled) and the 63rd bit of the source operand is = 0, the 0-11 bits of the source operand are the specified PCID. Then the CPU invalidates all TLB cache entries associated with the specified PCID. TLB cache entries associated with other PCIDs in the TLB do not expire.
If CR4.PCIDE=1 and the 63rd bit of the source operand is =1, the CPU will not perform any invalidation on the TLB.
5 code analysis
We selected the linux4.15 version as a demonstration to illustrate the distribution in the kernel of the KPTI patch. This is the 4.16 version and the PTI (pagetable isolation) related diff stat. It can be seen that a total of 45 files have been modified, and 1636 lines of code have been inserted. Delete the 202 line code.
The top three to increase the number of lines of code is
Mm/pti.c
Arch/x86/include/asm/tlbflush.h
Arch/x86/entry/calling.h
5.1 arch/x86/mm/pti.c
Pti.c is the new file for the patch. The entry function is pti_init(), which is called in the mm_init() function in init/main.c. There are two functions in this file. The first one is similar to pti_clone_user_shared(), which copies the kernel's page table entries into user space. The second type is similar to pti_user_pagetable_walk_p4d (unsigned long address). According to the virtual address in the parameter, the corresponding page table entry pointer of the address is obtained.
Void __init pti_init(void)
{
If(!static_cpu_has(X86_FEATURE_PTI))
Return;
Pr_info("enabled");
Pti_clone_user_shared();
Pti_clone_entry_text();
Pti_setup_espfix64();
Pti_setup_vsyscall();
}
5.2 arch/x86/include/asm/tlbflush.h
This file contains a series of functions related to TLB flush. In KPTI, not only the PCID is used, since the process address space identifier in the kernel must start from 0. So ASID is the real identifier of the address space. And because the address space of the process in the patch has two parts, we need two PCIDs. The identifier used by the kPCID kernel space. The identifier used by the uPCID user space.
* ASID -[0, TLB_NR_DYN_ASIDS-1]
* the canonical identifier for an mm
*
* kPCID -[1, TLB_NR_DYN_ASIDS]
* the value we write into the PCID part of CR3;
* ASID+1, because PCID 0 is special.
*
* uPCID -[2048+1,2048+ TLB_NR_DYN_ASIDS]
* for KPTI each mm has two address spaces and thus needs two
* PCID values, but we can still do with a single ASID denomination
* for each mm.Corresponds to kPCID +2048.
#define CR3_HW_ASID_BITS 12
# define PTI_CONSUMED_PCID_BITS 1
/*
* 6 because 6 should be plenty and struct tlb_state will fit in two cache
* lines.
*/
#define TLB_NR_DYN_ASIDS 6
5.3 /arch/x86/entry/calling.h
Calling.h is the entry function of the system call to handle the register save operation when the system is called. System calls involve switching from user mode to kernel mode. So calling.h needs to be modified.
The following series of assembly macros involves switching between user PGD and kernel PGD. Let's pick a few macros to illustrate:
1. SWITCH_TO_KERNEL_CR3
The task of this macro is to clear the PCID of the CR3 store and set the 13th of CR3 to 1 so that it points to the kernel PGD.
.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
ALTERNATIVE "jmp .Lend_\@","", X86_FEATURE_PTI
Mov %cr3, \scratch_reg
ADJUST_KERNEL_CR3 \scratch_reg
Mov \scratch_reg, %cr3
.Lend_\@:
.endm
2. SWITCH_TO_USER_CR3_NOSTACK The task of this macro is to determine whether its TLB needs to be flushed according to the ASID of the process. If it is not needed, it is marked as no_flush in CR3. The kPCID is then converted to a uPCID and CR3 is directed to the user PGD. All of this happens in a very short time because they are just setting the CR3 register.
.macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
ALTERNATIVE "jmp .Lend_\@","", X86_FEATURE_PTI
Mov %cr3, \scratch_reg
ALTERNATIVE "jmp .Lwrcr3_\@","", X86_FEATURE_PCID
/*
* Test if the ASID needs a flush.
*/
Movq \scratch_reg, \scratch_reg2
Andq $(0x7FF), \scratch_reg /* mask ASID */
Bt \scratch_reg, THIS_CPU_user_pcid_flush_mask
Jnc .Lnoflush_\@
/* Flush needed, clear the bit */
Btr \scratch_reg, THIS_CPU_user_pcid_flush_mask
Movq \scratch_reg2, \scratch_reg
Jmp .Lwrcr3_pcid_\@
.Lnoflush_\@:
Movq \scratch_reg2, \scratch_reg
SET_NOFLUSH_BIT \scratch_reg
.Lwrcr3_pcid_\@:
/* Flip the ASID to the user version */
Orq $(PTI_USER_PCID_MASK), \scratch_reg
.Lwrcr3_\@:
/* Flip the PGD to the user version */
Orq $(PTI_USER_PGTABLE_MASK), \scratch_reg
Mov \scratch_reg, %cr3
.Lend_\@:
.endm

Wall Lamp Series
Led Wall Lamp,Outdoor Wall lighting,High Quality Wall Lamp
Kindwin Technology (H.K.) Limited , https://www.ktlleds.com