目录

浅析nv_peer_memory的实现

nv_peer_memory 是在Linux用户空间使用GPU Direct RDMA需要安装的一个kernel module。只要装上它之后,开发者就可以直接在Linux用户态 ibv_reg_mr一块显存,然后拿着这块显存直接去做RDMA,而不需要经过内存。

听起来,nv_peer_memory要实现如此fancy的功能,实现一定很复杂。但实际上,最复杂的部分dma mapping的接口是由nvidia驱动提供的,nv_peer_mem这个模块本身做的事情并不多,只是借助了IB Peer Direct的技术,使得用户态能够方便地调用驱动提供的接口对显存做dma mapping。

Peer Direct

PeerDirect 技术主要用于在用户态里直接让Mellanox IB网卡直接传数据到其他PCIe设备里。 拿nv_peer_mem来说,安装了这个module以后,在用户态里可以直接将显存register给IB卡,就可以直接对显存做RDMA操作。

使用PeerDirect需要实现一个kernel driver,在kernel中给ib_core 注册一个peer_memory_client(使用ib_register_peer_memory_client)。

peer_memory_client相关的定义在mlnx-ofed-kernel/include/peer_mem.h里

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
/**
 *  struct peer_memory_client - registration information for user virtual
 *                              memory handlers
 *
 * The peer_memory_client scheme allows a driver to register with the ib_umem
 * system that it has the ability to understand user virtual address ranges
 * that are not compatible with get_user_pages(). For instance VMAs created
 * with io_remap_pfn_range(), or other driver special VMA.
 *
 * For ranges the interface understands it can provide a DMA mapped sg_table
 * for use by the ib_umem, allowing user virtual ranges that cannot be
 * supported by get_user_pages() to be used as umems.
 */
struct peer_memory_client {
	char name[IB_PEER_MEMORY_NAME_MAX];
	char version[IB_PEER_MEMORY_VER_MAX];

	/**
	 * acquire - Begin working with a user space virtual address range
	 *
	 * @addr - Virtual address to be checked whether belongs to peer.
	 * @size - Length of the virtual memory area starting at addr.
	 * @peer_mem_private_data - Obsolete, always NULL
	 * @peer_mem_name - Obsolete, always NULL
	 * @client_context - Returns an opaque value for this acquire use in
	 *                   other APIs
	 *
	 * Returns 1 if the peer_memory_client supports the entire virtual
	 * address range, 0 or -ERRNO otherwise. If 1 is returned then
	 * release() will be called to release the acquire().
	 */
	int (*acquire)(unsigned long addr, size_t size,
		       void *peer_mem_private_data, char *peer_mem_name,
		       void **client_context);
	/**
	 * get_pages - Fill in the first part of a sg_table for a virtual
	 *             address range
	 *
	 * @addr - Virtual address to be checked whether belongs to peer.
	 * @size - Length of the virtual memory area starting at addr.
	 * @write - Always 1
	 * @force - 1 if write is required
	 * @sg_head - Obsolete, always NULL
	 * @client_context - Value returned by acquire()
	 * @core_context - Value to be passed to invalidate_peer_memory for
	 *                 this get
	 *
	 * addr/size are passed as the raw virtual address range requested by
	 * the user, it is not aligned to any page size. get_pages() is always
	 * followed by dma_map().
	 *
	 * Upon return the caller can call the invalidate_callback().
	 *
	 * Returns 0 on success, -ERRNO on failure. After success put_pages()
	 * will be called to return the pages.
	 */
	int (*get_pages)(unsigned long addr, size_t size, int write, int force,
			 struct sg_table *sg_head, void *client_context,
			 u64 core_context);
	/**
	 * dma_map - Create a DMA mapped sg_table
	 *
	 * @sg_head - The sg_table to allocate
	 * @client_context - Value returned by acquire()
	 * @dma_device - The device that will be doing DMA from these addresses
	 * @dmasync - Obsolete, always 0
	 * @nmap - Returns the number of dma mapped entries in the sg_head
	 *
	 * Must be called after get_pages(). This must fill in the sg_head with
	 * DMA mapped SGLs for dma_device. Each SGL start and end must meet a
	 * minimum alignment of at least PAGE_SIZE, though individual sgls can
	 * be multiples of PAGE_SIZE, in any mixture. Since the user virtual
	 * address/size are not page aligned, the implementation must increase
	 * it to the logical alignment when building the SGLs.
	 *
	 * Returns 0 on success, -ERRNO on failure. After success dma_unmap()
	 * will be called to unmap the pages. On failure sg_head must be left
	 * untouched or point to a valid sg_table.
	 */
	int (*dma_map)(struct sg_table *sg_head, void *client_context,
		       struct device *dma_device, int dmasync, int *nmap);
	/**
	 * dma_unmap - Unmap a DMA mapped sg_table
	 *
	 * @sg_head - The sg_table to unmap
	 * @client_context - Value returned by acquire()
	 * @dma_device - The device that will be doing DMA from these addresses
	 *
	 * sg_head will not be touched after this function returns.
	 *
	 * Must return 0.
	 */
	int (*dma_unmap)(struct sg_table *sg_head, void *client_context,
			 struct device *dma_device);
	/**
	 * put_pages - Unpin a SGL
	 *
	 * @sg_head - The sg_table to unpin
	 * @client_context - Value returned by acquire()
	 *
	 * sg_head must be freed on return.
	 */
	void (*put_pages)(struct sg_table *sg_head, void *client_context);
	/* Obsolete, not used */
	unsigned long (*get_page_size)(void *client_context);
	/**
	 * release - Undo acquire
	 *
	 * @client_context - Value returned by acquire()
	 *
	 * If acquire() returns 1 then release() must be called. All
	 * get_pages() and dma_map()'s must be undone before calling this
	 * function.
	 */
	void (*release)(void *client_context);
};

enum {
	PEER_MEM_INVALIDATE_UNMAPS = 1 << 0,
};

struct peer_memory_client_ex {
	struct peer_memory_client client;
	size_t ex_size;
	u32 flags;
};

/*
 * If invalidate_callback() is non-NULL then the client will only support
 * umems which can be invalidated. The caller may call the
 * invalidate_callback() after acquire() on return the range will no longer
 * have DMA active, and release() will have been called.
 *
 * Note: The implementation locking must ensure that get_pages(), and
 * dma_map() do not have locking dependencies with invalidate_callback(). The
 * ib_core will wait until any concurrent get_pages() or dma_map() completes
 * before returning.
 *
 * Similarly, this can call dma_unmap(), put_pages() and release() from within
 * the callback, or will wait for another thread doing those operations to
 * complete.
 *
 * For these reasons the user of invalidate_callback() must be careful with
 * locking.
 */
typedef int (*invalidate_peer_memory)(void *reg_handle, u64 core_context);

void *
ib_register_peer_memory_client(const struct peer_memory_client *peer_client,
	invalidate_peer_memory *invalidate_callback);
void ib_unregister_peer_memory_client(void *reg_handle);

具体实现过程

在用户程序register mr的时候,IB Core会尝试调用每个client的acquire函数,让每个client判断自己有没有办法将给定的虚拟地址区间翻译成物理页面。在nv_peer_mem中,nv_mem_acquire()会尝试调用nvidia_p2p_get_pages()来取物理页面,若能成功,就返回True,表示这段内存由我来handle!

IB core找到正确的peer client之后,就会调用其get_pages()函数来将虚拟地址区间翻译成物理页面。nv_peer_mem这里会使用nvidia_p2p_get_pages()来做地址翻译,并把结果保存在context里。

之后,IB core再调用dma_map()来由物理页面map到bus address即dma_address,并将结果保存在sg_head里(sg_head是调用者传入的一个scatterlist指针, scatterlist专门用来描述一组物理内存)。 nv_peer_mem会使用nvidia_p2p_dma_map_pages()nvidia_p2p_dma_map_pages()需要传入peer device,正巧ib_core也会将dma_device传入dma_map()里。

以上就是装上nv_peer_mem在用户态对一块显存ibv_reg_mr()的过程。

资源释放

IB卡不再使用这些页面的时候(dereg_mr?不确定),会调用dma_unmap(),put_pagesrelease()来释放资源。

Peer client可以使用invalidate_peer_memory()回调来将一段register过的memory标记为不再dma active。在nv_peer_memory中,由于GPU在做上下文切换的时候有可能会重新分配它的MMIO pages,此时原虚拟地址对应的显存物理页面的映射关系可能失效。在调用nvidia_p2p_get_pages()的时候注册了回调nv_get_p2p_free_callback(),当映射关系失效时,会调用nv_get_p2p_free_callback(),该函数在释放nvidia_p2p_page_table()的同时会去调用invalidate_peer_memory()告知ib core这段内存不再可用。

有一点需要注意,nvidia_p2p_get_pages()在acquire中也被调到了,但那里传入的回调函数是nv_mem_dummy_callback(),这个函数其实不太可能会被调到,因为在acquire函数中们get到的page马上就被free了,即使在非常特殊的情况下触发了这个回调,dummy callback也只会与nvidia驱动交互,而不会与IB core通信,这是它与nv_get_p2p_free_callback()的区别。

参考

  1. How To Implement PeerDirect Client using MLNX_OFED,这篇文档对某些函数参数的描述已经过时,具体还是参照以上的函数定义。