在llvm ir中实现union

LLVM并没有内置c风格的union类型，为了实现发射到LLVM IR的编译器前端，必须要在LLVM IR上实现union类型的类型表示、union的成员变量的获取并集成到原本存在的类型推导、AST -> LLVM IR的翻译系统中。

环境介绍

我正在尝试实现的是一款C99标准的编译器前端，其基础类型采用以下方式存储

单个ctype节点，存储最基础的类型信息

struct ctype {
    enum type_qualifier qualifier;
    int type;
    enum tok_type signint;
    // used for struct, union, enum, holds a reference only, do not free it
    struct astn *user_defined_type;
} ctype;

一个符号的声明结构，使用一个经过修改的单向链表（添加到头或者添加到尾部的时间复杂度均为O(1)，但是只能从前往后遍历）存储type_chain

struct declaration {
    sds ident;
    // initializer/function body/struct declaration bitfield (expr)
    struct astn *extdata;
    size_t uid;
    struct astn *scope_ref;
    enum tok_type storage_class;
    /* the logical type chain of the declaration,
    * it contains those node type:
    * ast_expr_unary is used for array declaration [<expr>],
    * ast_parameters is used for function parameters(<parameter-type-list>),
    * ast_ctype
    */
    struct slist type_chain;
    typed_value V;
} declaration;

对于typedef的类型，会在每解析完一个declaration后使用一个名为parser_unfold_type_chain的函数将其展开，由于typedef同样也算作一个declaration，如果它定义的类型中包括了typedef类型，一样会被展开，这样就避免了嵌套typedef的类型展开问题。这样在随后的build（codegen）阶段，就不需要考虑typedef问题而专注于代码生成了。

clang的实现

clang可以发射到llvm ir，我们可以先看一看clang生成的结果。为了降低难度，我们采用较新的clang/llvm版本：19.1.7

// a.c
union s {
  int a ;
  char b ;
  struct s2 {
    int c;
    float d;
  } s;
};
int F() {
  union s s;
  s.a = 1;
  s.b = 2;
}

执行clang -S -emit-llvm a.c，打开a.ll

%union.s = type { %struct.s2 }
%struct.s2 = type { i32, float }

; Function Attrs: noinline nounwind optnone uwtable
define dso_local i32 @F() #0 {
  %1 = alloca %union.s, align 4
  store i32 1, ptr %1, align 4
  store i8 2, ptr %1, align 4
  %2 = load i32, ptr %1, align 4
  ret i32 %2
}

可以看到，它的实现是创建了一个结构体，但是其中只塞了一个类型。当写入或者读取成员时再具体从指针中store/load不同类型。接下来我们将要仿照此方法来实现union。

由于较新版本的clang/llvm默认启用了Opaque Pointers特性，降低了指针的类型操作和访问操作的复杂度。

在尚未启用此特性的版本，指针仍然需要保存其指向的类型，而直接从不同类型的指针中读取数据则需要进行转换。这也是我使用较新的llvm的原因之一。

选择union所创建的结构体的成员类型

union的尺寸由其最大尺寸的成员决定。所以我们需要查找先前生成的union的AST，并获取所有成员的类型在对应平台上的尺寸，最后选择出尺寸最大的一个，作为union的具体存储大小。

我们应该如何获取对应类型的尺寸？这个工作十分琐碎，而且真的很折磨。万幸经过搜索我找到了可以使用的LLVM API：LLVMABISizeOfType。

既然我们有了API获取类型的在目标平台的尺寸，接下来可以通过遍历union的AST来得到尺寸最大的数据类型。

dynarray build_declaration_type_union_member(builder b, astn n, dynarray arr) {
  // get the max size member type
  assert(n->type == ast_struct_union_declaration);
  LLVMTypeRef max_size_type = NULL;
  astn union_member_declaration;
  slist_foreach(&n->struct_union_declaration.member_declarations,
                union_member_declaration) {
    LLVMTypeRef t =
        build_declaration_variable_type(b, union_member_declaration);
    if (union_member_declaration->declaration.extdata) {
      log_panic("Unsupported union member declaration with bitfield");
    }
    if (max_size_type == NULL) {
      max_size_type = t;
    } else {
      size_t old_size = LLVMABISizeOfType(b->data_layout, max_size_type);
      size_t sz = LLVMABISizeOfType(b->data_layout, t);
      if (old_size < sz) {
        max_size_type = t;
      }
    }
  }
  dynarray_add(arr, &max_size_type);
  return arr;
}

其中的dynarray是我自己实现的动态数组的库，虽然此函数只会添加一个类型，但是由于struct和union不管从parse还是codegen过程都十分相似，而且我首先实现了struct，为了兼容性仍然使用了动态数组。

获得了尺寸最大的成员的类型后，接下来的步骤与创建结构体的过程别无二致，因此可以直接汇入到结构体创建的过程中。

struct dynarray dyn_elements;
dynarray_default(&dyn_elements, sizeof(LLVMTypeRef));
if (n->ctype.type == TOK_KW_STRUCT) {
    build_declaration_type_struct_member(b, udt, &dyn_elements);
} else {
    build_declaration_type_union_member(b, udt, &dyn_elements);
}
size_t elements_count = dyn_elements.used;
LLVMTypeRef *elements = dyn_elements.data;
if (udt->struct_union_declaration.ident) {
    sds name;
    name = sdscatprintf(
        sdsempty(), n->ctype.type == TOK_KW_STRUCT ? STRUCT_FMT : UNION_FMT,
        udt->struct_union_declaration.ident,
        udt->struct_union_declaration.uid);
    log_debug("create struct/union definition with name: %s", name);
    t = LLVMStructCreateNamed(b->context, name);
    LLVMStructSetBody(t, elements, elements_count, false);
    sdsfree(name);
} else {
    log_debug("create anonymous struct/union definition");
    t = LLVMStructTypeInContext(b->context, elements, elements_count, false);
}
dynarray_free(&dyn_elements);
udt->struct_union_declaration.V = t;

struct的赋值操作

为了更好支持指针类型，满足我们的需要，我们封装了LLVMTypeRef和自己实现的type_chain

typedef struct typed_value {
  LLVMValueRef v;
  /** the physical llvm type chain of the declaration, should be used only in non-base type processing
   * if it is in a declaration, it may be a ptr which points to the type chain of the declaration
   */
  struct slist type_chain;
} *typed_value;

并封装了LLVM的load和store操作为build_value_load和build_value_store。

这是一个链式赋值的案例

1	`v1 = v2 = v3;`

为了支持链式赋值，对于每个赋值类型的expression，我们会在赋值完成后对lhs重新进行一次load以生成对应的rvalue，虽然可能会生成冗余的ir，但是可以方便地解决这种链式赋值的情况。

但是当struct被引入后，处理逻辑发生了变化。

1	`stu1 = stu2 = stu3;`

struct的赋值实现一般有两种情况：

load每个成员再赋值给目标结构体（GEP）
根据结构体的尺寸进行复制（memcpy）

无论哪种情况都需要得到结构体的指针才能进行下一步操作。而由于前文所言为了支持链式赋值，会在赋值完成后对lvalue进行load从而产生一个新的rvalue，无论llvm ir是否支持从结构体指针中load结构体，都无法获得其指针。

即使在构建lhs和rhs之前通过读取语法树，解决了stu1 = stu2的赋值问题，但是它既不能合并入现有代码，也不能很好地解决stu1 = stu2 = stu3的链式赋值的问题或者更加复杂的stu_arr[0] = stu_arr[1] = stu_arr[2]表达式——这种复杂表达式需要在构建后才能获得最准确的类型，因此需要一种新的、可以兼容当前代码的解决方案。

经过思考，我想到了一种适用于现状的解决方案：当前所有和结构体有关的有效操作实际上都不会使用结构体本身，而是使用其指针来间接操作结构体，这就意味着我们可以在load的过程中针对struct结构体的指针进行特殊操作：只修改其type_chain而不会进行实际的LLVMLoad操作。此时的type_chain和对应的LLVMValueRef之间出现了一个偏移——虽然type_chain标识的是结构体类型，但是LLVMValueRef实际存储的类型为指向此结构体类型的指针。

typed_value build_value_load(builder b, typed_value v) {
  // get lhs points type
  slist points_to_type_chain =
      build_type_get_points_to_type_chian(b, &v->type_chain);
  astn points_to_base_type = slist_peek_head(points_to_type_chain);
  LLVMTypeRef points_to_type =
      build_type_ctype_convert_to_llvm(b, points_to_base_type);
  LLVMValueRef load;
  if (points_to_base_type->type == ast_ctype &&
      g_is_struct_or_union_token(points_to_base_type->ctype.type)) {
    log_trace("Try to load struct or union type from pointer, return the "
              "struct pointer "
              "directly");
    load = v->v; // <- here
  } else {
    load = LLVMBuildLoad2(b->builder, points_to_type, v->v, "load");
  }
  return typed_value_new(load, points_to_type_chain);
}

这样就确保了struct的load操作可以完全兼容其他类型。同时修改我们封装的load，检查到指针指向的类型为结构体时跳转到专门进行结构体赋值的函数

void build_value_store(builder b, typed_value v, typed_value ptr) {
  // get ptr points to type
  slist points_to_type_chain =
      build_type_get_points_to_type_chian(b, &ptr->type_chain);
  astn points_to_base_type = slist_peek_head(points_to_type_chain);
  // special case for struct or union
  if (points_to_base_type->type == ast_ctype &&
      g_is_struct_or_union_token(points_to_base_type->ctype.type)) {
    log_trace("Try to store struct or union type to pointer");
    return build_value_cpy_struct_union(b, ptr, v);
  }
  // cast rhs to ptr points to type
  v = build_type_convert_by_type_chain(b, v, points_to_type_chain);
  // normal: store rhs to where ptr points to
  LLVMBuildStore(b->builder, v->v, ptr->v);
}

此时，v的type_chain标识它是一个结构体，而实际上LLVMValueRef存储的是它的指针。从而可以确保经过load，变为rvalue后也能取得struct的指针：

// assert has been removed
void build_value_cpy_struct_union(builder b, typed_value lhs_ptr,
                                  typed_value rhs_struct) {
  astn rhs_base_type = slist_peek_head(&rhs_struct->type_chain);
  slist lhs_points_to_type_chain =
      build_type_get_points_to_type_chian(b, &lhs_ptr->type_chain);
  astn lhs_base_type = slist_peek_head(lhs_points_to_type_chain);

  LLVMTypeRef struct_or_union_type =
      build_declaration_struct_or_union(b, rhs_base_type);
  LLVMBuildMemCpy(b->builder, lhs_ptr->v, 1, rhs_struct->v, 1,
                  LLVMSizeOf(struct_or_union_type));
}

这样，我们就可以在不修改大量代码（赋值表达式，变量初始化等）的情况下直接添加结构体赋值的功能

struct和union的关系

前文描述的操作让结构体赋值操作集成进入原本只有基础类型的系统中。而struct又和union有什么关系呢？经过之前对clang生成的ir进行分析，我们知道了union的底层实现是单个成员的结构体，因此上文所描述的代码经过简单的修改后同样兼容了union的赋值操作。

union的获取成员操作（类型转换）

我们知道，union的获取成员相比于结构体的获取成员的内容的过程，更加类似于强制类型转换。

由于Opaque Pointers特性，类型转换比想象中简单得多：获取union ast中的成员的type_chain，然后将其和union的指针结合产生新的typed_value并直接返回。随后根据type_chain从LLVMValueRef直接load不同类型的值。

union和struct的BNF格式几近相同，实现时代码几乎也相同。因此在实现了struct的获取成员的指针的基础上只需要少量的改动即可兼容union

typed_value build_expr_unary_get_member_ptr(builder b, astn n) {
  typed_value struct_or_union_ptr = build_lvalue_expression(b, n->unary.expr);
  astn member = n->unary.extdata;
  slist points_to_struct_type_chain =
      build_type_get_points_to_type_chian(b, &struct_or_union_ptr->type_chain);
  astn points_to_base_type = slist_peek_head(points_to_struct_type_chain);
  astn member_declaration;
  int member_idx = build_type_struct_type_get_member(
      points_to_base_type, member->ident, &member_declaration);
  if (member_idx < 0) {
    log_panic("Member %s not found in struct or union", member->ident);
  }

  slist member_type_chain = &member_declaration->declaration.type_chain;
  if (points_to_base_type->ctype.type == TOK_KW_STRUCT) {
    LLVMValueRef gep = LLVMBuildStructGEP2(
        b->builder, build_declaration_struct_or_union(b, points_to_base_type),
        struct_or_union_ptr->v, member_idx, "struct_gep");
    slist member_ptr_type_chain =
        build_type_chain_add_pointer(b, member_type_chain);
    return typed_value_new(gep, member_ptr_type_chain);
  } else {
    slist member_ptr_type_chain =
        build_type_chain_add_pointer(b, member_type_chain);
    return typed_value_new(struct_or_union_ptr->v, member_ptr_type_chain);
  }
}

解决方案可能并不是很优秀，毕竟是自己想出来的，但是依旧很有解决问题的成就感。希望早日可以实现编译器自举。

参考

mapping-high-level-constructs-to-llvm-ir ：指出llvm中不存在union但是可以模拟
OpaquePointers的描述
LLVMABISizeOfType对应的llvm::DataLayout::getTypeAllocSize