# c2goasm: C to Go Assembly 

## Introduction

This is a tool to convert assembly as generated by a C/C++ compiler into Golang assembly. It is meant to be used in combination with [asm2plan9s](https://github.com/minio/asm2plan9s) in order to automatically generate pure Go wrappers for C/C++ code (that may for instance take advantage of compiler SIMD intrinsics or `template<>` code).

Mode of operation:
```
$ c2goasm -a /path/to/some/great/c-code.s /path/to/now/great/golang-code_amd64.s
```

You can optionally nicely format the code using [asmfmt](https://github.com/klauspost/asmfmt) by passing in an `-f` flag. 

This project has been developed as part of developing a Go wrapper around [Simd](https://github.com/fwessels/go-cv-simd). However it should also work with other projects and libraries. Keep in mind though that it is not intented to 'port' a complete C/C++ project in a single action but rather do it on a case-by-case basis per function/source file (and create accompanying high level Go code to call into the assembly code).

## Command line options
```
$ c2goasm --help
Usage of c2goasm:
  -a	Immediately invoke asm2plan9s
  -c	Compact byte codes
  -f	Format using asmfmt
  -s	Strip comments
```

## A simple example

Here is a simple C function doing an AVX2 intrinsics computation:
```
void MultiplyAndAdd(float* arg1, float* arg2, float* arg3, float* result) {
    __m256 vec1 = _mm256_load_ps(arg1);
    __m256 vec2 = _mm256_load_ps(arg2);
    __m256 vec3 = _mm256_load_ps(arg3);
    __m256 res  = _mm256_fmadd_ps(vec1, vec2, vec3);
    _mm256_storeu_ps(result, res);
}
```

Compiling into assembly gives the following
```
__ZN14MultiplyAndAddEPfS1_S1_S1_: ## @_ZN14MultiplyAndAddEPfS1_S1_S1_
## BB#0:
        push          rbp
        mov           rbp, rsp
        vmovups       ymm0, ymmword ptr [rdi]
        vmovups       ymm1, ymmword ptr [rsi]
        vfmadd213ps   ymm1, ymm0, ymmword ptr [rdx]
        vmovups       ymmword ptr [rcx], ymm1
        pop           rbp
        vzeroupper
        ret
```

Running `c2goasm` will generate the following Go assembly (eg. saved in `MultiplyAndAdd_amd64.s`)
```
//+build !noasm !appengine
// AUTO-GENERATED BY C2GOASM -- DO NOT EDIT

TEXT ·_MultiplyAndAdd(SB), $0-32

	MOVQ vec1+0(FP), DI
	MOVQ vec2+8(FP), SI
	MOVQ vec3+16(FP), DX
	MOVQ result+24(FP), CX

	LONG $0x0710fcc5             // vmovups    ymm0, yword [rdi]
	LONG $0x0e10fcc5             // vmovups    ymm1, yword [rsi]
	LONG $0xa87de2c4; BYTE $0x0a // vfmadd213ps    ymm1, ymm0, yword [rdx]
	LONG $0x0911fcc5             // vmovups    yword [rcx], ymm1

	VZEROUPPER
	RET
```

This needs to be accompanied by the following Go code (in `MultiplyAndAdd_amd64.go`)
```
//go:noescape
func _MultiplyAndAdd(vec1, vec2, vec3, result unsafe.Pointer)

func MultiplyAndAdd(someObj Object) {

	_MultiplyAndAdd(someObj.GetVec1(), someObj.GetVec2(), someObj.GetVec3(), someObj.GetResult()))
}
```

And as you may have gathered the amd64.go file needs to be in place in order for the arguments names to be derived (and allow `go vet` to succeed).

## Benchmark against cgo

We have run benchmarks of `c2goasm` versus `cgo` for both Go version 1.7.5 and 1.8.1. You can find the `c2goasm` benchmark test in `test/` and the `cgo` test in `cgocmp/` respectively. Here are the results for both versions:
```
$ benchcmp ../cgocmp/cgo-1.7.5.out c2goasm.out 
benchmark                      old ns/op     new ns/op     delta
BenchmarkMultiplyAndAdd-12     382           10.9          -97.15%
```
```
$ benchcmp ../cgocmp/cgo-1.8.1.out c2goasm.out 
benchmark                      old ns/op     new ns/op     delta
BenchmarkMultiplyAndAdd-12     236           10.9          -95.38%
```

As you can see Golang 1.8 has made a significant improvement (38.2%) over 1.7.5, but it is still about 20x slower than directly calling into assembly code as wrapped by `c2goasm`.

## Converted projects

- [go-cv-simd (WIP)](https://github.com/fwessels/go-cv-simd)

## Internals

The basic process is to (in the prologue) setup the stack and registers as how the C code expects this to be the case, and upon exiting the subroutine (in the epilogue) to revert back to the golang world and pass a return value back if required. In more details:
- Define assembly subroutine with proper golang decoration in terms of needed stack space and overall size of arguments plus return value. 
- Function arguments are loaded from the golang stack into registers and prior to starting the C code any arguments beyond 6 are stored in C stack space.
- Stack space is reserved and setup for the C code. Depending on the C code, the stack pointer maybe aligned on a certain boundary (especially needed for code that takes advantages of SIMD instructions such as AVX etc.).
- A constants table is generated (if needed) and any `rip`-based references are replaced with proper offsets to where Go will put the table. 

## Limitations

- Arguments need (for now) to be 64-bit size, meaning either a value or a pointer (this requirement will be lifted)
- Maximum number of 14 arguments (hard limit -- if you hit this maybe you should rethink your api anyway...)
- Generally no `call` statements (thus inline your C code) with a couple of exceptions for functions such as `memset` and `memcpy` (see `clib_amd64.s`)

## Generate assembly from C/C++

For eg. projects using cmake, here is how to see a list of assembly targets
```
$ make help | grep "\.s"
```

To see the actual command to generate the assembly
```
$ make -n SimdAvx2BgraToGray.s
```

## Supported golang architectures

For now just the AMD64 architecture is supported. Also ARM64 should work just fine in a similar fashion but support is lacking at the moment.

## Compatible compilers

The following compilers have been tested:
- `clang` (Apple LLVM version) on OSX/darwin
- `clang` on linux

Compiler flags:
```
-masm=intel -mno-red-zone -mstackrealign -mllvm -inline-threshold=1000 -fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti
```

| Flag                              | Explanation                                        |
|:----------------------------------| :--------------------------------------------------|
| `-masm=intel`                     | Output Intel syntax for assembly                   |
| `-mno-red-zone`                   | Do not write below stack pointer (avoid [red zone](https://en.wikipedia.org/wiki/Red_zone_(computing)))  |
| `-mstackrealign`                  | Use explicit stack initialization                  |
| `-mllvm -inline-threshold=1000`   | Higher limit for inlining heuristic (default=255)  |
| `-fno-asynchronous-unwind-tables` | Do not generate unwind tables (for debug purposes) |
| `-fno-exceptions`                 | Disable exception handling                         |
| `-fno-rtti`                       | Disable run-time type information                  |

The following flags are only available in `clang -cc1` frontend mode (see [below]()):

| Flag                              | Explanation                                                        |
|:----------------------------------| :------------------------------------------------------------------|
| `-fno-jump-tables`                | Do not use jump tables as may be generated for `select` statements |

#### `clang` vs `clang -cc1` 

As per the clang [FAQ](https://clang.llvm.org/docs/FAQ.html#driver), `clang -cc1` is the frontend, and `clang` is a (mostly GCC compatible) driver for the frontend. To see all options that the driver passes on to the frontend, use `-###` like this:

```
$ clang -### -c hello.c
"/usr/lib/llvm/bin/clang" "-cc1" "-triple" "x86_64-pc-linux-gnu" etc. etc. etc.
```

#### Command line flags for clang

To see all command line flags use either `clang --help` or `clang --help-hidden` for the clang driver or `clang -cc1 -help` for the frontend.

#### Further optimization and fine tuning

Using the LLVM optimizer ([opt](http://llvm.org/docs/CommandGuide/opt.html)) you can further optimize the code generation. Use `opt -help` or `opt -help-hidden` for all available options.

An option can be passed in via `clang` using the `-mllvm <value>` option, such as `-mllvm -inline-threshold=1000` as discussed above.

Also LLVM allows you to tune specific functions via [function attributes](http://llvm.org/docs/LangRef.html#function-attributes) like `define void @f() alwaysinline norecurse { ... }`.

#### What about GCC support?

For now GCC code will not work out of the box. However there is no reason why GCC should not work fundamentally (PRs are welcome).

## Resources

- [A Primer on Go Assembly](https://github.com/teh-cmc/go-internals/blob/master/chapter1_assembly_primer/README.md)
- [Go Function in Assembly](https://github.com/golang/go/files/447163/GoFunctionsInAssembly.pdf)
- [Stack frame layout on x86-64](http://eli.thegreenplace.net/2011/09/06/stack-frame-layout-on-x86-64)
- [Compiler Explorer (interactive)](https://go.godbolt.org/)

## License

c2goasm is released under the Apache License v2.0. You can find the complete text in the file LICENSE.

## Contributing

Contributions are welcome, please send PRs for any enhancements.
