Diego Bravo Estrada
This is a very-very-very basic introduction to programming in assembly for 64-bit Linux based in AMD/Intel cpus. We are using Debian 12, but should be usable with any modern Linux system.
We follow several sources, like:
-
Seyfarth, R. (2011). Introduction to 64 Bit Intel Assembly Language Programming: Getting the most out of your computer. CreateSpace.
-
Kusswurm, D. (2014). Modern X86 Assembly Language Programming 32-bit, 64-bit. SSE, and AVX, Apress.
-
Intel® 64 and IA-32 Architectures Software Developer’s Manual (edition of December 2023)
-
Stackoverflow
Intro
We’ll use the popular nasm
and (for further testing) yasm
, which are
recognized assemblers for x86 processors:
sudo apt install nasm sudo apt install yasm
We recommend checking the installation of the binutils
, build-essential
and make
packages.
For more information, see https://www.nasm.us/ and https://yasm.tortall.net/
All the examples should be workable with both assemblers.
On 64-bit architecture:
"Intel found itself in the position of adopting the ISA which AMD had created as an extension to Intel’s own x86 processor line. Intel’s name for this instruction set has changed several times. […] within weeks they began referring to it as IA-32e (for IA-32 extensions) and in March 2004 unveiled the "official" name EM64T (Extended Memory 64 Technology). In late 2006 Intel began instead using the name Intel 64 for its implementation, paralleling AMD’s use of the name AMD64."
The Tutorial
From this point, we’ll implement several small programs in order to explain the relevant concepts.
First program
We’ll replicate the following C program:
#include <stdio.h>
#include <stdlib.h>
int main() {
printf("%#x\n", 0x12345678);
exit(0);
}
Which may be compile with:
diego@dataone:~/devel/ASM$ gcc -Wall -o model model.c diego@dataone:~/devel/ASM$ ./model 12345678
We assume the reader understands this program. Now the assembly version:
; test1.asm
default rel
extern printf ; extern symbols
section .rodata
format: db "%#x", 10, 0 ; C 0-terminated string: "%#x\n"
section .text
global _start
_start:
sub rsp, 8 ; re-align the stack to 16 before calling another function
; call printf
mov esi, 0x12345678 ; "%x" takes a 32-bit unsigned int
lea rdi, [format]
xor eax, eax ; AL=0 no FP args in XMM regs
call printf wrt ..plt
; call exit
mov rax, 60
xor rdi, rdi
syscall
Note
|
This example was adapted from https://stackoverflow.com/questions/8194141/how-to-print-a-number-in-assembly-nasm/32853546#32853546 |
Build with:
nasm -f elf64 -g -l test1.lst test1.asm ld --dynamic-linker=/lib64/ld-linux-x86-64.so.2 -o test1 test1.o -lc
or
yasm -f elf64 -g dwarf2 -l test1.lst test1.asm ld --dynamic-linker=/lib64/ld-linux-x86-64.so.2 -o test1 test1.o -lc
Execute with:
./test1 0x12345678
Before explaining the code, let’s understand the build steps.
Assembling
nasm
or yasm
got our assembly listing test1.asm
which in turn will
generate an "object file" named test1.o
. This is the assembler’s
main work.
We also ask for an "hexadecimal listing" of the object (with -l test1.lst
), which
provides a detailed view of the corresponding machine code opcodes.
The -f elf64
option instructs the assembler to generate an object
file following the ELF standard (used in Linux and some other
unixes) using the "64-bit model", and -g
asks
for the inclusion of debugging information (for example, to be
used by gdb
.)
Note
|
Dwarf stands for "Debug With Arbitrary Record Format". |
Linking
The "link editor" ld
is used to create the executable. As we’ll
create an executable "dynamically linked" (i.e. will load
required libraries at runtime) then we need to provide
the path for the "dynamic linker/loader" which (sadly) is not
provided by ld
so we add the --dynamic-linker
option. The
file /lib64/ld-linux-x86-64.so.2
ls really a symbolic link
to the real library. For example, in out test system it is:
diego@dataone:~/devel/ASM$ ls -l /lib64/ld-linux-x86-64.so.2 lrwxrwxrwx 1 root root 32 Dec 7 11:38 /lib64/ld-linux-x86-64.so.2 -> /lib/x86_64-linux-gnu/ld-2.27.so
Note
|
At least in the binutils world (where Linux ld lives), it is not
expected the direct usage of ld by the user; instead, the
C compiler gcc should be employed, even if the source is not
in the C language. See more information
in https://bugzilla.redhat.com/show_bug.cgi?id=868662 .
|
Finally, we instruct ld
to set a dynamic link for the
executable with the C standard library (libc.so.nnn) with
the option -lc
. More libraries may be added with
the syntax -lXXX
corresponding to libraries named libXXX.so.nnn
.
The linked libraries may be inspected with ldd
:
diego@dataone:~/devel/ASM$ ldd test1 linux-vdso.so.1 (0x00007ffc915f7000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f720d284000) /lib64/ld-linux-x86-64.so.2 (0x00007f720d67c000)
Note
|
The linux-vdso is a shared library provided by the
kernel allowing the access to some services without the
rather slow "jump" to kernel mode of a system call.
|
Entry point
When a process is started, the "entry point" is specified
by a symbol named _start
. That is the reason our
program has such symbol in the place where we want our
code to start its execution:
_start: sub rsp, 8 ; Call printf.
The _start
symbol must be marked as global, so the
operating system may find it when the process is
executed:
global _start
The instructions are stored in a memory "section"
(a region in RAM dynamically assigned at runtime) named .text
. There
is another section for read-only data suitable
named .rodata
, where we put the arguments to the
printf()
call:
section .rodata format: db "%#x", 10, 0 ; C 0-terminated string: "%#x\n"
The db
assembly directive instructs the assembler to
consider the arguments as plain data which must be
stored in the object file (unlike the assembly cpu
instructions below.) The db
(roughly "define bytes")
directive accepts several comma-separated arguments
including strings and numbers as in the example. Note that
the 10
corresponds to an ASCII new-line (or \n
), and
the last 0
is the mandatory end mark
required by C language strings. The format
is simply
a label to reference this data from the code below.
Registers
As the reader probably knows, the cpu contains a number of
"registers" which may be considered some kind of very fast
integer predefined "variable" in the assembly language. The
mov
family of instructions are used to put some value
into the registers, and to extract the value contained
in the registers. A syntax like mov a, b
means that
the value represented by b
is assigned into a
(we’ll
see some examples below.) This order for the
assignations is known as the "Intel syntax". For more
information, see: http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html#s3
Process termination: System Calls
The first thing we’ll review is how to terminate
a running process. To that effect we must use the
exit()
system call, which is provided by the
Linux kernel (and every POSIX compliant operating
system.)
The system calls are identified by a numeric code
which may be located in a Linux header file (albeit
its exact location may vary); for example, our test
system has the following contents
inside /usr/include/x86_64-linux-gnu/asm/unistd_64.h
:
... #define __NR_execve 59 #define __NR_exit 60 #define __NR_wait4 61 ...
Following
https://blog.rchapman.org/posts/Linux_System_Call_Table_for_x86_64/
we’ll discover that the desired system call must be identified by
such numeric code put into the rax
cpu register.
So we conclude that the code for the exit()
system
call is 60, and we must put this value in rax
. Also, this
system call does require a single numeric parameter (which will be the
"process return value" for the operating system); such value
must be sent in the rdi
register (xor rdi, rdi
is
an usual abbreviated way to call mov rdi, 0
.) So in order
to call exit(0);
, we issue the assembly instructions:
mov rax, 60 xor rdi, rdi syscall
Calling library functions
A couple of citations:
An application binary interface (ABI) is an interface between two binary program modules. Often, one of these modules is a library or operating system facility, and the other is a program that is being run by a user.
An ABI defines how data structures or computational routines are accessed in machine code, which is a low-level, hardware-dependent format.
Also:
Calling convention is an implementation-level (low-level) scheme for how […] functions receive parameters from their caller and how they return a result. […] These transfers typically done via certain registers or within a stack frame on the call stack.
Calling conventions are usually considered part of the application binary interface (ABI).
We call printf
which at runtime is a global
symbol external to our code (is located in the
C standard library.) Such external symbols must
be signaled with:
extern printf
Now, for the call to printf
we prepare the
arguments: a formatting argument and a numeric
one with the registers rdi
and rsi
. Also,
al
(via eax
) is zeroed, since it is used
to indicate the number of floating point arguments
present in the call (here, zero.)
In general, the AMD64 ABI uses the sequence of
registers rdi
, rsi
, rdx
, rcx
, r8
and r9
for
integer or pointer-type parameters:
https://software.intel.com/sites/default/files/article/402129/mpx-linux64-abi.pdf
for a friendlier set of related slides:
https://bart.disi.unige.it/zxgio/phd-course-2017/x86intro_slides.pdf
Note
|
The floating point arguments are expected
to be stored in (rather heavy) "vector registers". The
zeroed eax is an optimization advice for integer-only
variadic calls.
|
Note that the first argument to printf
is a "pointer" to
the memory containing the formatting string. We could try
something like
mov rdi, format
but that will introduce some problems with PIC (see below.) So we prefer the "load effective address" instruction which does (almost) the same:
lea rdi, [format]
Now, the actual call:
call printf wrt ..plt
Note
|
In other assemblers this is usually written
as call printf@PLT .
|
PLT means "Procedure Linkage Table"; from https://eli.thegreenplace.net/2011/11/03/position-independent-code-pic-in-shared-libraries "The PLT is part of the executable text section, consisting of a set of entries (one for each external function the shared library calls). Each PLT entry is a short chunk of executable code. Instead of calling the function directly, the code calls an entry in the PLT, which then takes care to call the actual function."
So this is used by "position independent code" to make calls to shared library’s functions whose real address is not known at process startup. This resolution is done in a lazy way: only if the function is actually called, then its address is resolved.
Note
|
The System V ABI requires the functions to preserve the registers rbx, rbp, and r12-r15. |
Position Independent Code (PIC)
Refers to compiled code whose internal addresses (for example, for jumps or for loading some data) is only known at runtime. The usual case are shared libraries, which are loaded in demand, and stored in memory at effectively random addresses.
As of 2020, recent versions of Linux also provide executable programs composed by PIC; this is done for security, with the intention of making difficult some attacks which depend on previous knowledge of (non-random) memory addresses.
So we call with the help of a PLT.
Also, references to memory must be relative. So instead of
lea rdi, [format]
we should use something like
lea rdi, [rel format]
In yasm
another available syntax is:
lea rdi, [format wrt rip]
We avoid this with the assembler directive:
default rel
then the rip
index is added automatically to
the memory references.
Note
|
In the following examples we’ll always use default rel .
|
The stack pointer
As an ABI requirement, the stack pointer address
must be multiple of 16 at the start of the execution of the
called function. Since the call
instruction consumes 8
bytes of the stack, we start our function adding another
8 bytes:
sub rsp, 8
Note that the stack is filled in descending order, so when subtracting the address we are effectively filling it with the specified bytes.
Before returning from a function, the stack should be
"corrected" to its original value. Here we don’t do that
since the exit
call will stop the process execution
and we no longer care about additional (caller) code.
Note
|
the stack alignment of 16 bytes is not required at all for code which does not call external functions. |
Note
|
even if no function calls are to be issued, the stack contents should be aligned to 8 bytes to avoid further unexpected failures: code which tries to push quad-words will fail if the stack is not 8-byte aligned. |
Using Gcc
This is the preferred method for building the executable. Also, it automatically links with the C library.
; test2.asm
default rel
extern printf
section .rodata
format: db "%#x", 10, 0 ; C 0-terminated string: "%#x\n"
section .text
global main
main:
sub rsp, 8 ; re-align the stack to 16 before calling another function
; Call printf.
mov esi, 0x12345678 ; "%x" takes a 32-bit unsigned int
lea rdi, [format]
xor eax, eax ; AL=0 no FP args in XMM regs
call printf wrt ..plt
; Return from main.
xor eax, eax
add rsp, 8
ret
section .note.GNU-stack noalloc noexec nowrite progbits
The C compiler adds the _start
initialization code and looks for
the popular main
global symbol used by every C program. Here
we implement a main
function just providing such symbol.
Assembly from Gcc
The -S
option generates assembly code for a C program. For example:
gcc -S model.c
generates the following listing in my system:
.file "model.c" .text .section .rodata .LC0: .string "%#x\n" .text .globl main .type main, @function main: .LFB6: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp .cfi_def_cfa_register 6 movl $305419896, %esi leaq .LC0(%rip), %rax movq %rax, %rdi movl $0, %eax call printf@PLT movl $0, %edi call exit@PLT .cfi_endproc .LFE6: .size main, .-main .ident "GCC: (Debian 12.2.0-14) 12.2.0" .section .note.GNU-stack,"",@progbits
It is very similar to the previous test*.asm
listings, but:
-
The syntax is for the Gnu assembler
gas
, which uses the AT&T instruction syntax, where the assignments are in reversed order:mov a, b
movesa
intob
. -
The register names are prefixed by a percentage sign
This is evident in the instruction
leaq .LC0(%rip), %rdi
which is analog to the test3.asm (the .LC0
label
corresponds to our format
label):
lea rdi, [rel format]
Another useful tool for analyzing already
compiled code is objdump
. For example, to
disassemble test1.o
simply issue
the command objdump -d -M intel test1.o
. Also works
in the final executable test1
.
Since we are generating ELF-format object and
executable files, the readelf
command may be
helpful too. For example. readelf -a test1.o
provides
the file’s ELF-related information.
More about the PLT
To further illustrate the point, we suggest to the reader
to create a small modification from our test2
into an equivalent
version with "absolute addressing" (without default rel
):
diego@dataone:~/devel/ASM$ diff test2.asm test3.asm 1d0 < default rel 14c13 < lea rdi, [format] --- > lea rdi, [rel format]
Note that we may force the generation of a (less secure) non PIC executable, so we can use absolute addressing:
diego@dataone:~/devel/ASM$ diff test2.asm test4.asm 1d0 < default rel 14c13 < lea rdi, [format] --- > mov rdi, format
We must build with -no-pie
(no position independent executable):
nasm -f elf64 -g -l test4.lst test4.asm # also: yasm -f elf64 -g dwarf2 -l test4.lst test4.asm gcc -no-pie -o test4 test4.o diego@dataone:~/devel/ASM$ ./test4 0x12345678
Note that the executables with PIC and without PIC may be
queried with file
:
diego@dataone:~/devel/ASM$ file test3 test4 test3: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, [...] with debug_info, not stripped test4: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, [...] with debug_info, not stripped
The addressing (RIP/PIC) here presented isusing the default "small" code model of the x64 architecture. For more information check this article: https://eli.thegreenplace.net/2012/01/03/understanding-the-x64-code-models
Returning from a function
Since we are running the main
function, to terminate the
process it is enough with returning from it (of course,
calling exit
is allowed as before.)
Following the ABI specification, the (integer) return value goes
in the eax
register, here zeroed with xor:
xor eax, eax add rsp, 8 ret
Note that the stack pointer is returned to its starting value by adding 8 bytes before returning to the caller.
Note
|
The section .note.GNU-stack added to the listings
avoids some warnings which are not interesting at this
time. For more information see for example:
https://www.redhat.com/en/blog/linkers-warnings-about-executable-stacks-and-segments
|
Digression: 32-bit ABI
There are LOTS of code built using
the 32-bit ABI. The calling conventions are
distinct; for example, for system calls
instead of the syscall
indexed by rax
with
rdi+rsi+rdx…
as parameters, the "interruption 0x80" is used
indexed by eax
with the ebx+ecx+…
as parameters, as shown
below:
; exit32.asm segment .text global _start _start: mov eax , 1 ; 1 is the exit sys call number mov ebx , 5 ; the status value to return int 0x80 ; execute a system call
Compile with:
nasm -f elf64 -g -l exit32.lst exit32.asm # also: yasm -f elf64 -g dwarf2 -l exit32.lst exit32.asm ld -o exit32 exit32.o
This code usually works in 64-bits Linux since
legacy support for 32 bits is very common (but
it is expected to decline with the time.) Note that
no link to the C library is made, since it is not
called (unlike previous examples which use printf
.)
See more in: https://stackoverflow.com/questions/46087730/what-happens-if-you-use-the-32-bit-int-0x80-linux-abi-in-64-bit-code
For more examples: https://github.com/cirosantilli/x86-assembly-cheat
For signed integers the x86 family cpus use the "two’s complement" representation: for positive values the MSB is zero. Negative numbers have MSB=1, their the absolute value bit representation is inverted (complemented) and a unit is added.
This representation allows for adding and subtracting following the usual arithmetic rules (in base 2) if the two-complement meaning is consistently used to interpret the operations result.
Three flavors: 32 bits, 64 bits, and 80 bits. Approximated precisions of 7, 16 and 19 digits.
The mapping from logic memory addresses (of a process) o the physical addresses is made using pages. The X84-64 CPU maps the memory in two page sizes: 4kB and 2MB. In modern CPUs also 1 GB pages are supported.
The Linux process' memory comprehends the regions: text, data, heap and stack. The logical addresses are 47-bit wide.
In the initial addresses there is the .text
segment
which does not require to grow. Next the initialized
data segment .data
. On top there is the .bss
segment
(block started by symbol) which is statically assigned
data (initially zeroes) and does not exist at all in the
executable file.
The heap
contains dynamically allocated data. The last is the
stack, restricted by to kernel to (usually) 16 MB.
The heap
may also map shared libraries and memory
mapped files.
Conditional jumps
The following example shows how to print this:
diego@dataone:~/devel/ASM$ ./triangle * ** *** **** ***** ****** ******* ******** *********
Now, the program:
; triangle.asm: adapted from https://cs.lmu.edu/~ray/notes/nasmtutorial/
default rel
global main
section .text
main:
lea rdx, [output] ; rdx holds address of next byte to write
mov r8, 1 ; initial line length
mov r9, 0 ; number of stars written on line so far
line:
mov byte [rdx], '*' ; write single star
inc rdx ; advance pointer to next cell to write
inc r9 ; "count" number so far on line
cmp r9, r8 ; did we reach the number of stars for this line?
jne line ; not yet, keep writing on this line
lineDone:
mov byte [rdx], 10 ; write a new line char
inc rdx ; and move pointer to where next char goes
inc r8 ; next line will be one char longer
mov r9, 0 ; reset count of stars written on this line
cmp r8, maxlines ; wait, did we already finish the last line?
jle line ; if not, begin writing this line
; jng line
done:
mov rax, 1 ; system call for write
mov rdi, 1 ; file handle 1 is stdout
lea rsi, [output] ; address of string to output
mov rdx, dataSize ; number of bytes
syscall ; invoke operating system to do the write
xor rax, rax
ret
section .note.GNU-stack noalloc noexec nowrite progbits
section .bss
maxlines: equ 9
dataSize: equ (maxlines+1)*(maxlines+2)/2-1
output: resb dataSize
Some macro comments:
-
We use the
equ
directive to define two constants: the first one sets the labelmaxline
to the constant 9. That’s it. -
The
maxline: equ
may be located at any place in the file, since its workings are totally internal to the assembly phase; i.e. it is not directly converted into machine code -
The second
equ
defines a constant via a constant expression; here it is used to calculate the required buffer to accumulate all the asterisks to be printed with a single call to write. -
Another way would have been to define a single-line buffer and make several write calls
-
The
resb
directive reserves "uninitialized" data blocks; here its size is the (just calculated) constantdataSize
; there blocks should be inside the.bss
section
About the flow:
-
The
rdx
register is used as a pointer to write into the buffer; the instructionmov byte [rdx], ?
is analogous to the C version*rdx=?
where?
is a character constant (asterisk) or the ASCII code for newline (10.) -
The
cmp r9, r8
compares those registers. If not equal then jump back toline
to get more asterisks.jne
stands for jump-if-not-equal. -
After the asterisks for a line, add a newline and compare
r8
(the number of asterisks in a line AND the number of lines to print) withmaxlines
; here we must jump back to print the next line ifr8⇐maxlines
sojle
stands for jump-if-less-or-equal -
The last comparison is logically the same as
!r8>maxlines
, so thejle
synonym instructionjng
stands for jump-if-not-greater.
Externally linked function
The following file defines a function with a global symbol
xdump
:
; xdump.asm - dump hex bytes to stdout
; arguments:
;
; rdi: pointer to buffer
; esi: number of bytes
;
default rel
extern printf
section .text
global xdump
xdump:
push rbp
mov rbp, rsp
sub rsp, 32
; between calls to printf we need to save the buffer
; pointer, the offset and the total count
buffer: equ 0
total: equ 8
xoffset: equ 16
; save the parameters rdi and esi in variables
mov [rsp+buffer], rdi
mov [rsp+total], rsi
; offset starts in zero
xor r8, r8
mov [rsp+xoffset], r8
; for the comparison offset->r8 total->rsi
nextbyte: cmp r8, rsi
je finalize
; Call printf.
lea rdi, [format]
mov rsi, [rsp+buffer]
add rsi, r8
mov al, [rsi]
mov rsi, rax
and rsi, 0xff
xor eax, eax
call printf wrt ..plt
; increment offset
mov r8, [rsp+xoffset]
inc r8
mov [rsp+xoffset], r8
; get the total
mov rsi, [rsp+total]
jmp nextbyte
finalize:
lea rdi, [newline]
xor eax, eax
call printf wrt ..plt
xor eax, eax
leave
ret
section .note.GNU-stack noalloc noexec nowrite progbits
section .rodata
format: db "%02x", 0 ; C 0-terminated string
newline: db 10, 0 ; 0-terminated new line
This function does print the provided buffer contents one byte
at a time using printf
. The buffer size must be provided too.
Here we illustrate the preferred way to deal with the stack:
push rbp
mov rbp, rsp
These instructions save the rbp
register in the stack (which
effectively uses 8 bytes so is equivalent to the subtraction
to rsp
we used before, but this way is conventionally used
by the debugger to navigate in the stack frames.
The next instruction "reserves" some space in the stack for local variables:
sub rsp, 32
We must use multiples of 16 in order to preserve the
stack alignment. Here we’ll use just 24 bytes in order
to store the buffer pointer, the buffer size and
an offset counter. This information will go to and
from some registers and need to be saved since
we’ll call printf
for each buffer byte.
With this arrangement, we’ll put our "variables"
at positions rsp
, rsp+8
and rsp+16
. To
easy this scheme a bit, three constants
are defined:
buffer: equ 0
total: equ 8
offset: equ 16
So we may reference the variables as rsp+buffer
, etc. and
their contents as [rsp+buffer]
, etc.
The buffer address is stored in rsp+buffer
so it is extracted
as [rsp+buffer]
into rsi
; this address is then increased
by the current offset (we use r8
for that) and then the
content of the pointed byte is stored in al (mov al, [rsi]
.)
This is sent to rsi for ABI requirements and the higher
bits are cleared with an and
instruction.
We call this with this test program:
#include <string.h>
extern int xdump(const char *ptr, int sz);
int main() {
const char *data = "A1 *** @@@ !!!";
xdump(data, strlen(data));
return 0;
}
The compilation steps and test:
nasm -f elf64 -g -l xdump.lst xdump.asm # also: yasm -f elf64 -g dwarf2 -l xdump.lst xdump.asm gcc -c xdump1.c gcc -o xdump1 xdump1.o xdump.o diego@dataone:~/devel/ASM$ ./xdump1 4131202a2a2a2040404020212121
Addressing
The following program shows several memory/register transfer sizes:
;
; sizes.asm
;
; what happens when we move data between registers, memory and code
;
default rel
extern xdump, printf
section .text
global main
main:
push rbp
mov rbp, rsp
%macro title 1
lea rdi, [%1]
xor eax, eax
call printf wrt ..plt
call justones ; fill 111111111....
%endmacro
; MEMORY <- value
; writing in memory
; in immediate mode
; must set write size
title tv2mb
mov byte [testbuf1], 0x34
mov byte [testbuf1+1], 0xa7
mov byte [testbuf2], 0x35
; mov [testbuf1], 0x34 error!
; mov byte [testbuf1], 0x3412 warning!
; mov word [testbuf1], 0x341264 warning!
call prn
title tv2mw
mov word [testbuf1], 0x53
mov word [testbuf2+1], 0x5556
mov word [testbuf2+2], 0x5758
; mov word [testbuf2+2], 0x575859 warning!
call prn
title tv2md
mov dword [testbuf1], 0xab
mov dword [testbuf2+1], 0xbe444545
mov dword [testbuf2], 0xbeaa5566
call prn
title tv2mq
mov qword [testbuf1], 0xbeaa
mov qword [testbuf2], 0x4eaa5566
; mov qword [testbuf2], 0xbeaa5566 warning!
mov qword [testbuf2+8], 0x36373839
; mov qword [testbuf2], 0xbcbdbebfa0a1a2a3 warning!
call prn
; MEMORY <- (PARTIAL)REG
; from register to memory
; must set write size
title tr2m
mov rax, 0xbcbdbebfa0a1a2a3
mov qword [testbuf1], rax
mov rax, 0x252627
mov qword [testbuf1+8], rax
mov rbx, 0x4444444444444444
mov qword [testbuf2], rbx
mov rbx, 0x123456
mov qword [testbuf2+8], rbx
call prn
title tr2ms
mov rax, 0x445566778899aabb ; saving subregisters
; mov qword [testbuf1], rax ok
; mov dword [testbuf1], rax error!
; mov qword [testbuf1], eax error!
mov dword [testbuf1], eax
; mov word [testbuf1], eax error!
; mov qword [testbuf1+8], ax error!
; mov dword [testbuf1+8], ax error!
mov word [testbuf1+8], ax
; mov qword [testbuf2], ah error!
; mov dword [testbuf2], ah error!
; mov word [testbuf2], ah error!
mov byte [testbuf2], ah
mov [testbuf2+1], ah ; exceptions: byte assumed
mov [testbuf2+2], al
call prn
; (PARTIAL)REG <- MEMORY
; partial memory loads
title tm2r
mov rax, 0x445566778899aabb ; fill a pattern in rax
mov qword [testbuf1], rax ; store in memory
mov rax, 0xcccccccccccccccc ; fill another pattern in rax
mov rax, [testbuf1] ; load from memory
mov qword [testbuf1+8], rax ; store in memory
mov rax, 0xcccccccccccccccc ; fill another pattern in rax
mov eax, [testbuf1] ; partial load from memory -> clears the rest of rax!
mov qword [testbuf2], rax ; store in memory -> see zeros in high bits!
mov rax, 0xcccccccccccccccc ; fill another pattern in rax
mov ax, [testbuf1] ; partial load from memory
mov qword [testbuf2+8], rax ; store in memory -> pattern remains in high bits!
call prn
; al/ah tests
; memory loads
title tm2rs
mov rax, 0x445566778899aabb ; fill a pattern in rax
mov qword [testbuf1], rax ; store in memory
mov rax, 0xcccccccccccccccc ; fill another pattern in rax
mov al, [testbuf1] ; partial load from memory
mov qword [testbuf1+8], rax ; store in memory -> pattern remains in high bits!
mov rax, 0xcccccccccccccccc ; fill another pattern in rax
mov ah, [testbuf1] ; partial load from memory
mov qword [testbuf2], rax ; store in memory -> pattern remains in high bits!
call prn
; (PARTIAL)REG <- value
; immediate mode load
title tv2r
mov rax, 0x445566778899aabb ; fill a pattern in rax
mov eax, 0xdddddddd ; fill a pattern in eax
mov qword [testbuf1], rax ; store in memory -> dddddddd00000000
mov rax, 0x445566778899aabb ; fill a pattern in rax
mov ax, 0xeeee ; fill a pattern in ax
mov qword [testbuf1+8], rax ; store in memory -> eeee998877665544
mov rax, 0x445566778899aabb ; fill a pattern in rax
mov ah, 0xff ; fill a pattern in ah
mov qword [testbuf2], rax ; store in memory -> bbff998877665544
mov rax, 0x445566778899aabb ; fill a pattern in rax
mov al, 0xff ; fill a pattern in al
mov qword [testbuf2+8], rax ; store in memory -> ffaa998877665544
call prn
; (PARTIAL)REG <- (PARTIAL)REG
; register copy
title tr2r
mov rax, 0x445566778899aabb ; fill a pattern in rax
mov rbx, 0xeeeeeeeeeeeeeeee ; and in rbx
mov eax, ebx ; partial copy
mov qword [testbuf1], rax ; store in memory -> dddddddd00000000
mov rax, 0x445566778899aabb ; fill a pattern in rax
mov rbx, 0xeeeeeeeeeeeeeeee ; and in rbx
mov ax, bx ; partial copy
mov qword [testbuf1+8], rax ; store in memory -> eeee998877665544
mov rax, 0x445566778899aabb ; fill a pattern in rax
mov rbx, 0xeeeeeeeeeeeeeeee ; and in rbx
mov ah, bh ; partial copy
mov qword [testbuf2], rax ; store in memory -> bbff998877665544
mov rax, 0x445566778899aabb ; fill a pattern in rax
mov rbx, 0xeeeeeeeeeeeeeeee ; and in rbx
mov al, bl ; partial copy
mov qword [testbuf2+8], rax ; store in memory -> ffaa998877665544
call prn
xor rax, rax ; exit
leave
ret
justones:
; fill with 0x12
xor esi, esi
lea rdi, [testbuf1]
nxt:
mov byte [rdi], 0x11
inc rdi
inc esi
cmp esi, 32
jne nxt
ret
prn:
; dump 48 bytes
push rbp
mov rbp, rsp
lea rdi, [format]
xor eax, eax
call printf wrt ..plt
lea rdi, [testbuf1]
mov rsi, 32
call xdump wrt ..plt
leave
ret
section .note.GNU-stack noalloc noexec nowrite progbits
section .bss
testbuf1: resb 16
testbuf2: resb 16
section .rodata
format: db "000102030405060708090A0B0C0D0E0F"
db "000102030405060708090A0B0C0D0E0F", 10, 0
tv2mb: db "MEMORY byte <- value", 10, 0
tv2mw: db "MEMORY word <- value", 10, 0
tv2md: db "MEMORY dword <- value", 10, 0
tv2mq: db "MEMORY qword <- value", 10, 0
tr2m: db "MEMORY <- REG", 10, 0
tr2ms: db "MEMORY <- PARTIAL REG", 10, 0
tm2r: db "REG <- MEMORY", 10, 0
tm2rs: db "PARTIAL REG <- MEMORY", 10, 0
tv2r: db "(PARTIAL) REG <- value", 10, 0
tr2r: db "PARTIAL REG <- PARTIAL REG", 10, 0
Note the use of the %macro…%endmacro
assembler directives
used to define the title
macro which receives one argument
(a message to be printed.)
The commented warning for the line 49 is subtle: There is
no available mov
mode to transfer a 64-bit value (qword
) directly
to memory (without the intervention of some register.) Thus the value
0xbeaa5566
is interpreted as a 64 bit one which must
be converted to a signed number of 32 bits, but this
requires to be in the range -0x80000000 … +0x7FFFFFFF
, hence
the warning.
It is interesting that specifying dword
(instead of qword
) the
value is correctly interpreted as an unsigned 32 bit integer. See
https://stackoverflow.com/questions/69415210/why-do-32-bit-immediates-with-imul-give-warnings-in-nasm-and-errors-in-gcc
and
https://stackoverflow.com/questions/66577023/why-nasm-warns-when-i-store-qword-directly-onto-stack
The results:
diego@dataone:~/devel/ASM$ ./sizes | sed 's/\(.\{16\}\)\(.\{16\}\)\(.\{16\}\)\(.*\)/\1|\2|\3|\4/' MEMORY byte <- value 0001020304050607|08090A0B0C0D0E0F|0001020304050607|08090A0B0C0D0E0F 34a7111111111111|1111111111111111|3511111111111111|1111111111111111 MEMORY word <- value 0001020304050607|08090A0B0C0D0E0F|0001020304050607|08090A0B0C0D0E0F 5300111111111111|1111111111111111|1156585711111111|1111111111111111 MEMORY dword <- value 0001020304050607|08090A0B0C0D0E0F|0001020304050607|08090A0B0C0D0E0F ab00000011111111|1111111111111111|6655aabebe111111|1111111111111111 MEMORY qword <- value 0001020304050607|08090A0B0C0D0E0F|0001020304050607|08090A0B0C0D0E0F aabe000000000000|1111111111111111|6655aa4e00000000|3938373600000000 MEMORY <- REG 0001020304050607|08090A0B0C0D0E0F|0001020304050607|08090A0B0C0D0E0F a3a2a1a0bfbebdbc|2726250000000000|4444444444444444|5634120000000000 MEMORY <- PARTIAL REG 0001020304050607|08090A0B0C0D0E0F|0001020304050607|08090A0B0C0D0E0F bbaa998811111111|bbaa111111111111|aaaabb1111111111|1111111111111111 REG <- MEMORY 0001020304050607|08090A0B0C0D0E0F|0001020304050607|08090A0B0C0D0E0F bbaa998877665544|bbaa998877665544|bbaa998800000000|bbaacccccccccccc PARTIAL REG <- MEMORY 0001020304050607|08090A0B0C0D0E0F|0001020304050607|08090A0B0C0D0E0F bbaa998877665544|bbcccccccccccccc|ccbbcccccccccccc|1111111111111111 (PARTIAL) REG <- value 0001020304050607|08090A0B0C0D0E0F|0001020304050607|08090A0B0C0D0E0F dddddddd00000000|eeee998877665544|bbff998877665544|ffaa998877665544
The build does require the xdump.o
module to link:
nasm -f elf64 -g -l sizes.lst sizes.asm # also: yasm -f elf64 -g dwarf2 -l sizes.lst sizes.asm gcc -O0 -o sizes sizes.o xdump.o
For more sizing operands, see also: https://stackoverflow.com/questions/12063840/what-are-the-sizes-of-tword-oword-and-yword-operands
Printing a float
The following example shows how to do this:
; test5.asm: print a float number
;
default rel
extern printf
section .text
global main
main:
push rbp
mov rbp, rsp
lea rdi, [format]
mov eax, 1
movsd xmm0, [piaprox]
call printf wrt ..plt
ret
section .note.GNU-stack noalloc noexec nowrite progbits
section .rodata
format: db "%g", 10, 0 ; C 0-terminated string
piaprox: dq 3.1416
We’re using the SSE extension xmm*
registers (with 128 bytes.)
There is no immediate mode for setting a float value (like
mov xmm0, 1.4
) but constants may be loaded from memory.
See that we’re setting the eax
register to 1 in order to
notify printf
about one floating point argument.
See also the dq
assembler instruction to signal a quad-byte
floating point number.
See https://stackoverflow.com/questions/12063840/what-are-the-sizes-of-tword-oword-and-yword-operands for more information about the size related operands.
The next example expands these ideas and approximates the PI constant with the (rather slowly convergent) Leibniz formula:
PI=4*(1-1/3+1/5-1/7...)
See the usage of the call
instruction to call a
local (not globally exported) subroutine. Here we
show several instructions to do arithmetic operations:
; test6.asm: approximate pi via Leibniz formula
;
default rel
extern printf
section .text
global main
main:
push rbp
mov rbp, rsp
lea rdi, [format]
mov ecx, 10000
call calcpi
mov eax, 1
call printf wrt ..plt
ret
calcpi:
mov rax, 1
cvtsi2sd xmm1, rax ; xmm1 contains 1 for ever
movapd xmm2, xmm1
addpd xmm2, xmm2 ; xmm2 contains 2 in the loop
movapd xmm0, xmm1 ; accumulator starts in 1
movapd xmm4, xmm1 ; divisor starts in 1
mov bl, 0 ; subtract
nextpi:
addpd xmm4, xmm2 ; divisor += 2
movapd xmm3, xmm1 ; xmm3<- 1
divpd xmm3, xmm4 ; xmm3<- 1/divisor
not bl
cmp bl, 0
jne l_subtract
l_add:
addpd xmm0, xmm3 ; xmm0 += quotient
dec ecx
jnz nextpi
jmp finalize
l_subtract:
subpd xmm0, xmm3 ; xmm0 -= quotient
dec ecx
jnz nextpi
finalize:
addpd xmm2, xmm2 ; xmm2 contains 4
mulpd xmm0, xmm2 ; 4*accumulator
ret
section .note.GNU-stack noalloc noexec nowrite progbits
section .rodata
format: db "%g", 10, 0 ; C 0-terminated string
Note that we don’t use the memory in order to initialize
the required constants, but make a conversion from
an integer value with the cvtsi2sd xmm1, rax
instruction. Also,
the alternating operations are obtained by continuous negation of the
bl
subregister which turn it from all-zeros to all-ones.
A note about SIMD
The xmm*
registers and its related instructions are
facilities collectively known as "single instruction multiple data"
(or SIMD.) For example, a 128-bits wide register is able to
represent (or contain) two 64-bits double precision floating point
values, or four 32-bits single precision floating point values, and
(most importantly) to execute arithmetic operations simultaneously
with all the contained values.
Since these facilities were not included in the original 8086 cpu, they are announced as "extensions" to be created/supported by the successive releases of Intel and AMD cpus. The extensions which provided SIMD facilities were (chronologically) MMX, SSE (Streaming SIMD Extensions) and AVX (Advanced Vector Extensions.)
MMX introduced eight 64-bits wide registers (MM0…MM7) able to
hold integers of 8, 16, 32 and 64 bits. These are aliased
to the x87 registers (share the same bits.) AMD added the capability
to store 32-bit single precision floating point numbers
with its extension named 3DNow!. Later, the SSE extensions provided
the 128-bits xmm*
registers used in the previous example. MMX
and x87 instructions are currently obsolete.
Fibonacci
The next program asks the user to provide a positive integer number and tries to calculate the n-th number of the well known Fibonacci' series:
1, 1, 2, 3, 5, 8, 13, 21...
In order to get some text from the user, we’ll call fgets
; but first
note that this is not portable enough, since stdin
is not always
an external symbol but a macro:
https://stackoverflow.com/questions/3689919/how-to-call-fgets-in-x86-assembly
; test7.asm: calculate Fibonacci series number
;
default rel
extern printf, fgets, atoi, stdin
section .text
global main
main:
push rbp
mov rbp, rsp
; show prompt
lea rdi, [prompt]
xor eax, eax
call printf wrt ..plt
; Get a number by calling
; char *fgets(char *s, int size, FILE *stream);
lea rdi, [inbuf]
mov rsi, 8
mov rdx, [stdin]
call fgets wrt ..plt
; Get an integer by calling
; int atoi(const char *nptr);
lea rdi, [inbuf]
call atoi wrt ..plt
cmp rax, 0 ; integer must be positive
jg nok
; else reject
lea rdi, [errneg]
xor eax, eax
call printf wrt ..plt
leave
ret
nok:
cmp rax, 2 ; >=2 to for calculations
jg g2ok
mov rax, 1 ; else just print 1
call printer
leave
ret
g2ok:
; r8=r9=1
mov r8, 1
mov r9, 1
sub rax, 2
fstep:
; swap r8/r9, r8=r8+r9
mov r10, r8
mov r8, r9
mov r9, r10
add r8, r9
dec rax
jnz fstep
mov rax, r8
call printer
leave
ret
printer:
push rbp
mov rbp, rsp
lea rdi, [format]
mov rsi, rax ; atoi() answer
xor eax, eax
call printf wrt ..plt
leave
ret
section .note.GNU-stack noalloc noexec nowrite progbits
; C 0-terminated strings
section .rodata
format: db "%d", 10, 0
errneg: db "The n-th value must be positive", 10, 0
prompt: db "Please enter a positive integer> ", 0
section .bss
inbuf resb 8
The text buffer is converted to an integer using atoi()
and the
result (in rax
) is tested for positiveness. If not positive,
an error message is printed and the program terminates. Else, rax
is tested against 2
since the first couple of terms are 1
and does not need further calculation.
Next, we use rax
as a counter, and setup r9
and r10
as
"current terms" of the series. Then a simple loop is executed
until rax
becomes zero (jumps to retry while not zero, jnz
.)
Square Root x87
Here we’ll make use of the (now obsolete) x87 Floating Point
Unit. It provides eight 80-bits-wide registers which are organized
in a stack fashion; the stack elements are referenced by st0
,
st1
, etc. Some assemblers use st
as an abbreviation for st0
.
The last "pushed" element corresponds to st0
; that push
operation automatically moves the previous value of st0
into
st1
and so on.
The x87 registers store floating point numbers, and interchange those values with the memory. The memory values may be values encoded as:
-
signed integer word (16 bits)
-
signed integer double word (32 bits), usually C’s
int
type -
signed integer quad word (64 bits), usually C’s
long
type -
single precision floating point (32 bits), usually C’s
float
type -
double precision floating point (64 bits), usually C’s
double
type -
double extended precision floating point (80 bits)
The x87 instructions do not allow the data interchange with the standard cpu registers: the memory must be used as temporary storage.
The following example shows how to calculate and print the square root of five:
; test8: calculate and print sqrt(5)
;
default rel
extern printf
section .text
global main
main:
push rbp
mov rbp, rsp
mov rax, 5
mov [numbuf], rax
fild qword [numbuf] ; store 5 in st(0)
fsqrt
fstp qword [numbuf] ; save sqrt(5)
call printer
leave
ret
printer:
push rbp
mov rbp, rsp
lea rdi, [format]
movsd xmm0, [numbuf]
mov eax, 1
call printf wrt ..plt
leave
ret
section .note.GNU-stack noalloc noexec nowrite progbits
; C 0-terminated strings
section .rodata
format: db "sqrt(5) is %g", 10, 0
section .bss
numbuf: resb 8
An integer five is stored into [numbuf]
as a quad word integer, which
in turn is used to feed the x87 FPU:
fild qword [numbuf]
Now st0
does contain 5.0; the fsqrt
just turns st0
into sqrt(st0)
so this register contents is sent into the same memory buffer:
fstp qword [numbuf]
Note that there is a sister instruction fst
with the same purpose, but the
fstp
version additionally "pops" the stack so it is effectively emptied. Several
x87 instructions have a *p
version for a complementary stack pop.
The buffer content is then stored into xmm0
in order to be sent into printf()
as previously shown.
Fibonacci revisited
Now we’ll make use of the FPU in order to calculate Fibonacci numbers by using the Binet’s formula:
Fn = (phi^n - (-1/phi)^2)/sqrt(5)
where phi
is the golden ratio: (1+sqrt(5))/2
. See https://en.wikipedia.org/wiki/Fibonacci_number
for more information.
; test9.asm: Binet formula
;
default rel
extern printf, atoi
section .text
global main
main:
push rbp
mov rbp, rsp
cmp rdi, 2 ; arc must be 2
je a1
lea rdi, [errarg]
xor eax, eax
call printf wrt ..plt
leave
ret
a1:
mov rdi, [rsi+8] ; call atoi(argv[1])
call atoi wrt ..plt
cmp rax, 0 ; integer must be positive
jg nok
; else reject
lea rdi, [errneg]
xor eax, eax
call printf wrt ..plt
leave
ret
nok:
cmp rax, 2 ; >=2 to for calculations
jg g2ok
fld1
fstp qword [numbuf]
call printer
leave
ret
g2ok:
mov rdi, rax
call calc5
call calcgold
call fibo
call printer
leave
ret
fibo:
mov [pow_exp], rdi
mov rax, [gold]
mov [pow_base], rax
call power ; pow
fld1 ; 1 pow
fld st1 ; pow 1 pow
fdivp ; 1/pow pow
and rdi, 0x1 ; test if rdi is even
jnz fibo2
; negate st0
fldz ; 0 1/pow pow
fxch ; 1/pow 0 pow
fsubp ; -1/pow pow
fibo2:
faddp ; [+-]1/pow + pow
; divide by sqrt(5)
fld qword [sq5] ; sq5 ([+-]1/pow + pow)
fdivp ; (+)/sq5
fstp qword [numbuf]
ret
calc5:
mov rax, 5
mov [sq5], rax
fild qword [sq5]
fsqrt
fstp qword [sq5]
ret
calcgold:
fld qword [sq5]
fld1
faddp
fld1
fld1
faddp
fdivp
fstp qword [gold]
ret
power:
push rbp
mov rbp, rsp
fild qword [pow_exp]
fld qword [pow_base]
fyl2x
fld1
fld st1
fprem
f2xm1
faddp
fscale
fxch st1
fstp st0
leave
ret
printer:
push rbp
mov rbp, rsp
lea rdi, [format]
movsd xmm0, [numbuf]
mov eax, 1
call printf wrt ..plt
xor rax, rax ; don't care printf retval
leave
ret
section .note.GNU-stack noalloc noexec nowrite progbits
; C 0-terminated strings
section .rodata
format: db "pow is %g", 10, 0
prompt: db "Please enter a positive integer> ", 0
errneg: db "The n-th value must be positive", 10, 0
errarg: db "Must provide one command line argument", 10, 0
section .bss
inbuf: resb 8
pow_base: resb 8
pow_exp: resb 8
numbuf: resb 8
sq5: resb 8
gold: resb 8
The program is an extension from the two previous ones.
Note
|
The power routine was adapted from https://www.madwizard.org/programming/snippets?id=36 |
The power routine makes use of the identity:
x^y = exp(y*log(x))
but the FPU provides two-based functions, so the identity must be rewritten as:
x^y = 2^(y*log2(x)) = 2^K
The FPU provides the fyl2x
instruction in order to compute K = y*log2(x))
, but
there is a quirk: the calculation of 2^K
must be done in two stages. If K=[K]+{K}
(integer and fractional parts of K
), then:
x^y = 2^([K]+{K]) = 2^[K] * 2^{k} = 2^[K] * (2^{k} -1 + 1)
The f2xm1
instruction is able to calculate 2^x -1
, provided -1 < x < 1
, and
the fscale
calculates y * 2^[x]
. So, in pseudocode:
K = fyl2x(x,y) fscale(K, add(f2xm1(K), 1)
Note also the use of fld1
to load a 1
in st0. It is used in order
to get the fractional part of K
with the fprem
instruction (remainder
of division by 1.)
Finally, after the fscale
execution, one unneeded value remains in st1. It is exchanged with st0
(so the unneeded value goes into st0), and that value is "popped"
with fstp st0
. The final st0 is used as return value.
Other point of interest is the implementation of the Binet’s formula. This is implemented as:
phi^n + ((+/-) phi^-n)
where the sign depends on the parity of the exponent. This is tested
by checking the last bit of the exponent in rdi
with an and 0x1
instruction. The negated sign version is obtained by subtracting
from zero (the zero is loaded with fldz
.)
Finally, the program expects the n-th term to be provided from
command line (not via fgets
.) The argc,argv
arguments come
in rdi,rsi
, so we first check that rdi==2
, then get
the pointer argv[1]
in the following way:
argv==rsi
→ argv+1==rsi+8
→ *(argv+1)==argv[1]==[rsi+8]
.
Parenthesis: most used instructions
Here we made a trivial and gross analysis about the most
used instructions in the Linux kernel (downloaded on March 4, 2021
from https://github.com/torvalds/linux); we excluded the crypto
subdir since it contains code which is too specialized to be
considered as a reference to study (but I wonder if all the
subdirs are too specialized in some way.)
The source files were concatenated into /tmp/linux-raw.c
in
order to apply the gcc
pre-compiler for comments removal, else
many non-assembly words leak out to the output. Of course, no C
compilation is performed at all!
We get usage counters for the instructions, which are grouped by 20-units-frequency. Note that several words are not instructions but macros or assembler directives. Also, note that the frequency "0" does mean the 0-19 range.
There is also assembly code embedded in many C language files, which was not extracted since the parsing was not trivial. The instructions use the AT&T syntax.
diego@dataone:~/Desktop/linux-master/arch/x86$ cat $(find $(ls | grep -v crypto) -name '*.S') > /tmp/linux-raw.c diego@dataone:~/Desktop/linux-master/arch/x86$ cd /tmp diego@dataone:/tmp$ gcc -fpreprocessed -dD -E linux-raw.c > linux-raw.out diego@dataone:/tmp$ cat linux-raw.out | grep -v '^;' | sed 's/^[ ]*//' | grep '^[a-z]' | grep -v ':' | sed 's/[ \.].*//' | grep '^[a-z]*$' | sort | uniq -c | awk '{a=20*int($1/20); print a,$2}' | sort -nr > /tmp/linux-freq.txt diego@dataone:/tmp$ awk '{dat[$1]=dat[$1] " " $2}END{for(i in dat){print i,dat[i]}}' /tmp/linux-freq.txt | sort -nr 1160 movl 580 movq 280 pushl 240 jmp 200 pushq 180 popl adcl 160 xorl mov 140 ret call addl 120 jz 100 pop cmpl 80 subl push movw leaq jnz andl 60 testl popq orl jne je 40 xor shrl mull movb leal jb addq 20 testb subq shrd sbbl rep rcrl popfl leave lea jnc jae divl decl cmpw cmpq cmp cld 0 xorw xorq xchgl xchg wrmsr writepost wrgsbase wbinvd text testq test sysretq sysretl sysexit syscall swapgs subw subb sub str sti std stc source sldt sidt shrw shrq shrb shr shlq shll shld shl sgdtl sgdt sfence setz setne setaeb sbbb sbb sarl sar rorq roll retq retl rdmsr rcll pushw pushfq pushfl pushf pushal popw popfq popf popal pgtno percpu pause outb orw orq orb or notq notl note nop negw negl movzwq movzwl movzbq movzbl movswl movsq movsl movsb movabsq movabs ltr lss lretw lretq lret lodsb lmsw lldt ljmpw ljmpl ljmp lidtl lidt lgdtl lgdt lfence js jns jnb jmpq jmpl jle jl jiffies jge jg jecxz jc jbe ja iretq iret int init incw incq incl inc imulq imull idtentry i hlt enclu dynamic dest decw decq dec data cpuid cmpb clts cli clc cdq calll btsq btsl bts btrl btr btl bt bsrl andw andq andb and addw addb add adcq
For comparison, we did the same for the venerable Microsoft DOS 2.0, available
from https://github.com/microsoft/MS-DOS; the DOSMAC.ASM
contained several
macros for the assembler which must be excluded from the output, but several
assembler directives remain:
diego@dataone:~/Desktop/MS-DOS-master/v2.0/source$ grep -ai macro DOSMAC.ASM | grep '^[a-zA-Z]' | awk '{print $1}' | tr 'a-z' 'A-Z' | sed 's/.*/^&$/' > /tmp/macros.tmp diego@dataone:~/Desktop/MS-DOS-master/v2.0/source$ cat *.ASM | grep -av '^;' | grep -av ':' | grep -a '^ [^ ]' | sed 's/^ //' | sed 's/[ ].*//' | grep '^[a-zA-Z]' | tr 'a-z' 'A-Z' | egrep -av -f /tmp/macros.tmp | sed 's/^M//' | sort | uniq -c | awk '{a=20*int($1/20); print a,$2}' | sort -nr > /tmp/dos-raw.txt diego@dataone:~/Desktop/MS-DOS-master/v2.0/source$ awk '{dat[$1]=dat[$1] " " $2}END{for(i in dat){print i,dat[i]}}' /tmp/dos-raw.txt | sort -nr 6160 MOV 1780 CALL 1660 POP 1640 PUSH CMP 1120 JZ 1080 JMP 900 JNZ 760 INT 640 DB 600 INC 520 XOR 460 OR ADD 400 DW 380 RET 340 DEC 320 ENDIF 300 IF 280 PUBLIC 260 SUB 220 LODSB 200 TEST STOSB RETURN JC 180 XCHG 160 SHR AND 140 SHL 120 REP LDS JNC JB 100 STOSW LOOP JE INCLUDE 80 LES JCXZ JAE 60 STC PAGE JNE JBE JA ELSE 40 SUBTTL STI MUL END CLD CLC 20 RETZ REPNE PUSHF POPF OUT ORG MOVSW MOVSB LODSW IRET DIV CONTEXT CLI CBW ADC 0 XLAT TRANSFERIF TITLE SYS STD SCASB SBB SAVE SAR SAHF ROR ROL RETNZ RETNC RETC RESTORE REPZ REPNZ REPE RCR RCL NOT NEG NAME MESSAGE LOOPNZ LOOPNE LOCAL LEA LAHF JS JPE JNS JNB JNA JLE JL JGE JG IN IFIDN IFDEF ENDM DO_EXT DAA CMPSB CMC AAM AAD
Getting the time
The next program uses the clock_gettime()
POSIX function used to
get the current time with a (theoretical) precision of nanoseconds. The
real precision should depend on the operating system and hardware. We
introduce a dummy delay of 50000 microseconds (0.05 seconds) using
the usleep()
POSIX function.
; timing.asm: showing a delay and time computation
;
default rel
extern printf, clock_gettime, usleep
%macro prelude 0
push rbp
mov rbp, rsp
%endmacro
section .text
global main
main:
prelude
sub rsp, 32
mov rdi, 0
mov rsi, rsp
call clock_gettime wrt ..plt
mov rdi, 50000
call usleep wrt ..plt
mov rdi, 0
lea rsi, [rsp+16]
call clock_gettime wrt ..plt
mov rax, 1000000000
cvtsi2sd xmm4, rax
mov rax, [rsp]
cvtsi2sd xmm0, rax
mov rax, [rsp+8]
cvtsi2sd xmm1, rax
mov rax, [rsp+16]
cvtsi2sd xmm2, rax
mov rax, [rsp+24]
cvtsi2sd xmm3, rax
subpd xmm3, xmm1
subpd xmm2, xmm0
divpd xmm3, xmm4
movapd xmm0, xmm2 ;
addpd xmm0, xmm3
lea rdi, [format]
mov eax, 1
call printf wrt ..plt
leave
ret
section .note.GNU-stack noalloc noexec nowrite progbits
; C 0-terminated strings
section .rodata
format: db "time dif is %g sec", 10, 0
From the man page, the clock_gettime()
retrieves the time in the
structure:
struct timespec {
time_t tv_sec; /* seconds */
long tv_nsec; /* nanoseconds */
};
After compiling a dummy C program which calls such function, we discover that (in our system) both fields are long integers using 8 bytes, compactly stored as 16 consecutive bytes. So we reserve 32 bytes for storing two structures used in two calls to the function:
sub rsp, 32
The structures are located at addresses rsp
and rsp+16
, corresponding
to the tv_sec
fields. At rsp+8
and rsp+24
we’ll get the
tv_nsec
fields.
Then we make the following calculation using double precision arithmetic,
using pseudo-code for the times t1
and t2
:
(t2.tv_sec-t1.tv_sec) + (t2.tv_nsec-t1.tv_nsec)/1e9
The resulting time difference is shown using printf()
:
diego@dataone:~/devel/ASM$ ./timing time dif is 0.0502603 sec diego@dataone:~/devel/ASM$ ./timing time dif is 0.0502895 sec diego@dataone:~/devel/ASM$ ./timing time dif is 0.0502988 sec diego@dataone:~/devel/ASM$ ./timing time dif is 0.0502095 sec
Note about MMX
For floating point calculations, the x87 FPU is (in practice) superseded by the SSE2 extension’s instructions. Before SSE there was the MMX extension which provided some 64-bit registers.
From a draft for the first edition of The Art of Assembly (Randall Hyde):
"Although MM0..MM7 appear as separate registers in the Intel Architecture, the Pentium processors alias these registers with the FPU’s registers (ST0..ST7). Each of the eight MMX 64-bit registers is physically equivalent to the L.O. 64-bits of each of the FPU’s registers […]. The MMX registers overlay the FPU registers in much the same way that the 16-bit general purpose registers overlay the 32-bit general purpose registers."
The combination of MMX and the FPU is discouraged:
"Because the MMX registers overlay the FPU registers, you cannot mix FPU and MMX instructions in the same computation sequence. You can begin executing an MMX instruction sequence at any time; however, once you execute an MMX instruction you cannot execute another FPU instruction until you execute a special MMX instruction, EMMS (Exit MMX Machine State)."
There is a chance that a program is executed in a cpu which does not implements a required extension; in this case it usually crashes because an unsupported instruction; in a later section we comment about the runtime discovery of cpu extensions to better handle these cases.
So, in practice, we’ll use the SSE registers (xmm*
) as in the
following example:
Inverting a matrix
The following C program shows a simple (dummy) implementation of the high school procedure for matrix inversion, applying "elementary operations" into the matrix and into a helper identity matrix; when the original matrix is converted into an identity, the helper matrix is the desired inverse. Note that we avoid several checks in order to simplify the implementation:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <time.h>
void show(double *m, int n) {
if(n > 10) {
printf("Too big to show\n");
return;
}
int r, c;
for(r = 0 ; r < n ; r ++) {
for(c = 0 ; c < n ; c ++) {
printf("%10g ", m[r*n+c]);
}
putchar('\n');
}
putchar('\n');
}
int isIdentity(double *m, int n) {
int r, c;
for(r = 0 ; r < n ; r ++) {
for(c = 0 ; c < n ; c ++) {
if(r == c) {
if(fabs(m[r*n+c]-1.0) > 1e-5) return 0;
} else {
if(fabs(m[r*n+c]-0.0) > 1e-5) return 0;
}
}
}
return 1;
}
double matrixTotal(double *m, int n) {
int r, c;
double t = 0.0;
for(r = 0 ; r < n ; r ++) {
for(c = 0 ; c < n ; c ++) {
t += m[r*n+c];
}
}
return t;
}
double matrixTrace(double *m, int n) {
int r;
double t = 0.0;
for(r = 0 ; r < n ; r ++) {
t += m[r*n+r];
}
return t;
}
void xcopy(double *a, double *b, double *prod, int n) {
/* multiply copy by m2 */
int c, r, x;
for(c = 0 ; c < n ; c ++) {
for(r = 0 ; r < n ; r ++) {
prod[c*n+r]=0.0;
for(x = 0 ; x < n ; x ++) {
prod[c+n*r] += a[c*n+x]*b[x*n+r];
}
}
}
}
int main(int argc, char **argv) {
int r, c, x;
int n = atoi(argv[1]);
double rmax = RAND_MAX;
struct timespec tp1, tp2;
srand(0);
double *m1 = calloc(n * n, sizeof(double));
double *m2 = calloc(n * n, sizeof(double));
double *prod = calloc(n * n, sizeof(double));
/* build identity */
for(r = 0 ; r < n ; r ++) {
m2[r*n+r] = 1.0;
}
/* build random matrix */
for(r = 0 ; r < n ; r ++) {
for(c = 0 ; c < n ; c ++) {
m1[r*n+c] = rand()/rmax;
}
}
show(m1, n);
show(m2, n);
/* copy the random matrix in prod for final check */
memcpy(prod, m1, n * n * sizeof(double));
/* invert the matrix */
clock_gettime(CLOCK_REALTIME, &tp1);
for(c = 0 ; c < n ; c ++) {
double ipivot = 1.0/m1[c*n+c];
/* pivot row multiply by 1/pivot */
for(x = 0 ; x < n ; x ++) {
m1[c*n+x] *= ipivot;
m2[c*n+x] *= ipivot;
}
/* process rows except pivot */
for(r = 0 ; r < n ; r ++) {
if(r != c) {
double factor = m1[r*n+c];
for(x = 0 ; x < n ; x ++) {
m1[r*n+x] -= factor * m1[c*n+x];
m2[r*n+x] -= factor * m2[c*n+x];
}
}
}
}
clock_gettime(CLOCK_REALTIME, &tp2);
double ds = tp2.tv_sec - tp1.tv_sec;
double dn = tp2.tv_nsec - tp1.tv_nsec;
double dif = ds + dn / 1e9;
printf("time dif is %g sec\n", dif);
show(m1, n); /* identity */
show(m2, n); /* inverted matrix */
/* m1 returns to the original matrix */
memcpy(m1, prod, n * n * sizeof(double));
/* multiply m1*m2 into prod for checking */
xcopy(m1, m2, prod, n);
printf("is identity = %d\n", isIdentity(prod, n));
printf("total = %g\n", matrixTotal(m2, n));
printf("trace = %g\n", matrixTrace(m2, n));
return 0;
}
Additionally we implemented a matrix product routine used to check that the inverse multiplied by the original matrix result in an identity. A helper function checks it is indeed an identity (up to some precision.)
In order to quickly compare the results with an assembly version, we calculate the sum of all the inverse’s elements, and its trace:
diego@dataone:~/devel/ASM$ ./model-inv 500 time dif is 1.26518 sec is identity = 1 total = 1.94742 trace = 3.69845 diego@dataone:~/devel/ASM$ ./model-inv 500 time dif is 1.21957 sec is identity = 1 total = 1.94742 trace = 3.69845
After compiling with optimization level 3 (-O3 option), the performance improves:
diego@dataone:~/devel/ASM$ ./model-inv 500 time dif is 0.531943 sec is identity = 1 total = 1.94742 trace = 3.69845 diego@dataone:~/devel/ASM$ ./model-inv 500 time dif is 0.500119 sec is identity = 1 total = 1.94742 trace = 3.69845
A corresponding assembly version follows:
;
; inv.asm: inverting a matrix
;
default rel
extern printf, clock_gettime, rand, srand, calloc, atoi, putchar
%macro prelude 0
push rbp
mov rbp, rsp
%endmacro
section .text
global main
main:
prelude
sub rsp, 48
mov rdi, [rsi+8] ; get matrix size from argv[1]
call atoi wrt ..plt
mov r15, rax
mul eax ; n^2
mov edi, eax
mov rsi, 16 ; alignment for xmm* registers
ccal:
call calloc wrt ..plt
mov qword [m1], rax
mov rax, r15
mul eax ; n^2
xor rdi, rdi
mov edi, eax
mov rsi, 16 ; alignment for xmm* registers
call calloc wrt ..plt
mov qword [m2], rax
; srand(0)
xor rdi, rdi
call srand wrt ..plt
; build random on m1
mov rax, [randmax]
cvtsi2sd xmm4, rax
xor r12, r12
nextr:
xor r13, r13
nextc:
movdqu oword [rsp+32], xmm4
; movdqu dqword [rsp+32], xmm4
call rand wrt ..plt
movdqu xmm4, oword [rsp+32]
; movdqu xmm4, dqword [rsp+32]
cvtsi2sd xmm0, rax
divpd xmm0, xmm4 ; range is 0-1
mov rax, r15
imul rax, r12
add rax, r13
imul rax, 16
; movapd [m1+rax], xmm0 -> relocation R_X86_64_32
mov r14, qword [m1]
add r14, rax
movapd [r14], xmm0
inc r13
cmp r13, r15
jl nextc
inc r12
cmp r12, r15
jl nextr
; build identity on m2
xor r12, r12
nextri:
mov rax, r15
imul rax, r12
add rax, r12
imul rax, 16
; movapd [m2+rax], xmm0 -> relocation R_X86_64_32
mov r14, qword [m2]
add r14, rax
mov rax, 1
cvtsi2sd xmm0, rax
movapd [r14], xmm0
inc r12
cmp r12, r15
jl nextri
; printing
mov r14, qword [m1]
call show
mov r14, qword [m2]
call show
; get start time
mov rdi, 0
mov rsi, rsp
call clock_gettime wrt ..plt
; begin invertion
mov rax, 1
cvtsi2sd xmm5, rax ; xmm5 contains 1.0
xor r13, r13 ; main loop: column iteration
inextc:
mov rax, r15
imul rax, r13
add rax, r13
imul rax, 16
mov r14, qword [m1]
add r14, rax
movapd xmm0, [r14]
movapd xmm3, xmm5
divpd xmm3, xmm0 ; xmm3 has ipivot = 1.0/m1[c*n+c];
xor rbx, rbx ; normalize pivot cell: row factor loop
inextx:
mov rax, r15
imul rax, r13
add rax, rbx
imul rax, 16
mov r14, qword [m1]
add r14, rax
movapd xmm0, [r14] ; get m1[c*n+x]
mulpd xmm0, xmm3
movapd [r14], xmm0 ; m1[c*n+x] *= ipivot;
mov r14, qword [m2]
add r14, rax
movapd xmm0, [r14] ; get m1[c*n+x]
mulpd xmm0, xmm3
movapd [r14], xmm0 ; m2[c*n+x] *= ipivot;
inc rbx
cmp rbx, r15
jl inextx
xor r12, r12
inextr:
cmp r12, r13 ; if(r != c)
je inextre
mov rax, r15
imul rax, r12
add rax, r13
imul rax, 16
mov r14, qword [m1]
add r14, rax
movapd xmm3, [r14] ; factor = m1[r*n+c];
xor rbx, rbx
inextxx:
mov rax, r15
imul rax, r13
add rax, rbx
imul rax, 16
mov r14, qword [m1]
add r14, rax
movapd xmm2, [r14] ; m1[c*n+x]
mulpd xmm2, xmm3 ; factor * m1[c*n+x]
mov r14, qword [m2]
add r14, rax
movapd xmm1, [r14] ; m2[c*n+x]
mulpd xmm1, xmm3 ; factor * m2[c*n+x]
mov rax, r15
imul rax, r12
add rax, rbx
imul rax, 16
mov r14, qword [m1]
add r14, rax
movapd xmm0, [r14] ; m1[r*n+x]
subpd xmm0, xmm2
movapd [r14], xmm0 ; m1[r*n+x] -= factor * m1[c*n+x]
mov r14, qword [m2]
add r14, rax
movapd xmm0, [r14] ; m2[r*n+x]
subpd xmm0, xmm1
movapd [r14], xmm0 ; m2[r*n+x] -= factor * m2[c*n+x]
inc rbx
cmp rbx, r15
jl inextxx
inextre:
inc r12
cmp r12, r15
jl inextr
inc r13
cmp r13, r15
jl inextc
; get end time
mov rdi, 0
lea rsi, [rsp+16]
call clock_gettime wrt ..plt
; calculate time difference
mov rax, 1000000000
cvtsi2sd xmm4, rax
mov rax, [rsp]
cvtsi2sd xmm0, rax
mov rax, [rsp+8]
cvtsi2sd xmm1, rax
mov rax, [rsp+16]
cvtsi2sd xmm2, rax
mov rax, [rsp+24]
cvtsi2sd xmm3, rax
subpd xmm3, xmm1
subpd xmm2, xmm0
divpd xmm3, xmm4
movapd xmm0, xmm2 ;
addpd xmm0, xmm3
; show time difference
lea rdi, [format]
mov eax, 1
call printf wrt ..plt
; printing again
mov r14, qword [m1]
call show
mov r14, qword [m2]
call show
; get total
mov r14, qword [m2]
call matrixtotal
lea rdi, [ftotal]
mov eax, 1
call printf wrt ..plt
; get trace
mov r14, qword [m2]
call matrixtrace
lea rdi, [ftrace]
mov eax, 1
call printf wrt ..plt
leave
ret
; r15 has matrix size, r14 has pointer to the data, result in xmm0
matrixtotal:
prelude
sub rsp, 16
mov qword [rsp], r14
pxor xmm0, xmm0
xor r12, r12
tonextr:
xor r13, r13
tonextc:
mov rax, r15
imul rax, r12
add rax, r13
imul rax, 16
mov r14, qword [rsp]
add r14, rax
movapd xmm1, [r14]
addpd xmm0, xmm1
inc r13
cmp r13, r15
jl tonextc
inc r12
cmp r12, r15
jl tonextr
leave
ret
; r15 has matrix size, r14 has pointer to the data, result in xmm0
matrixtrace:
prelude
sub rsp, 16
mov qword [rsp], r14
pxor xmm0, xmm0
xor r12, r12
trnextr:
mov rax, r15
imul rax, r12
add rax, r12
imul rax, 16
mov r14, qword [rsp]
add r14, rax
movapd xmm1, [r14]
addpd xmm0, xmm1
inc r12
cmp r12, r15
jl trnextr
leave
ret
; r15 has matrix size, r14 has pointer to the data
show:
prelude
sub rsp, 16
cmp r15, 10
jg toobig
mov qword [rsp], r14
xor r12, r12
snextr:
xor r13, r13
snextc:
mov rax, r15
imul rax, r12
add rax, r13
imul rax, 16
mov r14, qword [rsp]
add r14, rax
movapd xmm0, [r14]
lea rdi, [mformat]
mov eax, 1
call printf wrt ..plt
inc r13
cmp r13, r15
jl snextc
mov rdi, 10
call putchar wrt ..plt
inc r12
cmp r12, r15
jl snextr
mov rdi, 10
call putchar wrt ..plt
leave
ret
toobig: lea rdi, [ftoobig]
xor eax, eax
call printf wrt ..plt
leave
ret
section .note.GNU-stack noalloc noexec nowrite progbits
; C 0-terminated strings
section .rodata
format: db "time dif is %g sec", 10, 0
ftoobig: db "Too big to show", 10, 0
ftotal: db "total = %g", 10, 0
ftrace: db "trace = %g", 10, 0
mformat: db "%10g ", 0
randmax: dq 2147483647
section .bss
n: resb 8
m1: resb 8
m2: resb 8
-
We made strong use of the "preserved" registers r12-r15 and rbx
-
The RAND_MAX constant is implementation dependent; the value was located in
/usr/include/stdlib.h
(an alternative way to look it up is generating the assembly of a dummy C program which uses such value) -
The
pxor
instruction was used to setxmm0
to zero; see https://stackoverflow.com/questions/33666617/what-is-the-best-way-to-set-a-register-to-zero-in-x86-assembly-xor-mov-or-and for more information regarding the zeroing of the registers -
Note that we aligned all the matrix elements to 16 bytes in memory, since the
xmm*
instructions does require it for load/store operations; we’re wasting half of the memory -
The returned
calloc
memory block was 16-bytes aligned. This is not guaranteed, and we should take some precautions for a distinct scenery (see below)
Some results:
diego@dataone:~/devel/ASM$ ./inv 500 time dif is 1.17459 sec total = 1.94742 trace = 3.69845 diego@dataone:~/devel/ASM$ ./inv 500 time dif is 1.21626 sec total = 1.94742 trace = 3.69845
Using the debugger
The debugger is a powerful tool for assembly
programs. For example, we may start a debugging
session of the inv
program:
diego@dataone:~/devel/ASM$ gdb ./inv ... (gdb)
Usually, we instruct the debugger to run until some point in the program
identified by a label (breakpoint.) For example, to mark the
inextc
label as breakpoint:
(gdb) break inextc Breakpoint 1 at 0x913: file inv.asm, line 104.
Then the program is executed, providing command line arguments:
(gdb) run 4 Starting program: /home/diego/devel/ASM/inv 4 0.840188 0.394383 0.783099 0.79844 0.911647 0.197551 0.335223 0.76823 0.277775 0.55397 0.477397 0.628871 0.364784 0.513401 0.95223 0.916195 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 Breakpoint 1, inextc () at inv.asm:104 104 mov rax, r15 (gdb)
Here we may list the surrounding code:
(gdb) list 99 mov rax, 1 100 cvtsi2sd xmm5, rax ; xmm5 contains 1.0 101 102 xor r13, r13 ; main loop: column iteration 103 inextc: 104 mov rax, r15 105 imul rax, r13 106 add rax, r13 107 imul rax, 16 108 mov r14, qword [m1] (gdb)
The xmm5
register was loaded with 1.0 from rax
(line 100.) It may
be shown this way:
(gdb) print $xmm5 $1 = {v4_float = {0, 1.875, 0, 0}, v2_double = {1, 0}, v16_int8 = {0, 0, 0, 0, 0, 0, -16, 63, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {0, 0, 0, 16368, 0, 0, 0, 0}, v4_int32 = {0, 1072693248, 0, 0}, v2_int64 = {4607182418800017408, 0}, uint128 = 4607182418800017408} (gdb)
Since the xmm*
registers have 128 bits, they are able
to store two double precision floating point values (8 bytes
each one.) This is shown as v2_double = {1, 0}
in the
previous output. The cvtsi2sd
just filled the "low"
64 bits of the register.
Now, we execute the following instructions with the
next
command. Note that the step
command here
is equivalent; it is useful when
we are interested in the execution of the called
functions, since next
just steps over them.
(gdb) next 105 imul rax, r13 (gdb) next 106 add rax, r13 (gdb) next 107 imul rax, 16 (gdb) next 108 mov r14, qword [m1] (gdb) next 109 add r14, rax (gdb) next 110 movapd xmm0, [r14] (gdb) next 111 movapd xmm3, xmm5 (gdb) print $xmm0 $3 = {v4_float = {8.64733415e+15, 1.83504689, 0, 0}, v2_double = {0.84018771715470952, 0}, v16_int8 = {-93, -59, -11, 89, -47, -30, -22, 63, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {-14941, 23029, -7471, 16362, 0, 0, 0, 0}, v4_int32 = {1509279139, 1072358097, 0, 0}, v2_int64 = { 4605742957725074851, 0}, uint128 = 4605742957725074851} (gdb)
After the
execution of line 110 (when the 111 is shown) then the
xmm0
register is printed, showing the double values
v2_double = {0.84018771715470952, 0}
. This is the result
from the movapd
instruction which did read a 16 bytes
block, which was previously filled with a random double
and (implicitly) with a zero.
We may avoid the memory waste by the use of the full
128 bits xmm*
registers, by working in "pairs" of
double precision values in the memory.
The best news of this scheme is that the arithmetical operations are automatically applied to both parts of the register, so we effectively duplicate the computation speed.
Loading information in packed form
The xmm*
and ymm*
registers may load floating
point data from memory aligned at 16 and 32 bytes
(resp.)
;
; packed.asm: loading packed doubles
;
default rel
%macro prelude 0
push rbp
mov rbp, rsp
%endmacro
section .text
global main
main:
prelude
xload:
movapd xmm0, [xdata1]
vmovapd ymm0, [ydata1]
movapd xmm0, [xdata2]
leave
ret
section .note.GNU-stack noalloc noexec nowrite progbits
; C 0-terminated strings
section .rodata
align 16
xdata1: dq 1.1
dq 2.2
dq 3.3
dq 4.4
align 32
ydata1: dq 5.5
dq 6.6
dq 7.7
dq 8.8
align 16
xdata2: dq 11.1
dq 22.2
We’ll run this with a debugger. In the following display we show only the "double precision" section:
diego@dataone:~/devel/ASM$ gdb ./packed ... (gdb) break xload Breakpoint 1 at 0x5e4: file packed.asm, line 16. (gdb) run Starting program: /home/diego/devel/ASM/packed Breakpoint 1, xload () at packed.asm:16 16 movapd xmm0, [xdata1] (gdb) print $xmm0 $1 = {v2_double = { 7.0632744564454665e-304, 0} ...} (gdb) print $ymm0 $2 = { v4_double = {7.0632744564454665e-304, 0, 0, 0} ...} (gdb) next 17 vmovapd ymm0, [ydata1] (gdb) print $xmm0 $3 = { v2_double = {1.1000000000000001, 2.2000000000000002} ...} (gdb) print $ymm0 $4 = { v4_double = {1.1000000000000001, 2.2000000000000002, 0, 0} ...} (gdb) next 18 movapd xmm0, [xdata2] (gdb) print $xmm0 $5 = { v2_double = {5.5, 6.5999999999999996} ...} (gdb) print $ymm0 $6 = { v4_double = {5.5, 6.5999999999999996, 7.7000000000000002, 8.8000000000000007} ...} (gdb) next 20 leave (gdb) print $xmm0 $7 = { v2_double = {11.1, 22.199999999999999} ...} (gdb) print $ymm0 $8 = { v4_double = {11.1, 22.199999999999999, 7.7000000000000002, 8.8000000000000007} ...} (gdb)
The last instruction shows how the movapd
fully loads the
xmm0
register and only touches the lower part of the ymm0
register (the higher part is not altered.)
Shuffling registers
Here we show the shufpd
instruction which "shuffles" the
parts of two registers into the first one, upon two controlling
bits passwd as third parameter:
;
; shuffle.asm: shuffling registers
;
default rel
%macro prelude 0
push rbp
mov rbp, rsp
%endmacro
section .text
global main
main:
prelude
xload:
movapd xmm0, [xdata3]
movapd xmm1, [xdata4]
shufpd xmm0, xmm1, 0x00
movapd xmm0, [xdata3]
movapd xmm1, [xdata4]
shufpd xmm0, xmm1, 0x01
movapd xmm0, [xdata3]
movapd xmm1, [xdata4]
shufpd xmm0, xmm1, 0x02
movapd xmm0, [xdata3]
movapd xmm1, [xdata4]
shufpd xmm0, xmm1, 0x03
movapd xmm0, [xdata3]
shufpd xmm0, xmm0, 0x00
leave
ret
section .note.GNU-stack noalloc noexec nowrite progbits
; C 0-terminated strings
section .rodata
align 16
xdata3: dq 37.0
dq 41.0
align 16
xdata4: dq 43.0
dq 53.0
Again, we use gdb
to inspect the behavior:
Breakpoint 1, xload () at shuffle.asm:16 16 movapd xmm0, [xdata3] (gdb) next 17 movapd xmm1, [xdata4] (gdb) next 18 shufpd xmm0, xmm1, 0x00 (gdb) print $xmm0.v2_double $7 = {37, 41} (gdb) print $xmm1.v2_double $8 = {43, 53} (gdb) next 19 movapd xmm0, [xdata3] (gdb) print $xmm0.v2_double $9 = {37, 43} (gdb) print $xmm1.v2_double $10 = {43, 53} (gdb) next 20 movapd xmm1, [xdata4] (gdb) 21 shufpd xmm0, xmm1, 0x01 (gdb) 22 movapd xmm0, [xdata3] (gdb) print $xmm0.v2_double $11 = {41, 43} (gdb) next 23 movapd xmm1, [xdata4] (gdb) 24 shufpd xmm0, xmm1, 0x02 (gdb) 25 movapd xmm0, [xdata3] (gdb) print $xmm0.v2_double $12 = {37, 53} (gdb) next 26 movapd xmm1, [xdata4] (gdb) 27 shufpd xmm0, xmm1, 0x03 (gdb) 29 leave (gdb) print $xmm0.v2_double $13 = {41, 53} (gdb)
So, we have the initial values
xmm0 = {37, 41}, xmm1 = {43, 53}
; after the
shuffles, we get the following results for
xmm0
(xmm1
does not change), dependent on
the control bits:
The bit-0 controls the lower lower part of the
destination. If zero, it gets the lower part of
the first argument; else, the higher part of the
first argument. Here we are in the former case,
effectively doing nothing, so the lower part
of xmm0
remains in 37
.
The bit-1 controls the higher part in a similar
way as before, but using the second argument
as source. So the higher part of xmm0
gets
the lower part of the second argument (43
.) So
we got as result: xmm0 = {37, 43}
.
Only the lower part gets the higher part of
the first argument, so we get xmm0 = {41, 43}
.
The lower part remains from the first argument (37
)
and the higher part gets the higher part of the second
argument, so we get xmm0 = {37, 53}
.
Combining the previous cases, we get xmm0 = {41, 53}
.
The last shuffling is the instruction shufpd xmm0, xmm0, 0x00
,
which from the previous explanations results in the
copy of the lower part of the register into the upper part.
Matrix Inversion revisited
To avoid unhelpful complications, we’ll require the matrix dimensions to be an even number, in order to always store the numbers in pairs; the following version improves the inner loop of the computation, which is where the performance benefits at most.
;
; inv2.asm: inverting a matrix
;
default rel
extern printf, clock_gettime, rand, srand, calloc, atoi, putchar
%macro prelude 0
push rbp
mov rbp, rsp
%endmacro
section .text
global main
main:
prelude
sub rsp, 48
mov rdi, [rsi+8] ; get matrix size from argv[1]
call atoi wrt ..plt
mov r15, rax
mov edi, eax
and edi, 1
jz is_even
lea rdi, [noteven]
mov eax, eax
call printf wrt ..plt
leave
ret
is_even:
mul eax ; n^2
mov edi, eax
mov rsi, 8 ; no wasted space
ccal:
call calloc wrt ..plt
mov qword [m1], rax
mov rax, r15
mul eax ; n^2
xor rdi, rdi
mov edi, eax
mov rsi, 8 ; no wasted space
call calloc wrt ..plt
mov qword [m2], rax
; srand(0)
xor rdi, rdi
call srand wrt ..plt
; build random on m1
mov rax, [randmax]
cvtsi2sd xmm4, rax
xor r12, r12
nextr:
xor r13, r13
nextc:
movdqu oword [rsp+32], xmm4
call rand wrt ..plt
movdqu xmm4, oword [rsp+32]
cvtsi2sd xmm0, rax
divpd xmm0, xmm4 ; range is 0-1
mov rax, r15
imul rax, r12
add rax, r13
imul rax, 8
; movapd [m1+rax], xmm0 -> relocation R_X86_64_32
mov r14, qword [m1]
add r14, rax
movlpd [r14], xmm0 ; low quadword
inc r13
cmp r13, r15
jl nextc
inc r12
cmp r12, r15
jl nextr
; build identity on m2
xor r12, r12
nextri:
mov rax, r15
imul rax, r12
add rax, r12
imul rax, 8
; movapd [m2+rax], xmm0 -> relocation R_X86_64_32
mov r14, qword [m2]
add r14, rax
mov rax, 1
cvtsi2sd xmm0, rax
movlpd [r14], xmm0 ; low quadword
inc r12
cmp r12, r15
jl nextri
; printing
mov r14, qword [m1]
call show
mov r14, qword [m2]
call show
; get start time
mov rdi, 0
mov rsi, rsp
call clock_gettime wrt ..plt
; begin invertion
mov rax, 1
cvtsi2sd xmm5, rax ; xmm5 contains 1.0
xor r13, r13 ; main loop: column iteration
inextc:
mov rax, r15
imul rax, r13
add rax, r13
imul rax, 8
mov r14, qword [m1]
add r14, rax
movlpd xmm0, [r14]
movapd xmm3, xmm5
divpd xmm3, xmm0 ; xmm3 has ipivot = 1.0/m1[c*n+c];
xor rbx, rbx ; normalize pivot cell: row factor loop
inextx:
mov rax, r15
imul rax, r13
add rax, rbx
imul rax, 8
mov r14, qword [m1]
add r14, rax
movlpd xmm0, [r14] ; get m1[c*n+x]
mulpd xmm0, xmm3
movlpd [r14], xmm0 ; m1[c*n+x] *= ipivot;
mov r14, qword [m2]
add r14, rax
movlpd xmm0, [r14] ; get m1[c*n+x]
mulpd xmm0, xmm3
movlpd [r14], xmm0 ; m2[c*n+x] *= ipivot;
inc rbx
cmp rbx, r15
jl inextx
xor r12, r12
inextr:
cmp r12, r13 ; if(r != c)
je inextre
mov rax, r15
imul rax, r12
add rax, r13
imul rax, 8
mov r14, qword [m1]
add r14, rax
movlpd xmm3, [r14] ; factor = m1[r*n+c];
shufpd xmm3, xmm3, 0x00
xor rbx, rbx
inextxx:
mov rax, r15
imul rax, r13
add rax, rbx
imul rax, 8
mov r14, qword [m1]
add r14, rax
movapd xmm2, [r14] ; m1[c*n+x] and next
mulpd xmm2, xmm3 ; factor * m1[c*n+x]
mov r14, qword [m2]
add r14, rax
movapd xmm1, [r14] ; m2[c*n+x] and next
mulpd xmm1, xmm3 ; factor * m2[c*n+x]
mov rax, r15
imul rax, r12
add rax, rbx
imul rax, 8
mov r14, qword [m1]
add r14, rax
movapd xmm0, [r14] ; m1[r*n+x] and next
subpd xmm0, xmm2
movapd [r14], xmm0 ; m1[r*n+x] -= factor * m1[c*n+x]
mov r14, qword [m2]
add r14, rax
movapd xmm0, [r14] ; m2[r*n+x]
subpd xmm0, xmm1
movapd [r14], xmm0 ; m2[r*n+x] -= factor * m2[c*n+x]
inc rbx
inc rbx ; rbx+=2
cmp rbx, r15
jl inextxx
inextre:
inc r12
cmp r12, r15
jl inextr
inc r13
cmp r13, r15
jl inextc
; get end time
mov rdi, 0
lea rsi, [rsp+16]
call clock_gettime wrt ..plt
; calculate time difference
mov rax, 1000000000
cvtsi2sd xmm4, rax
mov rax, [rsp]
cvtsi2sd xmm0, rax
mov rax, [rsp+8]
cvtsi2sd xmm1, rax
mov rax, [rsp+16]
cvtsi2sd xmm2, rax
mov rax, [rsp+24]
cvtsi2sd xmm3, rax
subpd xmm3, xmm1
subpd xmm2, xmm0
divpd xmm3, xmm4
movapd xmm0, xmm2 ;
addpd xmm0, xmm3
; show time difference
lea rdi, [format]
mov eax, 1
call printf wrt ..plt
; printing again
mov r14, qword [m1]
call show
mov r14, qword [m2]
call show
; get total
mov r14, qword [m2]
call matrixtotal
lea rdi, [ftotal]
mov eax, 1
call printf wrt ..plt
; get trace
mov r14, qword [m2]
call matrixtrace
lea rdi, [ftrace]
mov eax, 1
call printf wrt ..plt
leave
ret
; r15 has matrix size, r14 has pointer to the data, result in xmm0
matrixtotal:
prelude
sub rsp, 16
mov qword [rsp], r14
pxor xmm0, xmm0
xor r12, r12
tonextr:
xor r13, r13
tonextc:
mov rax, r15
imul rax, r12
add rax, r13
imul rax, 8
mov r14, qword [rsp]
add r14, rax
movlpd xmm1, [r14]
addpd xmm0, xmm1
inc r13
cmp r13, r15
jl tonextc
inc r12
cmp r12, r15
jl tonextr
leave
ret
; r15 has matrix size, r14 has pointer to the data, result in xmm0
matrixtrace:
prelude
sub rsp, 16
mov qword [rsp], r14
pxor xmm0, xmm0
xor r12, r12
trnextr:
mov rax, r15
imul rax, r12
add rax, r12
imul rax, 8
mov r14, qword [rsp]
add r14, rax
movlpd xmm1, [r14]
addpd xmm0, xmm1
inc r12
cmp r12, r15
jl trnextr
leave
ret
; r15 has matrix size, r14 has pointer to the data
show:
prelude
sub rsp, 16
cmp r15, 10
jg toobig
mov qword [rsp], r14
xor r12, r12
snextr:
xor r13, r13
snextc:
mov rax, r15
imul rax, r12
add rax, r13
imul rax, 8
mov r14, qword [rsp]
add r14, rax
movlpd xmm0, [r14]
lea rdi, [mformat]
mov eax, 1
call printf wrt ..plt
inc r13
cmp r13, r15
jl snextc
mov rdi, 10
call putchar wrt ..plt
inc r12
cmp r12, r15
jl snextr
mov rdi, 10
call putchar wrt ..plt
leave
ret
toobig: lea rdi, [ftoobig]
xor eax, eax
call printf wrt ..plt
leave
ret
section .note.GNU-stack noalloc noexec nowrite progbits
; C 0-terminated strings
section .rodata
noteven: db "Matrix must have even dimensions", 10, 0
format: db "time dif is %g sec", 10, 0
ftoobig: db "Too big to show", 10, 0
ftotal: db "total = %g", 10, 0
ftrace: db "trace = %g", 10, 0
mformat: db "%10g ", 0
randmax: dq 2147483647
section .bss
n: resb 8
m1: resb 8
m2: resb 8
-
When creating the random and identity matrices, we explicitly ask for the low part of the
xmm*
register; for example withmovlpd [r14], xmm0
-
Also, we calculate memory locations by a factor of 8 bytes (
imul rax, 8
) -
In the inner loop we use memory aligned access of 16 bytes (128 bits)
-
The addresses are obtained with the formula
(r15 * RC + rbx)*8
whereRC=r12
for rows andRC=r13
for columns. Sincer15
andrbx
are even, the result is 16-byte aligned -
We use the
shufpd
instruction to copy the lower part ofxmm3
into the higher one, since this same value is to be multiplied by other two-valued registers.
The time is improved:
diego@dataone:~/devel/ASM$ ./inv2 500 time dif is 0.61319 sec total = 1.94742 trace = 3.69845 diego@dataone:~/devel/ASM$ ./inv2 500 time dif is 0.587854 sec total = 1.94742 trace = 3.69845
Matrix Inversion with AVX
The following assembly implementation builds on the
previous one, making use of the
so-called AVX (advanced vector extensions) instructions, which
allow some "vectorized" operations to be carried on. This
is employed only in the most inner loop of the matrix inversion, where
the 256-bit registers ymm*
are used to store four double-precision
values, pointing to 32-bytes aligned memory regions.
Note that we reserve 32 additional bytes in the calloc
calls in order to
align the memory block for the matrices.
To avoid unhelpful complications, we’ll require the matrix dimensions to be a multiple of four.
;
; inv3.asm: inverting a matrix
;
default rel
extern printf, clock_gettime, rand, srand, calloc, atoi, putchar
%macro prelude 0
push rbp
mov rbp, rsp
%endmacro
section .text
global main
main:
prelude
sub rsp, 48
mov rdi, [rsi+8] ; get matrix size from argv[1]
call atoi wrt ..plt
mov r15, rax
mov edi, eax
and edi, 3
jz is_m4
lea rdi, [notm4]
mov eax, eax
call printf wrt ..plt
leave
ret
is_m4:
mul eax ; n^2
add eax, 4 ; 32 extra bytes
mov edi, eax
mov rsi, 8 ; no wasted space
ccal:
call calloc wrt ..plt
mov edi, eax
and edi, 0x1f
jz a32fixed1
mov esi, 32
sub esi, edi
add rax, rsi
a32fixed1:
mov qword [m1], rax
mov rax, r15
mul eax ; n^2
add eax, 4 ; 32 extra bytes
xor rdi, rdi
mov edi, eax
mov rsi, 8 ; no wasted space
call calloc wrt ..plt
mov edi, eax
and edi, 0x1f
jz a32fixed2
mov esi, 32
sub esi, edi
add rax, rsi
a32fixed2:
mov qword [m2], rax
; srand(0)
xor rdi, rdi
call srand wrt ..plt
; build random on m1
mov rax, [randmax]
cvtsi2sd xmm4, rax
xor r12, r12
nextr:
xor r13, r13
nextc:
movdqu oword [rsp+32], xmm4
call rand wrt ..plt
movdqu xmm4, oword [rsp+32]
cvtsi2sd xmm0, rax
divpd xmm0, xmm4 ; range is 0-1
mov rax, r15
imul rax, r12
add rax, r13
imul rax, 8
; movapd [m1+rax], xmm0 -> relocation R_X86_64_32
mov r14, qword [m1]
add r14, rax
movlpd [r14], xmm0 ; low quadword
inc r13
cmp r13, r15
jl nextc
inc r12
cmp r12, r15
jl nextr
; build identity on m2
xor r12, r12
nextri:
mov rax, r15
imul rax, r12
add rax, r12
imul rax, 8
; movapd [m2+rax], xmm0 -> relocation R_X86_64_32
mov r14, qword [m2]
add r14, rax
mov rax, 1
cvtsi2sd xmm0, rax
movlpd [r14], xmm0 ; low quadword
inc r12
cmp r12, r15
jl nextri
; printing
mov r14, qword [m1]
call show
mov r14, qword [m2]
call show
; get start time
mov rdi, 0
mov rsi, rsp
call clock_gettime wrt ..plt
; begin invertion
mov rax, 1
cvtsi2sd xmm5, rax ; xmm5 contains 1.0
xor r13, r13 ; main loop: column iteration
inextc:
mov rax, r15
imul rax, r13
add rax, r13
imul rax, 8
mov r14, qword [m1]
add r14, rax
movlpd xmm0, [r14]
movapd xmm3, xmm5
divpd xmm3, xmm0 ; xmm3 has ipivot = 1.0/m1[c*n+c];
xor rbx, rbx ; normalize pivot cell: row factor loop
inextx:
mov rax, r15
imul rax, r13
add rax, rbx
imul rax, 8
mov r14, qword [m1]
add r14, rax
movlpd xmm0, [r14] ; get m1[c*n+x]
mulpd xmm0, xmm3
movlpd [r14], xmm0 ; m1[c*n+x] *= ipivot;
mov r14, qword [m2]
add r14, rax
movlpd xmm0, [r14] ; get m1[c*n+x]
mulpd xmm0, xmm3
movlpd [r14], xmm0 ; m2[c*n+x] *= ipivot;
inc rbx
cmp rbx, r15
jl inextx
xor r12, r12
inextr:
cmp r12, r13 ; if(r != c)
je inextre
mov rax, r15
imul rax, r12
add rax, r13
imul rax, 8
mov r14, qword [m1]
add r14, rax
movlpd xmm3, [r14] ; factor = m1[r*n+c];
shufpd xmm3, xmm3, 0x00
vinserti128 ymm3, ymm3, xmm3, 1
xor rbx, rbx
inextxx:
mov rax, r15
imul rax, r13
add rax, rbx
; imul rax, 8
shl rax, 3
mov r14, qword [m1]
add r14, rax
vmovapd ymm2, [r14] ; m1[c*n+x] and next
vmulpd ymm2, ymm3 ; factor * m1[c*n+x]
mov r14, qword [m2]
add r14, rax
vmovapd ymm1, [r14] ; m2[c*n+x] and next
vmulpd ymm1, ymm3 ; factor * m2[c*n+x]
mov rax, r15
imul rax, r12
add rax, rbx
; imul rax, 8
shl rax, 3
mov r14, qword [m1]
add r14, rax
vmovapd ymm0, [r14] ; m1[r*n+x] and next
vsubpd ymm0, ymm2
vmovapd [r14], ymm0 ; m1[r*n+x] -= factor * m1[c*n+x]
mov r14, qword [m2]
add r14, rax
vmovapd ymm0, [r14] ; m2[r*n+x]
vsubpd ymm0, ymm1
vmovapd [r14], ymm0 ; m2[r*n+x] -= factor * m2[c*n+x]
add rbx, 4
cmp rbx, r15
jl inextxx
inextre:
inc r12
cmp r12, r15
jl inextr
inc r13
cmp r13, r15
jl inextc
; get end time
mov rdi, 0
lea rsi, [rsp+16]
call clock_gettime wrt ..plt
; calculate time difference
mov rax, 1000000000
cvtsi2sd xmm4, rax
mov rax, [rsp]
cvtsi2sd xmm0, rax
mov rax, [rsp+8]
cvtsi2sd xmm1, rax
mov rax, [rsp+16]
cvtsi2sd xmm2, rax
mov rax, [rsp+24]
cvtsi2sd xmm3, rax
subpd xmm3, xmm1
subpd xmm2, xmm0
divpd xmm3, xmm4
movapd xmm0, xmm2 ;
addpd xmm0, xmm3
; show time difference
lea rdi, [format]
mov eax, 1
call printf wrt ..plt
; printing again
mov r14, qword [m1]
call show
mov r14, qword [m2]
call show
; get total
mov r14, qword [m2]
call matrixtotal
lea rdi, [ftotal]
mov eax, 1
call printf wrt ..plt
; get trace
mov r14, qword [m2]
call matrixtrace
lea rdi, [ftrace]
mov eax, 1
call printf wrt ..plt
leave
ret
; r15 has matrix size, r14 has pointer to the data, result in xmm0
matrixtotal:
prelude
sub rsp, 16
mov qword [rsp], r14
pxor xmm0, xmm0
xor r12, r12
tonextr:
xor r13, r13
tonextc:
mov rax, r15
imul rax, r12
add rax, r13
imul rax, 8
mov r14, qword [rsp]
add r14, rax
movlpd xmm1, [r14]
addpd xmm0, xmm1
inc r13
cmp r13, r15
jl tonextc
inc r12
cmp r12, r15
jl tonextr
leave
ret
; r15 has matrix size, r14 has pointer to the data, result in xmm0
matrixtrace:
prelude
sub rsp, 16
mov qword [rsp], r14
pxor xmm0, xmm0
xor r12, r12
trnextr:
mov rax, r15
imul rax, r12
add rax, r12
imul rax, 8
mov r14, qword [rsp]
add r14, rax
movlpd xmm1, [r14]
addpd xmm0, xmm1
inc r12
cmp r12, r15
jl trnextr
leave
ret
; r15 has matrix size, r14 has pointer to the data
show:
prelude
sub rsp, 16
cmp r15, 10
jg toobig
mov qword [rsp], r14
xor r12, r12
snextr:
xor r13, r13
snextc:
mov rax, r15
imul rax, r12
add rax, r13
imul rax, 8
mov r14, qword [rsp]
add r14, rax
movlpd xmm0, [r14]
lea rdi, [mformat]
mov eax, 1
call printf wrt ..plt
inc r13
cmp r13, r15
jl snextc
mov rdi, 10
call putchar wrt ..plt
inc r12
cmp r12, r15
jl snextr
mov rdi, 10
call putchar wrt ..plt
leave
ret
toobig: lea rdi, [ftoobig]
xor eax, eax
call printf wrt ..plt
leave
ret
section .note.GNU-stack noalloc noexec nowrite progbits
; C 0-terminated strings
section .rodata
notm4: db "Matrix must have dimensions multiple of four", 10, 0
format: db "time dif is %g sec", 10, 0
ftoobig: db "Too big to show", 10, 0
ftotal: db "total = %g", 10, 0
ftrace: db "trace = %g", 10, 0
mformat: db "%10g ", 0
randmax: dq 2147483647
section .bss
n: resb 8
m1: resb 8
m2: resb 8
The times were barely improved from the previous version, maybe because of memory bandwith limitations.
diego@dataone:~/devel/ASM$ ./inv3 500 time dif is 0.523782 sec total = 1.94742 trace = 3.69845 diego@dataone:~/devel/ASM$ ./inv3 500 time dif is 0.534893 sec total = 1.94742 trace = 3.69845
To copy the lower part of ymm3 into its high part, we used
the VINSERTI128
instruction, effectively quadrupling the
factor
value. From
https://www.felixcloutier.com/x86/vinserti128:vinserti32x4:vinserti64x2:vinserti32x8:vinserti64x4
its behavior is:
TEMP[255:0] <-SRC1[255:0] CASE (imm8[0]) OF 0: TEMP[127:0]<-SRC2[127:0] 1: TEMP[255:128]<-SRC2[127:0] ESAC DEST <-TEMP
In this reference this instruction is listed as AVX2 in the col
"CPUID Feature Flag". That the "AVX2" extension is
present in our test cpu can be seen with the /proc/cpuinfo
file:
processor : 0
vendor_id : AuthenticAMD
cpu family : 21
model : 101
model name : AMD A8-9600 RADEON R7, 10 COMPUTE CORES 4C+6G
...
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp
lm constant_tsc rep_good acc_power nopl nonstop_tsc cpuid extd_apicid aperfmper
f pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx
f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefet
ch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perf
ctr_nb bpext ptsc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep
bmi2 xsaveopt arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid
decodeassists pausefilter pfthreshold avic vgif overflow_recov
It is interesting to note that this program crashed in another equipment which doesn’t have such extension:
$ ./inv3 500 Too big to show Too big to show Illegal instruction (core dumped) $ more /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Core(TM) i3-2120 CPU @ 3.30GHz ... flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm cons tant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmp erf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid ss e4_1 sse4_2 popcnt tsc_deadline_timer xsave avx lahf_lm epb pti ssbd ibrs ibpb s tibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm arat pln pts md_clear
Note
|
The core dumped message is misleading. See for example
https://stackoverflow.com/questions/2065912
|
Running a session of gdb
:
$ gdb ./inv3 GNU gdb (Debian 13.1-3) 13.1 ... (gdb) run 500 Starting program: /home/diego/git/asm-repo/inv3 500 ... Program received signal SIGILL, Illegal instruction. inextr () at inv3.asm:176 176 vinserti128 ymm3, ymm3, xmm3, 1
The culprit seems to be the vinserti128
instruction, which
is part of the AVX2
extension.
Now how to figure out the supported cpu features (like AVX2
,
SSE
, etc.) in runtime? in
brief, the cpuid
instruction is in charge (see
for example https://stackoverflow.com/a/17754393/4876728); for the
AVX2
case, the Intel Software Developer’s Manual (section
14.7.1: Detection of Intel® AVX2) provides sample code which
we adapted for an inv4.asm
version that improves
over the previous inv3.asm
(we only show the differences
between those source files):
$ diff -c inv3.asm inv4.asm
*** inv3.asm
--- inv4.asm
***************
*** 15,20 ****
--- 15,38 ----
prelude
sub rsp, 48
+ mov eax, 1
+ cpuid
+ and ecx, 0x018000000
+ cmp ecx, 0x018000000
+ jne n_avx2
+ mov eax, 7
+ mov ecx, 0
+ cpuid
+ and ebx, 0x20
+ cmp ebx, 0x20
+ je w_avx2
+ n_avx2:
+ lea rdi, [notavx2]
+ mov eax, eax
+ call printf wrt ..plt
+ leave
+ ret
+ w_avx2:
mov rdi, [rsi+8] ; get matrix size from argv[1]
call atoi wrt ..plt
mov r15, rax
***************
*** 365,370 ****
--- 383,389 ----
; C 0-terminated strings
section .rodata
+ notavx2: db "This cpu does not support the AVX2 extension", 10, 0
notm4: db "Matrix must have dimensions multiple of four", 10, 0
format: db "time dif is %g sec", 10, 0
ftoobig: db "Too big to show", 10, 0
This way, in the cpu missing AVX2
we get:
$ ./inv4 500 This cpu does not support the AVX2 extension
Appendix: Makefile
PROGS = test1 test2 test3 test4 exit32 model write \
triangle test5 test6 test7 test8 test9 timing \
sizes inv model-inv inv2 inv3 inv4 packed shuffle \
xdump1
AS = nasm
ifeq ($(AS),nasm)
ASOPTS = -f elf64 -g -l
else
ASOPTS = -f elf64 -g dwarf2 -l
endif
# for yasm, run with:
# make "AS=yasm"
all: $(PROGS) assembler.html
.SUFFIXES: .o .asm
assembler.html: assembler.adoc
asciidoctor assembler.adoc
sizes: sizes.asm xdump.o
$(AS) $(ASOPTS) sizes.lst sizes.asm
gcc -O0 -o sizes sizes.o xdump.o
xdump1: xdump1.c xdump.o
gcc -g -c xdump1.c
gcc -o xdump1 xdump1.o xdump.o
model-inv: model-inv.c
gcc -o model-inv model-inv.c
model: model.c
gcc -S model.c
gcc -o model model.c
test1: test1.asm
$(AS) $(ASOPTS) test1.lst test1.asm
ld --dynamic-linker=/lib64/ld-linux-x86-64.so.2 -o test1 test1.o -lc
test4: test4.asm
$(AS) $(ASOPTS) test4.lst test4.asm
gcc -no-pie -o test4 test4.o
test3: test3.asm
$(AS) $(ASOPTS) test3.lst test3.asm
gcc -o $@ test3.o
exit32: exit32.asm
$(AS) $(ASOPTS) exit32.lst exit32.asm
ld -o exit32 exit32.o
.o:
gcc -o $@ $<
.asm.o:
$(AS) $(ASOPTS) $*.lst $<
clean:
rm -f $(PROGS) *.o *.lst *.s assembler.html