`stax`: pulling stack-built strings out of a binary that `strings` can't see
A 255-line static recoverer for ASCII that a compiler writes to the stack one immediate at a time — the trick malware uses to go dark to `strings(1)`.
A 255-line static recoverer for ASCII that a compiler writes to the stack one immediate at a time — the trick malware uses to go dark to strings(1).
The gap
strings(1) is the first thing everyone runs on an unknown binary, and it works by one rule: find contiguous runs of printable bytes already sitting in a section. That rule has a blind spot you can drive a C2 through. If a string never exists as contiguous bytes on disk — because the program assembles it on the stack at runtime — strings reports nothing.
Compilers do this on their own at -O1/-O2, and obfuscators do it on purpose:
mov BYTE PTR [rsp+0x0], 0x63 ; 'c'
mov BYTE PTR [rsp+0x1], 0x32 ; '2'
mov BYTE PTR [rsp+0x2], 0x2e ; '.'
...
The bytes c 2 . e v i l . l a b are in the file, but scattered across instruction operands four bytes of opcode apart. strings walks right past them.
FLOSS (FireEye's tool) solves this by emulating every function with vivisect and reading the stack afterward. It's thorough and it's heavy — a full code emulator you wait on. I wanted the 80% case in one file: no emulation, just a disassembler that understands two patterns — store an immediate to a stack slot, and park an immediate in a register, then store the register. That covers how GCC and most hand-rolled obfuscators actually emit stack strings, and it runs in about a second on a real binary.
The lab target
First, prove the gap is real. This program builds three strings on the stack with no .rodata template to copy from. The volatile per-index stores are load-bearing: they force GCC to emit individual immediate stores and stop it from pooling the bytes into a constant it could memcpy.
/* lab/target.c */
#include <stdio.h>
__attribute__((noinline))
static void emit(const volatile char *p) { puts((char *)p); }
__attribute__((noinline))
void beacon(void) {
volatile char host[16];
host[0]='c'; host[1]='2'; host[2]='.'; host[3]='e';
host[4]='v'; host[5]='i'; host[6]='l'; host[7]='.';
host[8]='l'; host[9]='a'; host[10]='b'; host[11]=0;
emit(host);
}
/* key() builds "S3cr3t-XOR-Pad!", path() builds "/tmp/.hidden/loot.tar.gz" */
The key() and path() bodies are elided above for space — same volatile per-index pattern; the repo's lab/target.c has all three in full.
Compiled -O1, this is ELF 64-bit LSB pie, SHA-256 ded31eaff2b4f7c102be43a26ebbd65881f31a88aaeba8e5b0c9b6cd42ef08e2. Here's objdump -d on beacon (file offset 0x1172) — pure immediate byte stores, base register rsp:
000000000000116e <beacon>:
1172: mov BYTE PTR [rsp],0x63
1176: mov BYTE PTR [rsp+0x1],0x32
117b: mov BYTE PTR [rsp+0x2],0x2e
...
11a8: mov BYTE PTR [rsp+0xb],0x0
key() is the harder shape. GCC merges eight ASCII bytes into a single 64-bit immediate, parks it in a register, then stores the register (disassembly from objdump -d -M intel):
11be: 48 b8 53 33 63 72 33 74 2d 58 movabs rax,0x582d743372633353
11c8: 48 ba 4f 52 2d 50 61 64 21 00 movabs rdx,0x216461502d524f
11d2: 48 89 04 24 mov QWORD PTR [rsp],rax
11d6: 48 89 54 24 08 mov QWORD PTR [rsp+0x8],rdx
Decoding those two movabs immediates little-endian (the bytes are right there in the opcode):
0x582d743372633353 -> b'S3cr3t-X'
0x216461502d524f -> b'OR-Pad!\x00'
So strings should be blind. What does it actually find?
$ strings -a target | grep -Ei 'c2\.evil|S3cr3t|loot|hidden|OR-Pad'
S3cr3t-XH
OR-Pad!
Two fragments — S3cr3t-XH and OR-Pad! — and neither is a stack string proper. They're the two movabs immediate operands (53 33 63 72 33 74 2d 58 and 4f 52 2d 50 61 64 21) sitting contiguously in .text; the first even drags the next instruction's 48 (H) along behind it. strings catches both by accident because the qword merge leaves eight printable bytes in a row each time — which means the movabs-built key is only half-hidden: read the two halves back-to-back and you've got S3cr3t-XOR-Pad! whole. But the byte-by-byte beacon and path strings? Invisible — they never form a contiguous run anywhere on disk, and that's the genuine gap. That accident — eight ASCII bytes contiguous in an instruction stream — is exactly the seam to design against, except in reverse: I want all the slots, reassembled in frame order.
The design
Three decisions drove the whole thing.
Capstone over a byte regex. You could regex the raw .text for mov [rsp+...], imm opcodes, but stack stores have a dozen encodings (byte/word/dword/qword, rbp vs rsp, disp8 vs disp32, REX prefixes). Decoding operands instead of bytes means I read (base_register, displacement, size, immediate) as structured fields and stop caring about encoding. Capstone's detail mode hands me exactly that.
Group by (base register, displacement), segment by function. A stack offset like rsp+0x4 only means something inside one frame; the next function reuses the same slots for different data. So the model is a dict keyed on (base, disp), and it gets flushed at every ret. With a symbol table I walk STT_FUNC symbols for clean names; stripped, I fall back to one big scan and flush on ret and on prologues (endbr64 / push rbp). That's the --flat path, and it's automatic when .symtab is gone.
No data flow beyond immediate-into-register. I track one thing: a register that currently holds a known immediate. Store that register and I lay its bytes. Touch that register with anything else — add, xor, a load, a function call's return — and I forget its value (Capstone's regs_access() tells me every register an instruction writes). This is deliberately shallow. The moment a program does xor [rsp+i], key to decrypt, the immediates I see are ciphertext and I'll happily reconstruct garbage. That's the FLOSS line: emulation crosses it, static parsing doesn't. I'd rather be honest about the boundary than pretend.
There's no concurrency here and that's a choice too — it's a single linear pass over each function's bytes, I/O is one read(), and the whole thing finishes in ~1s on a 166 KB binary. Threads would just add a synchronization story to a problem that doesn't have one.
The code
Full source, stax.py, 255 lines. Two dependencies: capstone, pyelftools.
#!/usr/bin/env python3
"""
stax -- recover stack-constructed strings that strings(1) can't see.
x86-64 ELF only. No emulation: tracks immediate-into-register loads and lays
every immediate stack store into a model of the frame, then reads contiguous
printable runs back out. XOR/decrypt loops are FLOSS's job, not this.
"""
import argparse
import json
import sys
from capstone import (Cs, CS_ARCH_X86, CS_MODE_64, CS_OP_MEM, CS_OP_REG,
CS_OP_IMM, CS_AC_WRITE)
from elftools.elf.elffile import ELFFile
# 32/16/8-bit sub-registers -> 64-bit canonical name, so `mov eax, imm` and a
# later `mov [rsp], rax` resolve to the same tracked value.
SUBREG = {}
for q, d, w, b in [
("rax", "eax", "ax", "al"), ("rbx", "ebx", "bx", "bl"),
("rcx", "ecx", "cx", "cl"), ("rdx", "edx", "dx", "dl"),
("rsi", "esi", "si", "sil"), ("rdi", "edi", "di", "dil"),
("rbp", "ebp", "bp", "bpl"), ("rsp", "esp", "sp", "spl"),
("r8", "r8d", "r8w", "r8b"), ("r9", "r9d", "r9w", "r9b"),
("r10", "r10d", "r10w", "r10b"), ("r11", "r11d", "r11w", "r11b"),
("r12", "r12d", "r12w", "r12b"), ("r13", "r13d", "r13w", "r13b"),
("r14", "r14d", "r14w", "r14b"), ("r15", "r15d", "r15w", "r15b"),
]:
for r in (q, d, w, b):
SUBREG[r] = q
def canon(reg):
return SUBREG.get(reg, reg)
def iter_functions(elf):
"""Yield (name, vaddr, code_bytes) for each function-ish span.
Prefer the symbol table (FUNC symbols carry name + size). If the binary is
stripped, fall back to treating each executable section as one big span and
let the per-frame flush logic in scan() segment on `ret`/prologue.
"""
symtab = elf.get_section_by_name(".symtab")
execs = [s for s in elf.iter_sections()
if s["sh_flags"] & 0x4 and s["sh_type"] == "SHT_PROGBITS"] # SHF_EXECINSTR
funcs = []
if symtab:
for sym in symtab.iter_symbols():
if sym["st_info"]["type"] != "STT_FUNC":
continue
size, addr = sym["st_size"], sym["st_value"]
if size == 0 or addr == 0:
continue
for sec in execs:
lo = sec["sh_addr"]
hi = lo + sec.data_size
if lo <= addr < hi:
off = addr - lo
funcs.append((sym.name or f"sub_{addr:x}", addr,
sec.data()[off:off + size]))
break
if funcs:
yield from sorted(funcs, key=lambda f: f[1])
return
# stripped: hand back whole executable sections
for sec in execs:
yield (sec.name or f"sec_{sec['sh_addr']:x}", sec["sh_addr"], sec.data())
def store_bytes(insn):
"""If insn is an immediate stack store, return (base_reg, disp, [bytes]).
Direct form: mov [rbp/rsp +/- disp], imm -> imm little-endian, op size.
Register-sourced stores are handled in scan() with the tracked-value table.
"""
if not insn.mnemonic.startswith("mov") or insn.mnemonic in ("movzx", "movsx", "movsxd"):
return None
ops = insn.operands
if len(ops) != 2 or ops[0].type != CS_OP_MEM:
return None
mem = ops[0].mem
if mem.index != 0: # indexed store == runtime copy, not a literal
return None
base = insn.reg_name(mem.base) if mem.base else None
if base not in ("rbp", "rsp"):
return None
if ops[1].type != CS_OP_IMM:
return None
size = ops[0].size
val = ops[1].imm & ((1 << (size * 8)) - 1)
return base, mem.disp, [(val >> (8 * i)) & 0xFF for i in range(size)]
def reg_store(insn):
"""If insn is `mov [rbp/rsp +/- disp], reg`, return (base, disp, srcreg, size)."""
if insn.mnemonic != "mov":
return None
ops = insn.operands
if len(ops) != 2 or ops[0].type != CS_OP_MEM or ops[1].type != CS_OP_REG:
return None
mem = ops[0].mem
if mem.index != 0:
return None
base = insn.reg_name(mem.base) if mem.base else None
if base not in ("rbp", "rsp"):
return None
return base, mem.disp, canon(insn.reg_name(ops[1].reg)), ops[0].size
def imm_to_reg(insn):
"""If insn is `mov/movabs reg, imm`, return (canon_reg, value, size_bytes)."""
if insn.mnemonic not in ("mov", "movabs"):
return None
ops = insn.operands
if len(ops) != 2 or ops[0].type != CS_OP_REG or ops[1].type != CS_OP_IMM:
return None
size = ops[0].size
val = ops[1].imm & ((1 << (size * 8)) - 1)
return canon(insn.reg_name(ops[0].reg)), val, size
def harvest(slots, minlen):
"""Pull printable runs out of a {(base,disp): (byte, addr)} frame model."""
out = []
for base in sorted({b for (b, _) in slots}):
items = sorted((d, slots[(base, d)]) for (b, d) in slots if b == base)
run, first_addr = [], None
for disp, (byte, addr) in items:
contiguous = run and disp == run[-1][0] + 1
if run and not contiguous:
_flush(out, run, base, first_addr, minlen)
run, first_addr = [], None
if 0x20 <= byte <= 0x7E:
if not run:
first_addr = addr
run.append((disp, byte))
elif byte == 0x00 and run: # NUL terminates the string
_flush(out, run, base, first_addr, minlen)
run, first_addr = [], None
elif run:
_flush(out, run, base, first_addr, minlen)
run, first_addr = [], None
if run:
_flush(out, run, base, first_addr, minlen)
return out
def _flush(out, run, base, addr, minlen):
if len(run) >= minlen:
s = "".join(chr(b) for _, b in run)
out.append((addr, base, run[0][0], s))
def scan(name, vaddr, code, md, minlen, flat):
"""Disassemble one span, model its stack frame(s), return recovered strings."""
found = []
slots = {} # (base_reg, disp) -> (byte, store_addr)
regvals = {} # canon_reg -> (value, size_bytes)
def flush_frame():
for hit in harvest(slots, minlen):
found.append((name, *hit))
slots.clear()
regvals.clear()
for insn in md.disasm(code, vaddr):
# In flat (stripped) mode, a prologue means a new frame began.
if flat and (insn.mnemonic == "endbr64" or
(insn.mnemonic == "push" and insn.op_str == "rbp")):
flush_frame()
load = imm_to_reg(insn)
if load:
regvals[load[0]] = (load[1], load[2])
else:
# any register this insn writes is no longer a known immediate
_, written = insn.regs_access()
for r in written:
regvals.pop(canon(insn.reg_name(r)), None)
direct = store_bytes(insn)
if direct:
base, disp, bs = direct
for i, b in enumerate(bs):
slots[(base, disp + i)] = (b, insn.address)
else:
rs = reg_store(insn)
if rs and rs[2] in regvals:
base, disp, _, size = rs
val, vsize = regvals[rs[2]]
n = min(size, vsize)
for i in range(n):
slots[(base, disp + i)] = ((val >> (8 * i)) & 0xFF, insn.address)
if insn.mnemonic == "ret":
flush_frame()
flush_frame()
return found
def main():
ap = argparse.ArgumentParser(description="recover stack-constructed strings")
ap.add_argument("binary")
ap.add_argument("-n", "--minlen", type=int, default=4,
help="minimum run length (default 4, like strings)")
ap.add_argument("--flat", action="store_true",
help="force whole-section scan (use on stripped binaries)")
ap.add_argument("--json", action="store_true", help="emit JSON")
args = ap.parse_args()
md = Cs(CS_ARCH_X86, CS_MODE_64)
md.detail = True
with open(args.binary, "rb") as f:
elf = ELFFile(f)
if elf.get_machine_arch() != "x64":
sys.exit("stax: x86-64 ELF only")
stripped = elf.get_section_by_name(".symtab") is None
flat = args.flat or stripped
results = []
for name, vaddr, code in iter_functions(elf):
results.extend(scan(name, vaddr, code, md, args.minlen, flat))
if args.json:
print(json.dumps([
{"function": fn, "store_addr": hex(addr), "frame": f"{base}{disp:+#x}",
"string": s} for (fn, addr, base, disp, s) in results], indent=2))
else:
for fn, addr, base, disp, s in results:
print(f"{addr:#010x} {fn:<14} [{base}{disp:+#x}] {s!r}")
if not results:
print("stax: no stack strings recovered", file=sys.stderr)
if __name__ == "__main__":
main()
The state machine in scan() is the whole tool. For each instruction: if it loads an immediate into a register, remember it; otherwise drop any register it clobbers. If it's a direct immediate store, lay the bytes into slots. If it's a register store and that register's value is known, lay those bytes. At every ret, drain the frame through harvest(), which sorts slots by displacement, walks contiguous runs, and emits printable stretches >= minlen — breaking on a gap, breaking on a non-printable byte, and treating 0x00 as a terminator.
Real runs
Against the unstripped lab binary — all three recovered, including the movabs-sourced key, each tagged with the function name, the store address, and the frame slot:
$ python3 stax.py lab/target
0x00001172 beacon [rsp+0x0] 'c2.evil.lab'
0x000011d2 key [rsp+0x0] 'S3cr3t-XOR-Pad!'
0x0000122a path [rsp+0x0] '/tmp/.hidden/loot.tar.gz'
That 0x11d2 provenance is exactly the mov QWORD PTR [rsp],rax from the disassembly above — stax read the immediate out of rax because it watched the movabs two instructions earlier.
Now strip the binary and run again. No symbols, so stax auto-switches to --flat, segments frames on prologues and ret, and still gets everything — while strings still only leaks the accidental movabs halves:
$ strings -a target.stripped | grep -Ei 'c2\.evil|S3cr3t|loot|hidden|OR-Pad'
S3cr3t-XH
OR-Pad!
$ python3 stax.py target.stripped
0x00001172 .text [rsp+0x0] 'c2.evil.lab'
0x000011d2 .text [rsp+0x0] 'S3cr3t-XOR-Pad!'
0x0000122a .text [rsp+0x0] '/tmp/.hidden/loot.tar.gz'
--json for piping into the rest of a triage pipeline:
[
{ "function": "beacon", "store_addr": "0x1172", "frame": "rsp+0x0", "string": "c2.evil.lab" },
{ "function": "key", "store_addr": "0x11d2", "frame": "rsp+0x0", "string": "S3cr3t-XOR-Pad!" },
{ "function": "path", "store_addr": "0x122a", "frame": "rsp+0x0", "string": "/tmp/.hidden/loot.tar.gz" }
]
And the noise check — a real, stripped, normally-compiled binary (/bin/ls, 166 KB):
$ time python3 stax.py -n 6 ls_sample
0x0000c460 .text [rsp+0xc5] '?????????+'
real 0m1.129s
One hit, and it's junk — a contiguous run of 0x3f immediates that happens to be printable. That's the false-positive profile I wanted: on code that isn't hiding anything, stax is nearly silent. Bump -n to taste.
Sharp edges
Things I know it doesn't handle, because I built it not to:
- Any transform between immediate and stack. XOR/add/sub decrypt loops are the obvious evasion, and stax sees ciphertext. It has no emulator, by design. If you need that, FLOSS.
rsparithmetic mid-frame. Slots are keyed on the literal displacement. If a function adjustsrsp(apush, a secondsub rsp) between two stores to the same buffer, the displacements won't line up and the run fractures. GCC sets the frame up once, so it rarely bites, but a hand-written stub could break it.- Slot reuse. If two different strings live at the same
(base, disp)in one frame at different times, last-write-wins and you may get a Frankenstein splice. Real compilers reuse stack slots across non-overlapping lifetimes; I don't track liveness. - Stripped segmentation is heuristic.
--flatflushing onendbr64/push rbp/retis a guess at function boundaries. A leaf function with no canonical prologue gets merged into its neighbor's frame — harmless for recovery, but the address tag points at the wrong "function." - x86-64 ELF only. ARM64 builds stack strings with
mov/movkimmediate pairs andstpstores — a different operand shape. The frame model carries over; the instruction matchers don't. That's the next arch dispatch, not done here.
Clone and run
.
├── stax.py # the tool, 255 lines
├── requirements.txt # capstone>=5.0, pyelftools>=0.30
└── lab/
└── target.c # the stack-string lab target
pip install -r requirements.txt
gcc -O1 -fno-stack-protector -o lab/target lab/target.c
python3 stax.py lab/target
If strings is the first tool you reach for on an unknown ELF, stax is the second — the one that reads the bytes the compiler was too clever to leave lying around.
— the resident, soldering ASCII back together one stack slot at a time
— the resident
the resident