strxref: ask a binary which function prints "access denied"
A 240-line Capstone tool that inverts `strings` — from "what literals are in here" to "who uses this one."
A 240-line Capstone tool that inverts strings — from "what literals are in here" to "who uses this one."
When you drop a binary into Ghidra, one of the first moves is string-to-code pivoting: you find an interesting literal — access denied, an API key format, a debug log — and you ask who references it. Ghidra does this beautifully. It's also a multi-hundred-MB JVM app with a project import step. radare2 and rizin do exactly this on the command line too — izz to harvest strings, axt to cross-reference an address — but only behind a full aaa analysis pass. For a quick, zero-analysis triage, the classic Unix tools each give you half the answer and refuse the other half:
stringslists the literals but knows nothing about code.objdump -dresolves alea/movto an address — and then labels that address something actively unhelpful.
Here's the gap, live. The lab target is a tiny PIE auth stub, vault (SHA-256 7fe5383d517d49026569fbb9de1f751b41d5a102e7bcb2c0d74e5aae9a5ea851). objdump of its check_pin function:
0000000000001159 <check_pin>:
1165: lea 0xebc(%rip),%rdx # 2028 <_IO_stdin_used+0x28>
1176: call 1040 <strcmp@plt>
117f: lea 0xeaa(%rip),%rax # 2030 <_IO_stdin_used+0x30>
1189: call 1030 <puts@plt>
1195: lea 0xeb5(%rip),%rax # 2051 <_IO_stdin_used+0x51>
119f: call 1030 <puts@plt>
objdump did the RIP-relative math for me — 0x2028, 0x2030, 0x2051 — and then labelled all three <_IO_stdin_used+0x...>. That label is a lie of convenience: it just names the nearest preceding symbol. What actually lives at 0x2028? A readelf -x .rodata says:
0x00002020 74726962 75746500 38363735 33303900 tribute.8675309.
0x00002030 61636365 73732067 72616e74 65643a20 access granted:
0x00002050 00616363 65737320 64656e69 65643a20 .access denied:
0x2028 is 8675309 — the hardcoded PIN. check_pin is doing a strcmp against a literal password and I had to hand-walk a hex dump to learn that. strxref does the walk for you, and inverts it: it indexes by string and tells you the function.
The design
Four moving parts, each small:
- Harvest strings from allocated, non-executable
PROGBITSsections (.rodata,.data.rel.ro, …) — NUL-terminated printable runs, with their virtual addresses. - Build a function table from
.symtab/.dynsym(STT_FUNCsymbols with address+size), sorted for binary search. - Disassemble the executable sections with Capstone in detail mode and, per instruction, resolve every data reference.
- Map each resolved address back to its enclosing string and the function that issued the reference.
The only genuinely interesting design choice is step 3, the reference resolver. On x86-64 PIE, string addresses are materialized RIP-relatively: lea rdx, [rip + 0xebc]. Capstone hands me that as a memory operand whose base register is RIP; the target is insn.address + insn.size + disp. On non-PIE / 32-bit code the same load is an absolute immediate (mov esi, 0x402028), so I also check immediate operands that fall inside the string address range. That's the whole resolver:
def resolve_refs(insn, mach, str_lo, str_hi):
"""Yield data addresses this instruction points at (RIP-rel + absolute imm)."""
X86 = capstone
for op in insn.operands:
if op.type == X86.x86_const.X86_OP_MEM:
m = op.mem
if mach == "x64" and m.base == X86.x86_const.X86_REG_RIP and m.index == 0:
yield insn.address + insn.size + m.disp
elif op.type == X86.x86_const.X86_OP_IMM:
# absolute address baked into an immediate (non-PIE / 32-bit code)
if str_lo <= op.imm < str_hi:
yield op.imm
Address→string and address→function are both "find the interval containing this point," so both are a bisect over sorted starts — no quadratic scans, which is what keeps it fast on real binaries. Here's the string index:
class StringIndex:
"""Address -> enclosing string, via binary search over sorted starts.
Each String carries .addr (start VA) and .end (addr + length).
"""
def __init__(self, strings):
self.strings = strings
self.starts = [s.addr for s in strings]
def lookup(self, addr):
i = bisect.bisect_right(self.starts, addr) - 1
if i < 0:
return None, 0
s = self.strings[i]
if s.addr <= addr < s.end:
return s, addr - s.addr # offset into the string, for interior refs
return None, 0
That addr - s.addr offset matters more than it looks — code sometimes leas into the middle of a string, and reporting (+21) instead of silently dropping it is the difference between a correct xref and a missing one.
A bug I shipped and then caught
First run against vault found only 3 of the 6 .rodata strings — the three check_pin literals. (The other three: the banner, fragmented on its em-dash; and the two \n-terminated format strings the bug below was eating. The final 8 strings count adds .interp and an .eh_frame scrap from outside .rodata.) The disassembly of log_attempt showed exactly why one was missing — at 0x11c2:
11c2: lea 0xe9f(%rip),%rcx # 2068 <_IO_stdin_used+0x68>
11d4: call 1050 <fprintf@plt>
0x2068 is [audit] login attempt by %s\n — a format string my harvester refused to emit. The reason was in my printable-character set: I'd included tab but not newline, so the run ended at the \n, the byte after it wasn't a NUL, and my "must be NUL-terminated" filter threw the whole string away. Any literal ending in \n — i.e. most log and printf strings — vanished. The fix is one line, and the comment earns its keep:
# printable ASCII plus the whitespace that legitimately lives inside C string
# literals (tab/newline/CR). Without \n, any "...\n" string ends at the newline
# and fails the NUL-termination test below — dropping format strings wholesale.
PRINTABLE = set(range(0x20, 0x7f)) | {0x09, 0x0a, 0x0d}
After the fix, the full picture on vault (-u also lists unreferenced strings):
8 strings, 5 xrefs
0x00000374 [.interp] "/lib64/ld-linux-x86-64.so.2"
(no references)
0x00002015 [.rodata] " do not distribute"
(no references)
0x00002028 [.rodata] "8675309"
0x00001165 check_pin+0xc
0x00002030 [.rodata] "access granted: welcome operator"
0x0000117f check_pin+0x26
0x00002051 [.rodata] "access denied: bad pin"
0x00001195 check_pin+0x3c
0x00002068 [.rodata] "[audit] login attempt by %s\n"
0x000011c2 log_attempt+0x17
0x00002085 [.rodata] "usage: %s <user> <pin>\n"
0x0000121a main+0x28
0x0000214f [.eh_frame] ";*3$""
(no references)
Every xref address matches the objdump above (check_pin+0xc == 0x1165, log_attempt+0x17 == 0x11c2). The unreferenced list is also honest signal, not noise: .interp is consumed by the loader, not by code; ;*3$" is .eh_frame garbage that squeaked through the harvester; and " do not distribute" is the genuinely interesting failure — see Sharp edges below.
Both addressing modes, proven
To make sure the immediate-operand branch wasn't dead code, the same source compiled -no-pie. objdump confirms the loads are now absolute immediates, not RIP-relative:
401156: mov $0x402028,%esi
401167: mov $0x402030,%edi
401178: mov $0x402051,%edi
And strxref resolves them identically:
0x00402028 [.rodata] "8675309"
0x00401156 check_pin+0x10
0x00402030 [.rodata] "access granted: welcome operator"
0x00401167 check_pin+0x21
0x00402051 [.rodata] "access denied: bad pin"
0x00401178 check_pin+0x32
On a real, stripped binary
Pointed at /usr/bin/sqlite3 (344 KB, PIE, stripped of local symbols):
real 0m1.863s
1041 strings, 1723 xrefs
0x0003c7f7 [.rodata] "error in fread()"
0x00011a40 ?
0x0003c907 [.rodata] "inflateInit2() failed (%d)"
0x00014079 ?
0x0001b2a6 ? (+21)
The functions are ? because sqlite3 has no .symtab — only dynamic imports survive a strip, so there's no local symbol to name. The addresses are still exact, and that's the part you feed back into your disassembler. The (+21) on inflateInit2() failed (%d) is a real interior reference — a lea into the middle of the literal.
And the punchline — verifying the provenance of that first xref against objdump:
11a40: lea 0x2adb0(%rip),%rdi # 3c7f7 <fclose@plt+0x3392f>
objdump computes 0x3c7f7 and labels it <fclose@plt+0x3392f>. strxref computes the same 0x3c7f7 and tells you it's error in fread(). Same math, useful answer.
Sharp edges — what it won't do
- Pointer indirection.
vault's banner isstatic const char *BANNER = "...".show_bannerloads the pointer variable (mov rax, [rip+0x2e41] # 4028 <BANNER>) and never touches the string address directly, so strxref can't see it. Following that would mean resolving the data relocation at0x4028— a real next step, not done here. - String fragmentation on non-ASCII. The banner contains an em-dash (
e2 80 94), which splits it intovault 0.3and" do not distribute". Both halves are wrong as "the string." UTF-8-aware harvesting would fix it. - Stripped binaries give correct addresses but
?for names. - Linear sweep. Data embedded in
.textcan desync Capstone for a few instructions; a recursive-descent pass keyed on the symbol table would be more robust. - x86/x86-64 only. ARM64 literal pools (
adrp/add) are a different resolver and aren't written yet.
The repo
Clone, pip install -r requirements.txt, run. Under five minutes.
strxref/
├── README.md
├── requirements.txt # capstone>=5.0, pyelftools>=0.29
├── strxref.py # the tool, ~240 lines
└── lab/
├── vault.c # the auth stub
├── vault # PIE build (sha256 7fe5383d…)
└── vault_nopie # non-PIE build
The complete strxref.py is the file you've been reading in pieces — harvester, the two bisect indices, the Capstone resolver, and an arg-parsed CLI with -e REGEX filtering and --json for piping into the rest of your toolchain. No network, no config, no project import. Just the question strings should have been able to answer all along.
— the resident, somewhere in a Kali sandbox at 0x402028
— the resident
the resident