Skip to content
All writing
Technical · 9 min

Ten Rounds of Breaking My Own App: The Tools, and the Lessons

The bespoke tooling that made a ten-round self-pentest possible, the attacks that failed, the full round-by-round scoreboard, and the lessons that outlive the app. Part 4 of a four-part series.

Contents

The final part of a four-part series on the ten-round pentest of MoodHaven Journal. Part 1 covered the why and the lab; Part 2 the findings; Part 3 the flagship feature that never engaged and the rounds that closed the campaign. This part zooms out: the tools, what held, and what it all taught me.


The tools I built to do this

Here’s the thing nobody tells you about pentesting your own software: for an app like this, the tools don’t exist yet. The off-the-shelf security scanners — the ones that crawl a website or hammer a public API — have nothing to say about a local-first desktop app that speaks its own private, encrypted protocol between two of your own machines. There’s no URL to point them at. So a real test of this app meant building the instrumentation myself, and that turned out to be as much of the work — and as much of the skill on display — as finding the bugs. The honest meta-lesson of the whole campaign is that the gap between “I designed it securely” and “I proved it” was only closable with bespoke tooling, and that tooling caught bugs the generic scanners never could have.

In plain terms, here’s what I had to build:

  • A real attack lab. A dedicated Kali Linux attacker machine and two “victim” machines (Windows and Ubuntu) running the actual installed app, plus an occasional real Mac when it was on the network — all driven from a single orchestrator over SSH, so one command could build, install, attack, and tear down across every machine.
  • A from-scratch clone of the app’s own encrypted sync protocol. This is the centerpiece. To test the network features properly, I reimplemented the app’s entire secure handshake — the cryptographic key exchange, the device-identity challenge-and-signature, the derived session key, and the encrypted message framing — as a standalone attack client. That let me act like a trusted device and probe exactly one thing at a time. It’s also the tool that caught the dropped-connection bug live.
  • A pairing-screen fuzzer and a traffic-anonymization pipeline. One tool hammered the device-pairing server with malformed and oversized requests to confirm it holds up. Another captured the real network traffic off the wire and then carefully scrubbed it — stripping IP addresses, device IDs, keys, and PINs — so the captures could become honest, blog-safe evidence without leaking anything.
  • A multi-OS build-and-install harness. Getting the installed app (not a dev build) onto each operating system and into a testable on-screen state was its own engineering problem — automating a Windows GUI from a headless session, getting the encrypted-database build to compile correctly, and installing and running the real .msi and .deb artifacts a user would actually download.
  • An AI-orchestrated testing loop with a verification gate. The whole campaign ran as fan-out investigations that were then adversarially verified — every candidate finding re-checked from scratch against the source before it counted — with a security-review step acting as the gate and a background health monitor watching the lab. The point of the verification gate is that an attack tool throwing an error isn’t a finding until it can be re-derived from the code.

None of this is the app. It’s the scaffolding around the app — and building it well is the difference between “I ran a scanner” and “I actually attacked this.”

For the technically inclined — the v2 sync-client emulator, the fuzzer + pcap anonymization, and the orchestration

The lab. Orchestrator (laptop, Claude Code) drives everything over SSH: red (Kali, attacker — holds the probe library at ~/pt6/ and a stable Ed25519 pentest identity in red_peer_key.bin, 0600, so it can be added to a victim’s trusted_devices.json), green (Windows 11, installed release .msi), purple (Ubuntu 22.04, .deb/AppImage), and an opportunistic macbook when it’s on the LAN. Red’s network probes are OS-agnostic — the same script hits green, purple, or the Mac; only the victim IP and the device-derived port change. Sync port = 44000 + (first-4-hex of device_id % 1000); pairing port uses the 43000 base.

The custom v2 sync-client emulator (e3_restore_gate.py). A full reimplementation of MoodHaven’s peer-sync v2 protocol in Python (using cryptography), enough to be a trusted client and exercise the restore consent gate:

  1. Load a stable Ed25519 identity (red_peer_key.bin) so the victim recognizes the device.
  2. Compute the victim’s sync port from its device_id.
  3. Handshake: send Hello { did, eph_pub } with a fresh X25519 ephemeral public key → receive Ok { eph_pub, challenge } → answer the Ed25519 challenge by signing "moodhaven-hello-auth-v1:" || challenge_bytes and sending Auth { signature }.
  4. Derive the session key exactly as the app does: SHA-256("moodhaven-sync-v2:" || X25519_shared || sorted(static_pub_A, static_pub_B)).
  5. Speak the framed transport — [4-byte BE length][12-byte random nonce][AES-256-GCM ciphertext] — to send a single encrypted RestoreRequest and observe the consent gate: unarmed → encrypted Err/reject; armed → the server begins streaming, and the probe aborts immediately (it never exfiltrates). A whoami mode prints red’s device_id + public key so it can be added to the victim’s trusted list.

Reimplementing the protocol (rather than driving the app’s own client) is what made it a probe: it can send one malformed or out-of-sequence message at a time, with an attacker’s identity, and observe the server in isolation. It’s also what surfaced the non-blocking-socket bug — the emulator’s handshake kept failing where the source said it should succeed, isolating the defect to the accepted socket’s mode rather than the protocol logic.

The pairing fuzzer (e1_pairing_probes.sh) + pcap pipeline. e1 nmaps the pairing port range, then fires oversized bodies (2 MB > the 1 MB cap → expect HTTP 400), malformed JSON (expect 400, server survives), a 5-attempt PIN brute force (expect 429 + server closes on the 5th), and a post-lockout probe (expect connection refused). Packet captures run on red (which is a party to the TCP connections, so no span port or ARP trickery is needed; it sees mDNS multicast natively). The anonymization pipeline keeps exhibits real but blog-safe: tcprewrite/bittwiste rewrite L2/L3 (real LAN IPs → RFC 5737 documentation IPs, MACs → placeholders), and because those tools don’t touch application-layer strings, the published figures are hand-built tshark field extracts with sed redaction of device IDs (xxxx…xxxx), public keys (len=32, ab12…ef90), challenge nonces, the 6-digit PIN (######), and ciphertext (short prefix + length only). A pre-publish checklist greps the final files for any real 192.168.1.x, full device ID, full key, or PIN before anything ships.

The build/install harness. The hardest automation problem was getting a GUI app into a testable state on a headless-driven Windows box: solved with interactive Scheduled Tasks (LogonType Interactive, RunLevel Highest) that build the release .msi, install it elevated via msiexec, and launch the installed exe onto the logged-in RDP desktop. A hard-won build lesson is baked in: do not set OPENSSL_DIR/OPENSSL_STATICrusqlite’s bundled-sqlcipher-vendored-openssl must compile its own vendored OpenSSL (pointing it at a prebuilt OpenSSL was what first masked the SQLCipher readback bug), and nasm is required on Windows for the vendored asm but not on Linux. The rule throughout: build from the PR branch and test the installed artifact, not tauri dev.

The orchestrated loop + health monitor. Each round is static-analysis → live-exploit → packet-capture → memory-forensics → fix → re-test, with subagents fanning out per surface (lock guards / crypto-zeroization / network-injection / browser-shim parity). Every candidate finding is adversarially verified by a fresh-context pass that must quote the motivating file:line from source or the finding is suppressed — the same discipline the cso (“Chief Security Officer”) security-review skill formalizes as a confidence gate, and the de-flaker that keeps transient network noise from becoming a “finding.” A proactive health monitor (harness-monitor.sh) tails per-victim phase status and distinguishes an infrastructure failure (victim never came up) from a genuine security PASS/FAIL, so a flaky SSH hop never gets misread as a clean result.


What the testing didn’t find (and why that counts)

A good penetration test is not measured only by the holes it finds. The roughly two dozen attacks that failed are evidence that the defenses I’d built actually work under fire:

  • Cross-site scripting via book and tag names didn’t work — React escapes text by default, and the app never uses the unsafe raw-HTML escape hatch (the two raw-HTML sinks that do exist are run through DOMPurify).
  • A denial-of-service flood against the sync engine didn’t work — an Ed25519 challenge-response rejects untrusted devices before any large data is read, and a hard frame cap blocks memory-exhaustion payloads. This round I fuzzed the sync frame parser directly: 4 GB and 256 MB length prefixes, truncated frames, a garbage HELLO, zero-length frames, and random bytes. Every one was rejected cleanly — the oversized prefixes hit Frame too large (limit 16777216) (the 16 MiB cap), the malformed HELLO came back missing field did — and the app survived all of them with no panic and no out-of-memory.
  • Settings injection from a malicious peer didn’t work — only a single, explicitly allowlisted preferences blob is allowed to sync; credentials and auth secrets are blocked at the data layer.
  • Brute-forcing the recovery key is infeasible — its 24 characters from a 32-symbol alphabet give 120 bits of entropy (32^24 ≈ 1.3×10^36 combinations). Even at a wildly optimistic 10^12 guesses/second, exhausting that space takes on the order of ~10^16 years (and ~10^19 at a more realistic 10^9/s) — billions of times the age of the universe either way.
  • Pulling secrets out of the shipped binary didn’t work — no hardcoded keys or passwords; the release build is stripped and hardened (strip, lto, panic = "abort").
  • Lifting the device’s signing key off disk didn’t work — I checked, and the Ed25519 private key (peer_key.bin) isn’t sitting in the app’s data directory at all; it’s held in the OS keyring. File-system access alone doesn’t hand an attacker the device identity key.
  • Memory dumps after the fixes came back clean — the key-shaped hex strings that appeared in earlier dumps were gone. (One honest caveat: the live memory-dump pass against the latest build — the empirical re-check of key-zeroization on the just-locked process — is the one test still pending; see the note below.)

Confirming that a defense works is a different kind of value from finding a hole, but it’s real value. It turns “I think this is safe” into “I tried to break this and couldn’t.”


The ten rounds at a glance

RoundFocusConfirmedStatus
PT1Sync engine, browser build, conflict resolution3Fixed (commit 3cd3a60)
PT2Encryption at rest, key files, lockout4Fixed (v1.7.1)
PT3The encryption migration + live network traffic6Fixed (v1.7.2, PR #122)
PT4Memory forensics, startup recovery, binary hardening5Fixed (v1.7.3, PRs #123, #124)
PT5Completeness sweep: file paths, reset, every key path3Fixed (v1.7.4, PR #125)
QARunning the real build through first-time setup2Fixed (v1.7.5, PR #127)
PT6Access-control audit across the full command surface7Fixed (PR #133)
PT7Verify the prior fixes, then hunt again11Fixed (PR #133)
PT8Prove the encryption actually works + Windows reset + Ubuntu victim2SQLCipher fix verified end-to-end on green Windows (commit e6fb416); Linux re-validation pending
PT9Turn the custom sync-client tool on the running app6Fixed + committed (949f9a9, 0774a3e, 07e9d44, 4443a2b)
PT10Red-team the PT9 fixes; then an independent verification hunt2Both fix-bugs fixed (2334269, fa2d299); verification hunt clean → campaign closed
Total65+ targets41 through PT7 (all fixed) + 2 in PT8 + 6 in PT9 + 2 in PT10All fixed; PT8 verified on green Windows; PT10 verification hunt clean → closed

A note on PT7 that captures the spirit of the whole thing: before hunting anything new, I pointed the testing at the previous round’s pull request with one job — prove it fixes what it claims. That verification pass found two commands the project’s own documentation said were protected and which, in fact, were not. A fix’s own documentation is not evidence the fix is complete. You verify against the code.

PT8 took that same instinct to its logical extreme — and it paid off twice over. First, instead of trusting that the flagship SQLCipher encryption was working because it was documented and had passed earlier rounds, I set out to prove it on a real machine. It wasn’t working at all: the database had been plaintext on every install the whole time. Then, holding to the same standard, I refused to mark the fix done until the corrected build was reinstalled from scratch on Windows and confirmed encrypted-on-disk with a clean unlock — which it now is. The lesson PT7 hinted at, PT8 made undeniable: you don’t verify a fix against its documentation, you verify it against a running build — and ideally against a minimal reproduction that can’t lie to you.

PT9 is where that discipline turned into automation. The custom sync-client emulator built to probe the app became a tool that found a bug on its own — a Windows-only dropped-connection defect that no amount of source review would have reliably surfaced — and the round closed out six issues in total, including two data-loss bugs that the eighth round’s own encryption fixes had introduced.

PT10 is where the loop terminated, honestly. I red-teamed PT9’s two trickiest fixes as their own attack surface and found a bug in each — the recovery probe accepting an empty-database decoy, and the restore-salt transfer writing an attacker-controlled salt unchecked. Both fixed. Then an independent verification hunt for new high-severity issues came back clean while every happy path still worked. That’s the close-out signal: you don’t reach zero bugs, you converge — the externally-reachable surface goes to zero, the residuals settle into local-access / lockout-class items, the no-journal-content-exposure invariant holds, and a fresh hunt comes back empty. Every fix is new attack surface; you keep attacking your own fixes until a clean pass comes back.


Lessons worth keeping

These generalize past this one app, which is really why I’m writing them down.

  1. Network capture beats code review for whole classes of bugs. The leaked-public-key findings were invisible in the source and obvious the moment I watched the actual traffic. Static analysis has a ceiling; a packet capture doesn’t.

  2. Fix the root cause, not the symptoms. Three “separate” findings were one. Always ask which findings share a cause before scheduling them as independent work.

  3. A fix is new, untested code — and “we wrote it” is not “it works.” Every round found something introduced by the previous round’s fixes — including two critical bugs that lived inside earlier fixes, a flagship encryption feature that turned out never to have engaged at all, and then two more data-loss bugs hiding inside the fixes for that. Re-test after you patch, and verify the fix actually executes correctly on a real build, not just that the code was written.

  4. The browser build needs the same discipline as the native build. Some of the worst access-control gaps were in the browser version, where there’s no shared type system to enforce parity with the backend. You have to audit them together.

  5. When you fix a class of bug, sweep the whole codebase. Fixing key-wiping in two functions wasn’t enough; later rounds found the same mistake in three, then four more places. Do the completeness pass.

  6. At some point you have to install the thing and break it. The most serious bugs survived multiple rounds of code review and showed up only when the real build ran on a real machine — including a security feature that looked correct in the source but was silently inert in production for its entire life. Code review can’t observe runtime state. Build it and break it. And when a result still surprises you, write a ten-line standalone reproduction that proves the mechanism — it can’t lie to you the way a passing-looking review can.

  7. For a custom app, you have to build your own tools — and that’s where the real signal is. No off-the-shelf scanner can speak a private encrypted desktop protocol. Reimplementing the app’s own sync handshake as a standalone attack client wasn’t a side quest; it was the thing that found a bug code review never would have, and it’s the clearest signal of the work. If the instrumentation doesn’t exist, building it well is the test.

  8. You converge, you don’t reach zero — and a clean independent pass is how you know. The honest end of a campaign like this isn’t “no bugs left,” which no finite test can prove. It’s that the externally-reachable surface has gone to zero, the residuals have migrated from “reads your data” to “needs your already-unlocked machine and only denies you access,” the one invariant that matters never bent, and a fresh, independent hunt for new high-severity issues comes back empty while the happy paths still work. The tenth round is what produced that pass — by attacking the ninth round’s own fixes first.

  9. Describing what you want built is not the same as knowing it works — and AI widens that gap. You can produce a lot of plausible, confident, wrong code fast. The flagship encryption feature was designed right, reviewed, documented, and believed — and it had never once run. Describing intent is cheap; verifying it on a real machine is the actual engineering. The honest move is to say exactly what you’ve proved and what you haven’t, rather than mistake a convincing first draft for a finished one.


Why this is a portfolio piece, not a postmortem

I’m sharing this because it demonstrates the way I like to work, and because “we take security seriously” should mean something concrete.

What I want it to show:

  • A security mindset by default. The app was designed to be private; this campaign was about proving it, adversarially, rather than asserting it.
  • Engineering rigor. Real lab, real installed builds, two victim operating systems, network captures, memory forensics, custom protocol tooling, and — the part most teams skip — re-testing every fix instead of trusting it.
  • Follow-through. Every confirmed finding through PT7 was fixed, each tied to a specific release or pull request; the eighth round’s flagship encryption fix is now verified end-to-end on the installed Windows build; the ninth round’s six fixes are committed and proven by reproductions and regression tests; and the tenth round red-teamed those fixes, fixed the bug it found in each, and closed on a clean independent verification pass. Found and fixed is the only standard that counts — and I’d rather tell you exactly where a fix stands than round it up to “done.”
  • Tooling as a deliverable. The hardest and most telling part wasn’t the bugs — it was building the lab, the from-scratch sync-protocol attack client, the fuzzer and the anonymization pipeline, and the orchestrated, adversarially-verified loop that drove it all. Bespoke instrumentation was required, and it caught what generic scanners couldn’t.
  • A modern, AI-assisted workflow used honestly. The AI orchestrator made the campaign faster and more thorough; I’ve been equally clear about what it couldn’t do and where a human had to stay in the loop.

The honest takeaway is the one I started with, only sharper now: I thought I’d built something secure, and I had — mostly. I even thought I’d fixed the parts that weren’t, and on one flagship feature I was wrong about that for a long time. Trying hard to break it — on real machines, with tools I had to build myself and reproductions that can’t be argued with — is what turned “mostly” into something I can actually stand behind. The defenses I couldn’t break are the ones I now trust; and the feature I thought was protecting me, but wasn’t, is exactly the kind of thing this whole process exists to catch.


That’s the series

Ten rounds, 65+ attacks, 41-plus confirmed-and-fixed vulnerabilities, and a flagship feature I had to catch lying to me. If you read all four parts: thank you — it was the long version on purpose.

MoodHaven Journal is the app all of this was in service of. And if you want the next thing I take apart — security writeups, build logs, the occasional dispatch from the trail — the box below is where it lands.

Share LinkedInXBlueskyReddit