More

zbentley · 2026-04-27T12:38:23 1777293503

Yes, but the incentives created by that system lead to insurance adjudicators operating with extreme adversariality towards the insured. Add to that the extreme inelasticity of demand for insured products (e.g. healthcare, or getting access to a car to use to commute after one is totaled), regulatory capture of insured products/services by insurers, and time, and you get pretty toxic systems wherein insurers exert upwards price pressure without significant checks.

zbentley · 2026-04-26T18:32:34 1777228354

Not as well as they can reason (or others can google) something as standardized as kubernetes. There’s just less context (in both senses of the term) needed to understand something running on a common substrate versus something bespoke, even if the bespoke thing is itself comprised of standardized parts.

winton · 2026-04-26T18:50:36 1777229436

For a project set up by a qualified engineer, there would be little difference to the end user in practice. The LLM would work out a solution with a negligible difference in speed. Maybe debugging would also be faster for the LLM without the abstraction layers and low level access?

zbentley · 2026-04-26T18:30:46 1777228246

Shit just gets really weird when your network isn’t split for k8s in an equivalent way to what GCP/AWS expect. Like, if you have other services running on the nodes that you want things inside k8s to talk to, or if the nodes are in a flat subnet with other stuff in it, things get annoying. Those are worst practices for a reason, but pretty common in environments with home rolled k8s clusters.

zbentley · 2026-04-26T18:07:23 1777226843

That is indeed a weirdly cursed requirement. Why? Black box of legacy stuff? A system that was never designed to be run in multiple does so if all the nodes think they’re the same machine? Defeating a license restriction?

zbentley · 2026-04-26T18:03:18 1777226598

> K8s is well suited to dynamically scaling a SaaS product delivered over the web

It’s well suited to other things as well, people are just in denial about some of them.

“I need to run more than two containers and have a googleable way to manage their behavior” is a very common need.

capitalhilbilly · 2026-04-28T15:50:04 1777391404

This is a need it fails at miserably. k8s reminds me of the raid recentralization anti pattern problem where you fix a hardware failure that never occurs in exchange for knowing simple higher level mistakes or security problems will tank something now too large to fail again.

zbentley · 2026-04-24T01:34:28 1776994468

Very neat! I like this a lot, nice work.

After peeking the source, a few possible areas of improvement:

- You can use `fstat` and keep a file handle around, likely further improving performance (well, reducing the performance hit to other users of the filesystem by not resolving vfs nodes). If you do this, you'll have to check for file deletions.

- If you do stick with stat(2), it might be a good idea to track the inode number from the stat result in addition to the time,size tuple. That handles the "t,s = 1,2; honker gets SIGSTOPped/CRIU'd; database file replaced; honker started again", as well as renameat/symlink-swap fiddling. Changing inode probably should just trigger a crash.

- Also check the device number from the stat call. It sounds fringe, but the number of weird hellbugs I've dealt with in my career caused by code continually interacting with a file at the same time as something else mounted an equivalent path "over" the directory the file was originally in is nonzero.

- It's been a few years since I fought with this, but aren't there edge cases here if the system clock goes backwards? IIRC the inode timestamp isn't monotonic--right? There are various strategies for detecting clock adjustment, of various reliability, that you could use here, if so. Just checking if the mtime-vs-system-clock diff is negative is a start.

That covers the more common of the "vanishingly uncommon but I've still seen 'em" cases related to file modification detection. Whether you choose to cope with people messing with the file via utime(2) is up to you (past a point, it feels like coping with malicious misuse rather than edge cases). But since your code runs in a loop, you're well-positioned to do that (and detect drift/manipulations of the system clock): track a monotonic clock and use it to approximate the elapsed wall time between honker poller ticks (say it fast with an accent, and you get https://www.bbc.com/news/world-latin-america-11465127); if the timestamp reported by (f)stat(2) ever doesn't advance at the same rate, fall back to checksumming the file, or crashing or something. But this is well into the realm of abject paranoia by now.

It's been a decade or so since I worked in this area, so some of that knowledge is likely stale; you probably know a lot more than I do after developing this library even before considering how out-of-date my knowledge might be. When I worked on this stuff, I remember that statx(2) was going to solve all the problems any day now, and then didn't. More relevant, I also remember that the lsyncd (https://github.com/lsyncd/lsyncd) and watchman (https://github.com/facebook/watchman) codebases were really good sources of "what didn't I think of" information in this area.

But seriously, again, nice work! Those are nitpicks; this is awesome as-is!

russellthehippo · 2026-04-24T03:51:33 1777002693

Wow, thanks for the great feedback.

I actually looked at fstat, but the "check for deletions" piece, given I'm polling at 1kHZ, was the reason I decided not to use it. Older hardware actually made this a big issue but it's fast enough now I decided it wasn't a problem.

I'll ignore the malicious ones bc [out of scope declaration]. Object paranoia is an artifact of build trama and I respect that lmao.

I've just looked into the device number and system clock issues. I think what i'll end up doing is actually a combo of ncruces's above comment and your feedback: a 1kHZ data_version and a 10HZ stat() with version check. This gets around syscall load, avoid clock issues, avoids the WAL truncation issues that others have mentioned, and is both lighter weight and less bugabooable than my previous design.

Thanks again.

zbentley · 2026-04-24T20:05:14 1777061114

Hope it helps!

One clarification: by "check for deletions" I didn't mean that you need to read back through the filesystem; you can check for deletions for free using fstat(2)'s result. The number of hard links to a file descriptor's underlying description returned by fstat includes the "existential" hard link of the file itself, and drops to zero when the file's deleted and the open handle is an orphan:

    import os
    import time
    from threading import Thread, Event

    f = '/tmp/foo.test'
    ev = Event()
    Thread(target=lambda: ev.wait() and os.unlink(f), daemon=True).start()

    with open(f, 'w+') as fh:
        print("before delete:", os.fstat(fh.fileno()).st_nlink)
        ev.set()
        time.sleep(1)
        print("after delete:", os.fstat(fh.fileno()).st_nlink)

russellthehippo · 2026-04-24T20:27:24 1777062444

Ha. Great callout. Will inspect further

zbentley · 2026-04-24T01:01:09 1776992469

There's more process-based concurrency than you'd expect in shops that use those languages.

Cron jobs might need to coordinate with webservers. Even heavily threaded webservers might have some subprocesses/forking to manage connection pools and hot reloads and whatnot. Suid programs are process-separated from non-suid programs. Plenty of places are in the "permanent middle" of a migration from e.g. Java 7 to Java 11 and migrate by splitting traffic to multiple copies of the same app running on different versions of the runtime.

If you're heavily using SQLite for your DB already, you probably are reluctant to replace those situations with multiple servers coordinating around a central DB.

Nit:

> languages that only have process based concurrency python/JS/TS/ruby

Not true. There are tons and tons of threaded Python web frameworks/server harnesses, and there were even before GIL-removal efforts started. Just because gunicorn/multiprocessing are popular doesn't mean there aren't loads of huge deployments running threads (and not suffering for it much, because most web stacks are IO bound). Ruby's similar, though threads are less heavily-used than in Python. JS/TS as well: https://nodejs.org/api/worker_threads.html

zbentley · 2026-04-22T16:25:25 1776875125

In some specific instances, this approach is clever. Taken as a general philosophy, it’s regrettable, harmful, unethical.

https://en.wikipedia.org/wiki/Tragedy_of_the_commons

renewiltord · 2026-04-22T20:14:54 1776888894

You can't Categorical Imperative me. I'm a hacker.

zbentley · 2026-04-24T00:47:00 1776991620

you kant hack me, I don't use computers

zbentley · 2026-04-21T22:31:05 1776810665

That's a low-leverage place to intervene. Whether or not the internal admin system was directly OAuth linked to Google, by the time the attacker was trying that, they already had a ton of sensitive/valuable info from the employee's Google Workspace account.

If you can only fix one thing (ideally you'd do both, but working in infosec has taught me that you can usually do one thing at most before the breach urgency political capital evaporates), fix the Google token scope/expiry, or fix the environment variable storage system.

zbentley · 2026-04-21T22:24:03 1776810243

I'm not sure that's necessarily a "problem", though it is fundamental to secrets. We wouldn't say that it's a fundamental problem that doors on houses need a key--that's what the key is for--the problem is if the key isn't kept secure from unauthorized actors.

Like, sure, you can go HAM here and use network proxy services to do secret decryption, and only talk from the app to those proxies via short-lived tokens; that's arguably a qualitative shift from app-uses-secret-directly, and it has some real benefits (and costs, namely significant complexity/fragility).

Instead, my favored option is to scope secret use to network locations. If, for example, a given NPM token can only be used for API calls issued from the public IP endpoint of the user's infrastructure, that's a significant added layer of security. People don't agree on whether or not this counts as a "token ACL", but it's certainly ACL-like in its functionality--just controlled by location, rather than identity.

This approach can also be adopted gradually and with less added fragility than the proxy-all-the-things approach: token holders can initially allowlist broad or shared network location ranges, and narrow allowed access sources over time as their networks are improved.

Of course, that's a fantasy. API providers would have to support network-scoped API access credentials, and almost none of them do.

niyikiza · 2026-04-21T23:57:26 1776815846

Speaking of fantansies...another approach would be holder binding: DPoP (RFC 9449) has been stable for a couple of years, AWS SigV4 does it too. The key holder proves control at call time, so a captured token without the key is useless.