[{"content":"Hey, I\u0026rsquo;m Gabriel, a self-taught Czech developer with strong, merit-based opinions\u0026mdash;and, I\u0026rsquo;ve been told, perhaps a slight tendency to embark on crusades when I encounter evil.\nThis is where I preach.\n","date":null,"permalink":"/","section":"","summary":"Hey, I\u0026rsquo;m Gabriel, a self-taught Czech developer with strong, merit-based opinions\u0026mdash;and, I\u0026rsquo;ve been told, perhaps a slight tendency to embark on crusades when I encounter evil.\nThis is where I preach.","title":""},{"content":"","date":null,"permalink":"/tags/agentic-systems/","section":"Tags","summary":"","title":"agentic systems"},{"content":"","date":null,"permalink":"/posts/","section":"Articles","summary":"","title":"Articles"},{"content":"","date":null,"permalink":"/tags/constrained-decoding/","section":"Tags","summary":"","title":"constrained decoding"},{"content":"","date":null,"permalink":"/tags/json/","section":"Tags","summary":"","title":"JSON"},{"content":"","date":null,"permalink":"/tags/llm/","section":"Tags","summary":"","title":"LLM"},{"content":"","date":null,"permalink":"/tags/structured-output/","section":"Tags","summary":"","title":"structured output"},{"content":"","date":null,"permalink":"/tags/","section":"Tags","summary":"","title":"Tags"},{"content":"Constraining LLM output to valid JSON is not a new idea. Every major API provider supports it, every serious open-source inference engine has some form of it, and if you\u0026rsquo;re building agentic systems, you\u0026rsquo;re almost certainly relying on it. The basic premise is straightforward: you give the model a JSON schema, and it produces output that conforms to that schema.\nWhat\u0026rsquo;s less obvious is how this guarantee is achieved\u0026mdash;and the mechanism has consequences that are easy to underestimate. I certainly did. I spent a couple of days debugging a problem that, in hindsight, follows directly and obviously from the mechanism. But I didn\u0026rsquo;t understand the mechanism well enough to see it, and I suspect most people don\u0026rsquo;t.\nSo let\u0026rsquo;s fix that.\nThe mechanism #When people hear \u0026ldquo;constrained JSON output,\u0026rdquo; the mental model they tend to form is something like: the model has been taught, or configured, to produce JSON. It hasn\u0026rsquo;t. The model has no idea it\u0026rsquo;s being constrained. What actually happens is this:\nThe model produces logits\u0026mdash;a probability distribution over its entire vocabulary of tokens. A separate grammar engine determines which tokens are valid continuations of the partially generated JSON so far. All invalid tokens have their logits set to negative infinity. Sampling proceeds as usual over whatever remains. That\u0026rsquo;s the whole thing. It\u0026rsquo;s a post-hoc filter applied to each token independently. The model produces the same probability distribution it always would have, and then a separate system discards everything that doesn\u0026rsquo;t fit the schema. The model never \u0026ldquo;learns\u0026rdquo; to produce JSON\u0026mdash;it simply isn\u0026rsquo;t allowed to produce anything else.\nThe specific implementations of this filter vary quite a bit. Outlines converts JSON schemas to regular expressions, compiles those into finite-state machines, and pre-computes an index mapping each FSM state to its valid token set\u0026mdash;the foundational approach described in \u0026ldquo;Efficient Guided Generation for Large Language Models\u0026rdquo; by Willard \u0026amp; Louf. llama.cpp uses a character-level backtracking stack parser in src/llama-grammar.cpp, checking each candidate token against a GBNF grammar. XGrammar introduced an adaptive token mask cache that precomputes validity for ~99% of the vocabulary, needing runtime checks for only the remaining ~1%. llguidance, Microsoft\u0026rsquo;s Rust-based engine, uses an Earley parser combined with derivative-based regular expressions, achieving ~50us per mask computation with essentially zero startup cost. SGLang contributes the compressed-FSM concept, collapsing multi-token deterministic paths into single steps.\nThe differences matter for performance and correctness\u0026mdash;we\u0026rsquo;ll get to both\u0026mdash;but the conceptual core is the same everywhere. It\u0026rsquo;s a filter over logits. No magic.\nThe performance consequence: your GPU is waiting for your CPU #If you\u0026rsquo;re running open-source models locally through Ollama, you may have noticed that enabling format=\u0026quot;json\u0026quot; makes generation substantially slower. This is well-documented in the issue tracker\u0026mdash;ollama#4370 reports a tenfold slowdown, ollama#3154 describes what llamafile handles in seconds taking Ollama two minutes\u0026mdash;and it\u0026rsquo;s not really a bug. It\u0026rsquo;s an architectural consequence of where grammar checking happens.\nOllama uses llama.cpp under the hood. In llama.cpp, token generation proceeds in three stages: graph preparation (CPU), model evaluation (GPU), and sampling (CPU). Grammar enforcement happens entirely within the sampling stage. NVIDIA documented this pipeline in a technical blog post about optimizing llama.cpp with CUDA graphs\u0026mdash;the grammar check is CPU-bound work that runs between GPU inference passes, and there\u0026rsquo;s nothing you can do about it from the outside.\nThe magnitude of the overhead is nontrivial. In llama.cpp#7554, someone profiled grammar sampling on an RTX 3090 with Llama-3-8B: sampling time went from 1.66 ms/token to 85 ms/token\u0026mdash;a 51x increase\u0026mdash;while GPU utilization dropped from over 70% to around 10%. The GPU was sitting idle, waiting for the CPU to finish checking grammar constraints.\nThere has been real progress. PR #6555 fixed combinatorial explosions in repetition rules, achieving an 8\u0026ndash;18x speedup in grammar processing. But the fundamental architecture remains: grammar checking is serial CPU work that blocks the GPU pipeline.\nModern inference engines address this differently. SGLang and vLLM overlap CPU grammar computation with GPU inference, so the mask for step n is computed while the GPU runs step n+1. XGrammar\u0026rsquo;s paper explicitly describes co-designing the grammar engine with the inference pipeline to enable this overlap, reporting up to 100x speedup over prior approaches. The SqueezeBits blog provides a good overview of this parallelization strategy.\nNone of this architectural work is present in Ollama\u0026rsquo;s pipeline.\nIf you\u0026rsquo;re on Apple Silicon, the situation is somewhat better. XGrammar ships with mlx-lm support for macosx_arm64, and Apple\u0026rsquo;s unified memory architecture eliminates data transfer overhead between CPU-generated masks and GPU logits. Outlines has an official mlx-lm integration, and LM Studio uses it for structured output with MLX models. There\u0026rsquo;s also llm-structured-output, a purpose-built MLX library using an Earley-style acceptor. I haven\u0026rsquo;t found any published benchmarks comparing these specifically on Apple Silicon\u0026mdash;the comparison would be valuable, but as far as I can tell, nobody\u0026rsquo;s done it yet.\nThe design consequence: field order significantly impacts correctness #This is the one that cost me two days.\nI had an agent with a simple set of actions\u0026mdash;ReadFile, WriteFile, and a few others. The schema looked roughly like this:\nReadFile: { \u0026#34;path\u0026#34;: \u0026#34;\u0026lt;path\u0026gt;\u0026#34;, \u0026#34;type\u0026#34;: \u0026#34;ReadFile\u0026#34; } | WriteFile: { \u0026#34;path\u0026#34;: \u0026#34;\u0026lt;path\u0026gt;\u0026#34;, \u0026#34;contents\u0026#34;: \u0026#34;\u0026lt;contents\u0026gt;\u0026#34;, \u0026#34;type\u0026#34;: \u0026#34;WriteFile\u0026#34; } The instructions were clear. The model had a thinking block where it would reason through the situation step by step, arrive at the correct conclusion\u0026mdash;\u0026ldquo;We need to call WriteFile\u0026rdquo;\u0026mdash;and then, you know it: it produced a ReadFile action.\nThe reasoning was perfect. The conclusion was correct. But the output was wrong.\nIt turned out to be a consequence of not having thought carefully enough about what left-to-right token generation means for schema design.\nHere\u0026rsquo;s the problem. The model starts generating the JSON, and the first field it encounters is path. Both ReadFile and WriteFile have a path field, so at this point the grammar constraint is maximally permissive\u0026mdash;everything valid for either action type is allowed. The model generates a path value.\nNow it moves on. In the previous turn, the model had sent a ReadFile action. That makes it plausible that the token sequence for ReadFile has elevated probability\u0026mdash;it appeared recently in context, the model\u0026rsquo;s attention mechanism is primed for it. By the time the model gets to the type field, it\u0026rsquo;s already generated a path value that\u0026rsquo;s perfectly consistent with ReadFile, the local context favors ReadFile, and the correct answer (WriteFile) was concluded many tokens ago in the thinking block. The probability distribution at the type field is skewed.\nThe grammar constraint can\u0026rsquo;t help. Both values are valid. The mechanics of LLM generation just pick the wrong one.\nThe fix is deceptively simple\u0026mdash;just put the discriminator first:\nReadFile: { \u0026#34;type\u0026#34;: \u0026#34;ReadFile\u0026#34;, \u0026#34;path\u0026#34;: \u0026#34;\u0026lt;path\u0026gt;\u0026#34; } | WriteFile: { \u0026#34;type\u0026#34;: \u0026#34;WriteFile\u0026#34;, \u0026#34;path\u0026#34;: \u0026#34;\u0026lt;path\u0026gt;\u0026#34;, \u0026#34;contents\u0026#34;: \u0026#34;\u0026lt;contents\u0026gt;\u0026#34; } That\u0026rsquo;s it. Force the model to commit to the action type before it generates any shared fields. Once \u0026quot;type\u0026quot;: \u0026quot;WriteFile\u0026quot; is in the output, everything downstream is conditioned on it, and the ambiguity disappears.\nThis isn\u0026rsquo;t just my anecdote. The Predibase/LoRAX blog documents the same phenomenon quantitatively: a model fine-tuned to output fields in a specific order, forced by alphabetical schema ordering to generate them differently, saw accuracy drop from 0.804 to 0.650. A 15-point degradation from field reordering alone. The wrong ordering also caused infinite whitespace generation loops.\nOpenAI\u0026rsquo;s documentation confirms that outputs are produced in schema key order. Google\u0026rsquo;s Vertex AI added a dedicated propertyOrdering field for explicit control. Dataiku recommends structuring JSON so that reasoning-dependent content is generated before outcome-dependent content.\nThe general principle is this: the model should encounter decisive fields before ambiguous ones. If you have a discriminated union, put the discriminator first. If your schema has both a reasoning field and an answer field, put reasoning first\u0026mdash;otherwise the model commits to an answer before it finishes thinking.\nAlways describe the schema in your prompt #There\u0026rsquo;s a related insight that also follows directly from the mechanism, but which I initially found counterintuitive: you should describe the expected JSON structure in your prompt text, even when you\u0026rsquo;re already using schema-based constrained decoding.\nThe reason is that constrained decoding only masks invalid tokens. It does nothing to increase the probability of the correct valid tokens. If the model\u0026rsquo;s natural distribution assigns low probability to the output you want, you\u0026rsquo;re sampling from the long tail of the distribution where everything is roughly equally improbable. The filter ensures the output is valid JSON, but it doesn\u0026rsquo;t ensure it\u0026rsquo;s good JSON.\nDataiku explains this clearly: the LLM is \u0026ldquo;unaware\u0026rdquo; of the constraints when computing next-token probabilities. Specifying the constraint in the prompt reduces the gap between the unconstrained and constrained probability distributions. vLLM\u0026rsquo;s documentation makes the same point: indicating in the prompt that JSON should be generated, and describing which fields to fill and how, improves results notably. A recent paper on draft-conditioned constrained decoding confirms the principle\u0026mdash;generating an unconstrained draft first, then applying constrained decoding conditioned on it, improves accuracy by up to 24 percentage points on GSM8K.\nIn practice: describe the schema\u0026rsquo;s fields and their semantics in your system prompt. Give examples. Explain what each action type means and when to use it. Treat the schema constraint as a safety net, not as the primary guidance mechanism.\nThe bug landscape #I should note that even with careful schema design and prompt engineering, the constrained decoding ecosystem remains structurally fragile. There are critical, sometimes interacting bugs across every major implementation, and a few patterns worth being aware of.\nInfinite loops and hangs. These appear everywhere. llama.cpp#10321 documents crashes from recursive grammars. Outlines#658 shows large schemas consuming 32GB+ RAM. SGLang#7639 reports recursive schemas crashing both xgrammar and Outlines backends\u0026mdash;only llguidance handles them, making deployments vulnerable to adversarial inputs.\nThinking mode conflicts. If you\u0026rsquo;re using reasoning models with structured output\u0026mdash;which is likely how you encountered this article\u0026mdash;the interaction is particularly unpleasant. There are various issues (Ollama#10538, Ollama#15260, SGLang#6675) that plague thinking mode combined with structured output.\nXGrammar\u0026rsquo;s incomplete JSON Schema support. XGrammar is the default backend in both vLLM and SGLang, so its limitations propagate widely. The tracking issue is vllm#12131: missing $ref support (vllm#10935), missing minItems/maxItems breaking tool calls (vllm#16880), and complex schemas hanging the server without cancellation (vllm#14151).\nThese aren\u0026rsquo;t edge cases you\u0026rsquo;ll never hit. If you\u0026rsquo;re building agentic systems on open-source models, you will likely encounter at least one of them, and knowing that the issue is in the infrastructure rather than in your code will save you time.\nAPI providers have a structural advantage #There\u0026rsquo;s a reason constrained decoding tends to work better through OpenAI, Anthropic, and Google than with open-source setups, and it goes beyond model quality. API providers do two things: they fine-tune their models to understand and produce structured output, and they apply constrained decoding on top as a guarantee layer.\nOpen-source constrained decoding can only do the second thing. The model was never specifically trained to produce the schema you\u0026rsquo;re asking for. All the work is done by the token mask, which\u0026mdash;as we\u0026rsquo;ve established\u0026mdash;only filters, never guides. This is why the \u0026ldquo;describe the schema in your prompt\u0026rdquo; advice matters disproportionately for open-source: you\u0026rsquo;re compensating for the missing fine-tuning step with prompt engineering.\nOpenAI explicitly describes their dual approach. Anthropic launched native structured outputs in November 2025, with constrained decoding, schema compilation, and 24-hour caching. JSONSchemaBench, an independent benchmark of 10K real-world schemas, found 2x differences in schema support across frameworks\u0026mdash;the implementation details matter a great deal.\nWrapping up #If I could go back and give myself the briefing before I started building:\nPut discriminator fields first. If your schema represents a union of action types, type goes at position zero. This single change would have saved me those two days, and I\u0026rsquo;m still a little salty about it.\nDescribe the schema in your prompt. The constraint only masks invalid tokens\u0026mdash;it doesn\u0026rsquo;t make correct tokens more likely. Your prompt is what moves probability mass. Examples, field descriptions, explanations of when each action type applies\u0026mdash;all of it helps.\nPut reasoning before conclusions. If your schema has both a reasoning field and an answer field, reasoning goes first. Let the model think before it commits.\nKnow the performance cost. On Ollama with a discrete GPU, grammar enforcement runs on CPU and can slow generation by 10\u0026ndash;50x. If that\u0026rsquo;s a problem, look into SGLang, vLLM, or XGrammar + mlx-lm on Apple Silicon.\nExpect bugs. Especially around recursive schemas, thinking-mode models, and XGrammar\u0026rsquo;s incomplete JSON Schema coverage. Keep schemas simple, test edge cases, and if something fails inexplicably, check the issue trackers before spending three days blaming yourself. Unlike me.\n","date":"10 April 2026","permalink":"/posts/constrained-json-decoding/","section":"Articles","summary":"Constrained JSON decoding works by filtering logits, not by teaching the model. This has consequences for performance, correctness, and schema design that are easy to miss.","title":"The unexpected implications of constrained JSON decoding"},{"content":"The way we interact with AI is surprisingly primitive. You have a chat window. You type a message. The AI responds. You type another message. It\u0026rsquo;s linear, sequential, one-thread-at-a-time\u0026mdash;the same interaction model as a 1990s IRC channel.\nBut the things we build with AI are not linear. They\u0026rsquo;re documents with sections, codebases with files, images with regions. When you\u0026rsquo;re working on a structured artifact with an AI, you\u0026rsquo;re not having one conversation\u0026mdash;you\u0026rsquo;re having many, about many different parts of the thing, all at once. The chat window forces you to serialize all of that into a single stream. You end up splicing multiple topics into one thread, accumulating TODO lists, mentally tracking what was addressed and what wasn\u0026rsquo;t, scrolling back and forth to maintain context. Things get lost. Things get tangled.\nWhat you actually want is to point at a specific part of the artifact and start a conversation right there\u0026mdash;and have that conversation exist independently from all the others. You want to be able to jump between topics without losing context, revisit earlier discussions without scrolling, and work on multiple things in parallel without waiting for the AI to finish one before starting another.\nIn other words, you want threaded, anchored conversations\u0026mdash;many parallel interactions, each tied to a specific piece of the artifact you\u0026rsquo;re working on.\nHere\u0026rsquo;s the thing\u0026mdash;nothing about my solution is particularly ground-breaking. I\u0026rsquo;m almost embarrassed to publish this as if it were some kind of revolutionary idea, almost as much as I\u0026rsquo;m embarrassed by the fact that it\u0026rsquo;s taken me this long to have it in the first place. Honestly, I wouldn\u0026rsquo;t be surprised if I found out most of the world was already doing this, and I\u0026rsquo;m the last one to the party. It\u0026rsquo;s so stupid.\nYet I can\u0026rsquo;t help but be amazed at the impact it had on my everyday work, and now I\u0026rsquo;m pissed everytime I find myself in a situation where I can\u0026rsquo;t use it.\nDevelopers already have this #If you write code, you already use this interaction model every day. It\u0026rsquo;s called a pull request. A PR gives you threaded discussions anchored to specific lines. You can jump between files, start conversations wherever you want, revisit earlier threads, resolve them when they\u0026rsquo;re done, and see the full context of each discussion. Nothing gets lost, nothing gets tangled. Each thread is its own mini-conversation.\nThe principle generalizes beyond code. Any artifact with addressable parts\u0026mdash;paragraphs in a document, regions in an image, timestamps in a video\u0026mdash;could support this kind of interaction. It just so happens that for code, the tooling already exists and works well.\nMaking Claude participate #The missing piece was getting Claude to actually live in this interface\u0026mdash;not as a one-shot tool, but as an active participant that responds to conversations in-place. Claude Code\u0026rsquo;s recent addition of scheduled tasks made this particularly easy to set up.\nThe setup is simple. I tell Claude Code to:\nCreate a PR with the generated code Start a background cron that periodically checks the PR for new comments For each new comment: read the context, make the fix, reply to the thread, and push Then I leave the chat and live entirely in the PR. I start conversations on specific lines, ask questions, request changes. Claude picks up each thread on the next poll cycle, responds, and pushes. What\u0026rsquo;s key is that I don\u0026rsquo;t wait for that to happen\u0026mdash;I just continue reading and firing off comments. At some point, I go back, check the results, continue the conversation in each thread if needed, or resolve and move on.\nThe difference is night and day. Instead of a single serial conversation where I have to queue up all my thoughts and track their resolution, I have a dozen parallel ones, each with its own context, each progressing independently. I can leave six comments across three files, go do something else, and come back to find all of them addressed. I don\u0026rsquo;t have to babysit the chat.\nTry it #I\u0026rsquo;ve published the Claude Code skill and supporting scripts as a Gist, or you can just ask Claude to build it for you from scratch.\nOne detail worth mentioning: fixes are amended into the original commits, not added as separate \u0026ldquo;address review\u0026rdquo; commits. The history stays clean, as if the fix was always there. To set it up:\nCopy skill.md into .claude/skills/babysit-pr/ in your project Put the three shell scripts somewhere on your PATH (or adjust the paths in the skill file) You\u0026rsquo;ll need gh (GitHub CLI) and jq Run /babysit-pr https://github.com/you/repo/pull/42, or just tell Claude to babysit a PR Stop it anytime by closing the Claude session, or asking it to stop babysitting the PR.\nThe bigger picture #It\u0026rsquo;s worth stepping back and noticing what I actually did here: I routed my AI interactions through GitHub pull requests. That\u0026rsquo;s not a natural (or particularly secure) home for this\u0026mdash;it\u0026rsquo;s a workaround. A pretty good one, but still\u0026mdash;I\u0026rsquo;m piggybacking on code review infrastructure because it happens to have the threading model I want.\nSo, why doesn\u0026rsquo;t this exist natively? Chat became the default AI interaction model, I think, mostly because LLMs arrived as chatbots. That made sense at first. But somewhere along the way, AI went from \u0026ldquo;thing I ask questions\u0026rdquo; to \u0026ldquo;thing I build stuff with,\u0026rdquo; and the interface didn\u0026rsquo;t really keep up. When you\u0026rsquo;re collaborating on an artifact, the artifact is the main thing. The conversation is secondary\u0026mdash;it\u0026rsquo;s how you shape the artifact, not the point of the interaction.\nAt the same time, it should be said that threaded conversations on artifacts aren\u0026rsquo;t a new idea at all. Google Docs has had comment threads on paragraphs for years. Figma lets you pin discussions to specific points on a design. PRs anchor threads to lines of code. The pattern is well-established. It\u0026rsquo;s just that when AI got added to these kinds of workflows, it mostly showed up as a sidebar chat rather than plugging into the threading that was already there. I\u0026rsquo;m sure there are tools that do this right, but most of the time the interaction model for AI, especially on the web and in agents, is still a chat window.\nI think there\u0026rsquo;s something to the idea of AI agents, and AI web interfaces, supporting this natively\u0026mdash;not \u0026ldquo;here\u0026rsquo;s a chat window, and also here\u0026rsquo;s your document,\u0026rdquo; but \u0026ldquo;here\u0026rsquo;s your document, and you can talk to the AI about any part of it, right there.\u0026rdquo; Each thread as its own conversation with its own context. Six discussions going at once on six different parts of the thing, none of them stepping on each other. Claude Code sorta-kinda took a step in this direction with its recent addition of /btw, but that\u0026rsquo;s still a long way from what I\u0026rsquo;m talking about here.\nAnd it doesn\u0026rsquo;t have to be code. Any artifact with addressable parts could work this way\u0026mdash;a contract where you\u0026rsquo;re discussing clause 4.2 in one thread and the indemnification section in another, a data pipeline config where you\u0026rsquo;re asking about the transform step separately from the source connector, a design mockup with one conversation about the navigation and another about the color palette. Even an image region, or a section of an audio file.\nFor now, I\u0026rsquo;ve got a polling loop built on gh and jq, and honestly, it works better than I expected. But I\u0026rsquo;d love to not have to need it.\n","date":"23 March 2026","permalink":"/posts/babysit-pr/","section":"Articles","summary":"Chat is an inconvenient interaction model for working with AI on structured artifacts. Here\u0026rsquo;s what I use instead.","title":"A better way to interact with AI — threaded conversations on artifacts"},{"content":"","date":null,"permalink":"/tags/ai-interaction/","section":"Tags","summary":"","title":"AI interaction"},{"content":"","date":null,"permalink":"/tags/claude-code/","section":"Tags","summary":"","title":"claude code"},{"content":"","date":null,"permalink":"/tags/developer-tooling/","section":"Tags","summary":"","title":"developer tooling"},{"content":"","date":null,"permalink":"/tags/workflow/","section":"Tags","summary":"","title":"workflow"},{"content":"","date":null,"permalink":"/tags/distributed-systems/","section":"Tags","summary":"","title":"distributed systems"},{"content":"","date":null,"permalink":"/tags/goto/","section":"Tags","summary":"","title":"goto"},{"content":"","date":null,"permalink":"/tags/microservices/","section":"Tags","summary":"","title":"microservices"},{"content":"","date":null,"permalink":"/tags/services/","section":"Tags","summary":"","title":"services"},{"content":"","date":null,"permalink":"/tags/structured-concurrency/","section":"Tags","summary":"","title":"structured concurrency"},{"content":"","date":null,"permalink":"/tags/structured-cooperation/","section":"Tags","summary":"","title":"structured cooperation"},{"content":"Now that we\u0026rsquo;ve talked at length about what you get when you apply the rule of structured cooperation\u0026mdash;and I hope I\u0026rsquo;ve convinced you that it\u0026rsquo;s quite a bit\u0026mdash;and also how to implemented it, I want to finally talk about why it works. What makes that rule so special, and why is it so unreasonably effective?\nAs it turns out, these questions actually have an answer, and structured cooperation didn\u0026rsquo;t just fall out of the sky. In fact, as we explore the answers and take a journey through some of the most fascinating parts of programming language history, you will find that it is actually just the newest incarnation of a principle, an idea, we\u0026rsquo;ve been applying for over half a century, and that has fundamentally affected each and every piece of code written during that time.\nStanding on the shoulders of giants #I have to confess that I\u0026rsquo;m actually a little embarrassed to put structured cooperation and Scoop out with my name next to it, because essentially all I\u0026rsquo;ve done is plagiarize the work of two giants\u0026mdash;Nathaniel J. Smith and Roman Elizarov, along with the rest of the Kotlin team.\nWhile Nathan emphasizes he\u0026rsquo;s not the author of structured concurrency, his phenomenal article, Notes on structured concurrency, or: Go statement considered harmful, introduced me to the idea, and his explanation is what allowed me to recognize that the way we currently design distributed systems suffers from fundamentally the same type of problem that structured concurrency was invented to solve. I really can\u0026rsquo;t oversell his article\u0026mdash;it\u0026rsquo;s been years since I first read it, I\u0026rsquo;ve read it many times since, and I will never shut up about it. I\u0026rsquo;ve mentioned it in my talks, as did Roman Elizarov in his, and I think it should be considered required reading for every single developer. I think it\u0026rsquo;s one of the best programming articles out there, period, because apart from doing an incredible job of explaining a non-trivial subject, between the lines, it reveals something very fundamental about the essence of programming. If you haven\u0026rsquo;t read it yet, I thoroughly recommend you do so, along with his article on timeouts and cancellations, which I copied pretty much verbatim in Scoop.\nRoman needs no introduction to anyone even vaguely aware of Kotlin, of which he was the lead designer for years, publishing many insightful articles along the way. When you start learning about how Scoop is implemented, you\u0026rsquo;ll discover that it is little more than a crude implementation of distributed coroutines on top of Postgres. I wouldn\u0026rsquo;t have recognized that that was the proper way to what model I was trying to achieve if I hadn\u0026rsquo;t previously interacted with Kotlin\u0026rsquo;s coroutine implementation. I can say unequivocally that Scoop would not exist in its current form were it not for the work of Roman and his team\u0026mdash;not merely because I had an existing implementation that I could reference when I needed to, but most importantly because interacting with coroutines in Kotlin, along with their ecosystem, gave me perspective, and framing Scoop in that way allowed me to anticipate features and capabilities I should be implementing. Scoop is much more cohesive as a consequence.\nThere has never been a more fitting time to say: If I have seen further, it is by standing on the shoulders of Giants.\nGOTO statement considered harmful #Ask any programmer what they think of GOTO, and I\u0026rsquo;m willing to bet that the overwhelming majority answer with \u0026quot;GOTO is bad and should not be used\u0026quot;. Now ask them why, and I\u0026rsquo;m willing to bet that an overwhelming majority can\u0026rsquo;t really give a satisfactory answer. The term \u0026lsquo;spaghetti code\u0026rsquo; will probably be uttered in a Pavlovian reflex we\u0026rsquo;ve all been conditioned to have, but even asking about that will probably quickly lead to an argumentum ad populum – it\u0026rsquo;s true because that\u0026rsquo;s what everyone says.\nHere\u0026rsquo;s the thing, though\u0026mdash;you don\u0026rsquo;t actually need GOTO in your language to be able to write spaghetti code. For example, Nathan gives the following spectacular example of some incantations written in FLOW-MATIC, a language that has GOTO:\nHere\u0026rsquo;s essentially the same code, jumping around in exactly the same way, rewritten in JavaScript (or so Claude assures me), a language that has no GOTO:\nfunction flomaticProcessor() { let operation = 0; while (true) { switch (operation) { case 0: if (inventoryIndex \u0026lt; inventoryFile.length) currentInventoryItem = inventoryFile[inventoryIndex]; if (priceIndex \u0026lt; priceFile.length) currentPriceItem = priceFile[priceIndex]; case 1: if (!currentInventoryItem) { operation = 14; break; } if (!currentPriceItem) { operation = 12; break; } if (currentInventoryItem.productNo \u0026gt; currentPriceItem.productNo) operation = 10; else if (currentInventoryItem.productNo === currentPriceItem.productNo) operation = 5; else operation = 2; break; case 2: unpricedInvFile.push({ ...currentInventoryItem }); operation = 8; break; case 5: pricedInvFile.push({ ...currentInventoryItem, unitPrice: currentPriceItem.unitPrice }); case 8: inventoryIndex++; if (inventoryIndex \u0026gt;= inventoryFile.length) { currentInventoryItem = null; operation = 14; break; } currentInventoryItem = inventoryFile[inventoryIndex]; operation = operation9Target; break; case 10: priceIndex++; if (priceIndex \u0026gt;= priceFile.length) { currentPriceItem = null; operation = 12; break; } currentPriceItem = priceFile[priceIndex]; operation = 1; break; case 12: operation9Target = 2; operation = 2; break; case 14: if (!currentPriceItem) { operation = 16; break; } if (currentPriceItem.productNo === \u0026#34;ZZZZZZZZZZZZ\u0026#34;) operation = 16; else operation = 15; break; case 15: priceIndex = 0; currentPriceItem = priceFile[priceIndex]; case 16: return \u0026#34;PROGRAM_END\u0026#34;; default: return \u0026#34;ERROR\u0026#34;; } } } flomaticProcessor(); It\u0026rsquo;s clear that the WTF\u0026rsquo;s/min for this function have no upper bound, but nobody\u0026rsquo;s in any hurry to ban switch statements and while loops\u0026mdash;it\u0026rsquo;s clear that spaghettification susceptibility isn\u0026rsquo;t in itself a sufficient reason to remove a language feature. So why did we ban GOTO, and why do we all think delicious pasta is the reason?\nIn what manner doth the Spaghetti Monster squiggle? #We all intuitively understand that the code above is abysmal, but it might not be immediately clear how to formulate what is wrong with it. The code is clearly very difficult to make sense of, but why?\nFundamentally, it\u0026rsquo;s because we can\u0026rsquo;t organize the path of execution into a hierarchy\u0026mdash;the path the code takes forms a graph, not a tree. Without this property, in order to determine what path a program will take, we need to be aware of the contents of each \u0026ldquo;node\u0026rdquo; (intuitively representing a function call, but in general can be any \u0026ldquo;block of statements\u0026rdquo;) in the diagram above. Any one of them could at any point send the execution flow someplace else, and the only way to find out is to open each and every one and see what it does. As a consequence, we’re deprived of one of the key faculties needed to grasp complex code\u0026mdash;the ability to think in terms of black boxes.\nIf you spend some time thinking about it, you will realize that in order to be able to form this kind of execution flow hierarchy, the code must be structured in such a way so that it follows what Nathan calls the \u0026ldquo;black box rule\u0026rdquo;: whenever the flow of execution goes into something, it must be guaranteed to always, at some point, come back out. That \u0026ldquo;something\u0026rdquo; could be anything representing a group of statements\u0026mdash;a function, an if statement, a for loop, whatever you like.\nHere\u0026rsquo;s the key thing: even though the JavaScript example above breaks that rule, it only breaks it locally. By that, I mean I can take that whole mess, throw it in a function, close my eyes and forget what I just saw, and that function will behave just like any other function would\u0026mdash;execution flow will go in, it\u0026rsquo;ll do its thing, and then it\u0026rsquo;ll come out. There is no way for the function (or almost any other modern construct for that matter1) to wrestle control away from the caller and not give it back. Whatever mess the block of code decides to inflict on itself, it must always eventually yield control back to its caller.\nBut the old-school GOTO, in its full, unfettered glory, allows any piece of code to jump to any other piece of code anywhere in the program. By doing that, it allows anyone to take over the execution flow and not give it back, and there\u0026rsquo;s fundamentally no way to control or restrict that power. The only way to guarantee that it is not (mis-)used is by reading every single instruction that comprises the entire execution path, and verifying that either GOTO isn\u0026rsquo;t used, or it\u0026rsquo;s used in a way that doesn\u0026rsquo;t break the black box rule. Having to open, parse, and understand each and every function that is called, and the functions they call, the functions they call, and so on, is fundamentally what breaks our ability to compartmentalize, which, in turn, breaks our ability to reason about code of any real complexity. If messy execution flow was cancer (which it is), GOTO is what allows it to metastasize.\nSo spaghettification is not, in fact, the reason why GOTO was banned, but rather its ability to potentially make the spaghetti strand squiggle across the entire codebase. Even with GOTO gone, you can still make spaghetti, but each strand can only squiggle within the confines of some delimited code block (typically a function), and must eventually always come out again.\nThis fundamental problem with GOTO was described by none other than E. W. Dijkstra (pronounced Dyke-strah\u0026mdash;now you finally know), when, in 1968, he published Go To Statement Considered Harmful. It\u0026rsquo;s only a page and a half; I suggest you read it. There, he argues that GOTO \u0026ldquo;should be abolished from all \u0026lsquo;higher-level\u0026rsquo; programming languages\u0026rdquo;, triggering an industry-wide tantrum. As Nathan writes:\nFrom here in 2018, this seems obvious enough. But have you seen how programmers react when you try to take away their toys because they\u0026rsquo;re not smart enough to use them safely? Yeah, some things never change. In 1969, this proposal was incredibly controversial. Donald Knuth defended goto. People who had become experts on writing code with goto quite reasonably resented having to basically learn how to program again in order to express their ideas using the newer, more constraining constructs. And of course it required building a whole new set of languages.\nBut, as we now know, cooler heads prevailed, and history reduced GOTO to the butt of jokes.\nOne door closes, many others open #Enforcing a hierarchy on execution flow by removing GOTO, and requiring that all constructs obey the black box rule, bore fruit beyond \u0026ldquo;just\u0026rdquo; the fact that it made code much easier to reason about. It also allowed us to make the position in this hierarchy explicit, which is pretty much what we know today as a call stack. That, in turn, allowed us to build exceptions, complete with stack unwinding and stack traces. Can you even imagine maintaining a codebase without having a stack trace at your disposal when something breaks? If not, try writing some reactive code, see how long it takes for you to cry2. And crucially, this would have been very difficult, if at all possible, to do if GOTO were still around.\nAnother feature enabled by killing off GOTO are the various flavors of resource handling constructs, e.g. try-with-resources. How would that even work if you could just jump into the middle of a try block from wherever?3 What about garbage collection? Various compiler optimizations? None of those would be possible in the way we\u0026rsquo;re used to if we kept GOTO alive. Removing GOTO led directly to the advent of many of the features we now take for granted, features that wouldn\u0026rsquo;t be possible otherwise.\nThis should come as no surprise\u0026mdash;restricting the set of permissible programs allows you to rely on patterns that become enforced by those restrictions, and you can take advantage of those patterns to build new features. That\u0026rsquo;s also fundamentally why the capabilities of IDE\u0026rsquo;s for statically typed languages are (and always will be) better and more reliable than those for dynamic ones.\nIn programming, as in other things, less is often more.\nVisualizing control flow #Nathan introduces an excellent visual notation that I will be switching to for the remainder of this post, and one that clearly shows how and why GOTO is different from other language constructs. The notation represents the \u0026ldquo;shape\u0026rdquo; of execution flow, but frankly it\u0026rsquo;s so intuitive that using words to describe it only makes it less so.\nHere are the depictions of the constructs we\u0026rsquo;re familiar with:\nHere\u0026rsquo;s GOTO:\nAs Nathan writes:\nFor everything except goto, flow control comes in the top → [stuff happens] → flow control comes out the bottom. We might call this the \u0026ldquo;black box rule\u0026rdquo;: if a control structure has this shape, then in contexts where you don\u0026rsquo;t care about the details of what happens internally, you can ignore the [stuff happens] part, and treat the whole thing as regular sequential flow. And even better, this is also true of any code that\u0026rsquo;s composed out of those pieces. When I look at this code:\nprint(\u0026#34;Hello world!\u0026#34;) I don\u0026rsquo;t have to go read the definition of print and all its transitive dependencies just to figure out how the control flow works. Maybe inside print there\u0026rsquo;s a loop, and inside the loop there\u0026rsquo;s an if/else, and inside the if/else there\u0026rsquo;s another function call\u0026hellip; or maybe it\u0026rsquo;s something else. It doesn\u0026rsquo;t really matter: I know control will flow into print, the function will do its thing, and then eventually control will come back to the code I\u0026rsquo;m reading.\nStructured concurrency\u0026mdash;GO statement considered harmful #Fast forward about half a century\u0026mdash;the year is now 2016, GOTO is dead and buried, and concurrent programming is all the rage. Every major language has introduced some sort of primitive that represents \u0026ldquo;spawn a separate branch of execution, and execute both branches concurrently\u0026mdash;the eponymous go of Go, new Thread(...).start() in Java, and so on.\nSome suspiciously familiar problems were arising in that area. Concurrent programs were notoriously difficult to reason about, error handling was error-prone (the industry default for handling errors in something spawned from the main thread was\u0026mdash;and still is\u0026mdash;often just \u0026ldquo;drop the error and hope for the best\u0026rdquo;[1], [2], [3], [4]), and when something did inevitably break, debugging was a nightmare. Stack-traces only showed information about the stack of the thread that failed, but it was impossible to determine where it was spawned from or under what conditions, unless you somehow kept track of that yourself.\nBut there were also new problems. Race conditions and synchronization became a thing you needed to deal with, and a slew of APIs were built to address this issue, including synchronized methods, various flavors of mutexes, semaphores, and other things. You needed to figure out how, when, and when not to share data between different branches of execution, leading to things like atomic variables, ThreadLocal, and an entirely separate set of concurrency-safe collections. If you screwed up, you would at best get an exception, at worst a deadlock or silent bug, and it would only happen sometimes. God forbid you needed to deal with something non-trivial, like timeouts, interruptions, or cancellations (think Ctrl-C).\nNeedless to say, all of this is really difficult, and all of these difficulties compound. Concurrent programming is a pain. And as it so happens, this is the control flow diagram for all of the primitives that were mentioned above (Nathan labels them all as go):\nCoincidence? I think not!\nI\u0026rsquo;ll defer to Nathan\u0026rsquo;s article for the blow-by-blow, but suffice to say that, yes, the similarity of the diagrams is no coincidence\u0026mdash;thread spawning mechanism are guilty of the same type of sin as GOTO, and consequently cause the same type of problems. They don\u0026rsquo;t do this by snatching control away from the caller, but rather splitting the execution into separate strands, and only \u0026ldquo;returning\u0026rdquo; one to the caller. In that sense, you could argue that this is an even more precarious state of affairs than with GOTO, because instead of a single spaghetti strand, you can now have 2 or more strands squiggling through the codebase.\nI should note that when I say \u0026ldquo;thread\u0026rdquo;, I mean it in the general sense of \u0026ldquo;branch/path of execution\u0026rdquo;\u0026mdash;it could be an OS thread, a green thread (e.g., a goroutine or virtual thread), a coroutine, a promise chain, or something else.\nThe solution? Structured concurrency. I recommend you read up on the concept if you\u0026rsquo;re not familiar with it, since I won\u0026rsquo;t be going into sufficient detail for my explanation to qualify as an introduction (Nathan\u0026rsquo;s article is an excellent place to start). However, I do want to walk through the essentials, because you\u0026rsquo;ll immediately recognize their counterparts in structured cooperation.\nIn a nutshell, the basic idea is to mandate that separate threads of execution can only be launched from within a special code block\u0026mdash;in Kotlin, it\u0026rsquo;s the lambda parameter passed to the coroutineScope function; in Nathan\u0026rsquo;s Trio, it\u0026rsquo;s inside the with block that delimits the lifetime of the Nursery object; in Java (still in preview at the time of writing), it\u0026rsquo;s inside the try-with-resources block that delimits the lifetime of the StructuredTaskScope object; in Swift, it\u0026rsquo;s inside the lambda parameter passed to the withTaskGroup function, and so on.\nCrucially, that code block does not exit until everything it spawned has finished running. If the execution of the spawning thread gets to the end of the block before all the threads it spawned have finished, it stays there and waits. This is the fundamental rule that structured concurrency is built around.\nHere\u0026rsquo;s a simple example in Kotlin that drives this home:\nprint(\u0026#34;1\u0026#34;) coroutineScope { print(\u0026#34;2\u0026#34;) launch { delay(1000L) print(\u0026#34;4\u0026#34;) } print(\u0026#34;3\u0026#34;) } print(\u0026#34;5\u0026#34;) // prints 12345, with a ~1s delay between printing 3 and 4 In most implementations, the code block is typically associated with an object whose lifetime is tied to it, i.e., it is created just before the code block is entered, and ceases to exist right after the code block is exited. In Kotlin, this object is the CoroutineScope, in Trio, it\u0026rsquo;s the Nursery , in Java, it\u0026rsquo;s the StructuredTaskScope, in Swift, it\u0026rsquo;s the TaskGroup. This object conceptually or literally represents a sort of \u0026ldquo;container\u0026rdquo; into which any spawned execution threads are placed, along with any child containers, and you can pass it around and spawn tasks \u0026ldquo;into\u0026rdquo; it from (lexically) outside the code block it represents (as long as the block hasn\u0026rsquo;t finished in the meantime). Any such tasks are treated as if they were spawned from within the code block.\nIf this is the first time you\u0026rsquo;re encountering structured concurrency, it might not be immediately obvious that this is an improvement over doing it any other way, but as anyone that has actually used it for any amount of time knows, and as the adoption across a wide range of languages and libraries backs up, this approach makes concurrent programming much, much easier to deal with. It by no means solves all the problems inherent to concurrent programming\u0026mdash;concurrent programming is still hard, still full of unique challenges, and it\u0026rsquo;s still easy to shoot yourself in the foot. But by imposing a hierarchy on the threads of execution and mandating that each parent needs to wait for its children, it cuts away entire classes of difficulties and reintroduces features that were previously lost, as you\u0026rsquo;ll see in the next section.\nWhy?\nWell, this is the control flow diagram of blocks that obey the rule of structured concurrency:\nAnd as you can see, it obeys the black box rule\u0026mdash;a single thread of execution goes in, and a single thread of execution comes out.\nCoincidence? I think not.\nOne door closes, many others open #Reasoning #By restricting what programmers could do, we recovered many features that were lost when concurrent programming first took off. Chief among those, as you can probably guess, is the ability to think in black boxes. Previously, any function anywhere, at any depth, could just decide to spawn a thread, and the only way to know for sure is to open up every function recursively and check.\nIn practice, that\u0026rsquo;s actually not that much of a problem\u0026mdash;spawning a separate thread of execution is usually (though not always) a pretty expensive operation, so people don\u0026rsquo;t tend to spawn them willy-nilly just for the fun of it (and if they do, their code is unlikely to proliferate much). But there is another, more realistic concern: for functions that legitimately do spawn threads, even when they openly advertise it, you still have no guarantees about what happens to those threads when the function exists. Do they keep living in the background? Are they finished, and did the function clean them up? Will that always happen, even when exceptions get thrown? Do I need to kill them or manage them somehow? Is there any situation where they become my responsibility? Is there some orchestrator somewhere else that also interacts with them? And in order to answer those questions, you have no choice but to concern yourself with what that function does and how it\u0026rsquo;s implemented. You can no longer treat it as a black box.\nWith structured concurrency, you don\u0026rsquo;t need to worry about that, ever. If a function returns (either normally or by throwing an exception), everything it spawned has either finished and been cleaned up or was placed into some sort of \u0026ldquo;scope\u0026rdquo; (the container objects mentioned above) that needs to be passed in. If that\u0026rsquo;s the case, you can clearly see that the object is part of the function\u0026rsquo;s signature, and therefore know that the threads are someone else\u0026rsquo;s problem (specifically, whoever provided the scope object).\nSynchronization #When you tackle a problem by decomposing it into parts that run in parallel, structured concurrency also makes synchronization much easier\u0026mdash;problems of the \u0026ldquo;don\u0026rsquo;t do X until Y happens\u0026rdquo; persuasion. This, in turn, dramatically decreases the number of times you need to use lower-level primitives such as mutexes or synchronized methods. It makes the code more understandable and predictable\u0026mdash;it\u0026rsquo;s infinitelly easier to understand a parallel algorithm written using structured concurrency than without it, and you\u0026rsquo;re much less likely to make a mistake when writing it. Nathan himself attests to having the same experience with Trio.\nExceptions #Structured concurrency also allows us to recover the concept of a call stack. Since each parent is guaranteed to wait for each child, a sensible, informative call stack can be obtained by essentially concatenating the call stack of the parent with the call stack of the child. This in turn allows us to a) have meaningful stack-traces, and b) have a well-defined place to propagate exceptions to\u0026mdash;the parent.\nThe latter is especially important since it allows you to manage errors properly. When an exception isn\u0026rsquo;t handled in a child, it is simply rethrown in the parent. If it\u0026rsquo;s unhandled there, all other children of the parent are cancelled/interrupted, and the exception is propagated up another level. It\u0026rsquo;s basically just regular stack unwinding, except the stack is now essentially a tree, with each node being a place where multiple threads were spawned. Unwinding past a branching point means first unwinding all of the other branches and then continuing up the tree.\nCancellations \u0026amp; Timeouts #This also makes cancellation/interruptions much easier to deal with, as there is always a well defined hierarchy across which the cancellation/interruption can propagate. There is no risk of forgetting to cancel something and leaving it hanging in a vacuum. Timeouts are another thing that became much easier to manage, since, again, you have a well-defined hierarchy across which you can track total execution time, and which you can cancel or interrupt if you need to.\nResources #Finally, we also recover the ability to do reasonable resource handling. Previously, if you spawned a new thread in the middle of a with-resources block, you would have no guarantee the resource would actually still be available:\n// pseudo code with (someResource) { spawn { sleep(1 day) // Resource is no longer available here, because parent has long since exited doStuff(someResource) } } However, with structured concurrency, this is no longer a concern:\n// pseudo code with (someResource) { scope { spawn { sleep(1 day) // Resource is no longer available here, because parent has long since exited doStuff(someResource) } // Won\u0026#39;t exit until everything is done } } Much like with GOTO, moving away from GO-like constructs and introducing structured concurrency allowed us to go from a world of tangled spaghetti, where multiple threads of execution squiggled chaotically all over the codebase with no guarantees whatsoever, to a world where all strands were neatly organized in a well-defined hierarchy. And by doing so, everything suddenly became much easier to manage, while also allowing us to recover many fundamental features that were previously lost.\nAre you starting to see the pattern?\nStructured cooperation\u0026mdash;distributed GO considered harmful #Our travels finally bring us to distributed systems, where, armed with our newfound understanding of programming language design, we see exactly the same problems that we discussed in the previous paragraphs.\nAnd no wonder! GOTO allowed a single strand to squiggle across a single codebase. GO allowed a known number of strands to squiggle across a single codebase. But when emitting a message in a distributed system, you have no idea if, and how many, listeners it might be triggering. Each such listener corresponds to a new splitting of the thread of execution, and on top of that, it\u0026rsquo;s executing in a completely different codebase as well. Distributed message-based systems allow a possibly unknown number of strands squiggle across an unknown number of codebases!\nIt\u0026rsquo;s funny how each time this problem appears, we\u0026rsquo;ve managed to amplify it and make it worse. We\u0026rsquo;ve effectively taken the problems of GOTO, combined them with the problems of GO, and spread it across multiple applications. The result is a spaghetti octopus of messy execution flow that can only be reasonably called spaghettopus code.\nBy now, it should be completely obvious where I\u0026rsquo;m going with this\u0026mdash;it\u0026rsquo;s clear that we need to fix things so that the black box rule is obeyed. To do that, we can just do pretty much exactly what we did in the case of concurrency, and recover exactly the same features and properties as we did with structured concurrency. Indeed, if you go through the previous section, you\u0026rsquo;ll find that every single thing discussed has a practically identical counterpart in structured cooperation.\nAs it turns out, the title of the first article contained all the information you actually needed to read\u0026mdash;structured cooperation, and all of its benefits, is literally what you get when you apply the principles of structured concurrency to distributed systems.\nConclusion #In this series of articles, I\u0026rsquo;ve tried to convince you that the adoption of structured cooperation helps combat the inherent complexities that are often associated with distributed systems.\nJust as the programming world learned to deprecate the unrestricted GOTO statement in favor of structured programming decades ago and more recently embraced structured concurrency to solve similar problems in concurrent operations, the observations above have led me to believe that we now face a similar evolutionary step in distributed environments. We\u0026rsquo;ve seen that the core problem lies in the uncontrolled flow of execution, where operations within a distributed system can jump unpredictably, making it difficult to reason about system behavior or impose any kind of hierarchy on the flow of execution. This absence of a clear, hierarchical execution flow\u0026mdash;not following the \u0026ldquo;black box rule\u0026rdquo;\u0026mdash;is what structured cooperation is designed to remedy.\nBy applying a single, simple rule—\u0026ndash;that a message handler does not proceed until all handlers of all emitted messages have completed—\u0026ndash;structured cooperation brings a lot to the table. It completely eliminates certain problems, such as race conditions due to eventual consistency. For us humans, it makes it easier to reason about, manage, and debug distributed systems. And it reintroduces powerful features often lost in distributed contexts, such as true distributed exceptions, complete with stack traces that span multiple services, and what amounts to stack unwinding, again across multiple services. However, it doesn\u0026rsquo;t force you into anything\u0026mdash;you\u0026rsquo;re always free to only apply structured cooperation to those things that need to be synchronized, while not using it (or launching an independent hierarchy) for those things that don\u0026rsquo;t. This also allows you to introduce sturctured cooperation gradually.\nScoop serves as a practical proof-of-concept, demonstrating how the principles of structured cooperation can be implemented. Though not a production-ready library, Scoop showcases how one can implement many features needed from a non-blocking orchestration framework on top of a few well-chosen primitives\u0026mdash;EventLoopStrategy and CooperationContext\u0026mdash;while leveraging structured cooperation to facilitate step coordination and rollbacks.\nUltimately, structured cooperation is the latest iteration of an enduring idea that has consistently shaped software development for over half a century, and I do believe that by embracing the lessons history has taught us and applying the same idea once again, we can make all our lives much easier.\nOne notable exception to this is, well, exceptions, which is fundamentally why it\u0026rsquo;s so easy to screw up a codebase by not using them diligently. This was the subject of one of my talks.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nI was stupid enough to start with the reactive version when prototyping Scoop. I cried a lot.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nThis is one reason why it\u0026rsquo;s difficult to build a practical language with full-blown continuations. If you don\u0026rsquo;t understand what any of that means, don\u0026rsquo;t worry about it.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","date":"8 July 2025","permalink":"/posts/framing-structured-cooperation/","section":"Articles","summary":"Why is GOTO bad, why was structured concurrency invented, and how does that explain the unreasonable effectiveness of structured cooperation.","title":"The Unreasonable Effectiveness of Structured Cooperation"},{"content":"","date":null,"permalink":"/tags/scoop/","section":"Tags","summary":"","title":"scoop"},{"content":"In the previous article, I gave you a whirlwind tour of some of the features of Scoop, a POC implementation of something I\u0026rsquo;m calling structured cooperation. I showed you how making the components of a distributed system obey a single, simple rule made the resulting system easier to reason about, behave predictably in the presence of failures, and also allowed us to recover features we know from \u0026ldquo;regular\u0026rdquo; programming, such as distributed exceptions and resource handling. However, I didn\u0026rsquo;t go into too much detail about how exactly Scoop implements structured cooperation, and also glossed over some very fundamental issues that arise in this context\u0026mdash;most especially the topic of \u0026ldquo;how does Scoop figure out which handlers it should be waiting for in the first place.\u0026rdquo; This is what I\u0026rsquo;d like to discuss in this article.\nTo give you an idea about the level of abstraction this article is situated at\u0026mdash;if, in the previous article, I had talked about \u0026ldquo;what would we get if computers could communicate statelessly across the internet,\u0026rdquo; then this article would be about the HTTP protocol. In that metaphor, Scoop is a POC webserver, and the technical details of how that web server is implemented is an equally interesting subject\u0026mdash;in the case of the actual Scoop, it involves an implementation of distributed coroutines on top of Postgres. However, that\u0026rsquo;s not something that we\u0026rsquo;ll be discussing here. For that, I\u0026rsquo;ll direct you to the README in the repository and the heavily commented codebase itself\u0026mdash;I wrote it with the assumption that it was going to be read, and did my best to make it accessible and understandable.\nI should also note that I often use the terms \u0026ldquo;saga\u0026rdquo; and \u0026ldquo;(message) handler\u0026rdquo; interchangeably, because in Scoop, they are\u0026mdash;all message handlers are sagas, and all sagas are message handlers. Naturally, that is not the case outside of Scoop.\nImplementing Structured Cooperation #Let\u0026rsquo;s restate the fundamental rule of structured cooperation:\nA message handler (= saga) does not continue with the next step until all handlers of all messages emitted during the current step have finished executing.\nTo implement that, you need two things:\nKeep track of which messages were emitted from which step of which saga Be able to figure out if a handler has finished or not Keeping track of where messages came from #Scoop does this by associating a list of unique UUIDs, called a cooperation lineage, to each emitted message, and to each saga run. Any time a message gets emitted by some place, and a saga picks it up and starts running, that entire run (i.e., all the steps involved in that execution, including any rollbacks) is assigned a UUID (called a cooperation ID), which is appended to the cooperation lineage.\nWhen a message is emitted from within a saga run, it is tagged by the cooperation lineage of the run, the name of the step, and any sagas that run as a result of this message build their own cooperation lineage by appending their unique cooperation ID to the cooperation lineage of the message, and so on. Any top-level message (i.e., one emitted outside a saga, or on the global scope, which is the same thing), has a cooperation lineage of length one.\nDoing it like this has the effect of structuring the saga runs into a parent-child hierarchy, and allows us to refer to useful parts of this hierarchy. A specific run of a saga can be queried for as WHERE cooperation_lineage = \u0026lt;some UUIDs\u0026gt;. A run of a saga and its entire sub-hierarchy, i.e., any sagas that were triggered as the result of a message, directly or indirectly, can be queried for as cooperation_lineage is prefixed by \u0026lt;some UUIDs\u0026gt;1.\nThis \u0026ldquo;subtree\u0026rdquo; is conceptually (but not literally) enclosed by something called a CooperationScope, which intuitively corresponds to the three large colored rectangles on the picture above. You will find an object called CooperationScope in the code, and is meant to represent something that spans the entire saga run, and all its children. If you\u0026rsquo;re familiar with structured concurrency, this is, to an extent, the conceptual counterpart of something found in almost all implementations\u0026mdash;in Kotlin coroutines, it\u0026rsquo;s the CoroutineScope, in Trio, it\u0026rsquo;s the Nursery , in Java, it\u0026rsquo;s the StructuredTaskScope, in Swift, it\u0026rsquo;s the TaskGroup. It\u0026rsquo;s also the type of that scope parameter you\u0026rsquo;ve been seeing in all the code examples, and the picture above lends interpretation to \u0026ldquo;launching a message on a scope.\u0026rdquo;\nIf the stuff about CooperationScope is too confusing or abstract for you, don\u0026rsquo;t worry about it; you don\u0026rsquo;t need to understand all that to understand structured cooperation. It\u0026rsquo;s more about connecting it to other existing concepts some people might be familiar with.\nKeeping track of execution state #Anytime something interesting happens, e.g., a handler sees a message published on the topic it\u0026rsquo;s subscribed to, or finishes executing a step, or a rollback gets triggered, or the entire saga run finishes, etc., Scoop makes a note of that fact. It does so by writing an entry into a special table, called message_events, that stores, well, events related to message handling.\nThat table serves two purposes:\nTo determine if a given saga is ready to move on to the next step To build the state necessary to actually run that step (we need to know things like what step we should execute next, if any children failed during the last one, if a rollback is already in progress, etc.) I think it might be easier to show you what it looks like before I dive any deeper. Here is some (simplified) code that defines two sagas, and publishes a message:\nmessageQueue.subscribe( \u0026#34;root-topic\u0026#34;, saga(\u0026#34;root-handler\u0026#34;) { step { scope, message -\u0026gt; scope.launch(\u0026#34;child-topic\u0026#34;, JsonObject()) } step { scope, message -\u0026gt; } }, ) messageQueue.subscribe( \u0026#34;child-topic\u0026#34;, saga(\u0026#34;child-handler\u0026#34;) { step { scope, message -\u0026gt; } step { scope, message -\u0026gt; } }, ) messageQueue.launch(rootTopic, JsonObject()) And here\u0026rsquo;s what message_events would look like after this finishes running, ordered by time:\nmessage_id type coroutine_name coroutine_identifier step cooperation_lineage b796 EMITTED null null null {56b9} b796 SEEN root-handler 5191 null {56b9,5310} 4501 EMITTED root-handler 5191 0 {56b9,5310} b796 SUSPENDED root-handler 5191 0 {56b9,5310} 4501 SEEN child-handler 62b8 null {56b9,5310,6817} 4501 SUSPENDED child-handler 62b8 0 {56b9,5310,6817} 4501 SUSPENDED child-handler 62b8 1 {56b9,5310,6817} 4501 COMMITTED child-handler 62b8 1 {56b9,5310,6817} b796 SUSPENDED root-handler 5191 1 {56b9,5310} b796 COMMITTED root-handler 5191 1 {56b9,5310} I truncated all UUIDs to the last four characters to save visual space, and I\u0026rsquo;m also leaving out a few columns for the same reason: id, the primary key of the row, created_at, which contains the timestamp the row was inserted, exception, which only comes into play when dealing with rollbacks, and context, which is used to share data across steps, or even across different parts of the cooperation hierarchy. We\u0026rsquo;ll talk more about the latter two later on. In this example, both columns are null everywhere.\nColumn Explanation message_id The id of the associated message in the messages table type The identifier of the type of event that occurred. More on this bellow coroutine_name The name of the handler/saga, if present coroutine_identifier The identifier of the handler/saga (for situations where the service is scaled horizontally) step The step name (or its ordinal, if not specified), if applicable cooperation_lineage Explained above With this setup, adhering to the rule of structured concurrency essentially boils down to building a (moderately complicated) query that checks if a saga is ready to proceed or not. In Scoop, that amounts to asking \u0026ldquo;for all messages emitted in the previous step, does every SEEN have a corresponding COMMITTED\u0026rdquo; (plus the equivalent for rollbacks, which we\u0026rsquo;ll get to).\nIn other words, when you strip it down, Scoop is essentially this:\nProvide a way to publish messages to topics Provide a way to associate an object (= saga/handler) containing a list of functions (= steps) to a topic When that association is created, launch a periodic process that, on each \u0026ldquo;tick\u0026rdquo;: Checks if an as-yet-unseen message was emitted to that topic. If so, it launches a new saga run (amounts to writing the SEEN record above). Also performs something similar for rollbacks. Runs the query mentioned above and checks if any existing run is ready to continue. If so, it uses the contents of message_events to pick the correct function (step) to run, and invokes it. When it finishes, it writes the appropriate row to message_events, depending on the result2. This also makes it non-blocking by design, i.e., there aren\u0026rsquo;t any processes waiting around until something else finishes. And that\u0026rsquo;s pretty much all there is to it.\nWith one exception.\nEventLoopStrategy #This entire time, we\u0026rsquo;ve been skirting a fundamental issue, one that profoundly influences the properties of the system you end up building with Scoop.\nThe rule of structured cooperation is \u0026ldquo;don\u0026rsquo;t continue until all handlers have finished executing.\u0026rdquo; So how do you know which handlers you should be waiting for in the first place? When you think about it, it\u0026rsquo;s not obvious what the correct answer is. For example, a naive approach could be, e.g., \u0026ldquo;After a message emission, just wait for X amount of time, after which all registered handlers will surely have seen the new message and written their SEEN event. Then, wait until those SEEN\u0026rsquo;s are terminated.\u0026rdquo;\nBut what if a network partition happens, and some handler is not available for a period of time longer than it takes all the others to complete? What if you just deployed a new service that happens to be listening to that message? What happens if the SEEN was written, so you know you should be waiting for it, and then a network partition occurs? Or you take the service offline deliberately?\nAll of these, and more, are questions you need to ask yourself, and there is no one correct answer\u0026mdash;the answer is \u0026ldquo;it depends.\u0026rdquo; Scoop recognizes this, provides an API that\u0026rsquo;s deliberately designed around the fact that you need to figure this out, and forces you to make an explicit decision.\nIt turns out that this is a specific instance of a more general question: under what conditions is a saga ready to be resumed? Naturally, the way we\u0026rsquo;ve been answering this question this entire time is \u0026ldquo;when the rule of structured concurrency is obeyed,\u0026rdquo; but that\u0026rsquo;s not the only useful answer.\nHere\u0026rsquo;s another one: when a certain amount of time has elapsed. And boom! You just got sleep() for free. And when you have sleep(), you also have task scheduling. Just like that.\nAn associated question you can ask is, under what conditions should you give up on a saga? Here are some useful answers:\nwhen a deadline has passed, when a cancellation has been requested. Again, you get powerful and well-established features, but you don\u0026rsquo;t get them as primitives that need to be built in, but rather as things that are built on something even more fundamental. If Scoop didn\u0026rsquo;t implement timeouts or cancellations, you could just as easily do it yourself.\nThe API in question is called an EventLoopStrategy, which is an interface exposing five methods: resumeHappyPath, giveUpOnHappyPath, resumeRollbackPath, giveUpOnRollbackPath, and start. We\u0026rsquo;ve introduced the former two, and the following two are the equivalent for rollbacks. The giveUp variants are checked by Scoop automatically on the \u0026ldquo;boundaries\u0026rdquo; of a step, i.e., before a step starts executing and after it stops. You can also check it on demand from a step by calling scope.giveUpIfNecessary() inside a step, which will throw the appropriate exception if necessary\u0026mdash;Scoop is cooperative. Finally, start is there to allow you to customize under which conditions you want to react to a message\u0026mdash;e.g., you typically don\u0026rsquo;t want a newly deployed saga to start reacting to messages that are from a year ago.\nAll these methods return raw SQL that is injected3 directly into the query that checks if some saga is ready to continue. Scoop provides reasonable implementations for all of these methods, and you\u0026rsquo;re encouraged to build your own. It\u0026rsquo;s called EventLoopStrategy because an event loop is the technical term for a thing that \u0026ldquo;periodically checks if there\u0026rsquo;s something to do, and does it.\u0026rdquo;\nDiscoverability of handler topologies #Since it so fundamentally influences the system you end up building, I want to spend a little time discussing the original question that motivated EventLoopStrategy: how do we determine which handlers we should be waiting for? This question stands at the center of how you implement resumeHappyPath (and resumeRollbackPath).\nA good question to start with is if the handler topology of your system is fundamentally discoverable or not. What I mean by that fancy word cocktail is a system where you can, at any time, determine which components are participating, what topics they\u0026rsquo;re listening to, and what messages they emit. Perhaps you have a static registry in a JSON file that\u0026rsquo;s updated with each deployment, and part of each deployed service. Perhaps you have a component dedicated to providing that information, e.g., some sort of service registry. Perhaps it\u0026rsquo;s something else. But somehow, it is always possible to determine the correct answer to that question.\nThe reason this question is important is that it determines what options you have from the perspective of the CAP theorem\u0026mdash;if your system is not discoverable, you cannot ever guarantee consistency, and all you can focus on is availability. Structured cooperation is fundamentally about enforcing consistency, so if you want to use it for that purpose, you need to invest the effort necessary to make your topology discoverable.\nAt least in theory, I can imagine there could be situations where it really might not be possible, by design. In those instances, you can still do a kind of \u0026ldquo;best-effort\u0026rdquo; structured cooperation\u0026mdash;for example, you can guarantee that any service that ACKs a new message within X amount of time will be included in structured cooperation. It\u0026rsquo;s kind of like a train schedule\u0026mdash;you need to be on board by a certain time, and once you\u0026rsquo;re on board, you all move as one piece. But if you miss the train, you get left behind.\nMy personal hunch is that the vast majority of practical systems are discoverable, at least in theory, so I want to spend some time discussing them. Even under the assumption of discoverability, things are not exactly trivial when combined with the reality of any distributed system of any complexity\u0026mdash;things keep changing all the time.\nBuilding and maintaining a handler topology #In order to build a handler topology for the system, each service needs to publish this information about itself in some manner. Ideally, that would be done in some automated fashion directly from the code\u0026mdash;there are a thousand ways to do that in any mature ecosystem. If you\u0026rsquo;re dealing with some horrible legacy system, maybe you can ask your MQ, or use something more advanced. Or maybe you like to watch the world burn, and require filling out a form with that information as part of the deployment process. But somehow, you need to do this.\nThen, you need to make this data accessible to all other services. Collecting that data for all services and bundling it with each app would be nice, but not very practical, as you would need to deploy a new version of every service whenever it changed for any of them. A much more realistic approach would be to publish this information to some central place\u0026mdash;could be in a database that\u0026rsquo;s accessible to all services, or your own custom service, or maybe something like Backstage. Just keep in mind that whatever this place is, it is a SPOF, and needs to be placed into the same QoS category as your MQ\u0026mdash;if it goes down, the world goes down, or at least gets severely degraded.\nThat\u0026rsquo;s the easy part.\nThe difficult part is dealing with deployments that change this handler topology. There are two issues here:\nHow to let the individual services know that something has changed? How to deal with running cooperation hierarchies that will be affected by what\u0026rsquo;s being deployed? The former is basically cache invalidation. Each time the event loop of each service checks if there\u0026rsquo;s a saga that can be resumed, it needs to access the information of what other sagas it should be waiting for. Obviously, that can\u0026rsquo;t be a network request, because a) that would take forever, and b) it would probably hammer your registry into oblivion. So each service needs to cache a copy and be notified when it\u0026rsquo;s no longer up-to-date. This needs to be part of the deployment pipeline. 4\nThe latter involves determining which services will be affected by the change (e.g., those that have a handler that publishes messages whose handlers are affected by the deployed change\u0026mdash;either their behavior is being changed, or a new handler added, or an old one removed, etc.), and doing two things:\nNotifying them to pause processing of new messages (and unpausing them after the deployment is finished) Waiting for them to finish processing any existing messages (or, alternatively, cancelling any that are currently running) If you don\u0026rsquo;t do that, you could get into any one of many inconsistent states\u0026mdash;some handlers are running old code, some are running new, some know they should be waiting for a new handler at a certain step, some don\u0026rsquo;t and some handlers start processing a message from the middle of a series of steps but didn\u0026rsquo;t process messages emitted in previous steps, and so on.\nObviously, this isn\u0026rsquo;t trivial and involves consideration of lots of edge cases, e.g., how long it\u0026rsquo;s acceptable for a given message to not be processed, or how long it\u0026rsquo;s acceptable to wait for handlers to finish doing their thing. You also need to properly handle any failures, e.g., if a handler was paused and then the deployment fails down the line, you need to unpause it again5. As hinted at in the previous two footnotes, all this corresponds exactly with what a saga is, publishing messages to topics is a very natural way to solve this problem, and the challenges involved are precisely what structured cooperation was designed to solve.\nI don\u0026rsquo;t presume to tell you how you should or shouldn\u0026rsquo;t solve any of this, mainly because the landscape is so heterogeneous that I don\u0026rsquo;t really believe there is any way of formulating a sensible answer. But whatever the answer might be in your particular case, you have the option of using structured cooperation itself to help you implement it.\nOne final note on EventLoopStrategy\u0026mdash;Scoop contains a dummy implementation that only works if all services are actually instances of the same service (i.e. same .jar). Each time a saga starts listening on some topic, it\u0026rsquo;s added to a local registry (basically a HashMap) that is then used for reference when building the list of handlers to wait for. Naturally this isn\u0026rsquo;t an implementation that\u0026rsquo;s viable for production use, although it\u0026rsquo;s still powerful enough to allow you to scale a given app horizontally with no additional work. Its main purpose is, as with Scoop itself, to convey an idea.\nRollbacks #We\u0026rsquo;ve mentioned rollbacks quite a few times, but we\u0026rsquo;ve been putting off taking a closer look at their semantics. Let\u0026rsquo;s fix that.\nFirst, to recap:\nEach saga step also defines an associated compensating action, which reverses whatever effect the step itself had. The default value is a lambda that does nothing. When an unhandled exception is thrown during the execution of some step, the compensating actions are executed in the reverse order. As a consequence, if any messages were emitted during the execution of a step, and any sagas reacted to those messages, the compensating actions of those sagas are run first (again, in reverse order), and only then is the compensating action of the \u0026ldquo;parent\u0026rdquo; step run. As we did when explaining the rudiments of how Scoop works a few paragraphs ago, it\u0026rsquo;ll be better if I first show you a few examples of what the message events look like in various rollback scenarios, so here are a few sagas (omitting the EventLoopStrategy) with their associated message events when they are triggered. As before, I\u0026rsquo;m truncating any UUIDs to the last 4 characters, and excluding some columns (id, created_at, coroutine_identifier and cooperation_lineage). Unlike before, I\u0026rsquo;m no longer excluding the exception column, but I\u0026rsquo;m obviously not including the full serialized JSON, instead opting for a symbolic placeholder.\n// a handler failing in a step never actually publishes any // message emitted from that stepp, since the transaction // doesn\u0026#39;t commit saga(\u0026#34;root-handler\u0026#34;) { step { scope, message -\u0026gt; scope.launch(\u0026#34;child-topic\u0026#34;, JsonObject()) throw RuntimeException(\u0026#34;Geronimo!\u0026#34;) } } message_id type coroutine_name step exception 8aaa EMITTED null null null 8aaa SEEN root-handler null null 8aaa ROLLING_BACK root-handler 0 RuntimeException(\u0026ldquo;Geronimo!\u0026rdquo;) 8aaa ROLLED_BACK root-handler Rollback of 0 null saga(\u0026#34;root-handler\u0026#34;) { step( invoke = { scope, message -\u0026gt; }, rollback = { scope, message, throwable -\u0026gt; throw IllegalArgumentException(\u0026#34;Geronimo again!\u0026#34;) }, ) step { scope, message -\u0026gt; scope.launch(\u0026#34;child-topic\u0026#34;, JsonObject()) throw RuntimeException(\u0026#34;Geronimo!\u0026#34;) } } message_id type coroutine_name step exception a8bf EMITTED null null null a8bf SEEN root-handler null null a8bf SUSPENDED root-handler 0 null a8bf ROLLING_BACK root-handler 1 RuntimeException(\u0026ldquo;Geronimo!\u0026rdquo;) a8bf SUSPENDED root-handler Rollback of 0 (rolling back child scopes) null a8bf ROLLBACK_FAILED root-handler Rollback of 0 IllegalArgumentException(\u0026ldquo;Geronimo again!\u0026rdquo;) saga(\u0026#34;root-handler\u0026#34;) { step { scope, message -\u0026gt; scope.launch(\u0026#34;child-topic\u0026#34;, JsonObject()) } } saga(\u0026#34;child-handler\u0026#34;) { step { scope, message -\u0026gt; } step { scope, message -\u0026gt; throw RuntimeException(\u0026#34;Geronimo!\u0026#34;) } } message_id type coroutine_name step exception 5e96 EMITTED null null null 5e96 SEEN root-handler null null c714 EMITTED root-handler 0 null 5e96 SUSPENDED root-handler 0 null c714 SEEN child-handler null null c714 SUSPENDED child-handler 0 null c714 ROLLING_BACK child-handler 1 RuntimeException(\u0026ldquo;Geronimo!\u0026rdquo;) c714 SUSPENDED child-handler Rollback of 0 (rolling back child scopes) null c714 SUSPENDED child-handler Rollback of 0 null c714 ROLLED_BACK child-handler Rollback of 0 null 5e96 ROLLING_BACK root-handler 0 ChildRolledBackException(\nRuntimeException(\u0026ldquo;Geronimo!\u0026rdquo;)\n) c714 ROLLBACK_EMITTED root-handler Rollback of 0 (rolling back child scopes) ParentSaidSoException(\nChildRolledBackException(\nRuntimeException(\u0026ldquo;Geronimo!\u0026rdquo;)\n)\n) 5e96 SUSPENDED root-handler Rollback of 0 (rolling back child scopes) null 5e96 SUSPENDED root-handler Rollback of 0 null 5e96 ROLLED_BACK root-handler Rollback of 0 null The \u0026ldquo;scopes\u0026rdquo; in \u0026ldquo;rolling back child scopes\u0026rdquo; is referring to the CooperationScope we briefly mentioned above. Basically, it is a synonym for what could intuitively be called \u0026ldquo;message hierarchies,\u0026rdquo; i.e., \u0026ldquo;rolling back child message hierarchies.\u0026rdquo;\nI\u0026rsquo;d like to give more examples, but you can see that the table starts getting pretty long, so I\u0026rsquo;ll refer you to the test suite instead\u0026mdash;I think playing around with different scenarios and looking at what the message_event table looks like after is the easiest way to understand it.\nHere are the most important points:\nIf an unhandled exception is thrown in some step, instead of writing a SUSPENDED (if it\u0026rsquo;s not the final step) or COMMITED (if it was the final step) message event, a ROLLING_BACK is writen. A ROLLING_BACK is, in a sense, semantically similar to a SEEN, in that it signifies a \u0026ldquo;start\u0026rdquo; of something\u0026mdash;while SEEN is the start of the happy path, a ROLLING_BACK is the start of the rollback path When that saga is next picked up for execution, the last finished (= the local transaction committed) step is looked up, and the set of messages emitted in that step is built from the EMITTED records corresponding to the step. A ROLLBACK_EMITTED is written for each such message, after which the saga suspends (a SUSPENDED is written), and is not picked up again until all child sagas have finished rolling back (as determined by the EventLoopStrategy). Only then is the compensating action of the step actually run, after which another SUSPENDED is written, and the process continues for previous steps. Child sagas react to the ROLLBACK_EMITTED by writing their own ROLLING_BACK, starting the same process for themselves. If all steps rolled back successfully, a ROLLED_BACK is emitted\u0026mdash;the equivalent of COMMITTED for rollbacks. If an unhandled exception is thrown during the execution of any compensating action, a ROLLBACK_FAILED is emitted, and the compensation is halted there. If multiple failures happen in different children, the exceptions are combined (i.e., no failure is thrown away). All this is really verbose to write out, but the principles are actually fairly simple. The basic rule to remember is that rollbacks amount to \u0026ldquo;running the sagas backwards\u0026rdquo;\u0026mdash;pretty much all aspects of the implementation follow from that.\nThere is one last thing I should mention before moving on\u0026mdash;each step() actually accepts a third parameter, also a lambda, called handleChildErrors. Any unhandled exceptions that bubble up from children are first sent into this function. If it returns normally (i.e., doesn\u0026rsquo;t throw an exception), the exception is considered handled, and execution goes on as if the child completed normally. This is basically the semantic equivalent of a catch block wrapped around all child executions from that step, and allows you to do things like retries (you can emit messages from there in the same way), or ignoring child failures.\nHowever, you should know that this is one of the things whose actual implementation is a little half-baked, and there are various things you need to be aware of when using it. All of them are solvable; I just didn\u0026rsquo;t invest the time to do so. The main thing you need to take away is that having something like handlerChildErrors is important for the ability to implement certain features.\nThe default implementation always throws whatever exception is passed into it.\nCooperationFailure \u0026amp; CooperationException #Since structured cooperation is, by design, language agnostic, you can have services running code in completely different languages all participating in a message hierarchy. If a failure happens in one place, it needs to be representable in whatever it bubbles up to, i.e., if a handler in Python emits a message that is picked up by a handler in Haskell, and that handler fails somehow, we need a way to represent that failure in the Python handler as well.\nTherefore, a common protocol for representing failures is a necessary part of structured cooperation. Scoop provides an implementation via CooperationFailure, which is a data structure containing data typically associated with a failure\u0026mdash;the type of failure, a message, a stack trace, and a list of causes. This is what determines the serialized data in the exception column. Inside Scoop, a JVM implementation of structured cooperation, this is then translated to a CooperationException, which is what is then thrown around. The concept of CooperationFailure pertains to structured cooperation as a whole, and its format must be agreed and adhered to by all participants (same as, e.g., that there is a message_event table, what is written to it, and when). The concept of CooperationException is specific to Scoop, which is one of many possible structured cooperation implementations, in one of many possible languages.\nYou should also note that even though there is a standard representation of a failure across systems, you should be very wary to actually using that representation as part of any sort of code logic (e.g., checking for a specific type of exception bubbling up from children), because if you do that, you\u0026rsquo;re tightly coupling your handler to code in a service that\u0026rsquo;s far away, and the format of the failure must be treated as part of the public API of that specific service\u0026mdash;just, e.g., refactoring the name of an exception could break your code.\nWhile structured cooperation fundamentally does introduce coupling, it only introduces causal coupling, not procedural/behavioral coupling. If procedural coupling is what you need, then you probably should be using some form of RPC, not \u0026ldquo;fire-and-forget\u0026rdquo; messages.6\nCancellation requests #Canceling a running saga on request is actually really simple to implement. A special kind of message event type, CANCELLATION_REQUESTED, is written, and the default EventLoopStrategy used by Scoop checks for it in its giveUpOnHappyPath and giveUpOnRollbackPath implementations. That\u0026rsquo;s pretty much all it takes.\nUndo/rollback requests #Due to the way it\u0026rsquo;s designed, Scoop supports doing rollbacks whenever you want\u0026mdash;even if a saga is already completed, you can just write a ROLLBACK_EMITTED and all the magic just works. So you get undo for free.\nThe only thing you should be wary about is rolling back anything but a top-level message hierarchy. While there\u0026rsquo;s nothing technical from stopping you from rolling back a subhierarchy, you\u0026rsquo;re essentially transitioning the system into an inconsistent state, which should be done with caution. But Scoop allows you to do both.\nCooperationContext #The last fundamental building block incorporated by Scoop which we\u0026rsquo;ve been glossing over until now, is CooperationContext, which is a way to share data across the entire message hierarchy. Basically, it\u0026rsquo;s a writable object that\u0026rsquo;s accessible in all steps in all sagas that participate in a given message hierarchy, and it\u0026rsquo;s crucial for implementing certain features, such as the try-finally used for resources or deadlines (discussed below). As we mentioned in the last article, you can think of it as the equivalent of reactive context, CoroutineContext, etc., if you\u0026rsquo;re familiar with those concepts. If not, don\u0026rsquo;t worry about it.\nHere\u0026rsquo;s an example:\ndata object MyContextKey : CooperationContext.MappedKey\u0026lt;MyContextValue\u0026gt;() data class MyContextValue(val value: Int) : CooperationContext.MappedElement(MyContextKey) saga(\u0026#34;root-handler\u0026#34;) { step { scope, message -\u0026gt; scope.context += MyContextValue(3) scope.launch(\u0026#34;child-topic\u0026#34;, JsonObject()) } step { scope, message -\u0026gt; // logs 3 log.info(scope.context[MyContextKey]!!.value) } } saga(\u0026#34;child-handler\u0026#34;) { step { scope, message -\u0026gt; // logs 3 log.info(scope.context[MyContextKey]!!.value) } } The implementation is heavily inspired by Kotlin\u0026rsquo;s CoroutineContext, although you don\u0026rsquo;t need to know anything about that to understand it. All you need to know is that a CooperationContext can be thought of as a HashMap, associating keys (instances of CooperationContext.MappedKey) to values/elements (instances of CooperationContext.MappedElement, which is propagated across the message hierarchy in a particular way. The rules of this propagation is the main thing that you need to understand.\nThe launch method on messageQueue, which is used to launch top-level messages from outside of a saga, allows you to add a context parameter\u0026mdash;this becomes the context in the first step of all sagas that react to that message. A saga may mutate its own context at any point, in which case this mutation will be seen in all subsequent steps. Additionally, all launch methods on scope inside a step also allow you to pass a context value, which is then combined to form the context of the first step of any child handlers that react to that message. Any changes done inside child handlers only propagate to their children, i.e., CooperationContext allows you to send data from parent to child, but not the other way around.\nsaga(\u0026#34;root-handler\u0026#34;) { step { scope, message -\u0026gt; scope.context += MyContextValue(1) // logs MyContextValue(1) log.info(scope.context[MyContextKey]) scope.launch(\u0026#34;child-topic\u0026#34;, JsonObject(), ChildContextValue(2)) scope.context += MyContextValue(3) // logs MyContextValue(3) log.info(scope.context[MyContextKey]) // logs null log.info(scope.context[ChildContextKey]) } step { scope, message -\u0026gt; // logs MyContextValue(3) log.info(scope.context[MyContextKey]) // logs null log.info(scope.context[ChildContextKey]) } } saga(\u0026#34;child-handler\u0026#34;) { step { scope, message -\u0026gt; // logs MyContextValue(1) log.info(scope.context[MyContextKey]) scope.context += MyContextValue(10) // logs MyContextValue(10) log.info(scope.context[MyContextKey]) // logs ChildContextValue(2) log.info(scope.context[ChildContextKey]) } } Things get a little more interesting when rollbacks are involved. The context traverses back up the steps in reverse, same as the execution flow. If, in a step, child rollbacks are triggered, the context travels to them as well, where it is combined with the context from the child\u0026rsquo;s last step (the \u0026ldquo;parent\u0026rdquo; context takes precedence if both contexts have any common keys).\nCooperationContext is the building block upon which deadlines and sleep() are built\u0026mdash;the final features we\u0026rsquo;ll talk about here. Since this post is already alarmingly long, I\u0026rsquo;ll restrict myself to a few paragraphs and invite you to take a look at the test suite to see them in action.\nDeadlines #The concept of deadlines is how Scoop implements timeouts, inspired (as is Scoop itself) by the work of the wonderful Nathaniel J. Smith. I recommend you read his insightful thoughts on timeouts and cancelations.\nA deadline is a time by which a message hierarchy must be completed. Scoop allows you to specify a deadline for the happy path, a deadline for the rollback path, and a deadline for both paths combined. All three deadlines are represented as their own CooperationContext keys, which are respected by the default EventLoopStrategy implementations giveUpOnX methods. The way CooperationContext propagates across the hierarchy ensures that a deadline applied to a message applies to all child messages as well, while also allowing parents to set stricter deadlines for their children. Deadlines also implement a tracing mechanism that allows you to determine where the given deadline originated from.\nDelayed execution, scheduling \u0026amp; periodic execution #Delayed execution, a.k.a. sleeping, is implemented using two things:\nA CooperationContext key containing the time until which we should be sleeping A dedicated handler, built into Scoop and listening to a dedicated topic, that uses a custom EventLoopStrategy which only runs the handler after the time elapses. Sleeping is then achieved by emitting a message on that sleep topic, with the appropriate context value.\nThis automatically gives you scheduling\u0026mdash;things like \u0026ldquo;send a survey 2 days after a purchase\u0026rdquo; involve creating a saga listening to the purchases topic that sleeps for 2 days as its first step. It also gives you periodic execution\u0026mdash;a saga which first sleeps for the period of execution, then emits a message to the same topic to which it listens using scope.launchOnGlobalScope, then proceeds with whatever it is supposed to do.\nNext steps #There\u0026rsquo;s more we could talk about. For example, we could talk about how the contents of message-event impact observability \u0026amp; tracing, and how they are invaluable debugging tools\u0026mdash;so much so that when I implemented a stripped down version of structured cooperation in an Axon app I was maintaining, it immediately became the first place anyone would go when there was any kind of problem. It also allowed us to catch a really, really nasty bug\u0026mdash; I shudder to think what debugging that would look like without it.\nBut I\u0026rsquo;ll leave it at that and talk about one final thing\u0026mdash;where we go from here.\nObviously, the first thing that needs to happen is that the community needs to pound at everything I\u0026rsquo;m talking about here. I expect there to be at least some pushback, because, like in both preceding applications of the idea (more on this in the next article), structured cooperation will probably make certain patterns obsolete, and people who are used to applying and thinking in terms of those patterns aren\u0026rsquo;t going to like having to learn to do things a different way.\nTo that end, I\u0026rsquo;d like to ask any readers with thoughts on the subject to please create an issue. Anything from \u0026ldquo;How would one do X using structured cooperation?\u0026rdquo; and \u0026ldquo;Wouldn\u0026rsquo;t X be a better approach to solve Y?\u0026rdquo; to \u0026ldquo;I see problem X in the way things are implemented\u0026rdquo; and \u0026ldquo;We should implement this in X next,\u0026rdquo; and any other thoughts you might have, are welcomed. Please do your best to read through existing issues and see if there isn\u0026rsquo;t one that matches what you want to say first, and I\u0026rsquo;ll do my best to answer them all as quickly as I can.\nAssuming the community decides that there is value in this approach, the next step should be to put it to practical use. As I\u0026rsquo;ve mentioned throughout both articles, structured cooperation is a concept, and it is implementable in any language, in any framework, and on top of any infrastructure. Unfortunately, that also makes it a little difficult to disseminate as a batteries-included package, because the landscape to which it is applicable is quite diverse. Scoop is built in Kotlin on Postgres, but you\u0026rsquo;re probably not using Postgres for messaging. Perhaps you want to implement the equivalent of message-event on top of whatever MQ you\u0026rsquo;re using. Perhaps you want to keep using a (shared) database for that, but perhaps it\u0026rsquo;s not Postgres. Perhaps you want to move Scoops event loop to a different place, e.g., run it in a dedicated service, or perhaps you\u0026rsquo;re feeling adventurous and want to implement it directly inside Postgres. Perhaps you have a completely different idea of how structured cooperation should be implemented. And perhaps you might want to do all that in 20 different frameworks and 20 different languages.\nThat\u0026rsquo;s one reason that I\u0026rsquo;m not currently pursuing turning Scoop into a production library\u0026mdash;this will clearly require a community effort, and solving it for a single combination of (structured cooperation implementation, infrastructure, framework, library) wouldn\u0026rsquo;t make that big of a dent. Rather, I think Scoop has value as something like a reference implementation\u0026mdash;something others can consult when implementing structured cooperation in their environment of choice.\nThat being said, if there\u0026rsquo;s serious interest from the community and people willing to participate and/or sponsor the work, I would consider working on polishing the JVM version and getting it to a production-ready state. At the same time, I should stress that I\u0026rsquo;m just one guy, with a normal 9-5 job, so my capacity is fairly limited. In any case, feel free to let me know in the corresponding issue.\nWhere to go from here #If you\u0026rsquo;ve managed to keep reading until here, thank you\u0026mdash;there\u0026rsquo;s a lot to unpack here, and it can be a little overwhelming.\nIf you want to learn more about how Scoop is implemented or play around with it, head on over to the GitHub repository. Otherwise, feel free to continue to the final article of the series. I should note that this one is not necessary for understanding structured cooperation. It will, however, help convince you that the effectiveness of structured cooperation isn\u0026rsquo;t something that just happens randomly, but that there\u0026rsquo;s a deep reason involved. In this last article, I\u0026rsquo;ll show you that it is actually just the latest incarnation of a concept with a very rich history that has had a tremendous impact on the entire programming industry over the course of the last six decades. See you there.\nAnd since this is UUIDs we\u0026rsquo;re talking about, we can safely change this to \u0026ldquo;cooperation_lineage contains \u0026lt;some UUIDs\u0026gt;,\u0026rdquo; or WHERE cooperation_lineage @\u0026gt; \u0026lt;some UUIDs\u0026gt; in Postgres. This kind of querying is used extensively in Scoop, and Postgres allows you to build an index for it.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nIf you\u0026rsquo;re getting dizzy imagining the mounds of data this approach generates, don\u0026rsquo;t\u0026mdash;you have lots of different options. Scoop only needs this data while the saga is running\u0026mdash;after that, you only need to retain it if you want to support user-requested rollbacks (and even then, things could be reimplemented so you\u0026rsquo;d only really need the values of the cooperation context). Therefore, from that point on, you can just treat the contents of message_event as logs\u0026mdash;define a retention period based on how long you want to be able to debug comfortably, after which you either remove the data, move it to cold storage, or whatever. Heck, you could just nuke it right after a saga finishes running. You could have different policies for different sagas. The sky is the limit here.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nThis is bound to make people green in the face, but again\u0026mdash;I\u0026rsquo;m trying to communicate an idea here. Focus on that, and be a loose lily floating down an amber river.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nA nice way to do that would be to have a deployment topic that all services could listen to, where any topology changes would be published as part of the deployment pipeline. Of course, we\u0026rsquo;d also like to wait for all services to acknowledge they\u0026rsquo;ve updated their cache, so we don\u0026rsquo;t get into a situation where we deploy a new handler, and there\u0026rsquo;s a time window where some services know that they should wait for it, and some don\u0026rsquo;t\u0026mdash;that could lead to messy situations. In other words, we\u0026rsquo;d like to publish the message and wait for all services to finish updating their cache. Do you see where I\u0026rsquo;m going with this?\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nNotice how we want to unpause a handler regardless of how the deployment ends. Handler inactivity is an expensive thing\u0026mdash;a resource.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nI should note that messaging is absolutely a way to implement RPC, e.g. you send a message (typically called a command) to a topic, and receive a response on another.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","date":"7 July 2025","permalink":"/posts/implementing-structured-cooperation/","section":"Articles","summary":"An overview of how structured cooperation is implemented in Scoop, along with two of its fundamental components: EventLoopStrategy and CooperationContext.","title":"Towards an Implementation of Structured Cooperation"},{"content":"If you\u0026rsquo;ve ever worked as an enterprise developer in any moderately complex company, you\u0026rsquo;ve likely encountered distributed systems of the kind I want to talk about in this post\u0026mdash;two or more systems communicating together via a message queue (MQ), such as RabbitMQ or Apache Kafka. Distributed, message-based systems are ubiquitous in today\u0026rsquo;s programming landscape, especially due to the (now hopefully at least somewhat tempered) microservice architecture frenzy that swept over our field during the past decade.\nMoving away from a majestic monolith involves significant tradeoffs, all of which have been documented extensively over the years. It is well known that dealing with distributed systems is a famously painful experience\u0026mdash;data is only eventually consistent, errors are difficult to trace and debug, and, perhaps most frustratingly, it gets increasingly more difficult to reason about the system as a whole. This is compounded by the organic way these systems often form\u0026mdash;rather than being a thought-out and planned architectural decision, many start out as an ad hoc solution to a particular localized problem and then gradually snowball into a mess.\nNothing I\u0026rsquo;ve said so far is news\u0026mdash;everybody knows that distributed systems are a pain.\nBut why?\nIn the following posts, I want to convince you that many of the difficulties traditionally associated with distributed systems are not actually unique to distributed systems at all. In fact, they\u0026rsquo;re something our industry has encountered, and solved, not once, but twice before\u0026mdash;once, around 1968, when we started thinking about the problems with GOTO, and then again more recently, around 2016, with the advent of structured concurrency.\nWhat\u0026rsquo;s more, the solutions to both problems revolve around essentially the same idea, and it turns out that this same idea\u0026mdash;a single, simple rule\u0026mdash;is also applicable to how we design distributed systems. Applying that idea not only prevents many of these difficulties from ever arising, but also opens the door to features that are not readily available using current approaches, such as (but not limited to) true distributed exceptions\u0026mdash;stack traces that span multiple services, and what amounts to stack unwinding across services. And perhaps most importantly, it makes these systems significantly easier to reason about.\nHowever, those discussions, while interesting and educational, are also rather theoretical, and let\u0026rsquo;s be honest\u0026mdash;that\u0026rsquo;s not everyone\u0026rsquo;s cup of tea. So before I give you a tour of the ivory tower, I want to stay on the ground for a little while, and show you what you get when you actually apply all that ivory business. To that end, I built Scoop, and in this post, I want to talk about what it is, and what it can do. Hopefully, that will motivate you to further explore the reasoning that led me to build it in this particular fashion, and who knows, maybe you\u0026rsquo;ll even learn a thing or two along the way.\nIn this post, I\u0026rsquo;m going to concentrate on what Scoop can do, without going into too much detail about how. That will be the topic of the subsequent post, which will give you a helicopter overview of how Scoop and its features are implemented, and concentrate on a few fundamental topics that I purposefully avoid talking about here. Finally, in the third and final post, I\u0026rsquo;ll frame the core concept Scoop is built around, something I\u0026rsquo;m calling structured cooperation, in a broader context, and show you how it\u0026rsquo;s the natural continuation of an idea that has, in one form or another, been shaping our industry for over half a century.\nWhat is Scoop, and what did I build it with? #Scoop amounts to what you might call an orchestration library\u0026mdash;it helps you write business operations in distributed systems. In that sense, it is similar to, e.g., Temporal, or, to an extent, Axon.1 Scoop is small\u0026mdash;it can be read cover-to-cover in a few hours, and most of the magic happens in ~500 lines of (heavily documented) SQL.\nThe primary purpose of Scoop, at least at this point, is to convey an idea. Scoop is a POC, not a production-ready library. That being said, feature-wise, it packs quite a punch if I may say so myself, especially given how small it is.\nThe principles upon which Scoop is built, along with the contents of these posts, are language and infrastructure agnostic, and no specific knowledge is assumed here, other than familiarity with any mainstream programming language, SQL, and a vague familiarity with MQ\u0026rsquo;s and distributed systems in general (e.g., you know what a message or a topic is).\nOf course, I did need to write it in something. Scoop is written in Kotlin, on top of Quarkus, and uses Postgres for everything. I chose Kotlin primarily because of its syntax and type system, and Quarkus since it allows using both blocking and reactive database drivers in a single application, and I wanted to write Scoop in both flavors, for educational and demonstrational purposes. However, I also wanted to target as wide an audience as possible, and tried to minimize the number of assumptions I made about what my audience might be familiar with.\nTherefore, I:\ndon\u0026rsquo;t use any fancy Kotlin features2, apart from extensions, deliberately chose to implement a simple MQ on top of Postgres instead of using an established MQ, try to keep abstraction to a minimum, only use Quarkus for dependency injection, don\u0026rsquo;t use an ORM, write raw SQL everywhere. The result certainly won\u0026rsquo;t win any beauty contests, and I know my JVM brothers and sisters have already fainted in horror at \u0026ldquo;keep abstraction to a minimum,\u0026rdquo; but hopefully, these choices will make the idea accessible to programmers from virtually any background, and conveying the idea is all I care about.\nThere is, unfortunately, some dancing around Jackson, the JSON library. There is always some dancing around Jackson.\nThe Saga Begins #One of the most painful consequences of moving to a distributed architecture is the impact it has on transactional bounderies\u0026mdash;you can no longer complete an entire \u0026ldquo;business operation\u0026rdquo; within the confines of a single, traditional database transaction. There are various approaches that work around this issue, such as two-phased commits, but a common one is the saga pattern. In a saga, you give up atomicity by modelling a business operation as \u0026ldquo;a sequence of local transactions\u0026rdquo;\u0026mdash;basically you break up the operation into a set of \u0026ldquo;steps\u0026rdquo;. Each step is wrapped in a regular transaction, and messages are emitted during their execution, which trigger operations in other services.\nThis is the approach Scoop takes. Here is an example of a saga in Scoop:\n// saga() actually takes a second required parameter, // but that\u0026#39;s one of the things I\u0026#39;ll be glossing over // in this article val myHandler = saga(name = \u0026#34;my-handler\u0026#34;) { step { scope, message -\u0026gt; println(\u0026#34;Hello from step 1 of my-handler!\u0026#34;) scope.launch(\u0026#34;some-topic\u0026#34;, JsonObject().put(\u0026#34;from\u0026#34;, \u0026#34;my-handler\u0026#34;)) } step { scope, message -\u0026gt; println(\u0026#34;Hello from step 2 of my-handler!\u0026#34;) } } \u0026ldquo;What\u0026rsquo;s that scope thing?\u0026rdquo; I hear you exclaim.\nDon\u0026rsquo;t worry about it\u0026mdash;for now, all you need to understand is that scope.launch(\u0026lt;topic\u0026gt;, \u0026lt;payload\u0026gt;) arranges for a message with \u0026lt;payload\u0026gt; to be published on \u0026lt;topic\u0026gt; once the (local) transaction of the step commits.3 Yes, yes, using a raw JsonObject is not at all how this would be done in an actual production implementation, but that\u0026rsquo;s not what Scoop is.\nTo have this saga actually do something, you subscribe it to some topic:\n// MessageQueue is part of Scoop, you just // inject it like any other component messageQueue.subscribe(\u0026#34;some-topic\u0026#34;, myHandler) // This is how you would publish a message (with // an empty payload) on that same topic messageQueue.launch(\u0026#34;some-topic\u0026#34;, JsonObject()) After the subscribe line, whenever a message is published on some-topic, the saga is run, step by step\u0026mdash;we\u0026rsquo;ll talk more about that in a second. You can subscribe multiple sagas to the same topic. You can also scale sagas horizontally\u0026mdash;running 1 instance or 100 just works, no configuration needed. Just keep in mind that each step can potentially be run by a different instance of the service.4\nRollbacks \u0026amp; Coordination #There are two rather unique issues that arise with sagas (in general, not just in Scoop).\nThe first is how the equivalent of a transaction rollback happen\u0026mdash;if an error happens somewhere down the road, the transactions of the previous steps are already committed, so you can\u0026rsquo;t do a traditional rollback. Instead, you basically need to write code that manually undoes whatever changes you made in each step, usually called a \u0026ldquo;compensating action\u0026rdquo;, and when an error happens, arrange for this code to run. I\u0026rsquo;ll show you how that\u0026rsquo;s done in Scoop in a moment.\nThe second is how you coordinate the steps, since you ideally only want to run a subsequent step when the previous one has finished. This is actually trickier than it sounds, and there are two fundamental approaches that you can take, usually called orchestration and choreography.\nOrchestration #In an orchestrated saga, the saga is usually explicit, which means that there is a specific place in some codebase where the saga is written out in its entirety, and the \u0026ldquo;god service\u0026rdquo; that runs this code orchestrates the entire business operation by calling out to all other services in the proper order to achieve the desired result. This makes it easy to reason about, but also feels like a step in the opposite direction from the decoupled, decentralized mindset that SOA/microservices traditionally embody. The god service must have direct knowledge of all the services it calls out to and how they respond, therefore becoming tightly coupled to them. Essentially, this approach amounts to normal RPC, just with an MQ sandwiched between the systems for better resilience, and possibly some other bells and whistles.\nChoreography #In a choreographed saga, the saga is often (but not always) implicit\u0026mdash;there\u0026rsquo;s no single place in any codebase where the steps are laid out. Instead, you trigger the saga by emitting a particular message, which is handled by whatever handler represents the first step. This handler then emits its own messages, which triggers the next step, and so on. Subsequent steps are triggered in response to intercepting a particular message emitted by the previous step, and messages are fired and forgotten\u0026mdash;handlers don\u0026rsquo;t receive responses to the messages they emit. That\u0026rsquo;s more in line with the decentralized mindset of microservices, but leads to incredibly messy systems that are difficult to reason about. There\u0026rsquo;s actually a very deep reason for the inevitable messiness, and we\u0026rsquo;ll talk about it at length in the final post of this series.\nEven if you ignore that, it still gets really tricky really fast. Sometimes, there\u0026rsquo;s no reasonable business message to emit, because some part of the saga ends up not actually doing anything (e.g., we\u0026rsquo;re handling CreateCustomer, but the customer already exists and already contains that same data, so neither CustomerCreated nor CustomerChanged make sense). In that case, you probably want to continue with the next step anyway (since the purpose of the step, i.e., the customer existing, was achieved), but there\u0026rsquo;s no sensible message to emit, and so nothing to actually react to. This is commonly a problem in event-sourced applications, where the messages themselves are the stored data, so you don\u0026rsquo;t want to be emitting messages that don\u0026rsquo;t mean anything. In other situations, you actually want to react to a combination of messages arriving (e.g., CustomerCreated and ContractCreated). Sometimes the order matters to you; sometimes it doesn\u0026rsquo;t. Usually, the payloads matter (ContractCreated needs customer_id matching the id in CustomerCreated). Sometimes you want to wait for a particular amount of messages of a particular type (e.g., the appropriate number of LineItemCreated) before continuing.\nNow, to be clear\u0026mdash;I\u0026rsquo;m not saying these problems don\u0026rsquo;t have solutions. They obviously do, otherwise these patterns couldn\u0026rsquo;t be used. I\u0026rsquo;m just saying that there\u0026rsquo;s significant baggage that needs to be dealt with, and dealing with it can feel like a game of whack-a-mole in terms of the tradeoffs one is forced to make, and involve significant effort to implement and maintain.\nBut Scoop doesn\u0026rsquo;t quite fit into either of these categories.\nThe Rule of Structured Cooperation #Sagas in Scoop are orchestrated in the sense that they are explicit\u0026mdash;you will find a specific place in some codebase where a saga can be seen in its entirety. But they are also choreographed, in the sense that they do not directly call any services, and simply fire off one or more messages without processing responses to them.\nHowever, they do obey a rule that\u0026rsquo;s somewhere halfway between orchestrated and choreographed:\nWhen they reach the end of a step in which messages were emitted, sagas suspend their execution and don\u0026rsquo;t continue with the next step until all handlers of those messages have finished executing.\nThis simple rule is at the heart of Scoop, at the heart of structured cooperation, and at the heart of everything these articles are about.5 And, as you\u0026rsquo;ll see in the rest of the article, having sagas adhere to this rule has some pretty profound consequences.\nLet\u0026rsquo;s take a look at some of them.\nNo more race conditions #In a system where structured cooperation is obeyed, it\u0026rsquo;s impossible to trip over race conditions, such as those associated with eventual consistency, unless you deliberately go out of your way to inflict them upon yourself. That\u0026rsquo;s because all handlers of all messages emitted in any previous steps are guaranteed to have finished successfully (and all handlers of any messages they emitted, and so on). Crucially, you don\u0026rsquo;t need to do anything yourself\u0026mdash;you don\u0026rsquo;t need to check for side effects other services might have in order to determine if they\u0026rsquo;ve finished or not.6 If a step is executing, the previous steps and all their side effects have finished executing, period.\nSay you\u0026rsquo;re not using structured cooperation, and, while importing some data, fire off a CustomerCreation and CustomerContractCreation message. Multiple services are listening to those messages and reacting to them. It could easily happen that a service starts reacting to CustomerContractCreation before it finishes reacting to CustomerCreation, which will lead to a CustomerNotFound error. There are ways to deal with that, but it\u0026rsquo;s a can of worms, e.g., how do you distinguish between a customer that hasn\u0026rsquo;t been created yet vs. an actual faulty message?\nIf your system uses structured cooperation, that can never happen. CustomerCreation is fired in a step preceding the one in which CustomerContractCreation is, so you can guarantee that the entire system is in a consistent state once you start executing any subsequent step.\nThat guarantee is at the heart of what makes structured cooperation so powerful\u0026mdash;it allows you to reason about the state of the entire system without actually writing any code that would require you to know anything about it. You don\u0026rsquo;t need to care if there is one, zero, or 500 handlers listening to a message you emitted. You don\u0026rsquo;t need to care if they themselves need to fire off 1000 messages of their own in order to react to your message, or run some dreadful local calculation that takes until October to complete. You fire off your messages and suspend, and Scoop will wake you up when September ends.\nDistributed Exceptions #In a system where structured cooperation is obeyed, if I\u0026rsquo;m inside some saga that\u0026rsquo;s handling some message, and that message is a \u0026ldquo;child\u0026rdquo; message of some other saga\u0026mdash;i.e., it was emitted from within another saga\u0026mdash;I\u0026rsquo;m guaranteed that the \u0026ldquo;parent\u0026rdquo; saga is still running, patiently waiting at the end of the step that emitted the message. Therefore, if my saga fails by throwing an (unhandled) exception, I have the option to propagate that exception to the parent, and rethrow it there. If the parent doesn\u0026rsquo;t handle it, it bubbles up to its parent, and so on\u0026mdash;exactly in the same fashion as they would in regular code.\nDistributed stack traces #The first thing this allows you to do is to build something akin to a distributed stack trace\u0026mdash;a description of the place an error happened, and how the \u0026ldquo;thread of execution\u0026rdquo; got there, but across multiple services. This overlaps with the information you get from distributed tracing, except it doesn\u0026rsquo;t require any separate technology or instrumentation\u0026mdash;it\u0026rsquo;s right there, inside your exception, where you need and expect it.\nAn example is in order (notice we\u0026rsquo;re naming the steps here\u0026mdash;optional, but it makes the exceptions more informative):\n// Note: Each of the following could, in theory, // run in a completely different service, // and be written in a completely different // language! // Pretend it listens to \u0026#34;parent-topic\u0026#34; saga(name = \u0026#34;parent-handler\u0026#34;) { step(\u0026#34;First parent step\u0026#34;) { scope, message -\u0026gt; logger.log(\u0026#34;1\u0026#34;) scope.launch(\u0026#34;child-topic\u0026#34;, JsonObject()) } step(\u0026#34;Second parent step\u0026#34;) { scope, message -\u0026gt; logger.log(\u0026#34;This will not execute\u0026#34;) } } // Pretend it listens to \u0026#34;child-topic\u0026#34; saga(name = \u0026#34;child-handler\u0026#34;) { step(\u0026#34;First child step\u0026#34;) { scope, message -\u0026gt; logger.log(\u0026#34;2\u0026#34;) } step(\u0026#34;Second child step\u0026#34;) { scope, message -\u0026gt; logger.log(\u0026#34;3\u0026#34;) scope.launch(\u0026#34;child-topic\u0026#34;, JsonObject()) } } // Pretend it listens to \u0026#34;grandchild-topic\u0026#34; saga(name = \u0026#34;grandchild-handler\u0026#34;) { step(\u0026#34;First grandchild step\u0026#34;) { scope, message -\u0026gt; logger.log(\u0026#34;4\u0026#34;) } step(\u0026#34;Second grandchild step\u0026#34;) { scope, message -\u0026gt; logger.log(\u0026#34;5\u0026#34;) throw MyException(\u0026#34;My exception message\u0026#34;) } } In the above, the log entries would appear in the expected 12345 order (but, naturally, if each saga were running in a different service, the strings would get logged to different places).\nAdditionally, in the service running grandchild-handler, the following exception would be visible (for brevity, I\u0026rsquo;m truncating the stack trace here):\n{ \u0026#34;type\u0026#34;: \u0026#34;io.github.gabrielshanahan.scoop.blocking.coroutine.MyException\u0026#34;, \u0026#34;causes\u0026#34;: [], \u0026#34;source\u0026#34;: \u0026#34;grandchild-handler[0197dfd8-a424-7712-926f-b557d00203c0]\u0026#34;, \u0026#34;message\u0026#34;: \u0026#34;My exception message\u0026#34;, \u0026#34;stackTrace\u0026#34;: [ { \u0026#34;fileName\u0026#34;: \u0026#34;DemoTest.kt\u0026#34;, \u0026#34;className\u0026#34;: \u0026#34;io.github.gabrielshanahan.scoop.blocking.coroutine.DemoTest\u0026#34;, \u0026#34;lineNumber\u0026#34;: 48, \u0026#34;functionName\u0026#34;: \u0026#34;demoTest$lambda$8$lambda$7\u0026#34; }, ... ] } After that, in the service running child-handler, the following exception would become visible (again truncating the stack trace):\n{ \u0026#34;type\u0026#34;: \u0026#34;io.github.gabrielshanahan.scoop.shared.coroutine.eventloop.ChildRolledBackException\u0026#34;, \u0026#34;causes\u0026#34;: [ { \u0026#34;type\u0026#34;: \u0026#34;io.github.gabrielshanahan.scoop.blocking.coroutine.MyException\u0026#34;, \u0026#34;causes\u0026#34;: [], \u0026#34;source\u0026#34;: \u0026#34;grandchild-handler[0197dfd8-a424-7712-926f-b557d00203c0]\u0026#34;, \u0026#34;message\u0026#34;: \u0026#34;My exception message\u0026#34;, \u0026#34;stackTrace\u0026#34;: [ { \u0026#34;fileName\u0026#34;: \u0026#34;DemoTest.kt\u0026#34;, \u0026#34;className\u0026#34;: \u0026#34;io.github.gabrielshanahan.scoop.blocking.coroutine.DemoTest\u0026#34;, \u0026#34;lineNumber\u0026#34;: 48, \u0026#34;functionName\u0026#34;: \u0026#34;demoTest$lambda$8$lambda$7\u0026#34; }, ... ] } ], \u0026#34;source\u0026#34;: \u0026#34;child-handler[0197dfd8-a424-73a9-926e-82e59ae4498a]\u0026#34;, \u0026#34;message\u0026#34;: \u0026#34;Child failure occurred while suspended in step [Second child step]\u0026#34;, \u0026#34;stackTrace\u0026#34;: [] } Finally, in the service running parent-handler, the following exception would become visible (truncating the stack trace again):\n{ \u0026#34;type\u0026#34;: \u0026#34;io.github.gabrielshanahan.scoop.shared.coroutine.eventloop.ChildRolledBackException\u0026#34;, \u0026#34;causes\u0026#34;: [ { \u0026#34;type\u0026#34;: \u0026#34;io.github.gabrielshanahan.scoop.shared.coroutine.eventloop.ChildRolledBackException\u0026#34;, \u0026#34;causes\u0026#34;: [ { \u0026#34;type\u0026#34;: \u0026#34;io.github.gabrielshanahan.scoop.blocking.coroutine.MyException\u0026#34;, \u0026#34;causes\u0026#34;: [], \u0026#34;source\u0026#34;: \u0026#34;grandchild-handler[0197dfd8-a424-7712-926f-b557d00203c0]\u0026#34;, \u0026#34;message\u0026#34;: \u0026#34;My exception message\u0026#34;, \u0026#34;stackTrace\u0026#34;: [ { \u0026#34;fileName\u0026#34;: \u0026#34;DemoTest.kt\u0026#34;, \u0026#34;className\u0026#34;: \u0026#34;io.github.gabrielshanahan.scoop.blocking.coroutine.DemoTest\u0026#34;, \u0026#34;lineNumber\u0026#34;: 48, \u0026#34;functionName\u0026#34;: \u0026#34;demoTest$lambda$8$lambda$7\u0026#34; }, ... ] } ], \u0026#34;source\u0026#34;: \u0026#34;child-handler[0197dfd8-a424-73a9-926e-82e59ae4498a]\u0026#34;, \u0026#34;message\u0026#34;: \u0026#34;Child failure occurred while suspended in step [Second child step]\u0026#34;, \u0026#34;stackTrace\u0026#34;: [] } ], \u0026#34;source\u0026#34;: \u0026#34;parent-handler[0197dfd8-a3ec-7e97-9a99-1ca8a5f1598c]\u0026#34;, \u0026#34;message\u0026#34;: \u0026#34;Child failure occurred while suspended in step [First parent step]\u0026#34;, \u0026#34;stackTrace\u0026#34;: [] } Those UUIDs next to the handler name are there because all sagas in Scoop are horizontally scalable by design, and the UUID identifies the actual instance of the saga that executed that particular step. We\u0026rsquo;ll talk more about Scoop\u0026rsquo;s execution model in the following article.\nYou could even take the above a step further, store the stack trace at the point each message is actually emitted, and \u0026ldquo;concatenate\u0026rdquo; it with the stack trace of the child to get even more precise information. Scoop, being a POC, keeps things simple and doesn\u0026rsquo;t do that, but it could.\nDistributed stack unwinding #I mentioned earlier that sagas need compensating actions in order to revert the changes they made when an error causes them to fail. Since parents wait for their children to finish executing, Scoop can execute rollbacks in a very structured and predictable way, in effect achieving the equivalent of stack unwinding, but across multiple services.\nCompensating actions are defined as part of the step that they roll back and are executed in the opposite order to the steps\u0026mdash;subsequent steps are rolled back before preceding ones.\nLet\u0026rsquo;s look at an example:\nsaga(name = \u0026#34;parent-handler\u0026#34;) { step( \u0026#34;First parent step\u0026#34;, invoke = { scope, message -\u0026gt; logger.log(\u0026#34;1\u0026#34;) scope.launch(\u0026#34;child-topic\u0026#34;, JsonObject()) }, rollback = { scope, message, throwable -\u0026gt; logger.log(\u0026#34;9\u0026#34;) } ) step( \u0026#34;Second parent step\u0026#34;, invoke = { scope, message -\u0026gt; println(\u0026#34;This will never print\u0026#34;) } ) } saga(name = \u0026#34;child-handler\u0026#34;) { step( \u0026#34;First child step\u0026#34;, invoke = { scope, message -\u0026gt; logger.log(\u0026#34;2\u0026#34;) }, rollback = { scope, message, throwable -\u0026gt; logger.log(\u0026#34;8\u0026#34;) } ) step( \u0026#34;Second child step\u0026#34;, invoke = { scope, message -\u0026gt; logger.log(\u0026#34;3\u0026#34;) scope.launch(\u0026#34;grandchild-topic\u0026#34;, JsonObject()) }, rollback = { scope, message, throwable -\u0026gt; logger.log(\u0026#34;7\u0026#34;) } ) } saga(name = \u0026#34;grandchild-handler\u0026#34;) { step( \u0026#34;First grandchild step\u0026#34;, invoke = { scope, message -\u0026gt; logger.log(\u0026#34;4\u0026#34;) }, rollback = { scope, message, throwable -\u0026gt; logger.log(\u0026#34;6\u0026#34;) } ) step( \u0026#34;Second grandchild step\u0026#34;, invoke = { scope, message -\u0026gt; logger.log(\u0026#34;5\u0026#34;) throw MyException(\u0026#34;My exception message\u0026#34;) }, rollback = { scope, message, throwable -\u0026gt; logger.log( \u0026#34;\u0026#34;\u0026#34; This will not execute, because the transaction hadn\u0026#39;t committed yet when the exception was thrown, so a standard transaction rollback happened and there\u0026#39;s nothing to compensate for. \u0026#34;\u0026#34;\u0026#34; ) } ) } Follow the numbers to understand in what order things execute, but it\u0026rsquo;s pretty intuitive\u0026mdash;you\u0026rsquo;re basically rolling back time.\nWhat about if there are multiple handlers listening to one topic, and only one of them fails, while the others succeed? Glad you asked!\nsaga(name = \u0026#34;parent-handler\u0026#34;) { step( \u0026#34;First parent step\u0026#34;, invoke = { scope, message -\u0026gt; logger.log(\u0026#34;1\u0026#34;) scope.launch(\u0026#34;child-topic\u0026#34;, JsonObject()) }, rollback = { scope, message, throwable -\u0026gt; logger.log(\u0026#34;7\u0026#34;) } ) step( \u0026#34;Second parent step\u0026#34;, invoke = { scope, message -\u0026gt; println(\u0026#34;This will never print\u0026#34;) } ) } saga(name = \u0026#34;child-handler-1\u0026#34;) { step( \u0026#34;First child-1 step\u0026#34;, invoke = { scope, message -\u0026gt; logger.log(\u0026#34;2a\u0026#34;) }, rollback = { scope, message, throwable -\u0026gt; logger.log(\u0026#34;6\u0026#34;) } ) step( \u0026#34;Second child-1 step\u0026#34;, invoke = { scope, message -\u0026gt; logger.log(\u0026#34;3a\u0026#34;) scope.launch(\u0026#34;grandchild-topic\u0026#34;, JsonObject()) }, rollback = { scope, message, throwable -\u0026gt; logger.log(\u0026#34;5\u0026#34;) } ) } saga(name = \u0026#34;child-handler-2\u0026#34;) { step( \u0026#34;First child-2 step\u0026#34;, invoke = { scope, message -\u0026gt; logger.log(\u0026#34;2b\u0026#34;) }, rollback = { scope, message, throwable -\u0026gt; logger.log(\u0026#34;4b\u0026#34;) } ) step( \u0026#34;Second child-2 step\u0026#34;, invoke = { scope, message -\u0026gt; logger.log(\u0026#34;3b\u0026#34;) throw MyException(\u0026#34;My exception message\u0026#34;) }, rollback = { scope, message, throwable -\u0026gt; logger.log( \u0026#34;\u0026#34;\u0026#34; This will not execute, because the transaction hadn\u0026#39;t committed yet when the exception was thrown, so a standard transaction rollback happened and there\u0026#39;s nothing to compensate for. \u0026#34;\u0026#34;\u0026#34; ) } ) } Again, keep in mind that each of those can potentially be running in a completely different service, written in a completely different language (assuming the same structured cooperation protocol is implemented in that language\u0026mdash;more on that in the next article).\nThe failing child handler first rolls itself back, after which control is transferred to the parent. The parent sees that one of its children has failed, so it triggers a rollback of the remaining children, waits for them to complete, then rolls back itself.\nNotice how I logged some of the numbers with a letter\u0026mdash;that\u0026rsquo;s to represent that these blocks are running in parallel, so you can\u0026rsquo;t guarantee their relative order. You could get any of 2a-2b-3a-3b-4b, 2a-3a-2b-3b-4b, 2a-2b-3b-4b-3a, or any other combination where each 2x comes before 3x and 3b comes before 4b. The rest of the logs will be ordered deterministically.\nOne last example: if any messages were emitted during any step that is being rolled back, compensating actions of those \u0026ldquo;child\u0026rdquo; handlers are run first.\nsaga(name = \u0026#34;parent-handler\u0026#34;) { step( \u0026#34;First parent step\u0026#34;, invoke = { scope, message -\u0026gt; logger.log(\u0026#34;1\u0026#34;) scope.launch(\u0026#34;child-topic\u0026#34;, JsonObject()) }, rollback = { scope, message, throwable -\u0026gt; logger.log(\u0026#34;7\u0026#34;) } ) step( \u0026#34;Second parent step\u0026#34;, invoke = { scope, message -\u0026gt; logger.log(\u0026#34;4\u0026#34;) throw MyException(\u0026#34;My exception message\u0026#34;) }, rollback = { scope, message, throwable -\u0026gt; logger.log( \u0026#34;\u0026#34;\u0026#34; This will not execute, because the transaction hadn\u0026#39;t committed yet when the exception was thrown, so a standard transaction rollback happened and there\u0026#39;s nothing to compensate for. \u0026#34;\u0026#34;\u0026#34; ) } ) } saga(name = \u0026#34;child-handler\u0026#34;) { step( \u0026#34;First child step\u0026#34;, invoke = { scope, message -\u0026gt; logger.log(\u0026#34;2\u0026#34;) }, rollback = { scope, message, throwable -\u0026gt; logger.log(\u0026#34;6\u0026#34;) } ) step( \u0026#34;Second child step\u0026#34;, invoke = { scope, message -\u0026gt; logger.log(\u0026#34;3\u0026#34;) }, rollback = { scope, message, throwable -\u0026gt; logger.log(\u0026#34;5\u0026#34;) } ) } There are other cases I\u0026rsquo;m not discussing here\u0026mdash;what if there are more than two child handlers? What if more than one handler fails? What if a rollback step fails? I\u0026rsquo;ll discuss some of these in the next article; others, I\u0026rsquo;ll leave up to the motivated reader to look up in tests. For now, suffice it to say that in all those scenarios, Scoop is well-behaved, and you can probably figure out what that behavior is just by thinking about what it should be.\nAs a consequence of this approach to handling failures, you get a lot of non-trivial features for free or very little work, such as cancellations, timeouts, rollbacks triggered by a user action, and more. Scoop supports all of these, and we\u0026rsquo;ll talk about some of them in the next article.\nResource handling #Another key feature recovered by adhering to structured cooperation is resource handling. By that, I mean the distributed analogue to various language constructs that allow you to delimit a block of code within which a resource is available, while also ensuring that resource is cleaned up regardless of how that block is exited (normally or exceptionally). This is typically done via try-finally.\nA resource is anything that\u0026rsquo;s considered expensive. In the context of distributed systems, think less \u0026ldquo;opening a file\u0026rdquo; and more \u0026ldquo;spinning up a cluster of 100 servers to run a calculation\u0026rdquo;.\nIn Scoop, because of the way failures are guaranteed to propagate, this is easy to do:\nsaga(\u0026#34;root-handler\u0026#34;) { tryFinallyStep( invoke = { scope, message -\u0026gt; k8Service.spinUp(requestId = \u0026#34;123\u0026#34;, num = 100) scope.launch(\u0026#34;do-intensive-calculation\u0026#34;, JsonObject()) }, finally = { scope, message -\u0026gt; k8Service.spinDown(requestId = \u0026#34;123\u0026#34;) }, ) } What\u0026rsquo;s important is that tryFinallyStep isn\u0026rsquo;t some special primitive\u0026mdash;you can build it yourself using what we\u0026rsquo;ve already introduced, plus a single additional thing, CooperationContext, which we\u0026rsquo;ll talk about in the next article.\nAlternatively, you could wrap the k8Service in a saga of its own, and take advantage of the way Scoop works natively.\n// Listens on \u0026#34;k8-spinup\u0026#34; topic saga(name = \u0026#34;k8-spinup\u0026#34;) { step( invoke = { scope, message -\u0026gt; k8Service.spinUp(\u0026lt;extract params from message\u0026gt;) }, rollback = { scope, message, throwable -\u0026gt; k8Service.spinDown(\u0026lt;extract params from message\u0026gt;) } ) } // Listens on \u0026#34;k8-spindown\u0026#34; topic saga(name = \u0026#34;k8-spindown\u0026#34;) { step { scope, message -\u0026gt; k8Service.spinDown(\u0026lt;extract params from message\u0026gt;) } } saga(name = \u0026#34;root-handler\u0026#34;) { step { scope, message -\u0026gt; scope.launch(\u0026#34;k8-spinup\u0026#34;, JsonObject().put(\u0026#34;request-id\u0026#34;, 123).put(\u0026#34;num\u0026#34;, 100)) } step { scope, message -\u0026gt; scope.launch(\u0026#34;do-intensive-calculation\u0026#34;, JsonObject()) } step { scope, message -\u0026gt; scope.launch(k8-spindown\u0026#34;, JsonObject().put(\u0026#34;request-id\u0026#34;, 123)) } } If rolling back whatever do-intensive-calculation entailed was itself also intensive, you could even consider having k8Service.spinUp as a compensating action for the k8-spindown saga step. The sky\u0026rsquo;s the limit here.\nWhat if I don\u0026rsquo;t want to cooperate? #The fundamental way structured cooperation works is by synchronizing parts of a distributed system\u0026mdash;in essence, structured cooperation is a synchronization primitive, much like structured concurrency is. It allows you to make explicit things that depend on each other, by allowing you to order them so that that which depends only starts executing after that which is depended on has finished executing. Components of the distributed system cooperate to ensure this is always the case, waiting for each other if needed.\nNaturally, there are times where you don\u0026rsquo;t want this behavior\u0026mdash;where you want to fire off a message that\u0026rsquo;s independent of the operation you\u0026rsquo;re implementing.\nThis is how that\u0026rsquo;s done in Scoop:\nsaga(name = \u0026#34;root-handler\u0026#34;) { step { scope, message -\u0026gt; scope.launchOnGlobalScope(\u0026#34;some-topic\u0026#34;, JsonObject()) } } In the above, you\u0026rsquo;re explicitly saying that you\u0026rsquo;re launching a completely independent hierarchy of messages. You\u0026rsquo;re not waiting for it to complete. If it fails, you won\u0026rsquo;t be notified about it. You can\u0026rsquo;t, because you\u0026rsquo;re not waiting\u0026mdash;there\u0026rsquo;s nobody to notify. If your saga rolls back, that message hierarchy will not be notified. It can\u0026rsquo;t be, because who knows what state it\u0026rsquo;s in\u0026mdash;it might not have been completed yet, or it might have already been rolled back, or it might be in the process of rolling back, or something else.\nI want to emphasize that this is not a fringe feature. Structured cooperation is tool, not a dogma\u0026mdash;you should only use it when you need to solve the problem it was designed to solve. If the operations performed by two different services depend on each other, then that dependency is there no matter what you do\u0026mdash;you just can\u0026rsquo;t start B before A finishes, period, and structured cooperation is an excelent tool to make that dependency explicit, and provide the necessary synchronization.\nBut if the operations performed by two operations are independent, then you have the option of choosing. Do you want to have predictable execution, distributed exceptions, stack traces and just general peace of mind, at the cost of additional overhead and latency? Great\u0026mdash;keep using structured cooperation. But if performance is an issue, you always have the option of falling back to doing things the old way by launching an independent message hierarchy.\nIn Scoop, the manner in which you decouple execution hierarchies has additional advantages:\nyou\u0026rsquo;re being explicit\u0026mdash;the launchOnGlobalScope is immediately visible, can be searched for, etc.,\nwhatever some-topic handlers may be launched, they can still participate in structured cooperation amongst themselves.\nSo in effect, you get the best of both worlds\u0026mdash;the ability to synchronize the parts of a distributed system that need to be synchronized, while also not needlessly slowing down the parts that must not be. Sometimes, the stars align, and you get to have your cake and eat it too.\nIncidentally, basically the same thing happens when only part of the (distributed) system implements structured cooperation, and another part doesn\u0026rsquo;t. The part that doesn\u0026rsquo;t is just independent of the part that does, and you have no guarantees about anything that happens there, but it doesn\u0026rsquo;t stop you from using structured cooperation in some subset of your system. As a consequence, you can switch to using structured cooperation gradually, service by service.\nWrapping up #I hope I\u0026rsquo;ve started to convince you that structured cooperation will make your interactions with distributed systems dramatically simpler, or at least piqued your curiosity. In the next article, we\u0026rsquo;ll take a closer look at how structured cooperation is implemented in Scoop.\nScoop has nothing to do with event sourcing. I\u0026rsquo;m including the comparison solely because Axon, by design, is built for distributed environments, and forces you to model things accordingly.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nAt times, you might notice the word coroutine being used, e.g., in package names. That isn\u0026rsquo;t a reference to Kotlin coroutines, but rather to Scoop\u0026rsquo;s own (distributed) implementation. Kotlin coroutines are not used anywhere in Scoop.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nSince Scoop uses its own MQ on top of Postgres, publishing messages only when the transaction commits is easy to do\u0026mdash;the messages are part of the transaction. If it were implemented in a realistic context, an external MQ would likely be involved, which means this would need to use something like the outbox pattern.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nYou might be wondering how you share data between steps, if you have no guarantee they will all be run by the same instance of a service. This is what CooperationContext is for, and we\u0026rsquo;ll talk about it in the next article. Basically, it\u0026rsquo;s the equivalent of reactive context, CoroutineContext, etc.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nIf you\u0026rsquo;re getting structured concurrency vibes, you\u0026rsquo;re exactly right! That\u0026rsquo;s where structured cooperation gets its name from.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nWait\u0026mdash;how does Scoop find out which handlers it was supposed to have been waiting for in the first place? That\u0026rsquo;s a very important question and not trivial to answer in the context of distributed systems. The way you decide to answer it\u0026mdash;because it will be up to you\u0026mdash;is one of the key decisions you need to make when using Scoop, depends on how your system is architected, has implications related to the CAP theorem, and is what that second required parameter to saga I mentioned earlier\u0026mdash;an instance of EventLoopStrategy\u0026mdash;is there for. We\u0026rsquo;ll discuss this whole topic at length in the following article.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","date":"6 July 2025","permalink":"/posts/introducing-structured-cooperation/","section":"Articles","summary":"An introduction to structured cooperation, a new way to design distributed systems, and Scoop, a POC orchestration library.","title":"Taming Eventual Consistency—Applying Principles of Structured Concurrency to Distributed Systems"},{"content":"","date":null,"permalink":"/tags/basics/","section":"Tags","summary":"","title":"basics"},{"content":"","date":null,"permalink":"/tags/lisp/","section":"Tags","summary":"","title":"lisp"},{"content":"Introduction #If you\u0026rsquo;ve spent any amount of time in programming circles, you\u0026rsquo;ve probably heard some version of the \u0026ldquo;static vs. dynamic typing\u0026rdquo; debate. But beneath the tribalism lies a very real and very practical difference in what these languages optimize for, and any good engineer should understand what that difference is, what the tradeoffs are, and when they should choose which.\nI\u0026rsquo;ve already made the case for static typing in a separate article, so I won\u0026rsquo;t reproduce it here. Instead, I want to dig into why the arguments on both sides feel so compelling to the side that puts them forth, touch on an analogy that drives the whole thing home, and finish by making a short stop in a place not many people get to visit\u0026mdash;Common Lisp.\nThere is no war #I want you to think about something that seems completely unrelated\u0026mdash;Microsoft Word vs. pen \u0026amp; paper.\nPen \u0026amp; paper is a fantastically flexible tool. You can use it to write down anything, in any form\u0026mdash;words, pictures, flowcharts, graphs, and anything and everything that has ever been invented or will ever be invented. You can write in any direction or angle, change the size of what you\u0026rsquo;re writing to fit the space, draw an arrow and just continue elsewhere if it still doesn\u0026rsquo;t, go back and add a quick note in the space above a sentence you had previously written, or squeeze something in the margins, or anywhere else. How someone else writes on paper, or how they feel you should, has absolutely no bearing on how you do it. You don\u0026rsquo;t need to learn anything new. If you need to write something down, you can write it down in precisely the time it takes to stretch out your arm and grab what you need\u0026mdash;it\u0026rsquo;s so fast it doesn\u0026rsquo;t even register as an activity. And everything I just described requires constant, and very low, effort to do.\nBy contrast, Microsoft Word is rigid. You can only really use it to write text, and in a very specific manner\u0026mdash;in Latin languages, it goes line by line, left-to-right. You can tweak the font, size, color, and many other attributes, but you can ultimately only do so in certain ways, and beyond a certain point it tends to involve significant effort. You spend a lot of time jumping through hoops, and there are things that you just can\u0026rsquo;t do at all, period. There\u0026rsquo;s a significant learning curve, so much so that there are dedicated courses that teach you how. That\u0026rsquo;s actually a pretty absurd thing, when you think about it\u0026mdash;why should you spend weeks learning to do something that you\u0026rsquo;ve been able to do with zero effort and zero cost since you were a child?\nI think the analogy is quite clear. Now ask yourself this: is there a war between pen \u0026amp; paper and Microsoft Word? Of course not! That\u0026rsquo;s completely absurd, and I\u0026rsquo;m sure that anybody reading this agrees.\nNow ask yourself, why? What would you tell two hypothetical people arguing about Microsoft Word vs. pen \u0026amp; paper?\nEasy\u0026mdash;you\u0026rsquo;d explain that they\u0026rsquo;re different tools for different jobs, ask the people arguing what problem they\u0026rsquo;re trying to solve, and make a recommendation based on the task at hand. You\u0026rsquo;d recognize that there\u0026rsquo;s no single correct answer, and that it depends on context. But you can easily imagine making the case from either side, and ridiculing the other.\nWould you write a book using pen \u0026amp; paper, much less collaborate on one? Would you manage project documentation, which changes every week, using pen \u0026amp; paper? Have you ever tried making sense of someone else\u0026rsquo;s notes, or even tried deciphering someone else\u0026rsquo;s handwriting for that matter?\nFrom the other side\u0026mdash;would you open up Word just to jot something down quick? Would you use Word to map out an idea in your head, or work through something you\u0026rsquo;re not quite clear on? Do you use Word for shopping lists, tic-tac-toe, or keeping track of scores in a drinking game? Can you imagine doing all three in a single document?\nThis underscores the key point I\u0026rsquo;m trying to make: you can make the argument in both directions, and still be completely right in the specific scenarios you cherry-picked. This is the fundamental reason there even is a typing war in the first place (and probably applies to wars in general)\u0026mdash;both sides are absolutely convinced that they\u0026rsquo;re correct, because they really are, in the scenarios they cherry-picked.\nWhat\u0026rsquo;s more\u0026mdash;they don\u0026rsquo;t realize they\u0026rsquo;re cherry-picking! Their opinions are based on their life experiences and the problems they\u0026rsquo;ve encountered, and all they\u0026rsquo;re doing is optimizing for these problems. But they\u0026rsquo;ve lived different lives, encountered different challenges, and learned to optimize for different problems!\nA typical web designer probably doesn\u0026rsquo;t know what it\u0026rsquo;s like to maintain code that was written 10 years ago by someone that\u0026rsquo;s not around anymore, in situations where a small mistake can cost hundreds of thousands of dollars. And a typical enterprise Java developer probably has no conception of what it\u0026rsquo;s like to have an idea now, hammering together a working prototype in two hours, polishing it, and tomorrow just being\u0026hellip;done. \u0026ldquo;Done\u0026rdquo; is not a word that exists in enterprise software development.\nHow surprising is it that these two groups have different perceptions of what constitutes an everyday problem, and therefore what people should be optimizing for? Obviously, these are two very different jobs, and require a very different set of tools.\nHow do you win a type war? Simple\u0026mdash;by recognizing that there\u0026rsquo;s no such thing as a type war.\n\u0026ldquo;Different tools for different jobs\u0026rdquo;, not \u0026ldquo;different strokes for different folks\u0026rdquo; #There is a group of people who will read the previous paragraphs and draw the conclusion that, fundamentally, both approaches are always valid, and that the decision is a purely subjective one\u0026mdash;\u0026ldquo;different strokes for different folks\u0026rdquo;.\nThey are wrong.\nYes, there is no choice that\u0026rsquo;s correct universally, but that doesn\u0026rsquo;t mean that there isn\u0026rsquo;t a choice that\u0026rsquo;s correct in a given set of circumstances. I\u0026rsquo;m not saying there isn\u0026rsquo;t a twilight zone between night and day where things can be reasonably done in both ways, but I am saying that\u0026rsquo;s not the case in the vast majority of situations.\nAll other things being equal, I strongly believe that, in most situations, there are choices that are objectively more correct than others. I do want to emphasize the \u0026ldquo;all other things being equal\u0026rdquo; part, as in the majority of situations, all other things are not equal. The Chief Architect at Slack would still choose PHP to build a new app in 2020, and it\u0026rsquo;s certainly not because of its approach to typing, but in spite of it.\nStatically typed languages force you to solve a certain problem and explicitly express your solution, every time, all the time. Dynamically typed languages do not force you to do anything - ever. They are completely flexible and arbitrary and you can do pretty much whatever you want. They don’t prevent you from thinking hard, but they don’t force you to, either. Often, that’s fine, but it becomes easy to miss subtle problems that look simple but actually require you to think hard, because there is no warning sign. In statically typed languages, the warning sign is that it doesn’t compile, and it\u0026rsquo;s there every time.\nAs a result, statically typed languages are better at catching mistakes that you didn’t (or even couldn’t) know you made. Above and beyond that, they make code more regular, more predictable, allow new features to be built, and allow you to communicate information that is guaranteed to be true.\nNone of that is subjective.\nSo, all other things being equal, when should you choose which?\nI think the analogy in the previous section provides a very good guiding principle. Use dynamic languages in situations where you\u0026rsquo;d use pen \u0026amp; paper\u0026mdash;when you\u0026rsquo;re doing something small, something that will be finished soon, finished forever, and something that you\u0026rsquo;re building alone, and will always build alone. In all other situations, especially large projects with multiple developers that evolve over time, use a language that provides a strong, static and sound (looking at you, Java) type system.\nOf course, that\u0026rsquo;s almost never how the choice is actually made, but that\u0026rsquo;s a different story.\nCan you have your cake, and eat it too? #For most of my career, I\u0026rsquo;ve operated under the assumption that this choice is all-or-nothing\u0026mdash;either my language, and my entire application, is typed, or it\u0026rsquo;s not. Either I optimize the initial phases of a project and pay for it in the later stages (dynamically typed languages), or I optimize for the later stages and pay for it in the initial phases (statically typed languages).\nHowever, there are what you call gradually-typed languages, where you can type parts of your code, and leave others untyped. In the vast majority of cases, gradually-typed is a fancy name for a by-product of a dynamic language that added types in retrospect and needed to maintain backwards compatibility, which is what happened with type hints in Python and PHP. Often, the capability is not even part of the language itself, and you need an external tool, e.g. PHPStan, mypy or Sorbet.\nWhile this would seem to be a very good compromise\u0026mdash;I can start out flexible, and become more rigid as I need to\u0026mdash;the immediate counter-argument is that these type systems are usually very low-quality\u0026mdash;obviously, when anything is bolted on as an afterthought, its quality is going to suffer. Additionally, and perhaps even more importantly\u0026mdash;if a language started out as dynamic, and only added static types in retrospect, the majority of its users will not actually use them, or use them correctly, and neither will the majority of its ecosystem. Both of these issues significantly impact the actual benefits you can reap from this approach.\nBut let\u0026rsquo;s steelman the argument and disregard those issues\u0026mdash;after all, we can imagine a language which was designed from the ground-up to be gradually typed (even though, interestingly enough, there are practically none). There is still another, much more subtle issue with these languages: calling out to untyped code practically never requires any special ceremony.\nMost languages and tools have various knobs and dials that allow you to tweak what happens when you do this, e.g. PHP doing runtime checks on function boundaries when configured to do so, PHPStan warning when its strictness is dialed up to the highest levels, TypeScript offering strict mode, and Sorbet doing the same. However, in all of these languages/tools, anything beyond \u0026ldquo;cover your eyes and hope for the best\u0026rdquo; is opt-in, and none of them tell you at the call-site that you\u0026rsquo;re calling out to untyped code.\nIn other words, in gradually typed languages, types do not color functions. I consider this to be a fatal flaw. The boundary between typed and untyped code should be like the Korean demilitarized zone\u0026mdash;immediately visible, and not crossed lightly.\nTo my knowledge, there is only one industrial-grade language that gets this right\u0026mdash;Common Lisp1. CL itself is dynamically, but strongly, typed, with an incredibly rich type system\u0026mdash;right out of the box, you have an excellent sweet spot between flexibility and safety. Furthermore, its macro system allows you to build whatever language feature you need, and naturally, people have. This is how we got Coalton, an implementation of a practical subset of the Haskell type system, checked during compile-time\u0026mdash;as a library.\nThis means you can:\nomit types when you don\u0026rsquo;t want them or they don\u0026rsquo;t make sense, e.g. when writing macros, prototyping, or interacting with the REPL at any time, choose to switch to one of the strongest and most sound type systems in existence, with 0 performance cost (and, since Coalton generates native Lisp type declarations, potentially significant performance gains) your type system is just a library\u0026mdash;it\u0026rsquo;s not \u0026ldquo;another thing\u0026rdquo;, it\u0026rsquo;s right there, next to the rest of your code. Want to know what a type declaration actually does? Just go-to-definition on it. Not sure why your program isn\u0026rsquo;t compiling? Debug it as you would any other problem. Can you imagine having the ability, when needed, to understand compiler errors by putting a breakpoint in the typechecker? In Lisp, that\u0026rsquo;s just a normal Tuesday. Last but not least, Coalton colors code.\nSummary #The debate between statically and dynamically typed languages is often presented as a war, but I think it\u0026rsquo;s actually anything but.\nAt its core, static typing optimizes for maintainability\u0026mdash;explicitly modeling data and relationships, catching subtle bugs early, and enabling rich tooling, predictable design patterns, and concise communication of information to the reader. On the flip side, dynamic typing optimizes for speed of initial development. It shines when exploring new ideas, building throwaway prototypes, or working solo on small-scale tasks.\nLike pen \u0026amp; paper and Microsoft Word, they’re different tools for different jobs, rather than opposing forces. Most flame wars arise because people tend to optimize for different pain points, based on the different experiences they\u0026rsquo;ve lived through.\nThat being said, I strongly disagree with the idea that the choice between them is simply a subjective one. The right tool depends on the task, context, and constraints, and in any given scenario, one is usually more appropriate than the other. Knowing when to choose which is a mark of good engineering.\nIdeally, one would want to work in an ecosystem that put both tools at their disposal, and while most current gradually-typed languages haven\u0026rsquo;t fleshed out the concept enough to bring any actual value, Common Lisp with Coalton is a rare example where you truly can blend both worlds meaningfully without giving anything up.\nThere is no war. Just tools, tradeoffs, and consequences.\nAn argument could be made for Typed Racket, however a) I\u0026rsquo;m personally not convinced it has a track record that would warrant the title \u0026lsquo;industrial-grade\u0026rsquo;, and b) while its approach to the typed/untyped boundary is sound and robust, it doesn\u0026rsquo;t actually visually distinguish the call-site\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","date":"26 March 2025","permalink":"/posts/how-to-win-type-wars/","section":"Articles","summary":"Beyond the static vs. dynamic typing flame war: a practical guide to understanding the tradeoffs, when to use each, and how Common Lisp bridges both worlds.","title":"This is how you win type wars"},{"content":"","date":null,"permalink":"/tags/types/","section":"Tags","summary":"","title":"types"},{"content":"Introduction #The debate between static and dynamic typing has been going on for as long as they have existed, and while I think it\u0026rsquo;s actually very easy to settle, I did wanted to take out some time to address the most common ones.\nStatic typing #The essence of statically typed languages is that they force you to think about certain problems\u0026mdash;specifically, the shapes of data\u0026mdash;and force you to be explicit about how you solve them.\nFor instance, a (proper) statically typed language will force you to think, and be explicit, about how you represent absence\u0026mdash;is it null, {} , [], Unit, Option.empty(), or something else? Are any of those even ever permissible, and if so, which one(s)? What do they mean in the context of the problem I’m solving? You have no choice but to think about this, solve it in a way that’s consistent with the rest of the code base, and precisely and explicitly describe that solution alongside your code.\nCrucially, statically typed languages are mean, and force you to think about this even when it might make no practical difference! For example, you might just be interested in the truthiness of the return value, so any (or some, depending on the language) of the above will do. Perhaps things are completely self-evidently correct no matter the choice, or you don’t want to decide right away and instead opt to keep your options open, or you specifically do want it to work for completely different data types. But even in those situations, you simply have no choice, and must define a specific, well-defined shape for each and every piece of data in a consistent manner, and adhere to that shape from then on.\nProponents of dynamically typed languages\u0026mdash;“dynamic typers”\u0026mdash;will balk at this, with a variation of the following arguments:\ntypes constrain my solutions, and force me to solve problems in a certain way. Types take away my guns freedom! types prolong the time it takes to deliver a solution types are a waste of time, because in 99% of cases, I can write perfectly functioning code without them proving my code is safe to the compiler is a waste of time, because in 99% of cases, my tests already do that types make code cluttered and difficult to read Proponents of static languages\u0026hellip;pretty much agree completely!\nTypes constrain my solutions #Exactly!\nThey force you to write code in a well-defined, well-researched manner that was specifically designed to prevent certain kinds of errors from happening, and to do so with mathematical certainty. It doesn’t matter if you’re tired, stupid, uninformed, malicious, make a mistake or forgot something. If it compiles, it provably doesn’t contain these mistakes, period.\nBut there’s more. Since it reduces the amount of different ways solutions can be coded, it actually makes things easier to understand\u0026mdash;less variety reduces the cognitive load for the reader.\nTaken further, since it forces all of us to think in similar ways, we tend to express ourselves in similar ways, which means what you write tends to be closer to what I would have written, further shortening the time it takes for me to understand it. This is pretty much the driving force behind things like Patterns in Enterprise Software\u0026mdash;“let’s solve problems consistently”.\nAnd there’s even more! Less variety = more patterns. A pattern is equivalent to saying that certain assumptions are guaranteed to be true. And when certain assumptions hold, features can be designed around them. This is why static languages have reliable IDE hinting, go-to-definition/show usages, refactorings, and so on. This is why banning GOTO opened the doors to try-with \u0026amp; exceptions, and the same principle also lies at the heart of structured concurrency.\nTypes prolong the time it takes to deliver a solution #Right on!\nIn a sense, I probably burn more time thinking about types, and naming them, than I do on anything else, apart from research.\nBut while types prolong the time it takes to deliver a solution, they also dramatically shorten the time it takes to deliver a continuously working solution\u0026mdash;one that works correctly now, and will continue to do so when assumptions change, when the project is 10x bigger, when none of the original authors work here anymore, etc.\nTypes expose and make explicit the coupling between distant parts of the codebase, and cause things to fail immediately if these parts are inconsistent. They also prevent you from collapsing code for different data types - if you want a function that operates on both strings and integers, you need to write two functions (or do some truly funky-JavaScript-like shit that will make reviewers go Liam Neeson on your ass).\nGiven that, it should come as no surprise to anyone that it takes more time to design things in a way so that they are consistent and separated. But that same extra time you spent will get reimbursed 100x by spending less time fixing bugs due to inconsistent assumptions in far-away places, especially as time goes on and things change. In this rare instance, tight coupling is actually a good thing.\nTypes also force you to spend time making sense of things, and give names to those things. Things that are named have meaning, and expressing a solution through something that has meaning makes the result more understandable to the people who maintain it (which includes future-you).\nAbove and beyond what I just wrote, I actually don\u0026rsquo;t think that types themselves take up any time whatsoever. What takes up time is the problems that they force you to think about. For example, when you\u0026rsquo;re modelling a business process, they force you to think hard about what is actually transforming into what and what the what\u0026rsquo;s should be called. That\u0026rsquo;s what takes long, but that\u0026rsquo;s a measure of both the complexity of the problem and the fact you don\u0026rsquo;t understand it fully yet, not of types being time-consuming. Actually writting the types themselves is a matter of seconds, and if the problem is trivial or you understand it well, i.e. \u0026ldquo;write a function that returns the first element of a List\u0026rdquo;, you spend no time on types at all.\nI would go as far as to say that types actually decrease the amount of time it takes for me to deliver a working solution. Why? Because I\u0026rsquo;ve been using them for so long that they\u0026rsquo;ve shaped my way of thinking into immediatelly focusing on what goes in and what goes out. I\u0026rsquo;m so used to this, I do it automatically, and each time I\u0026rsquo;m doing it, I train my brain to be a little better at it.\nAs explained above, understanding what the what\u0026rsquo;s are, and naming them, unlocks a much deeper understanding of what\u0026rsquo;s going on. Often, it will even uncover blind spots that you hadn\u0026rsquo;t thought of up until then, and you might even realize you need to go back to your stakeholders and pick their brains some more. If you hadn\u0026rsquo;t done that, you would\u0026rsquo;ve instead started writting a bunch of code that would\u0026rsquo;ve ended up being wrong. If you were lucky, you would discover that at some point down the road before going live, but even then, potentially a lot of that effort would go down the drain.\nTypes allowed you to save all that time by finding out in advance, before you wrote a single line of code.\nTypes are a waste of time, because in 99% of cases, I can write perfectly functioning code without them #Fo’ sho!\nIf your understanding of the problem is complete and correct, and you know it, clap your hands it’s easy to not make a mistake. If your understanding of the problem isn’t complete or correct, and you know it, clap your hands it’s also easy to not make a mistake. But when you think your understanding is complete and correct, but it actually isn’t, or when your understanding is correct, but you end up writing something else than what you’re thinking\u0026mdash;that’s when it’s really difficult not to make a mistake. Because you simply can’t know what you don’t know\u0026mdash;that’s the point.\nStatically typed languages force you to think hard about the problem. And when you think you’ve got it right, they check your work and call you out if you’re wrong, regardless of how sure you are of yourself.\nDynamic languages do not\u0026mdash;they accept what you write without question, they will not check your work, not even if you aren’t sure of yourself.\nNow, a lot of the time, maybe even 99% of the time, that doesn’t matter\u0026mdash;naturally, you’re a programmer worth your weight in bytes, and tend to not make mistakes. But then, one day, you do make a mistake, and it does matter. And this is the key difference\u0026mdash;while dynamic languages optimize for the 99%, static languages optimize for that 1% (because they know that, more often than not, it\u0026rsquo;s actually a lot more than 1%). And that\u0026rsquo;s before you consider situations where the mistakes happen just because you make a change that\u0026rsquo;s incompatible with the implicit assumptions in some far away place. You have no way of knowing that. You can be the best programmer in the world, and that\u0026rsquo;s still going to happen at some point\u0026mdash;it\u0026rsquo;s not if, it\u0026rsquo;s when.\nProving my code is safe to the compiler is a waste of time, because in 99% of cases, my tests already do that #Fo’ shizzle my nizzle!\nIf your perception of the problem truly is correct, and your code is also correct, then it’s a waste of time\u0026mdash;the job is already done. If your perception is correct, but your code isn’t correct, tests will likely tell you\u0026mdash;you designed them around your correct understanding of the problem\u0026mdash;so again, a waste of time.\nBut when your perception is wrong, then your code is also wrong, but crucially, in all likelihood, so are your tests! When designed based on flawed understanding of the problem, tests are likely validating a flawed solution, not the correct one. And in those situations, you’ll never know, because you can’t know what you don’t know.\nAnd that\u0026rsquo;s even before getting into the fact that tests rarely cover 100% of the code they\u0026rsquo;re supposed to test, not to mention that they\u0026rsquo;re code themselves, which makes them prone to exactly the same types of mistakes you set out to prove don\u0026rsquo;t exist in the first place!\nTypes are, literally, automatically generated tests that run at compile time. They are always generated, always correct, and always run. They are objective, and confront your solution with reality, not your (possibly flawed) perception of it. And because you’re forced to satisfy them, they prevent you from making mistakes even in situations when you don’t realize your reasoning is flawed.\nAgain\u0026mdash;dynamic languages optimize for the 99%, static languages optimize for that 1% (because they know that, more often than not, it\u0026rsquo;s actually a lot more than 1%).\nTypes make code cluttered and difficult to read #Well\u0026hellip;yes, but actually, no.\n\u0026lt;T\u0026gt; MergeMatchedSetMoreStep\u0026lt;R\u0026gt; set(Field\u0026lt;T\u0026gt; field, Select\u0026lt;? extends Record1\u0026lt;T\u0026gt;\u0026gt; value); Anybody who’s not used to typed languages and generics will look at the above and see a bowl of ASCII soup. But I don’t\u0026mdash;on the contrary, not only do I feel completely comfortable reading it, I actually feel like I’m flying blind when reading code that doesn\u0026rsquo;t contain such declarations.\nSo what’s the difference between me and dynamic typers? It’s simple\u0026mdash;I’m used to it. I’ve been looking at types for so long that they add almost no cognitive load for me. Is reading French inherently difficult, or is it only difficult for those who are not used to it? As with beauty, difficulty is in the eye of the beholder.\nAs every static typer will know, it turns out that, far from making things more difficult, the contrary is true - types make things much clearer, concisely communicating information that I would otherwise have to parse out of the solution. And while I do spend (infinitesimally) more time reading it than I would a simple function set(field, value), it’s because I’m busy learning something about the method that I will then take advantage of when reading its contents. I have the option to ignore the types if I want to, and just read the method and parameter names, but I choose not to, because the usefulness of the information contained in the types far outweighs the usefulness of the information contained in parameter names. Chief among the reasons is the fact that, unlike parameter names, types are actually checked for correctness, so I know they don’t lie.\nIn fact, a personal ideal I use for good code is \u0026ldquo;I shouldn’t have to read anything else than the method signature to understand what’s going on\u0026rdquo;. That\u0026rsquo;s not always attainable, but that\u0026rsquo;s why it\u0026rsquo;s an ideal\u0026mdash;so I can get as close to it as is reasonable given the circumstances. It shows me true north.\nSummary #It should be clear by now that I\u0026rsquo;m a strong proponent of static typing, however that does not mean that I think they should be used blindly, all the time, and without thought. On the contrary, I think that dynamic and static typing are complementary tools, and both should be used.\nHowever, that in no way means I think that the choice is a matter of mindset, as some do\u0026mdash;on the contrary, I think the decision is almost always objective.\n","date":"25 March 2025","permalink":"/posts/arguments-against-static-typing/","section":"Articles","summary":"A practical defense of static typing that addresses common complaints, agrees with them, and shows why they make static typing powerful.","title":"Arguments against static typing"},{"content":"Pretty much as far back as I can remember, I\u0026rsquo;ve always had a deep passion for technical subjects, particularly mathematics and programming. This has been the driving force behind most of my life, and strongly influenced every aspect of the path I\u0026rsquo;ve been on.\nI earned my Bachelor\u0026rsquo;s degree from University of West Bohemia, where I graduated summa cum laude. I originally studied pure mathematics for two semesters but switched to cybernetics because the math was more interesting. In the western world, cybernetics is more commonly known as control theory.\nAfter graduating, I became co-owner of SugarFactory, a SugarCRM Reseller Partner. With me serving as its CTO, we built SugarFactory from the ground up and became one of the fastest-growing SugarCRM partners in Europe, achieving Elite status in just under two years. SugarFactory was eventually acquired by Algotech a.s..\nTaking advantage of my experience in SugarCRM, I went on to build Glucose, a code-generation tool that dramatically reduces the time it takes to customize SugarCRM, and QuickQuery, a custom-built, user-friendly query language that transpiles to SQL, complete with parser, compiler and syntax highlighter.\nAt about the same time I joined a mid-sized Java-centric organization, where I kickstarted and oversaw a company-wide migration to Kotlin. It was during this time, and for this purpose, that I wrote the first version of what would become the Kotlin Primer.\nMy recent career has been centered in the JVM world, where I\u0026rsquo;ve built and maintained traditional applications with Spring Boot and event-sourced CQRS applications with Axon. It was my experience (and frustration) with the latter that led me to design structured cooperation and implement a POC\u0026mdash;Scoop.\nOutside of my career, I like to explore less conventional areas\u0026mdash;currently, it\u0026rsquo;s Haskell and functional programming in general, and Common Lisp.\nI also speak at meetups:\nUnexpected properties of exceptions (2024) Rust, Haskell \u0026amp; Context receivers - a brief introduction to functional error handling (2024) ","date":null,"permalink":"/author/","section":"","summary":"Pretty much as far back as I can remember, I\u0026rsquo;ve always had a deep passion for technical subjects, particularly mathematics and programming. This has been the driving force behind most of my life, and strongly influenced every aspect of the path I\u0026rsquo;ve been on.\nI earned my Bachelor\u0026rsquo;s degree from University of West Bohemia, where I graduated summa cum laude. I originally studied pure mathematics for two semesters but switched to cybernetics because the math was more interesting.","title":"About me"},{"content":"","date":null,"permalink":"/categories/","section":"Categories","summary":"","title":"Categories"}]