The Ascetic Programmer
How asceticism benefits programming, science, and the arts
Cover illustration based on a program by Robert Nystrom [183]. Used by permission.
Many thanks to Marco Scanu and David Bylsma for providing feedback on an early draft, Tom Lindsey and my wife Maria for patient proofreading and to Maria for her unwavering support.
Introduction
Ascetic programming is the writing of computer programs under restrictions unrelated to resources such as memory or power. They range from using fewer lines of code to forgoing certain programming language features. My interest in ascetic programming started with challenges among coworkers, which we undertook to keep our skills sharp, our curiosity for programming alive, and, not least, for bragging rights.
We were not the first to think of imposing restrictions on
programming other than those on resources. For instance, you may be
familiar with Edsger W. Dijkstra’s condemnation of the goto
statement (see Sec. 1.6).
However, among possible restrictions, the one on program length has
caught my long-term interest. It has connections to different facets of
computing, from maintainability to testing to theory. These associations
are mostly positive, meaning that shorter programs are better along
several dimensions: for instance, they are more readable and
maintainable and contain fewer errors.
Asceticism originated as a spiritual practice millennia ago. Ascetic restrictions traditionally target worldly pleasures and promise salvation or enlightenment in return. This book will focus mainly on code length restrictions and, more generally, description length restrictions, and our rewards will be plenty and varied, but, unfortunately, won’t include salvation.
Asceticism in programming comes in two flavors, roughly described by the slogans “doing more with less” and “less is more.” On the one hand, we seek to deliver the same functionality with fewer lines of code; on the other, we aim to identify essential features and discard the rest. I will refer to the former as conciseness and to the latter as frugality. We will discuss both as they apply to programming at different scales, from snippets of code to complete systems.
Beyond programming, I discovered unexpected examples of asceticism in statistics — where the preferred jargon is parsimony — philosophy, the visual arts, music, literature, and more. Outside science and engineering, the function of an artifact is less clear or undefined, and thus the boundary between conciseness and frugality becomes blurred. The equivalent of program length may or may not be available: for instance, it is, to a degree, for statistical models but less so for the visual arts. I hope you will find these more speculative forays interesting.
This book is a collection of cases or testimonials where brevity plays a positive role and where verbosity is the villain. The opposing view that verbosity is superior has also been allotted some space, as a counterpoint, without any pretense of equal treatment. Don’t expect a grand theory or a practical method, but rather a series of interconnected anecdotes and expert opinions, interspersed with the occasional theorem and research paper, some from giants in their respective fields and some from regular people working in the trenches. Let the case for asceticism build from the totality of the book rather than a single, decisive argument.
I hope that, by the end, you will be convinced that asceticism benefits engineering, science, the arts, and more. May this book inspire you to adopt unnecessary restrictions in your endeavors.
Who should read this book?
This book covers a wide variety of subjects, from computer programming to science to the arts. It doesn’t go particularly deep in any of them, but, reflecting my own background, it provides the most detail in the section about computing, less so in the one about science and only makes a cursory pass through the arts. I hope that anyone who has some familiarity with programming and mathematics will be able to enjoy most of the book. While there are plenty of internal connections, many of which I highlighted with cross-references, they are not essential for understanding. Therefore, skipping a subsection here or there won’t prevent you from enjoying the rest.
Reading advice
On devices that allow resizing the text view or the font, I recommend you choose settings such that the display shows a full line of code without wrapping — up to 88 characters. Where supported, long lines will scroll horizontally, and only as a last resort will they wrap. Since, in computer programs, indentation is meaningful, and this book often discusses line counts, this will reduce clarity.For the most part, footnotes contain definitions of concepts or entities that may be unfamiliar to some. If you don’t feel like you need to read a footnote to improve your understanding, it is probably safe to skip it. Other footnotes provide license information for material reproduced in this book when appropriate.
Code snippets are for illustrative purposes only, are stripped of
import
or equivalent statements to access libraries, and
are not guaranteed to be correct even restoring those statements.
1 Computing
1.1 Metrics
A first step towards espousing conciseness as a goal in software engineering is convincing ourselves that lines of code, or LOCs, are a valuable measure of program complexity. Certainly, 100,000 lines of well-structured code are simpler to understand or maintain than the same amount of so-called “spaghetti” code, which contains many interactions between its parts. However, one of the pioneering figures in computer science, Edsger W. Dijkstra, posed the problem of size as a fundamental one in software engineering [2]:
[…] widespread underestimation of the specific difficulties of size seems one of the major underlying causes of the current software failure.
Elsewhere, Dijkstra warned again about the dangers of accumulating large codebases [3]:
[…] if we wish to count lines of code, we should not regard them as “lines produced” but as “lines spent”: the current conventional wisdom is so foolish as to book that count on the wrong side of the ledger.
Dijkstra’s preferred solution, namely structured programming, is intended to make it easier to deal with large codebases rather than to reduce code size (see Sec. 1.2.4).
Microsoft co-founder Bill Gates, according to Internet folklore, said:
Measuring programming progress by lines of code is like measuring aircraft building progress by weight.
In more recent times, this sentiment is still present among developers. For example, free software advocate Chris Warburton writes on Hacker News2 [4]:
Software is a liability, not an asset. We should produce as little as needed, and delete it if possible.
In the classic software engineering book The Mythical Man-Month, Frederick P. Brooks Jr. states [5]:
Productivity seems constant in terms of elementary statements, a conclusion that is reasonable in terms of the thought a statement requires and the errors it may include.
Albeit expressed in statements rather than lines, according to the use of the time, this supports the idea that LOCs are a reasonable measure of complexity. This book follows the common practice of counting LOCs. Yet this count is partially dependent on the vagaries of code formatting (see Sec. 1.2.7). An alternative is counting tokens, the size of the representation of a program that a computer builds to run it. It is a formatting-independent measure, which we will adopt in one specific example (see Quicksort in Sec. 1.2.6).
We can often run into invariants like the one referenced above expressed in terms of LOCs. For instance, John Lions, the author of a well-known guide to the Unix operating system (see Sec. 1.3.6), states [6]:
It has often been suggested that 1000 lines of code represents [sic] the practical limit in size for a program which is to be understood and maintained by a single individual.
While the exact value in this statement may be debatable and its origin unclear, what is noteworthy is that it uses LOCs as its unit of measure. Code structure, programming language, or application domain appear to be secondary.
No other quantity has emerged as a viable substitute for LOCs. Herraiz and Hassan review some of the alternatives and come to the following conclusion [7]:
All the complexity metrics are highly correlated with lines of code, and therefore the more complex metrics provide no further information that could not be measured simply with lines of code.
Among the alternatives is cyclomatic complexity [8].
It focuses on the flow of control in a program by viewing it as a graph.
A node represents a sequence of statements devoid of control statements,
and if
and while
statements correspond to the
subgraphs in Fig. 2.
With this graph representation, the cyclomatic complexity of a
program is one more than the circuit rank3 of
its associated graph, that is,
,
where
is the number of edges,
is the number of nodes, and
is the number of connected components. If you suspected this is a
long-winded way to count the number of control statements, you would be
right, unless we consider multiway branching statements like
switch
in C. It is possible to double the size of a program
without altering its control structure — say, interleaving logging
statements with the original ones — and, therefore, without changing its
cyclomatic complexity. However, experimental studies show that it
correlates well with program size, and those that used it to predict the
number of bugs4 while controlling for program size
had inconsistent results. A possible conclusion is that cyclomatic
complexity isn’t so valuable “since program size is not a controllable
feature of commercial software” [8]. However, program size, controllable
or not, predicts crucial properties of software.
One of them is the number of bugs. According to Capers Jones, a consultant on software methodologies, it is more than proportional to program size, falling in the range of 0-25 defects per KLOC (a thousand LOCs) for programs smaller than 2 KLOCs, but rising to 4-100 for those over 512 KLOCs ([9]; as quoted in [10]).
However, even more important than the number of bugs is the probability of project failure, defined as substantially missing a project’s goals. For example, Stack Overflow and Discourse co-founder Jeff Atwood writes [11]:
Here’s the single most important decision you can make on your software project if you want it to be successful: keep it small. Small may not accomplish much, but the odds of outright failure — a disturbingly common outcome for most software projects — is [sic] low.
W. Wayt Gibbs, a science writer for Scientific American, explains [12]:
Software is exploding in size as society comes to rely on more powerful computer systems. That faith is often rewarded by disappointment as most large software projects overrun their schedules and many fail outright — usually after most of the development money has been spent.
1.2 Concise Software
1.2.1 Programming
I actually found a way to shorten his code by 1 word and it had only taken me 20 years to do it. […] it was the only time I’d managed to reduce some of Gosper’s code.5 It felt like a real victory. And it was a beautiful piece of code.
In this passage from an interview with Guy L. Steele Jr., inventor of the Lisp dialect Scheme, we read how the quest for conciseness transcended its practical origins and mutated into an aesthetic one [14]. Back in the day when Steele and Bill Gosper, also a founding figure for symbolic computations and Lisp, left their mark on computing, computer memory was so scarce that shortening a program even by one word could have been necessary. Twenty years later, the evolution of hardware and software had been such that there could no longer be practical motivations to shave off a word; only the intellectual ones remained.
Yukihiro Matsumoto, inventor of the programming language Ruby, agrees with the aesthetic qualities of concise code but also adds the language of ethics [15]:
Brevity is one element that helps make code beautiful. […] In the vocabulary of programming, brevity is a virtue.
Dan Ingalls, a key contributor to the implementation of the language Smalltalk, recalls that, as he learned the language FORTRAN, he “tried to see how much shorter I could make programs, and stuff like that,” suggesting that, for him, it was a learning or playful activity [14].
For an experimental study, Lutz Prechelt asked several mostly inexperienced programmers to solve a simple string problem in various languages monitoring several properties of the solutions, including length. He concludes [16]:
Interpersonal variability, that is the capability and behavior differences between programmers using the same language, tends to account for more differences between programs than a change of the programming language.
Focusing on program length, this study finds that choice of language and programmer skill or style equally contribute to variation. Yet the point that matters for us is the wide variability in program length for the same task in the same language (see Sec. 1.2.9 for language-dependent conciseness). In this study, the programmers were allowed to code according to their preferred style. Would they have performed differently had they been asked to minimize length?
They may have, but it isn’t something they will often be asked to do in their careers. We have seen that code size was a priority in the pioneering age of computing, and it might still be when developing software for underpowered devices. However, I am not aware that code size currently is an externally imposed constraint on major platforms. An exception should be web development since program download is part of the load time of a page. Yet, it hasn’t prevented web pages from ballooning in size (see the “web obesity crisis,” Sec. 1.5). Not only minimizing program length is no longer a goal: at least in some situations, there has been pressure to increase it. In an interview, former Microsoft CEO Steve Ballmer recalls how IBM required billing by the KLOC and that there was no way to convince them that this system discouraged the search for more concise solutions [17]. Once a task was estimated to require, say, 20 KLOCs, that put a cap on how much money the developer could make, and it became a target to hit.
Free from external pressure, do programmers aim to keep their programs short? Some argue in favor of slowing down, which is related to program length since time is a limited resource. For example, Poul-Henning Kamp, the author of the web caching program Varnish, estimates that he wrote what he intended to be his “highest quality program — ever” at the breakneck speed of 5 LOCs per hour [18]. If that sounds slow, the equivalent estimate in the previously mentioned Mythical Man-Month is much slower: 10 LOCs per programmer-day.6 Kamp adds that for life-and-death software, where correctness is paramount, such as that developed for human space flight missions, the same estimate drops to one LOC per programmer-day. Sure such a coding speed must discourage code bloat (but see Sec. 1.5 for bloated space software).
It isn’t only rocket science. Matt Lacey, a mobile developer and consultant, in a post titled “You’ve only added two lines — why did that take two days!” lists plenty of reasons why, at times, development may slow down to such a crawl [19]. He reports about fixing a bug, first understanding it in detail, from which subsystems are affected to how to reproduce it. Then, unsatisfied with just making the issue disappear, Lacey endeavors to find its root cause, if the problem may represent a class of related ones and how to fix it with a high-quality solution that won’t break his program elsewhere. Finally, he spends some more time preventing this or similar problems from happening in the future. Two days to perform all these tasks, even if the solution is only two lines, don’t seem like a waste of time.
Everyone in the software industry knows that slowing down is the opposite of how it works. There are entire methodologies aimed at creating arbitrary deadlines and fostering a fictional sense of emergency. Limits on size, on the other hand, are rarely, if ever, encountered. No boss ever told my teammates or me: “you could have done this with less code,” or “let’s try to keep it under 1000 LOCs.” However, I found a counter-example (see Dwm in Sec. 1.2.11 for an additional one). An anonymous Quora user asks if he should “fire someone who unnecessarily wrote 500 lines of code” without context [20]. We don’t know what these lines were supposed to achieve or if there was a pattern of verbosity in this engineer’s work. Rejecting the rush to summary judgment, another user replies:
As a programmer, I’ve written many, many lines of unnecessary code. Why? Because sometimes you have to write 500 unnecessary lines of code to get to the 50 absolutely necessary lines of code.
(for code deletion, see Sec. 1.2.2).
Going back to Kamp and his Varnish, he further writes:
Back when I was a kid I could write 1000 lines in a single sleep-deprived session across midnight, but it didn’t take that long before I discovered that I had to throw out most of it once I woke up again.
I was 40 years old when I started Varnish and I had 22 years of professional experience, a lot of them staring at, and often fixing/improving/refactoring, other peoples [sic] source code.
Patrick Burns, the author of Tao Te Programming, agrees that experience is a factor [21]:
Novices may think that the increase in the number of lines is good. More experienced programmers will know that the good parts are when the number decreases.
He goes one step further: instead of just accruing code slowly and carefully, Burns talks about reducing it (for code deletion, see the next section). As to bug fixes, Burns recommends making them as short as possible. He writes:
Change as little as possible. A one-character change is optimal.
One character is what would have saved the Mariner I probe from destruction (but it wasn’t in its control program, see Sec. 1.5). Burns brings this line of thought to its ultimate consequences and thinks “of programming as an exercise in compression.” While others use this language (see Steele’s “Huffman encoding problem” in Sec. 1.2.9 and Armstrong’s system compressor in Sec. 1.2.12), I believe the word “compression” has a more specific meaning in computing distinct from conciseness. Compressed data aren’t directly usable: they need to undergo a process of decompression that reconstitutes the original before being, for instance, displayed as images. Concise programs are run, read, and edited as they are, and there is no way to reconstruct a verbose version from which they may have been derived. A concise program is an independent entity, which can’t be said about compressed data, tightly bound to its uncompressed version. However, we could also take the other side of the argument: that decompressing a program is one of many transformations programs may undergo before being executed, such as macro expansion or translation into intermediate or machine language. The crucial difference between a compressed program and a concise one is that the latter is human-intelligible; it is the version developers read, modify and maintain. For compressed programs, that is the uncompressed version. A compressed program isn’t intelligible to developers, at least with known compression techniques.
It is unfair to complain that engineers write unnecessarily verbose code when they are not taught or rewarded for being concise. However, one company goes against the grain. Comma.ai, an AI startup, added it to their requirements for aspiring employees [22]:
We value those who can do more with less code; software engineering doesn’t have an external selection pressure for low part count and cost like EE and ME, but you pay massive costs down the line in debugging and upkeep.
In an echo of “art for art’s sake,” James Hague, a video game developer, enthusiastically declares [23]:
Writing concise to-the-purpose solutions is a primary reason for programming in the twenty-first century.
If that is the future, it isn’t “evenly distributed yet” [24].
1.2.2 Deleting Code
If code isn’t written directly in its most concise form, a solution is to delete some of it. Of course, removing random lines won’t work: we need to identify useless ones, if there are any. “I love deleting code,” declares Nathan Marz, the author of the pioneering stream processing system Storm [25]. He isn’t alone. For instance, Chromium browser developer Devlin Cronin writes in a related bug-tracking thread [26]:
It [a proposed code change] has concrete benefits […] the ability to potentially delete a bit of UI code (something that always makes me happy!).
It may sound like an odd statement for someone whose craft is writing code. Artists have destroyed or mutilated their work in fits of disappointment [27] or as a performance piece [28] or have deliberately created ephemeral works [29]. However, it isn’t a widespread practice. In other branches of engineering, there is a pressure to reduce part count, and simplifying a design may be rewarding in and of itself. However, owing to a clear distinction between design and production, unnecessary parts come off a blueprint, not a finished product. Once removed, it is no longer necessary to make, procure or assemble them, and they don’t contribute to the cost or weight of an artifact anymore. In software engineering, once created, a line of code’s further costs are in the future and harder to estimate, in the form of debugging, learning, documentation, maintenance, and, as we have seen, even project failure. In my decades of experience with several employers and many more clients, we hardly ever tried to estimate those delayed costs. Therefore, the incentives to delete are weaker. However, good programmers know how much code they have to read and that someone has to maintain each line and so on. They understand the burdensome side of a line of code.
The know-how on how to delete is sparse. Nobody taught me how to delete code. A post titled Advice on Deleting Code sparked my interest [30]. At 1600 words long, it doesn’t seem to convey much more than an appeal not to leave dead code7 behind, whether commented out8 or otherwise disabled or forgotten. That alone, though, is a good suggestion, as the following anecdote exemplifies.
I once worked on a codebase where a static analysis9 marked 1/3 of the code as unreachable. We supported this evidence with an analysis based on logging in production and for an extended time. Nonetheless, we couldn’t overcome resistance to deleting code that might still be useful to someone someday. This even after giving engineers veto power on removing any code for which they knew there was a use, and even after explaining that “deleting it” just meant “preserving it forever” in version control,10 from where it could be restored on demand. The unspoken truth was that few believed dead code to be harmful enough to warrant any risk of removing active code, thus creating bugs. The business side could not believe we were spending time on a non-customer-facing issue, and, in a way, they were right because nothing came of it.
Inexperience doesn’t help. Developer Neil Kakkar writes [31]:
I was very uncomfortable deleting shit or obsolete code. I considered what was written eons ago sacred.
Under the guidance of senior engineers, he reports having become more aggressive in deleting code but still recommends removing only dead code.
Some people even advocate the appointment of “code removal leaders.” For instance, in a Hacker News discussion about career prospects for senior developers, we read [32]:
I think the most valuable engineering leader would be someone who can remove code, or at least prevent unnecessary code from being written and maintained.
That reminded me of one of the best developers with whom I have ever worked. He had discovered another coworker was generating low-quality code in copious amounts, primarily by cut and paste. To address this, he spent the last half hour of each workday reviewing the other developer’s work, reverting all of his changes if they affected foundational modules. Luckily, the functionality to replace was minimal. This arrangement seemed to work well for everyone. What did this developer get from deleting all that code? His goal was to keep the amount of code he needed to understand and maintain in check. However, that isn’t the only valid motivation.11
Joe Armstrong, creator of the language Erlang, talked about “removing” as fundamental to how he went about his job [14]:
I would write a specific algorithm removing all things that were not necessary for this program. Whenever I got the program, it became shorter as it became more specific.
Reportedly, some can go beyond that. For example, Jon Bentley, inventor of the k-d tree data structure and author of the Programming Pearls series of books, reports that he “once heard a master programmer praised with the phrase, ‘He adds function by deleting code’ ” [15].
Others would like to fix bugs by deleting code. This tongue-in-cheek exchange between Michael B. Johnson, a software architect who contributed to many Pixar films, and Pete Barber, a software engineer, proposes this approach [33]:
In order to fix that bug I had to:
- have a good night’s sleep
- take a 2 hour hike
and then write one line of code.
deleting one line to fix it would have been even better
Fine. Ruin my little victory …
According to psychological studies, fixing by removing doesn’t readily occur to regular people, not just programmers (see Sec. 2.3). Only surgeons have made an art of it, owing to the redundancy in the human body and the scarcity of spare parts for it.
In another passage from [14], Armstrong goes beyond the practical:
If you start removing things, if you get to the point where if you were to remove anything more it would not work any more — at this point it is beautiful.
We will later read a similar statement generalized to all human artifacts and even natural forms (see Sec. 2.3). A crucial property that distinguishes an engineered object from a work of art is that the former serves a specific function: a program computes something or a building must let people in and out — unless it is a prison. In the arts, authors can push conciseness to nihilism, to the silent piece of music (see Sec. 3.3) or the empty canvas (see Sec. 3.2.3). Actually, these extremes can be reached by programmers as well, on occasion. This is the description for JavaScript module Noop2 [34]:
No operation as a module™. The long awaited successor to noop. Noop2 does even less than its predecessor, and is 100% tested to guarantee nothing happens.
Its author, Yoshua Wuyts, considers it “both a joke and actually useful,” reportedly to write tests.
1.2.3 Repetition
The most obvious opportunity for deletion is repetition. With physical artifacts, it is a form of order: the pillars of a bridge, its equally sized spans, the columns of a church, and so forth. Repetition brings about economies of scale, simplifies maintenance, and provides visual rhythm. On the contrary, it is hard to justify in computer programs since it is avoidable. There are many ways to eliminate repetition: for instance, loops, to execute the same code multiple times; and functions or methods, to name a snippet of code and recall it by name. These can optionally have parameters, like their mathematical brethren (see Sec. 1.2.4). Detecting repetition doesn’t require advanced skills or creativity: it is simple and mechanical. So much so that the code analysis service Code Climate includes an analysis tool to detect duplicated code [35]. Even if, according to their website, it “uses a fairly simple algorithm,” it detected a few almost identical duplications that had snuck into my own code. Unfortunately, their removal and the testing of the resulting code are still the user’s responsibility.
Indeed, repetitions need not be identical, but that is a fuzzy statement. Where does the quasi-repetition stop, and where does the “vaguely related code” start? In a simple case, we may find a simple parametrization that unifies all the repetitions under consideration. In a complex one, the corresponding abstraction may or may not be expressible in the language of choice, or it may require highly unsafe constructs whose use is discouraged (see Sec. 1.2.9.2). Another observation is that it is easy to generate duplications even with the best of intentions. Even if we aren’t maliciously wielding the cut-and-paste function in the editor to exceed our KLOC targets (see IBM in Sec. 1.2.1), a repetition could result from the convergence of two similar ideas. We wrote the first copy too long ago to remember, or it is different enough that recognizing it as duplicative requires careful analysis. Even when identification is easy, there may not be time to eliminate repetition because of the work required. If that entails writing a new function, we may also need unit tests,12 because of coding standards, unlike copy-and-paste code. Therefore, the elimination of the repetition must wait. However, writing a ticket13 to mark some duplication we left unattended generally isn’t a career boosting move, and those tickets never get taken care of regardless. The repetition lives to see another day and we move on to more glorious endeavors, like launching a new feature. Duplications accumulate in codebases like in genomes (see Sec. 2.4), slowly drifting apart over time.
Repetitions aren’t without drawbacks, though. Burns issues the following warnings [21]:
If you want to make a change to the repetitious code, then you have to change every occurrence. And before you can change them, you have to find them. If the change is a bug fix and you miss one, then you have just created a pernicious, hard-to-find bug. (“I already fixed that bug, it has to be somewhere else.”) If the change is a modification to the behavior and you miss one, then you have just created a pernicious, hard-to-find buglet. Buglets can grow into bugs.
(Burns doesn’t define a buglet, but, if I understand it correctly, it is a software defect that hasn’t yet manifested itself, a potential bug). That is why many programmers embrace the so-called DRY principle [36]. In the words of Matsumoto [15]:
Redundancy is defined as the duplication of information. In order to eliminate redundancy, we follow the DRY principle: Don’t Repeat Yourself. The concept of DRY is the antithesis of copy-and-paste coding.
Despite Matsumoto’s strong statement, there is nothing in Ruby to
enforce this principle, nor in any other language I am aware of. I can
only speculate why that is the case. The problem may be deciding what
repetitions are worth removing, particularly how small and after how
many repetitions. Is printing a new line worth a special construct? Is
println()
better than print("\n")
? Or is
x++
better than x = x + 1
? Different languages
give different answers to these questions. If you had to print the same
message
times, for what value of
is the following solution
print(message)
print(message)
...print(message) # n times
better than this alternative:
for i in range(n):
print(message)
I know which one most people would choose for or , but, for intermediate values, it is anyone’s guess.14 Without the lens of conciseness, these choices are seen as debatable and even just a matter of style.
Another obstacle to enforcing the DRY principle may be the issue of
approximate repetition. Banning only identical copies would only
represent a modest level of enforcement. Extending the proscription to
approximate ones would face the problem of choosing which ones to
remove. Moreover, if we replaced two code snippets with a function
f(x)
, invoked with different values of x
to
replace each one, we would need to prove or at least corroborate the
equivalence of those snippets to f
. The equivalence problem
is, in general, extremely difficult,15 but less so when the
original snippets are similar.
However, the simplest explanation may be that code quality is considered outside the purview of compilers and interpreters and something that can only be enforced and improved by programmers. Nonetheless, there are separate tools, such as CPD, the aptly named Copy-Paste Detector, to detect duplications [37]. In my experience, their use could be more widespread in the industry.
Sometimes the obstacles preventing this process are in our skills and the complexities of the code: finding repetitions and unifying them. At other times, they are a question of priorities and goals. In an interesting anecdote, Dan Abramov recalls how he identified repetitions in an otherwise working piece of code [38]. After refactoring it, this is how he described the results:
The code is half the total size, and the duplication is gone completely! So clean. If we want to change the behavior for a particular direction or a shape, we could do it in a single place instead of updating methods all over the place.
The remarkable twist is that he had to revert his elegant solution under his boss’ orders and became convinced the messy code with the duplications was better, in what may be an engineering version of Stockholm syndrome. The reasons he brings range from the social — he didn’t ask permission — to the over-engineering — the clean solution could not support some future hypothetical use cases (see YAGNI in Sec. 1.3.1). Adding and removing code aren’t just two equally important sides of programming: one is considered an ordinary, day to day activity; the other, an exceptional, dangerous one requiring a broad consensus.
1.2.4 Abstraction
When talking about approximate repetitions, we saw that we could sometimes replace two blocks of code with one, possibly shorter than the sum of the original two, which performs the function of both. To achieve this, we need to unify their commonalities and reduce their differences to a small fraction of the code or, even better, parametrize them. This process is a special case of what is known as abstraction. As Dijkstra said in his Turing Award Lecture [39]:
[…] the only mental tool by means of which a very finite piece of reasoning can cover a myriad cases is called “abstraction”; as a result the effective exploitation of his powers of abstraction must be regarded as one of the most vital activities of a competent programmer.
It is natural to think that a piece of code that covers “a myriad cases” would spare us from the effort of coding each case one by one, thus reducing code length by a corresponding factor. For instance, C. A. R. Hoare, Turing award recipient and inventor of the Quicksort algorithm, declared: “The way to shorten programs is to use procedures”.16
However, that isn’t the mechanism proposed by Dijkstra. In the same speech — indeed, the same paragraph — he advocated using abstraction to reduce the “intellectual effort” required to write a program of a given length, not to reduce its length. He didn’t formulate this recommendation based on empirical data or a mathematical theory. He started instead from the common-sense observation that if all parts of a program interact, this results in many interactions that need to be understood, debugged, maintained, etc. Their number grows with the square of the size of the program. On the other hand, if each part interacts with a small number of others, this results in a much more manageable linear number of interactions.
Dijkstra appears to assume that abstract code results in programs of the latter type, but this is far from obvious. On the contrary, a more general function creates indirect dependencies between parts of a program and, thus, more of the entanglement that Dijkstra suggests reducing. Taking sorting as an example, let’s assume we have two independent programs, one that needs to sort arrays of integers and the other of strings. In the first scenario, each program uses its specific sorting function, and there is no other code-sharing between the two. In the other, they rely on a shared library function that can sort any collection. This sharing constitutes an indirect dependence between the two programs. If changes in the sorting code were to become necessary because of the needs of one program, a bug could appear that cripples the other one.
If keeping the two programs completely independent is paramount, an option is duplicating the sorting function, no matter how general, and letting each program depend on its separate copy. When involving entire libraries, developers call this forking the shared dependency (from the idea of a fork in the road). Albeit an operation that results in more code to maintain, developer, blogger, and speaker Thomas Figg advocates for repetition to be temporarily accepted, to prevent coupling, or to let a better abstraction emerge from examples ([41], see also “interpolation” below).
There are other ways to ensure isolation between parts of a program besides duplication. For instance, modules or functions which are clearly defined and independently tested minimize the coupling risk. In my previous example of modifications to a sorting function, they would be unusual for a task that is so clearly specified. Because of this, such a function is simple to test, and extensive testing reduces the likelihood of introducing bugs when modifying it. Therefore, sharing a sorting function between modules doesn’t induce any coupling worth worrying about. Indeed, such a function is available in any language I can think of as a built-in or standard library function, and few developers would bother creating their own.
A program built primarily out of functions as clearly specified as a sorting function behaves more like several small programs than a single large one. All its components will still run on one machine, sharing its resources and crashing all at once if only one is faulty. However, from the point of view of development, the different parts are more swappable: we can replace a buggy function with another; a slow implementation can give way to a faster one. Microservices take decoupling to the next level by running parts of a program or application in isolation.17 This way, a microservice can fail without dragging down the rest, and graceful degradation is possible. At an extreme scale, that is why the Internet works despite being based on, possibly, billions of LOCs:18 it can keep working even if at any given time parts of it aren’t — there are nonetheless occasional large-scale failures [42].
In summary, conciseness and modularity are two sides of coping with size: if size can’t decrease, by splitting code neatly into modules, we can deal with smaller chunks at once.
If writing more general code is harder and introduces more
dependencies, does it offer any advantages? Well, what is better than
covering “a myriad cases” (see Dijkstra’s talk above) with a single
coding effort? “Wherever there is repetition, there is an opportunity
for abstraction,” states Burns [21]. For example, imagine having a
sorting function for each type of collection and then for each element
type: one for arrays of integers, one for lists of strings, and so on.
That would result in a large amount of repetitive code. Indeed we will
see an example where a library (in java.util.Arrays
, see
Sec. 1.5) contains seven
copies of essentially the same functions because the language hampers
further generalization. Moreover, bugs can lurk for a very long time.
For example, an implementation of binary search,19
generally accepted as correct, hid a bug for 20 years after being
published [43].
Its Java standard library version contained the same error for nine
years [44]. Imagine
the effort to fix this bug if every project maintained its version of
binary search.
Abstracting over different data types is only one instance of a general phenomenon. Often code can grow as an accumulation of special cases until we realize a general rule or solution that can replace them and covers additional ones as a bonus. Using a statistical analogy, we are replacing individual data points with a model and predicting new ones as a result (for statistical modeling, see Sec. 2.1).
Rich Skrenta has observed something similar [45]:
[…] more code often means less flexibility and functionality. This is counter-intuitive, but a lot of times a simple, elegant solution is faster and more general than the plodding mess of code produced by a programmer of lesser talent.
I had a stark demonstration of this when I was still in college. As part of standard computer science coursework, I was working with other students on implementing a toy programming language. Our instructor asked us to use a modular architecture that subdivided its compiler20 into multiple interacting programs, each assigned to a sub-team. After preparatory meetings and some development, we all met to check that all parts worked together. One of them seemed to work on the simplest programs, but it failed at the first hint of nesting — a for loop inside another for loop, for instance. The student responsible for it promised to fix it and meet us again in a couple of days. At the following meeting, his program failed again, this time on two levels of nesting. This sequence of events repeated itself a few times, with his program producing errors each time on slightly more complicated examples. Concerned about the fate of our project, we asked if we could review his source code. We found that he treated the no-, single- and double-nesting cases separately. He had written a large amount of code to cover all the examples we had provided. He had yet to use recursion21 to deal with unlimited nesting, as required by the definition of the language. It looked like he thought that, in practice, the programs to be compiled wouldn’t be too complicated and that his approach would be enough. The result was a program that had to grow in size proportionally to the complexity of the examples it was required to handle. In this respect, it had something in common with applications whose complexity is proportional to the number of screens to display; or machine learning models, whose complexity grows with the size of the training set (for more on this, see Sec. 2.1.4). Only much later in my career, I realized that my teammate22 was anticipating the future of the software industry:
- he was pioneering test-driven development;
- he focused on his customers’ requests, with his team-mates being the customers;
- he maximized customer value — higher levels of nesting are used less often;
- he met his self-imposed rapid-fire deadlines;
- he wrote code quickly.
One obstacle to achieving the level of generality of a sorting
function or a compiler is that not all abstractions emerge with the same
level of clarity. For example, sometimes we create a
do_stuff
function that we reuse somewhere else before we
understand what “stuff” exactly means. One consequence is that it keeps
changing as we modify code elsewhere. It may also be harder to test in
isolation from the contexts in which it runs. These are the telltale
signs of a poor abstraction, but they are seldom perfect from the start.
Sometimes the correct generalization emerges over time after seeing it
applied in different contexts, akin to interpolation in mathematics.
Marz writes [46]:
I view the development of beautiful abstractions as similar to statistical regression: you have a set of points on a graph (your use cases) and you’re looking for the simplest curve that fits those points (a set of abstractions).
The alternative, letting repetitive code proliferate and fixing it at a later time, has its drawbacks, as we have seen.
Another problem is that the connection between abstraction and conciseness isn’t considered fundamental. We have seen how Dijkstra thought abstract code reduced the intellectual effort required per line written, not the number of lines in a program. Furthermore, he didn’t look kindly at attempts to solve problems with a single line of code (see “one-liners” in Sec. 1.5). In my training, different abstraction techniques were touted for their potential to reduce coupling and not for any conciseness advantage. Without a measurable goal, the best level of abstraction becomes a matter of opinion. However, there is some agreement that it can be taken too far (see the LinkedIn example in Sec. 1.2.9.4), for instance, creating hierarchies of classes23 where most classes are never instantiated.
A few pay attention to the conciseness potential of specific programming techniques. For instance, software engineer and blogger Nikolay Grozev wrote an introduction to monads, the advanced programming technique inspired by category theory [47]. First, he develops an example that he deems “far from perfect, as we have repeated ‘glue code.’ ” Later, after applying monads, he observes:
This [using monads] avoids the shortcomings of our previous approach because the bind function implements all the glue code and we don’t have to repeat it.
The crucial point is ensuring that abstractions can be combined
seamlessly, providing appropriate combinators. Only then can we reap the
full benefits of creating those abstractions. An example could be
modifying a module for arithmetic calculations where a division by 0
terminates a program so that it would instead generate a special error
value, say "err"
. This module features four functions,
add
, sub
, mul
, and
div
, for the four basic arithmetical operations. Since a
division by 0 can only occur during the execution of div
,
we start by modifying that function:
def div(num, den):
return "err" if den==0 else num/den
The problem is that the other functions don’t know how to deal with
the value "err"
, so we can’t just write
add(1, div(1, 0))
to calculate
.
The intended behavior is that any operation involving "err"
also results in the same value. One approach is to check every time we
use div
:
= div(1, 0)
a = "err" if a == "err" else add(1, a) b
This approach is verbose and error prone. Another option is to modify
all four operations to deal with "err"
appropriately.
However, the “monadic” approach is to define a function, traditionally
named bind
, to modify the evaluation of the operations:
def bind(op, x, y):
return "err" if x == "err" or y == "err" else op(x,y)
This way, we can write complex expressions as follows:
sum, 1, bind(div, 1, 0)) bind(
Another example from functional programming24 circles relates to higher-order functions, which accept as arguments or return other functions, and are available in many popular languages but aren’t used widely in practice. Software engineer and blogger Gio Carlo Cielo guides us through a complete example in which “our code verbosity has been significantly reduced through the use of the higher-order functions” [48]. A minimal illustration of their use could be applying a function to each element in a list. Without higher-order functions, this is what we would write in Python:
def f(x):
# some definition for f
...
= []
result for x in range(10):
result.append(f(x))
With the standard (higher-order) function map
, the same
example becomes:
map(f, range(10))
This version is simultaneously more concise and more abstract. We obtained a more concise version by hiding details like creating an empty list and appending elements to it. Python offers an alternate form for this task, list comprehensions, using which our example becomes:
for x in range(10)] [f(x)
This code doesn’t eliminate the loop variable x
and is
limited in its generality. For instance, there is no list comprehension
for the fold
or reduce
higher-order function
typical of functional programming. The likely reason is that it is the
element of functional programming that Python’s author Guido van Rossum
has “always hated the most” and for which he needs to “grab pen and
paper to diagram what’s actually being fed into that function before I
understand what the reduce() is supposed to do” [49]. However, this issue is
wider-reaching than reduce
itself and van Rossum’s foibles:
it is whether languages should provide the most common abstractions or
the tools to build them. Higher-order functions are in the second group.
For instance, extending map
to a data structure such as a
tree is straightforward, as we will see shortly.
Let’s start with some definitions. A tree is made out of nodes. A relation exists among them such that some nodes are parents to other nodes, called children. An ancestor to a node is either parent to that node or an ancestor to its parent. An ancestor is connected to its descendant nodes through a unique path. Finally, there must be a node that is an ancestor to all other nodes, called root (see Fig. 3).
In Python, let’s represent a node with the list of its children and a
tree with its root node. For instance, [[],[]]
represents a
tree with three nodes, one the parent of the other two. Since trees are
most useful when holding some data on each node, let’s say the first
element in the list isn’t a child but such data, for instance
[1,[2],[3]]
, where the node holding 1
is the
root, which is parent to the other two nodes, containing 2
and 3
respectively. With these definitions and
representation conventions, we can define a map
for trees
as follows:
def map_tree(T, f):
return [f(T[0])] + list(map(partial(map_tree, f=f), T[1:])) if T else []
With two lines25 of code, we have extended the
concept of a map
function to trees.26
There is no such possibility for list comprehensions. They are a useful
pre-defined abstraction but aren’t amenable to being extended to other
data structures.
There isn’t anything special about functional techniques other than that they are powerful techniques for abstraction and, therefore, also for conciseness. Moreover, I wouldn’t want to generate the impression that abstraction is the result of applying sophisticated techniques or adopting esoteric languages, whereas, for the rest of us, there is copy-paste-modify. Sharing a complex method between two classes by introducing a common parent class also helps conciseness. As we have seen above, Hoare found procedures already fit for this task.
1.2.5 Rewrites
A rewrite is a controversial and radical step in software development: take what you have built, throw it away and start from scratch. Rewrites come in different sizes, from those targeting a complete application to those affecting just a few functions. They have been described as company-destroying mistakes [50] or a normal phase of development [5]. Opinion dominates, and I don’t have much to add to this debate but to consider what it means for conciseness. Rewriting some portion of a program isn’t guaranteed to result in less code. Famously, Netscape discarded its Navigator code base for various reasons, one being its size [51]. A prototype released to showcase the new rendering engine Geko was about a tenth of the size of contemporary browsers and could fit a single floppy disk — 1.44 MB. It had swollen to 17 MB by the time the 1.0 version was released, negating any size advantages.
However, it doesn’t have to end that way. Instead, we will see how switching to a higher level language or one better matched with the application can result in radical code savings, for instance, in a 60-KLOC rewrite performed at LinkedIn (see Sec. 1.2.9.4). In that case, the rewrite was management initiated. In other ones, rewrites start from the grassroots. For example, Apple developer Bill Atkinson used to report a negative number of lines contributed in a weekly form that he had to submit, including a “-2000” in the week in which he had rewritten some routines in QuickDraw to be six times faster. Eventually, he wasn’t asked to fill the form anymore [52].
Another example involves Bret Taylor, co-inventor of Google Maps [53]:
[Taylor] was frustrated that the JavaScript was really slow. It would take forever to load. So, one weekend he said, “I’m going to rewrite it.” And he rewrote the entire thing, making it 10 times as fast and a third of the size. He completely threw away the code that a team of engineers had been working on for months.
Unfortunately, we don’t know if the same result could have been achieved incrementally or by giving the right incentives to the original team.
An additional example is a cryptocurrency named XLM, initially based on a fork of a pre-existing one, XRP. Its developers were having doubts about the security of XRP and decided there were better options, eventually opting for the Stellar Consensus Protocol. However they did not merge it: instead, they “created an entirely new codebase designed with safety, minimalism, scalability, interoperability, and multiple implementations in mind,” elevating minimalism to a top-level concern [54].
In some instances, the impetus for a rewrite comes from a shift in hardware technology that makes an existing implementation obsolete. For example, the trend towards using GPUs27 for numerical computation prompted a complete rewrite for an application named Weather Research and Forecasting (WRF; [55]):
WRF is well over half a million lines of code, mostly rooted in Fortran. This new model is based on the same ideas but with less code. Still, the effort to get this up and running on Power9/Volta GPU systems took two and a half years.
We will see more of this as heterogeneous computing28 spreads to all types of devices, from supercomputers to smartphones, prompting partial or complete rewrites [56].
dogpile.cache
is a generic API29
that can use different caching30 backends. It is a
rewrite of a similar product, Beaker, by the developer of both, Mike
Bayer, who explains [57]:
All the ideas of Beaker which “work” are re-implemented in dogpile.cache in a more efficient and succinct manner, and all the cruft […] relegated to the trash heap.
Here we have an example of a rewrite that doesn’t involve management taking a new direction (see the LinkedIn JavaScript rewrite in Sec. 1.2.9.4) or a programming superstar showing his mettle (Taylor, see above). Instead, we witness a developer realizing that an incremental path from his existing code to where he wanted to be was less practicable than a fresh start. It is also an example where succinctness rises to top-level concern instead of being a happy accident.
1.2.6 Algorithms
Algorithms are generally small enough that their code length isn’t an issue, instead relevant for libraries, applications, and systems. With algorithms, we are more interested in their correctness and efficiency. However, it doesn’t mean keeping the line count low isn’t a nice extra. For example, Daniel Lemire, inventor of the xor filter, a replacement for bloom filters31 and other related algorithms, asks [58]:
Could we do even better [on speed and memory usage] while limiting the code to something you can hold in your head?
His xor filter reportedly marks an improvement on traditional measures of algorithm efficiency (speed and memory usage) while also keeping it relatively short:
The complete implementation in Go fits in less than 300 lines and I did not even try to be short. Indeed, any semi-competent Go coder can make it fit within 200 lines.
While I doubt I can memorize 300 LOCs, it is 50 lines shorter [59] than a bloom filter implementation by the same group [60].
Moving onto a more run-of-the-mill Quicksort, a mainstream sorting algorithm, there are different ways to implement it. Angel Sola Orbaiceta, a software engineer and book author, experiments with object-oriented and functional paradigms, using Java and Clojure [61]. He finds out that “functional code is much more readable and compact” for the same task and believes “the more code there is, more time and effort has to be put in maintaining the codebase and more bugs tend to appear.” Specifically, skipping empty lines or those with just a bracket or parenthesis, he ends up with 27 lines for Java and 9 for Clojure.
I tried to replicate his experiment in Python. Quicksort is an application of the divide et impera approach:
- pick some threshold or pivot;
- partition the values to sort based on that threshold;
- sort the two splits separately;
- concatenate them.
When a split is empty or only has one element, return it as is. Some
subtleties in how the partition is performed create the opportunity for
some variants. Namely, the Quicksort function (abbreviated to
qs
in the following code blocks) should never invoke itself
on its own input, or the algorithm won’t terminate (and run out of
memory in practice). We can achieve this by excluding the pivot or all
elements equal to the pivot from further sorting. Another choice is
either to sort in place — moving entries around in a single array — or
to create new arrays as needed: the former is more memory-efficient, the
latter quite a bit simpler.
The following version performs the sorting in place:
def qs(xx, lo = 0, hi = None):
= len(xx) - 1 if hi is None else hi
hi if (hi <= lo) or lo < 0:
return
= partition(xx, lo, hi)
p - 1)
qs(xx, lo, p + 1, hi)
qs(xx, p
def partition(xx, lo, hi):
= xx[hi]
pivot = lo - 1
i for j in range(lo, hi):
if xx[j] <= pivot:
= i + 1
i = xx[j], xx[i]
xx[i], xx[j] = i + 1
i = xx[hi], xx[i]
xx[i], xx[hi] return i
It takes 17 lines, which compares favorably with the 27 lines of Java in Orbaiceta’s blog post. However, on closer inspection, his version isn’t in-place. Removing that constraint, we can produce this version:
def qs(xx):
if len(xx) <= 1:
return xx
return (
for x in xx if x < (pivot := xx[0])])
qs([x + [x for x in xx if x == pivot]
+ qs([x for x in xx if x > pivot])
)
This code is just seven lines, ignoring the line with just a parenthesis, and it is also straightforward. There is no index accounting to master, no swapping of elements, just plain and simple divide et impera. However, the ascetic programmer can’t help but notice a certain repetitiveness in the three list comprehensions that perform the partition. The only element that changes is the comparison operator used. I looked unsuccessfully for a library function combining these steps; defining our own would take too many lines. However, we still can factor out the filtering operation:
= lambda op, xx: [x for x in xx if op(x, xx[0])] fo
where op
is one of <
, >
,
=
. We can’t pass them around as parameters in their most
familiar form, but there are equivalent functions in the standard
package Operators: lt
, gt
, and
eq
, respectively. With this parametrization, we can write
qs
as follows:
= lambda xx: xx if len(xx) <= 1 else qs(fo(lt, xx)) + fo(eq, xx) + qs(fo(gt, xx)) qs
I used a short name for fo
to fit qs
onto
one line (formatted with a popular tool, Black, which uses an
88-character bound). For the same reason, I used lambda
definitions instead of def
. Those tricks are possible
because line count is a blunt instrument. If we used token count, a
measure of program complexity insensitive to the vagaries of formatting
or naming (see the next section), those tricks would not pay off.
Indeed, it is easy to perform a token count with the help of package Ast
(for abstract syntax tree, a data structure used to compile
programs):
def token_count(source):
return len(list(ast.walk(ast.parse(source, "source", "exec"))))
According to this function, the in-place Quicksort contains 200
tokens, whereas the shorter versions are 75 and 74 tokens long,
respectively. Remarkably, the two-line version wins by just one token,
owing to its “density.” It is natural to ask if we can go any lower than
this. It is possible, indeed, by eight tokens. To achieve this feat, we
need to define the partitioning filter as a function of one instead of
two arguments. This change requires its definition to be inside the
scope of xx
:
def qs(xx):
if len(xx) <= 1:
return xx
= lambda op: [x for x in xx if op(x, xx[0])]
filter_xx return qs(filter_xx(lt)) + filter_xx(eq) + qs(filter_xx(gt))
It makes sense to limit the arguments of a function to those that
change within a scope of interest: what else could we be filtering other
than xx
? What may look like a silly search to shave off a
few additional tokens — a pursuit that Dijkstra thought to lack “any
conceptual relevance” (see Sec. 1.5) — spurred me to tighten or
specialize the filtering abstraction to the needed level of generality,
and nothing more than that. There may be a performance penalty for
defining a function inside a recursive call, certainly in Python, but we
are not trying to optimize performance in this book, and occasionally we
aim for the opposite (see the Busy Beaver numbers in Sec. 1.2.11.1).
Code length is also used to evaluate the effort needed to prove the correctness of algorithms (see Sec. 2.2). For instance, Martin Kleppmann, author of Designing Data-Intensive Applications, implemented, in 60 LOCs, a distributed data structure to support group editing, but the proof of its correctness took 800 lines of math for a proof-to-code ratio over 13:1 [62]. Kleppmann maintains it was worthwhile not only to raise confidence in the correctness of a less-than-intuitive algorithm but also to increase its understanding. Nonetheless, the large size of the proof highlights an obstacle to applying formal methods in software engineering. Another barrier is that proofs can only show that an algorithm satisfies its requirements, but those requirements must reflect the real world. Nonetheless, formal methods are valuable in high-stakes applications such as the control software for the French railway network [12] or the S3 storage system [63].
1.2.7 Notation and Formatting
If recognizing and implementing abstractions is considered good programming, abbreviating identifiers or skipping white space is generally frowned upon. Identifiers, as we were taught in school, should be self-explanatory; white space should be used liberally to impart visual structure to the source code.
An exchange from a Hacker News forum shows that at least some developers, including database architect Andres Freund, challenge the orthodoxy [64]:
Usage of acronyms is one of the worst offenders in bad code.Not really on board with that … It’s a balance. Brevity does have its value too.
The trade-off is memorizing the abbreviations in exchange for dealing with more compact and readable expressions. If a long variable name works as documentation, is having to read it every time you run into a variable a good idea? Whether or not a program uses acronyms, identifiers’ definitions and related documentation must be easily accessible, usually through features of an IDE.32 When a program deals with complex concepts, the idea that even long identifiers can be self-explanatory is unrealistic.
We should look at mathematics as a model. Variables are generally single letters. Mathematicians and physicists develop specialized notations to make expressions more compact. About Einstein’s general relativity, the biographer Walter Isaacson writes [65]:
Einstein’s final equations used the condensed notations of tensors to compress sprawling complexities into squiggly symbols and subscripts.
Clearly, “sprawling complexities” aren’t conducive to readability!
Charles Babbage, inventor of the first mechanical calculator, seemed to agree (as quoted in Kenneth Iverson’s Turing Award lecture [66]):
The quantity of meaning compressed into a small space by algebraic signs is another circumstance that facilitates the reasonings we are accustomed to carry on by their aid.
Even newlines should not get a free pass. In a thread on GitHub, a discussion developed between supporters of a format with shorter lines that was allegedly more friendly towards text-oriented diff tools33 and a more compact one. User tweakimp writes [67]:
I think you can call it [the choice of format] more than preference because if there is no other difference, less [sic] lines is [sic] better.In hugovks [sic] example the second version is shorter, easier to read and should be preferred in my opinion.
Optimizing code to be read by primitive comparison tools sounds like a bad idea, independent of conciseness. In the seminal book Structure and Interpretation of Computer Programs (SICP [68]), we read that “programs must be written for people to read, and only incidentally for machines to execute.” Even more so, we shouldn’t be writing them to ease the work of primitive diff tools but switch to better ones, like those that analyze the structure of programs.
1.2.8 Documentation and Comments
When it comes to documentation, the more documentation, the better, correct? Actually, who enjoys reading it? Documentation is a cost like code is. It needs to be written and maintained, developers must read it, and it silently becomes outdated.
Michele Simionato, author of several Python packages, reported this about halving the size of the documentation for one of them [69]:
[dropping Python 2] is a huge bonus, since I could remove over 2,000 lines of duplicated documentation/doctests. Having to maintain separate docs for Python 2 and Python 3 effectively stopped any development on the module for several years.
What may be hard to understand with large teams becomes evident for a one-person volunteer project.
Can we eliminate documentation? Joshua Bloch, known for his work on Java, thinks so. About API design, he writes [70]:
APIs should be self-documenting: It should rarely require documentation to read code written to a good API. Indeed, it should rarely require documentation to write it.Early drafts of APIs should be short, typically one page with class and method signatures and one-line descriptions. This makes it easy to restructure the API when you don’t get it right the first time.
It may look like a lofty goal, but Bloch was likely thinking about
some of his work on the Java standard library. Building on fundamental
concepts in software engineering, which all developers should be
familiar with, alleviates the need for documentation, and simple and
consistent API design should complete the task of minimizing the need
for documentation. Does the Java standard library meet these lofty
goals? For instance, in module java.util.Arrays
we find
entries such as [71]:
static int binarySearch(byte[] a, byte key)
static byte[] copyOf(byte[] original, int newLength)
static byte[] copyOfRange(byte[] original, int from, int to)
Their behavior should be intuitive. However, why does
copyOf
need a newLength
argument? Shouldn’t a
copy be length-preserving? And what happens if newLength
is
larger or smaller than the original? Luckily, the API reference provides
a simple one-line explanation that copyOf
is more than a
copy: it will happily stop after copying only newLength
elements, or continue, padding the new array with 0s until
newLength
has been reached. The padding value isn’t
configurable and depends on the type of the original
array
— copyOf
is overloaded34
for different types. Scanning the different forms of copyOf
I found
static <T> T[] copyOf(T[] original, int newLength),
an entry which uses Java generics to parametrize its parameter and
return types, for which the padding value is null
. So what
if T
equals byte
? Will copyOf
pad
its result with zeros or ‘nulls’? The answer is that Java generics
cannot be instantiated with primitive types, and therefore
T
can never be instantiated with byte
[72]. Thus the apparent ambiguity is
resolved. It gets more confusing if we consider this additional form of
copyOf
:
static <T,U> T[] copyOf(U[] original, int newLength,
Class<? extends T[]> newType)
which represents a copy with conversion from type U
to
type T
. What happens if you need to perform the same
operation from or to primitive types? I would check with a Java expert
before giving up, but I am afraid you will have to write 49 more
functions, one for each pair of primitive types.35
This limitation stems from the Java generics design, like the quantity
of almost identical copyOf
entries for each primitive type,
and the API developer must adapt (see java.util.Arrays
in
Sec. 1.5). However, the
documentation leaves the user helpless here, suggesting a consistency in
the library that is only aspirational.
In a thread about the “best code bases,” Hacker News user taspeotis has this to say about a code base touted for its line-by-line comments:36 “Every line? Oh dear” [73]. He and other users follow up by sampling the uninformative comments, which add nothing to what is already obvious from the code:
/* Set verbose flag */
++;
verbose...
/* Set advanced flag */
= 1; advanced
User teknopaul comments:
IMHO Coding guides should ban “SBO comments” Statin [sic] the Bleedin [sic] Obvious. Let devs [sic] decide what SBO is, usually its [sic] bleedin [sic] obvious.
Which sounds agreeable, but for the circular definition of “bleedin [sic] obvious.”
Martin Fowler, author of popular software engineering books, doesn’t like comments [74]:
Comments are not bad — but there are often better options. I always try to write code that doesn’t need comments …
One such option is assert
statements, which specify
properties to be satisfied by a program when executing at a certain
position in the code. While they are yet more code to write, understand
and maintain, they can express simple properties that aren’t obvious
from reading the rest of the code. These properties are checked at
run-time, helping in the early detection of bugs, near their origin,
rather than later as consequences snowball. However, even without
running any program, they work “as documentation for other developers
reading the code, who see the assert and can confidently say that its
condition holds” [75]. For
instance, we can express the property that a list is sorted with:
= lambda xx: all([x <= y for (x,y) in zip(xx[:-1], xx[1:])]) is_sorted
We can use this in the implementation of the algorithm Quicksort (see Sec. 1.2.6) to check and document that each call has to return a sorted list:
def qs(xx):
if len(xx) <= 1:
return xx
= lambda op: list(filter(lambda x: op(x, xx[0]), xx))
filter_op = qs(filter_op(lt)) + filter_op(eq) + qs(filter_op(gt))
yy assert is_sorted(yy)
return yy
Another option is randomized or property-based tests, which express, in executable form, required properties of inputs and outputs of a function. The idea is that, instead of specifying a set of inputs with corresponding desired outputs for a function, we can define a process to generate inputs and properties that the input-output pair must verify. For instance, when testing a matrix multiplication function, we would pick a matrix generator to provide inputs. Tests could verify properties such as associativity, distributivity, and having the correct dimensions. Additionally, they could check the results of operations involving identity and null matrices. This approach originated in the Haskell language community and was implemented in the package QuickCheck, where it relies on the type system to provide useful default generators even for user-defined types [76]. It has been imitated in many languages to different degrees of fidelity. In Python, the package Hypothesis supports property-based testing [77]. To continue with our Quicksort example, a test that the output is sorted would look as follows:
@given(xx=lists(integers()))
def test_is_sorted(xx):
assert is_sorted(qs(xx))
In this test, the first line is a decorator that specifies what
generator to use for the argument to the test function, one that
produces lists of integers. Writing a generator is also a form of
documentation because it tells us that qs
accepts lists of
integers. qs
is more general than that, which should prompt
us to write generators like
lists(one_of(integers(), floats(), text()))
which will
produce lists of integers, floats, or strings (not mixed in the same
list). The second line contains the name and signature of the test
function, following the convention used by the package Pytest. The body
of the test asserts that the result of applying qs
to its
argument is sorted. This condition is necessary but not sufficient for
qs
to be correct, as it could always return the same sorted
list. We also need to ensure its output is a permutation of its input,
that is, that it contains the same number of occurrences of each
element, a task suitable for the data structure Counter
from the library Collections:
@given(xx=lists(integers()))
def test_is_permutation(xx):
assert Counter(qs(xx)) == Counter(xx)
These two property tests are equivalent to a user-selectable number
of conventional ones because a property test can repeatedly run on newly
generated inputs. They can replace hundreds or thousands of lines of
tests written one by one, provide a high level of confidence in the
implementation, and a readable specification for qs
(for
theoretical work showing how much testing is sufficient, see Sec. 1.2.11.1). This
flexibility also mitigates the issue of slow-running tests because, by
changing a configuration parameter, we can run a more superficial but
quick battery of tests or a much slower but more thorough one. Indeed I
applied this two-speed strategy in developing the Rmr2 package [78], using Quickcheck for R [79].
While adding new features, I ran a quick set of tests until most bugs
were fixed. Then, as we approached a release, I increased the number and
size of tests to build more confidence in the correctness of the
package. You may be more of a believer in quality over quantity in the
writing of tests. In my experience, handwritten tests have several
limitations. First, they are often trivial because the correct output
has to be calculated by hand. When written in response to bug reports,
they catch problems that have already been discovered. Finally, they
sometimes share the same faulty assumptions as the code to be tested
and, therefore, won’t probe it outside of those.
All these elements make property-based testing valuable, but there are a couple of catches. First, a developer must state properties that a function’s inputs and outputs must satisfy, and implement them. This task is easy for mathematical objects like matrices or well-defined algorithms like those for sorting and searching but less so in other domains. For example, how to approach testing that a graphical routine draws a circle? Another problem is that this approach relies on referential transparency, a complicated way to say that computer functions behave like mathematical functions, that is their outputs depend only on their inputs. For example, if a function has access to a database, its return value may depend on its contents. In this case there are workarounds, like starting with an empty test database and testing read and write functions in combination, but the code to be tested has to be amenable to this technique. Because of these limitations, we should be weary of emulating Dijkstra’s maximalism — the view that all software should be proven correct like a theorem (see testing in Sec. 1.2.11.1). Like formal methods, property testing has a domain of applicability, and within it, it is a great asset.
Another alternative to traditional documentation is doctests, generally written using literate programming — a technique to interleave text and code so that the latter can be executed [80]. If a chunk of code fails, the documentation related to it may need updating. One instance of this is the “Examples” section from the documentation of package Altair_recipies, which includes both the test suite and the output for each test, consisting of statistical graphics [81]. Maintaining such an extensive set of examples covering the whole library and a comprehensive test suite would have been daunting for a one-person, spare-time project. However, since they are one and the same, the cost is cut in half.
1.2.9 Languages
Paul Graham, Lisp hacker turned venture capitalist, wrote an essay devoted to conciseness, “Succinctness is power.” He sees succinctness as a productivity multiplier and believes it to be related to other positive qualities of code, such as readability. He focuses specifically on programming languages. He makes a characteristically articulate case for it [82]:
[…] clearly succinctness is a large part of what higher-level languages are for. If it is not all they’re for, then what else are they for […]
However, his point of view, while related, is different from the one espoused in this book, as he sees conciseness as a desirable feature in languages but not necessarily in programs:
We should be clear that we are talking about the succinctness of languages, not of individual programs. It is certainly possible for individual programs to be too densely written.
While it is possible to run into cryptic code (for instance, see Sec. 1.2.11.2 and the front cover), we shouldn’t regard general programming as a substantially different activity from designing languages. The boundary between core language and standard library is often blurred regardless, more a matter of packaging and versioning than anything else. In SICP, the authors write [68]:
To appreciate this point [that language interpreters are just programs] is to change our images of ourselves as programmers. We come to see ourselves as designers of languages, rather than only users of languages designed by others.
Therefore it is hard to agree with the point made by Graham that, while higher-level languages exist primarily for the sake of program conciseness, this should not be pursued when using these languages. He considers moderate succinctness in programs acceptable but not a goal.
Any non-trivial program should define the concepts relevant to the task on hand and implement data structures and operations to represent and manipulate these concepts, and this extends a language. However, in many languages, some extensions aren’t possible. To pick an example in Python, let’s assume we need to increment a group of variables by one, as in the following example:
+=1
a+=1
b+=1 c
It is impossible to encapsulate this snippet in a function such as
inc(a, b, c, by = 1)
to better express intent and promote
reuse. The reason is that function arguments in Python are passed by
shallow copy: the value of an integer, a pointer for a list. In the case
of an integer, modifying it inside a function would only change the copy
rather than the original. Despite this, there actually is a way to
implement this function, as shown by Stack Overflow user zwer [83]. Zwer uses a built-in mechanism to
access the call stack, a data structure used by the language run-time to
implement function calls. By doing that, we can access the list of
variables in the calling environment and modify them at will. To my
knowledge, debugging is the only recommended use for these features.
Indeed, zwer’s solution comes with so many caveats and warnings that the
author himself can’t recommend it as a viable one and named it
here_be_dragons
.
I didn’t select this example to pick on Python: languages are defined by what they allow and what they prevent. Some features are considered “dangerous” because they are believed to lead to the creation of bugs. Limiting side effects is encouraged in functional programming when they are at all possible. In C++, primarily an object-oriented language, parameters that allow side effects must be of reference type. Whether or not we agree with the specific choices made in a language, we need to operate within those limitations. Hence the lofty vision promoted in SICP of programmers as language designers must be tempered with the reality that it has to be achieved within the confines of the language one is using.
Possible ways to push beyond language limits are macros and
reflection. The former enables defining abstractions that generate or
modify code before execution; the latter, gives the ability to operate
on the structures of the language run-time from the program being run,
which is used in here_be_dragons
. Lisp macros are notable
for being expressed in Lisp itself, unlike, for instance, C macros,
which use a dedicated macro language. This is natural in Lisp, since, in
that language, expressions are also data structures. Every interpreter
or compiler for any language needs to build a representation of the
program which can only be a data structure, but in Lisp the relationship
between code and representation is particularly simple, as programs are
represented by lists [84]. In an example, developer and
author Adam Tornhill shows how to eliminate code duplication for a
specific web application using a macro: “Now […] we have a concise way
to express web-pages with a uniform look […]” [85]. Not only is this solution concise,
but also conciseness begets consistency because the same code generates
all pages. Unfortunately, he doesn’t explain why macros are necessary in
this case. Graham devotes a chapter in his “On Lisp” to explaining their
domain of application [86]:
Macros can do two things that functions can’t: they can control (or prevent) the evaluation of their arguments, and they are expanded right into the calling context. Any application which requires macros requires, in the end, one or both of these properties.
For instance, being expanded in its calling context, a macro would
allow implementing the inc
function described above.
However, this possibility isn’t unique to macros: pointer and reference
parameters allow to modify the calling context in C++, as do
var
arguments in Pascal, but in a much more controlled
fashion. As to the ability to control argument evaluation, for an
alternative mechanism, see Sec. 1.2.9.3.
Besides these specific applications, Graham recommends using macros only “when nothing else will do” [86] because both macros and reflection can make programs harder to understand and hide subtle bugs [87]. Nonetheless, macros have an important role in Lisp. For example, the dominant object-oriented extension to Lisp, the Common Lisp Object System, is implemented with macros, as are several iteration constructs. Graham even shows how to implement a completely different language, Prolog (see Sec. 1.2.9.6), with just macros [86].
Steele thinks that the quest for the ideal, all-powerful language is bound to be frustrated. He says [14]:
[…] it’s really hard to make a language that’s great at everything, in part just because there are only so many concise notations to go around. There’s this Huffman encoding37 problem.
If this observation is correct, it suggests three ideas. Please note
that these are based on an abstract, mathematical view of programs
rather than a practical one. The extent to which they apply in practice
is debatable. First, languages that forbid many constructs preclude them
from expressing valid computations. Consequently, developers will have
to use longer programs to define equivalent computations. An example is
variable declarations and explicit initializations, which many advocate
on a safety basis (see the Mariner I disaster, Sec. 1.5). Second, if two programs
define the same computation, one of the two could have defined a
different one. That computation will have to use a longer program. It is
reasonable to conclude that a language with many ways to express the
same computation will lead to more verbose programs. For instance, it is
debatable whether a language needs both a map
function,
which applies a function to each element in a collection and returns a
collection of its return values, and a list comprehension:
map(function, a_list)
for x in a_list] [function(x)
Finally, defining more common computations with shorter programs and
accepting some verbosity for the less common ones could be a good
compromise but implies different languages tailored for different
application domains. For instance, a sum of two vectors in R is an
element-by-element sum, even if the two vectors have different lengths,
a problem solved with a technique known as recycling. In
Python, the sum of two lists is their concatenation. In R, the
concatenation function is concisely named c()
, probably
owing to its widespread use. In Python, there is no element-wise sum,
but we can perform it with a relatively complex expression:
map(lambda x, y: x + y, zip(x_list, y_list)
. There are
vector libraries for Python, such as Numpy, whereby the sum of two
vectors is defined as in R, save for the recycling rules. While we can
accomplish the same results in both, the choices of each language and
library reflect the decisions of their respective developers, which
likely took into account what the most common use cases were expected to
be and reserved the most concise forms for them.
1.2.9.1 Lisp
We started talking about Lisp in the previous section because of its macros, but it is notable in other ways. Its inventor, John McCarty, designed it as a concise alternative to the Turing machine38[88]:
Another way to show that Lisp was neater than Turing machines was to write a universal Lisp function and show that it is briefer and more comprehensible than the description of a universal Turing machine.39
The idea to turn it into an actual programming language came to a student of McCarty’s, Steve Russell, who wrote its first interpreter in assembly language.40 It wasn’t a large, challenging project, unlike other languages (both ADA and Algol68 raised concerns about their implementability; see Sec. 1.5). An interpreter for the Lisp in the original paper takes only 53 lines of Common Lisp — a more complex but also practical dialect thereof [89]. This is a remarkable combination: a simpler language should also be less powerful and therefore require more code for the same tasks, including the implementation of its own compiler. However, Lisp sidesteps this trade-off to some degree. The so-called “Computer Language Shootout,” a systematic comparison of many languages focused on small, well-defined tasks, corroborates its conciseness [90]. It ranks Common Lisp first among over 50 languages (including multiple implementations of the same language).
Clojure is a modernized cousin of Lisp, particularly focused on supporting concurrency. It hasn’t lost Lisp’s expressivity while getting none of the verbosity of the Java language, with which it shares the target virtual machine.41 Daniel Lebrero compares two programs for the same task, one in each language, and asks [91]:
So given that the inherent complexity of the problem is the same, and that the Clojure version is able to express the solution in 58 lines of code while the Java version require 441 lines of code, and that the Clojure version is easier to understand, what are those extra 383 (87% of the codebase) lines of perfectly idiomatic Java code about?
Lebrero also notices how much easier to review the Clojure version is, and that size is one or maybe the determining factor. Sure, concise code could be cryptic, and verbose code trivial, but this isn’t the case here:
[…] the Clojure version was way simpler to understand, and […] the fact of having a single file with 58 lines of code was a very important reason for it.
He ascribes the additional LOCs to “incidental complexity”, which he thinks is caused by the use of inappropriate tools. He concludes that LOCs aren’t helpful as a measure of productivity but as one of complexity (see Sec. 1.1).
1.2.9.2 Python
Paul Prescod, Python developer and book author, is quoted by Graham as stating that “Python’s goal is regularity and readability, not succinctness” [82]. Graham considers it a “damning thing to claim about a programming language.” However, many others think Python is concise, in practice, if not by design. For example, Nicholas Hahn, a software engineer, compared Python, Go, and Rust. He finds that “Python has the least code by far” but also that “it’s leaning heavily on features of the image library it’s using, but this is indicative of the general experience of using Python” [92]. For instance, in the Python version, the difference between two images requires a single method call, whereas it is done pixel by pixel in the other two languages. This observation doesn’t tell us much about the intrinsic qualities of each language while still being important in practice. Availability of libraries could depend on popularity, age — older languages accumulate more — and ease of development, which in turn depends on tools and resources available, as well as those intrinsic qualities. The “Computer Language Shootout” mentioned in the previous section ranks Python 19th out of 51 languages for conciseness [90].
It is less verbose than C in many cases. For example, the README file for a project by the CMU Cylab contains equivalent examples in several languages, including Python. However, it omits the one in C because “the same program in C will take too much space and will not fit into the README format” [93]. One of the advantages of concise code is that it is easier to discuss and present. It will fit into a few slides or, in this case, a README file. Verbose code is something to skim in long, painful, lonely sessions and to hide.
Python practitioners aren’t short on enthusiasm for their language of
choice and may occasionally overstate the power of certain features. For
example, Mahmoud Hashemi, author of package Boltons, would have us
believe that “decorators are among Python’s most elegant and succinct
language features.” However, all they replace is one assignment:
@decorator
instead of f = decorator(f)
[94].
Hashemi may be referring to the use of higher-order functions,
popularized by decorators, rather than their syntax (see higher order
functions in Sec. 1.2.4).
1.2.9.3 R
RStudio co-founder Joe Chang, also known for the R package Shiny, thinks R is more expressive than Python and other languages. As the CTO of a company devoted to supporting this language, we should be aware of his biases. However, he is referring to specific features that are unique to R, among popular languages. In an interview, he said (best effort transcription from [95]):
[…] each one of these libraries has a bunch of features and a bunch of syntax that would not be possible certainly with Python, certainly with Java, certainly with C++ […] when you look at the interfaces of these packages users don’t necessarily know what magic we had to do to make this happen, they just know oh it’s really easy it’s really concise when I want to express my plots this way when I express my data manipulations this way […]
The “magic” he refers to includes a macro-like facility that
optionally suspends the regular evaluation of expressions and replaces
it with specialized variants. For instance, we can refer to columns in a
tabular data structure, the data frame, in certain expressions without
constantly repeating the name of the table, as in
mutate(table, a = b + c)
, as opposed to the repetitive
table$a = table$b + table$c
, where a
,
b
and c
are names of columns in
table
. Users love these more concise forms, but having
multiple evaluation rules makes programs harder to understand and
hinders their parametrization and generalization. For instance, if we
wanted to make the above expression general with respect to the names of
the columns involved, we would need to use a different syntax (see
package Dplyr in Sec. 1.2.10.1).
One task that R performs well is operating on vectors. It had better
do, as it doesn’t even provide scalar types, replaced by vectors of
length one. While most languages offer extensions for vectors or other
types of collections, in R, by design, all library functions need to
accept vectors. R developers think in vectors, and the comparatively
modest speed of iterative constructs in R is such that they have to use
vector functions in day-to-day work. That often helps with conciseness.
For instance, Rasmus Bååth takes a simple spelling corrector in 21 lines
of Python and translates it to two lines of R without using any
extensions [96]. It is
two oversized lines, though, and, once automatically reformatted in the
R IDE RStudio, his code swells to 12 lines: more than 2, but still a
great deal shorter than the original. I took the liberty to modify this
code to eliminate one repetition and to use the pipe operator
(%>%
) to reorder the operations. The “classic” style of
function composition in R is f(g(h(a1, a2), a3), a4)
which
many find counter-intuitive because functions execute in a temporal
order opposite to that in which they appear, but also because arguments
to the same function can end up far apart. An alternative is using a
temporary variable:
= h(a1, a2)
x = g(x, a3)
x f(x, a4)
This approach adds some unnecessary steps and doesn’t convey the
rigid organization of a pipeline. Using R’s macro-like facility, we can
define a pipe operator similar to the ones in F# and the Unix shell and
available in package Magrittr, such that the above expression is
equivalent to h(a1, a2) %>% g(a3) %>% f(a4)
[96]. In
other words, the pipe operator turns the expression on its left side
into the first argument to the function call on its right side, pushing
the other arguments to the right. With these modifications, here is the
concise spell corrector in R:42
readLines("http://www.norvig.com/big.txt") %>%
paste(collapse = " ") %>%
%>%
tolower strsplit("[^a-z]+") %>%
%>%
table sort(decreasing = TRUE) %>%
-> sorted_words names
In this step, we build a vector of words by:
- loading a large, presumably spell-checked text available online;
- concatenating all lines;
- turning all characters to lowercase;
- splitting the result on non-alphabetic characters to generate words;
- collecting word counts;
- sorting these words from most to least common;
- and, finally, dropping the counts.
This sorting order will help return the most common among equally close matches for a word. Now onto the actual spell-correcting function:
<-
correct function(word) {
= adist(word, sorted_words)
distances c(sorted_words[distances <= min(distances, 2)],
1] } word)[
It works by computing the edit distance43
between a given word and our list of correct, frequency-sorted words.
This code’s conciseness stems from using vector functions instead of
loops and the vast number of functions considered “core” to R (about
700). Bååth observes that, in particular, the function
adist
, calculating the edit distance between words,
performs much of the difficult work. He adds that R also provides a
function aspell
in the standard library, which would reduce
his program to a single line. Unlike the features referred to by Chang
earlier as “magic,” I don’t think this example highlights any
conciseness advantage intrinsic to the language, save for the use of the
pipe operator. Instead, it shows the benefits of a vectorized
programming style, whereby we can process entire vectors in one step,
and the importance of being familiar with available libraries. We could
likely achieve a similar result in Python, but this programming style
may be less natural to practitioners of that language. It is a fine line
between what a language allows and what it encourages or enforces.
1.2.9.4 JavaScript
While Python is the target of unwavering praise and R of a mix of praise and blame [97], JavaScript is subject to relentless condemnation (see Sec. 1.3.2) while, ironically, being arguably the most widely adopted language. It certainly is so when measured by the number of JavaScript programs simultaneously running since it is the programming language used for web development on the browser side. Because of its ubiquitousness and perceived shortcomings, developers have defined new languages that translate into JavaScript, correcting some of its defects but maintaining a considerable degree of compatibility. One of these, CoffeeScript, is touted for its conciseness. For example, software engineer Simon Grondin implemented “a rate limiter in 31 lines of CoffeeScript” and thinks this is “a demonstration of its exceptional expressiveness” [98]. However, when I revisited this implementation after first becoming aware of it [99], I found that it had grown to 8 times its original size [100] and that finding the core rate limiter logic among the cruft had become difficult. You may wonder whether the initial, concise implementation was more the result of a fresh take on the subject rather than the intrinsic qualities of the language. Eventually, this code seems to have succumbed to the drift toward complexity.
Not so the three-line implementation of partial application,44 also in CoffeeScript, by Smooth
CoffeeScript’s author E. Hoigaard, which retained its original
purity [101]. partial
takes a
function func
of
arguments and returns a function of
arguments by providing
values in advance. In this specific implementation, this is accomplished
by using the undefined
constant as a marker for an argument
that has no pre-specified value. For instance, a function
rectangle(x,y,h,w)
which draws rectangles centered in
with height
and width
can be used to define a function that draws only unit squares as
follows:
= partial (rectangle, undefined, undefined, 1, 1) square
Hoigaard starts from someone else’s 15-line implementation (see below), bringing it gradually down to its final svelte size, which I further trimmed down to two lines:45
= (func, a...) -> (b...) ->
partial (for arg in a then arg ?= b.shift())... func
Even without knowing much about CoffeeScript, I could understand this
implementation of partial
after researching a couple of
constructs. The meaning of this code is that partial
is a
function that accepts another function func
plus a variable
number of arguments in vector a
and returns a function of a
variable number of arguments (vector b
). The body of this
function invokes func
with a list of arguments assembled in
a list comprehension equal to a
element by element when it
is defined, otherwise picking the next value from b
. Not
only is it remarkably concise without being cryptic but also it
implements an important feature that other languages lack and can’t add
with a simple extension. So does the 15-line version — here reduced to
14 — but its meaning must be extracted from a jungle of indexes and
tests:46
= () ->
partial15lines [func, args...] = arguments
= () ->
wrapper = 0
i = 0
j = []
res_args while i < args.length
if args[i] == undefined
.push arguments[j]
res_args++
jelse
.push args[i]
res_args++
ireturn func.apply null, res_args
Converted to JavaScript, the first step to execute CoffeeScript code, the two-line version expands to 21 LOCs, but an automated conversion may be more verbose than a skilled programmer’s.
In its latest version, ES6, JavaScript has evolved to look and work
more like CoffeeScript. However, even the unimproved JavaScript, warts
and all, seems to offer the potential for significant code reductions.
Kiran Prasad, a software executive, recalls how LinkedIn switched from
Ruby to server-side JavaScript (node.js
) to meet their
scalability goals for some mobile products [102]. In the process, their code base
shrank from 60 KLOCs to 1-2 KLOCs, a reduction greater than 30-fold that
wasn’t even a goal for this transition. He explains this remarkable
achievement in three parts: the node.js
version is
“framework-free,” therefore free of framework-related cruft (I assume he
means web framework, see Sec. 1.2.10.3); they switched to a more
functional style from an object-oriented one, which had resulted in a
lot of unneeded abstraction layers; and they moved rendering47 client-side48
from server-side, which relocated, rather than eliminated, some of the
code. With so many concurrent changes, it is impossible to tell which
ones were instrumental to the code savings. For me the most important
lesson is that a 30-fold reduction in a codebase, which presumably freed
29 out of 30 engineers from maintenance work, was achieved as a side
effect of the quest for scalability rather than being an explicit goal.
Had LinkedIn selected a different approach, they might still be saddled
with a 60-KLOC code base.
1.2.9.5 Erlang
Maybe not so if they went with Erlang. In a video presentation, its
creator, Joe Armstrong, shows how a couple of pages of C translate to
one line of Erlang (near 1:20 [103]). Unfortunately, this video is too
fuzzy to read the C code. Later, a co-worker cites an internal study at
telecommunication firm Ericsson that estimates the ratio between C and
Erlang code for the same tasks as 7-fold. A factor to consider is that,
just as node.js
(see previous section) supports
event-driven programming, Erlang includes specialized features to build
reliable systems. In particular, it aims to solve the problem of
creating them despite the reality of software errors by supporting a
system’s ability to recover. Erlang achieves this goal by providing the
developer with lightweight processes that don’t share resources and
communicate via asynchronous message passing. They are organized in
hierarchies, whereby those higher up supervise those below. Rather than
trying to code defensively a single process to keep it going at all
costs, developers focus on recovering when something goes wrong. The
goal isn’t eliminating crashes but making them ordinary, recoverable
events. It is easier to understand this concept in the context of the
original target application: software for telephone switches. The goal,
in that case, is to keep a large number of conversations going, a task
that naturally lends itself to concurrency. Other applications written
in Erlang are the messaging library RabbitMQ and the messaging app
WhatsApp. However, its noteworthy features aren’t limited to its
concurrency and fault tolerance models. It is also a functional language
with pattern matching and many other modern features. Armstrong wrote a
simple side-by-side comparison of Java and Erlang programs computing the
area of geometric shapes [104]. This is the Erlang version:
area({rectangle, Width, Ht}) -> Width * Ht;
area({square, X}) -> X * X;
area({circle, R}) -> 3.14159 * R * R.
And this is the Java version:
abstract class Shape {
abstract double area();
}
class Circle extends Shape {
final double radius;
Circle(double radius) { this.radius = radius; }
double area() { return Math.PI * radius*radius; }
}
class Rectangle extends Shape {
final double ht;
final double width;
Rectangle(double width, double height) {
this.ht = height;
this.width = width;
}
double area() { return width * ht; }
}
class Square extends Shape {
final double side;
Square(double side) {
this.side = side;
}
double area() { return side * side; }
}
Skipping empty or single-character lines, we have a 3:17 ratio in favor of the Erlang program. However, despite its conciseness, more than a million lines of Erlang were necessary for its first production deployment, the ATM switch AXD 301 [105].
1.2.9.6 Prolog
Armstrong supported conciseness in contexts other than his programming language. In another interview, this is what he said about Prolog [14]:
[Prolog is] not widely used. And it’s a great shame because programs are incredibly short.
While it is possible to write all types of applications in Prolog (Armstrong used it for the first version of the Erlang interpreter [106]), that is uncommon. Prolog is a logic-based language used primarily in AI, specifically the flavor known as symbolic AI. Programs consist of facts and rules that allow inferring more facts. A classic example is the implementation of a family tree. For instance, let’s consider the ancestry of Charles II of Spain going back to Philip I and Joanna of Castile in Fig. 4.
We can implement this tree and some simple searches in a few lines of Prolog. First, we need to enter what would be considered data in other languages, that is, the parent-offspring relation by listing Prolog facts:
parent("Philip I of Castile", "Charles V Holy Roman Emperor").
parent("Joanna of Castile", "Charles V Holy Roman Emperor").
parent("Philip I of Castile", "Ferdinand I Holy Roman Emperor").
parent("Joanna of Castile", "Ferdinand I Holy Roman Emperor").
/* more parent-child pairs as shown in the graph */
Then we have a set of rules that allow to derive new facts from existing ones:
offspring(X,Y):- parent(Y, X).
ancestor(X, Y):- parent(X, Y).
ancestor(X, Y):- parent(X, Z), ancestor(Z, Y).
descendant(X, Y):- ancestor(Y, X).
These four lines define three relations:
offspring
is the inverse ofparent
;ancestor
contains pairs(X, Y)
for which eitherX
is a parent toY
or there exists aZ
such thatX
is a parent toZ
andZ
is an ancestor toY
;descendant
is the inverse ofancestor
.
There is nothing wrong in having two rules for ancestor
as it is a way to express disjunction: either one applied successfully
will be sufficient to establish that a specific pair is an
ancestor-descendant pair. With just a handful of LOCs, we can already
ask questions like “Who were the ancestors of Anna of Austria?”:
?- findall(X, ancestor(X, "Anna of Austria"), XX).
XX = ["Ferdinand I Holy Roman Emperor", "Anna of Bohemia and Hungary", "Philip I of Castile", "Joanna of Castile"].
If we want to see how a person is an ancestor of another, and not just whether that person is, we need only two more rules:
path(X, Y, []):- parent(X, Y).
path(X, Y, [Z|P]):- parent(X, Z), path(Z, Y, P).
which mimic the definition of ancestor
but add a list
variable P
to accumulate the ancestry path.
[Z|P]
is the list obtained by prepending Z
to
P
. We can paraphrase the second rule as “There is a path
consisting of Z
followed by P
between
X
and Y
if there is a Z
that has
X
as a parent and there is a path P between Z
and Y
”.
?- path("Philip I of Castile", "Charles II of Spain", XX).
XX = ["Charles V Holy Roman Emperor", "Philip II of Spain", "Philip III of Spain", "Maria Anna of Spain", "Mariana of Austria"] .
We can combine path
with findall
to ask the
interpreter and provide not one but all paths connecting two members of
the dynasty (the output is truncated):
?- findall(XX, path("Philip I of Castile", "Charles II of Spain", XX), YY).
YY = [["Charles V Holy Roman Emperor", "Philip II of Spain", "Philip III of Spain", "Maria Anna of Spain", "Mariana of Austria"],
["Charles V Holy Roman Emperor", "Philip II of Spain", "Philip III of Spain", "Philip IV of Spain"], ["Charles V Holy Roman Emperor", "Maria of Spain", "Anna of Austria '49", "Philip III of Spain", "Maria Anna of Spain", "Mariana of Austria"],
["Charles V Holy Roman Emperor", "Maria of Spain", "Anna of Austria '49", "Philip III of Spain", "Philip IV of Spain"],
["Ferdinand I Holy Roman Emperor", "Maximilian II Holy Roman Emperor", "Anna of Austria '49", "Philip III of Spain"|...],
["Ferdinand I Holy Roman Emperor", "Maximilian II Holy Roman Emperor", "Anna of Austria '49"|...], ["Ferdinand I Holy Roman Emperor", "Anna of Austria"|...], ["Ferdinand I Holy Roman Emperor"|...], [...|...]|...].
In ordinary families, there should be a unique path linking an ancestor and a descendant, but not in European royal families of the 16th and 17th centuries, and certainly not in this one. Since political expedience dictated marriages, unions between cousins or an uncle and his niece were not unusual. The cumulative effect over several generations was to reduce the genetic diversity to unhealthy levels: Charles II of Spain was extraordinarily sickly and died at just 39; his sister only made it to 21.
Leaving royal inbreeding aside, we can add more features to this
genealogy system: finding if two individuals are half-siblings, if they
are related in any way, which common ancestors they share, or which one
is the closest. Each feature requires one or two new rules. I can’t even
think how long it would take me to implement this from first principles
in a more traditional language. Behind the scenes, while most languages
execute statements in the order they are written or as specified by
control structures, Prolog applies a search and unification algorithm.
The unification part solves the problem of matching
parent(X, "Charles II of Spain")
with
parent("Philip IV of Spain", "Charles II of Spain")
for
X = "Philip IV of Spain"
. The search part explores all the
connections between matching facts and rules. It may help to think of it
as a database with a powerful query language. It looks like Prolog would
free the developer from worrying about algorithms, that is, the specific
steps needed to produce results. Unfortunately, it is more complicated.
Sometimes the rule order has a drastic impact on the search space. In
other cases, there are different sets of rules to accomplish the same
task, but they can make the search process more or less efficient: for
instance, see how different sorting algorithms can be implemented [107].
1.2.9.7 Kotlin
Kotlin is a language that runs on the JVM50 and is thus seen as an alternative to Java. It prominently advertises its conciseness [108]:
Concise
Reduce boilerplate code […] For example, one can now create a POJO [Plain Old Java Object] with getters, setters, equals(), hashCode(), toString() and copy() in a single line.
POJOs are just Java objects that don’t satisfy any additional specifications. It is customary to provide them with the methods listed in the quote, which used to be written by hand. I distinctly remember when, early in my career, I discovered I had to write POJOs for the simplest tasks, like accessing a database table. It was like running while submerged to the waist: a trivial curiosity required half hour of mindless typing. I couldn’t believe it hadn’t been automated in some way. Senior colleagues reassured me that it was standard procedure and that I would improve with practice.
In Kotlin, data classes replace POJOs. An example would be:
data class Person(var name: String, var age: Int)
In Java, this would look like the following snippet, the longest included in this book:
public class Person {
private String name;
private int age;
public Person() {
}
public Person(String name, int age) {
this.name = name;
this.age = age;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public int getAge() {
return age;
}
public void setAge(int age) {
this.age = age;
}
@Override
public String toString() {
StringBuilder sb = new StringBuilder();
.append(Person.class.getName())
sb.append('@')
.append(Integer.toHexString( System.identityHashCode(this)))
.append('[');
.append("name");
sb.append('=');
sb.append(((this.name == null)?"<null>":this.name));
sb.append(',');
sb.append("age");
sb.append('=');
sb.append(this.age);
sb.append(',');
sbif (sb.charAt((sb.length()- 1)) == ',') {
.setCharAt((sb.length()- 1), ']');
sb} else {
.append(']');
sb}
return sb.toString();
}
@Override
public int hashCode() {
int result = 1;
= ((result* 31)
result +((this.name == null)? 0 :this.name.hashCode()));
= ((result* 31)+ this.age);
result return result;
}
@Override
public boolean equals(Object other) {
if (other == this) {
return true;
}
if ((other instanceof Person) == false) {
return false;
}
= ((Person) other);
Person rhs return (((this.name == rhs.name)
||((this.name!= null)
&&this.name.equals(rhs.name)))
&&(this.age == rhs.age));
}
}
Java developers aren’t fools and, in this day and age, use code
generators to write this boilerplate51 for them. However,
there is a problem with this approach. Generated code needs to be read,
understood, and maintained. Granted, it doesn’t look all that difficult,
but how do we find out someone didn’t change some detail in the equals
method after its generation? That is why code generation isn’t a
suitable replacement for abstraction. Moreover, I believe POJOs tell us
something about Java. What is wrong with a language that doesn’t allow
abstracting this boilerplate away? For instance, one in which a
developer can’t express with code the following algorithm: “To implement
equality comparison for a class, for each property in that class, call
the equals
method on that property and return true if all
of these calls return true?” It seems like one of the most
straightforward and useful abstractions you could come up with. In
Python, the specific issue of boilerplate methods can be solved with a
library (for instance, Attrs; see Sec. 1.2.10.4), but, apparently, not in Java.
Kotlin offers much more than data classes, and replacing POJOs was only
one of the motivations behind the language, but one that will get the
attention of POJO-aggrieved Java programmers.
1.2.9.8 C++
R’s vector functions perform well because they are written in other, lower-level, but more efficient languages, typically C or C++. The most downloaded R package is indeed one that facilitates mixed R and C++ development, Rcpp. Its maintainers took advantage of new features in Cpp11, a relatively recent update to the C++ standard, to refactor or, should I say, cull the Rcpp code base. Software engineer Kevin Ushey reports the following [109]:
Eliminating the code bloat and replacing [it] with variadic templates52 was awesome. […] For some hard numbers, I count 31,428 lines of code in the inst/include directory of Rcpp11, while there are 92,726 total lines of code in inst/include for Rcpp.
This process resulted in a 3-fold reduction stemming from a single
new language feature and without any impact on the Rcpp user, for whom
it is entirely transparent. Templates enable the parametrization of
variable types, but until C++11, the number of type parameters had to be
constant. As a workaround, Rcpp developers, to provide the flexibility
of functions with a variable number of arguments, painstakingly
replicated and adapted their code for each number up to a somewhat
arbitrary maximum, 20 (not unlike what my team-mate did with nesting
levels in Sec. 1.2.4, but in that case a
better solution was readily available). For example, searching for all
the Vector create
definitions in Rcpp, we find 20 instances
[110]:
static Vector create(const T1& t1)
static Vector create(const T1& t1, const T2& t2)
static Vector create(const T1& t1, const T2& t2,
const T3& t3)
static Vector create(const T1& t1, const T2& t2,
const T3& t3, const T4& t4)
static Vector create(const T1& t1, const T2& t2,
const T3& t3, const T4& t4,
const T5& t5)
[14 entries omitted]
static Vector create(const T1& t1, const T2& t2,
const T3& t3, const T4& t4,
const T5& t5, const T6& t6,
const T7& t7, const T8& t8,
const T9& t9, const T10& t10,
const T11& t11, const T12& t12,
const T13& t13, const T14& t14,
const T15& t15, const T16& t16,
const T17& t17, const T18& t18,
const T19& t19, const T20& t20)
To add insult to injury, each definition needs a largely repetitive but slightly customized implementation — here omitted. The developers were aware of this massive amount of repetition. They prefaced this code with a comment describing an imaginary abstraction that deals with the 20 cases in a unified way, which was impossible before C++11.
In Rcpp11, instead, there is only one definition for all these cases, and a single implementation [111]:
static Vector create(Args&&... args)
The line count went from 1245 to 144. However, there is a price to pay for this code reduction: an ever more sprawling C++ standard for a language already quite complex in its first incarnation (see Sec. 1.3.2). Its evolution has been guided by an unwavering commitment to backward compatibility,53 which implies features can’t be removed, and an openness to new features. Its standard committee doesn’t seem particularly concerned about the language becoming harder to learn and read. They aren’t alone in taking this position: for instance, Python has been growing as well, adding list comprehensions, coroutines, and f-strings, among other features. Unlike C++, though, Python abandoned full backward compatibility in its 3.0 version. The co-existence between the two incompatible versions continues at the time of this writing. They were officially supported in parallel for 12 years, explaining why language steering committees often accept backward compatibility as a constraint, no matter how burdensome.
1.2.9.9 Query Languages
SQL queries can quickly grow to be unmanageable. Luckily, there are libraries to side-step this problem, generating SQL from much more concise expressions. Dplyr (see Sec. 1.2.10.1) is one such library. For instance, here is a typical R expression using Dplyr:
%>%
mtcars group_by(cyl) %>%
filter(mpg == min(mpg) | mpg == max(mpg)) %>%
select(mpg, model) %>%
arrange(mpg)
First, remember the meaning of %>%
, the pipe operator
(see Sec. 1.2.9.3), which takes the expression on
its left hand side and makes it the first argument for the function call
on its right side. This code takes the data in table
mtcars
, a car-related dataset, groups it by the number of
cylinders, filters the rows that match the minimum and maximum fuel
mileage in each group, keeps only specific columns, and sorts the rows
by increasing gas mileage. The value of this expression is the following
table:
cyl mpg model
8 10.4 Cadillac Fleetwood
8 10.4 Lincoln Continental
6 17.8 Merc 280C
8 19.2 Pontiac Firebird
6 21.4 Hornet 4 Drive
4 21.4 Volvo 142E
4 33.9 Toyota Corolla
Dplyr can process datasets in memory or database tables, the latter by generating SQL queries. For example, the following is generated from the above expression (assuming that the target database is Postgres, as SQL is still subject to proprietary variations, despite the existence of a standard):
SELECT `cyl`, `mpg`, `model`
FROM (SELECT `cyl`, `mpg`, `model`
FROM (SELECT `cyl`, `mpg`, `model`,
MIN(`mpg`) OVER (PARTITION BY `cyl`) AS `q01`,
MAX(`mpg`) OVER (PARTITION BY `cyl`) AS `q02`
FROM `mtcars`) `q01`
WHERE (`mpg` = `q01` OR `mpg` = `q02`)) `q02`
ORDER BY `mpg`
It is possible that a skilled SQL developer could come up with a more concise query, but there is no denying that the R-Dplyr expression is more clear and concise.
Impala, an SQL variant specialized for big data, adds support for complex types. One of the advantages of this extension is that “common querying patterns can be expressed concisely” [112]. This statement reprises the concept that this isn’t possible for all patterns (see Steele’s “Huffman encoding problem” in Sec. 1.2.9). With knowledge of which are more valuable, it is possible to make those more concise, while allowing the expression of less common patterns in more complex ways rather than dropping them altogether.
1.2.10 Libraries
As we have seen in Sec. 1.2.9, libraries should not be seen as totally distinct from languages. On the contrary, they are language extensions, and the line between them is sometimes blurred. It is no surprise, then, that we can look at them from the point of view of their conciseness, meaning the conciseness they enable in the programs that use them — we will talk about frugal APIs later on (see Sec. 1.3.3). It is almost a given that using a library allows us to write less code than managing without it (but see Sec. 1.2.9.4 for an exception), possibly at the cost of a steep learning curve. However, not all libraries are created equal. We will see through several examples that some libraries specifically target conciseness. Users value this property, whereas they criticize or avoid libraries that force verbose code, particularly in the form of boilerplate.
1.2.10.1 Data Analysis and Processing
A remarkable example is the library Keras for deep learning (which is covered in Sec. 2.1.4). The brainchild of François Chollet, it entered a segment populated by many strong competitors. It doesn’t intend to replace them but only to provide a better API that makes developing deep learning models easier. On the implementation side, it relies on a user-selectable preexisting deep learning library: you can see it as a thin layer of peanut butter and jelly, with the underlying bread provided by others. Therefore it isn’t any faster or more powerful than any of these libraries, but only more convenient and makes developers more productive. These properties are based, at least partly, on the conciseness it enables. Version 0.2.0 opened its README with “Keras is a minimalist, highly modular neural network library” [113]. Among the design principles explained in the same document, we find: “Minimalism. Each module54 should be kept short and simple (<100 lines of code).” Strangely this principle was dropped in later versions, but not the following one [114]:
No separate models [sic] configuration files in a declarative format. Models are described in Python code, which is compact […].
Keras avoids defining a new, separate description language, finding Python is perfectly adequate and highlighting the compactness of model descriptions, which are also modular and easy to combine. Its users agree: fast.ai co-founder Rachel Thomas calls one of the alternatives to Keras, Tensorflow, “verbose and confusing” and recommends switching [115]. Indeed Tensorflow developers gave it the strongest endorsement, by adopting it as their recommended API and incorporating it into their package distribution. A run-of-the-mill feed-forward neural network in Keras looks as straightforward as it should be:
= (
model
Sequential()10, input_dim=4, activation="relu"))
.add(Dense(1))) .add(Dense(
There is a small terminology catch: what everyone else calls a
feed-forward neural network here is a sequential model, which
may suggest a model for sequences of variable length, such as time
series. Its input has length four, as the input_dim
parameter specifies. The two layers of this network are added one at a
time, calling a function named with the desired connectivity pattern,
Dense
, and parametrized with their sizes and, optionally,
activation functions. Everything else is set to reasonable defaults
customizable with additional parameters. Let’s see what this looked like
in Tensorflow before Keras was bundled with it:
= placeholder(float32, shape=(None, 4))
inputs
# Hidden layer
= 10
hidden_size = Variable(random_normal([hidden_size, 4], stddev=0.01))
weights_hidden = Variable(constant(0.1, shape=(hidden_size, 1)))
bias_hidden = relu(add(matmul(weights_hidden, transpose(inputs)), bias_hidden))
hidden
# Output layer
= Variable(random_normal([1, hidden_size], stddev=0.01), name='weights_output')
weights_output = Variable(random_normal([1, 1]))
bias_output = add(matmul(weights_output, hidden), bias_output) output
The general structure is the same: we start with an input layer,
created with the opaquely named placeholder
, then create a
hidden layer, and finally an output layer, but each requires 3-4 lines.
Crucial to creating each is the also opaquely named
Variable
, which creates a vector or matrix object that can
be initialized and later optimized during training. You can see
different initialization strategies being used. The layers’ definitions
make explicit the underlying mathematical operations, such as
add
, matmul
, and relu
, and are
combined by function composition instead of the add
method.
On the one hand, this approach enables the complex architectures used in
deep learning, for which the Sequential
API in Keras is
inadequate. On the other, they expose lots of details practitioners are
often happy to leave at defaults, resulting in three times as much code.
Particularly burdensome are those details about layer dimensions that
need to be compatible between layers and are all calculated
automatically in Keras. Confusingly, the layer dimensions, for which
there are no reasonable defaults, must be specified as parameters to the
initialization strategy, for which, instead, there are defaults. It may
seem a classic case of power vs. ease-of-use trade-off: after all,
reducing a layer to its mathematical definition means we can swap one
transformation for another and create all sorts of experimental neural
modules. Moreover, composing those modules by name means great
flexibility in defining complex architectures. However, Keras isn’t only
for feed-forward neural networks: its Sequential API is just a teaser
for newcomers or the result of recognizing this type of network is
perfectly adequate in many applications. For more advanced ones, it
offers a Functional API, more similar to the one native to Tensorflow,
but with significant differences. The following is how the same network
can be defined using this API:
= keras.Input(shape=(4,))
inputs = layers.Dense(10, activation="relu")
dense = dense(inputs)
hidden = layers.Dense(1)(hidden)
outputs = keras.Model(inputs=inputs, outputs=outputs) model
This approach adds the flexibility researchers need to define cutting
edge architectures while not being substantially more verbose than its
Sequential API. Like the original Tensorflow API, it uses function
composition to assemble neural modules but avoids several of its
pitfalls: this API is still centered around machine learning concepts of
input and layer rather than the more opaque Variable
and
placeholder
as in Tensorflow; the layer dimensions are
specified directly, not through initialization; and there are reasonable
defaults for initialization and numeric types. When necessary, we can
always define a custom layer at the level of mathematical
operations.
However, as good as Keras is, some people think there are better options: it is a very active field at the time of this writing. If Thomas vouches for Keras, fast.ai’s other co-founder Jeremy Howard has already moved on. Reporting on a comparison between Keras and PyTorch, another library for the same class of applications, he writes [116]:
We built models that are faster, more accurate, and more complex than those using Keras, yet were written with much less code.
This code wasn’t immediately available. While I don’t have direct experience with PyTorch, a quick look at plain-vanilla examples showed an API at an abstraction level somewhere between Keras’s functional API and Tensorflow’s, with additional object-oriented boilerplate layered on top, and I wasn’t impressed. However, PyTorch was built for experimentation and not for plain-vanilla models. Therefore, its true worth may not be evident unless someone works with models with the complexity Howard is interested in. Whichever the current champion, it doesn’t look like anyone is finding value in verbose machine learning code, while models are getting more complex over time, putting additional emphasis on compact descriptions. Moreover, no API designer can escape the compromise between the complexity of an API, its power, and the complexity of user code. Nonetheless, some navigate it better than others.
Large deep-learning models are trained on GPUs. While most practitioners don’t have to code directly for the GPU, those writing advanced custom models often do, a task that requires highly specialized skills. Triton, a library from OpenAI, aims to change that. In their words [117]:
Triton makes it possible to reach peak hardware performance with relatively little effort; for example, it can be used to write FP16 matrix multiplication kernels that match the performance of cuBLAS — something that many GPU programmers can’t do — in under 25 lines of code. […] although a variety of systems have recently emerged to make this process easier, we have found them to be either too verbose, lack flexibility or generate code noticeably slower than our hand-tuned baselines.
As a taste of programming for the GPU, let’s consider the “Hello World”55 of this field, which is the sum of two vectors. Here is the Triton version (lightly edited from [117]):
= 512
BLOCK @jit
def add(x, y, z, n):
= program_id(0)
pid = pid * BLOCK + arange(BLOCK)
index = index < N
mask = load(x + index, mask=mask)
gpu_x = load(y + index, mask=mask)
gpu_y + index, gpu_x + gpu_y, mask=mask) store(z
GPUs operate with a high degree of parallelism, and individual
threads are organized in blocks. Here, the program_id
call
provides an id for the current block as the function add
runs in parallel on each block of size BLOCK
. Within each
block, there can be even more parallelism, but the library manages this
transparently. The load
and store
calls are
responsible for moving data to the GPU and back. mask
prevents out-of-range accesses. Contrast this with the corresponding
program written in C using a library called CUDA (lightly edited from
[118]):
void add(int *a, int *b, int *c, int n) {
__global__ int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < n)
[index] = a[index] + b[index];
c}
#define N (2048*2048)
#define THREADS_PER_BLOCK 512
#define BLOCKS ((N + THREADS_PER_BLOCK-1) / THREADS_PER_BLOCK)
int add(int *a, int *b, int *c) {
int *gpu_a, *gpu_b, *gpu_c;
int size = N * sizeof(int);
(gpu_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(gpu_b, b, size, cudaMemcpyHostToDevice);
cudaMemcpy
<<<BLOCKS,THREADS_PER_BLOCK>>>(gpu_a, gpu_b, gpu_c, N);
add_kernel(c, gpu_c, size, cudaMemcpyDeviceToHost);
cudaMemcpy(gpu_a);
cudaFree(gpu_b);
cudaFree(gpu_c);
cudaFree}
We need a pair of functions, the first of which is a so-called
“kernel,” the part that runs on the GPU. It is marked by they keyword
__global__
, as the documentation explains:
__global__ is a CUDA C/C++ keyword meaning [function] will execute on the device [and] will be called from the host.
Here “device” means GPU. If you find __global__
a
confusing choice to mark these functions, you aren’t alone since
“global” is already a charged word in programming, mostly having to deal
with the scope of variables. Calculating which indexes to operate upon
follows a logic close to the Triton example. However, in this case,
instead of an index range, we have a purely scalar operation. The second
function runs sequentially on the CPU and takes care of memory
allocation, deallocation, and moving data in and out of the GPU. It
invokes the kernel with a unique triple chevron syntax that
provides information about the desired level of parallelism. While
already demonstrating the conciseness of Triton, this basic example
doesn’t highlight its ability to perform automatic optimizations that
are far from trivial and need to be implemented manually in C and CUDA.
An example where this ability is on display is the implementation of the
function softmax
, available in Python with Triton [117] and C with CUDA [119]. However, a complete understanding
of these examples requires a more in-depth treatment of GPU programming
than I can provide. Libraries such as PyTorch [120] and Numba [121] enable equally or even more
concise GPU programming. Unfortunately, they also impose a significant
performance penalty over Triton or hand-optimized C CUDA kernels, in a
highly performance-sensitive application domain.
The quest for conciseness isn’t limited to deep learning. For example, the authors of SparkSQL, a module that extends the big data processing system Spark with relational capabilities,56 are keen on touting its conciseness, both of its internals as well of sample programs. They do so at least seven times in the paper where it was first described [122]. SparkSQL attempts to combine relational processing with semi-structured, ETL-type57 processing. A basic example from the original article is the following:
employees.join(dept, employees("deptId") === dept("id"))
.where(employees("gender") === "female")
.groupBy(dept("id"), dept("name"))
.agg(count("name"))
In this expression, employees
and dept
are
two relational tables. The authors remark that this being just a library
within a fully featured programming language (either Scala or Python)
means that we can use control structures, variables, and all the
features of one together with the powerful declarative style of SQL (see
also Sec. 1.2.10.1).
Programming directly in SQL would not afford these possibilities. Since
evaluation in SparkSQL is delayed as much as feasible, the developer can
build abstractions for common subqueries without a performance penalty,
because only the final, fully assembled query will be submitted to Spark
and therefore optimized as a unit. As a SparkSQL user put it, it is
“concise and declarative like SQL, except I can name intermediate
results,” and naming is essential to reuse.
The authors of Scalding, also a Scala library, picked this byline for their twitter account [123]:
Scalding is a scala DSL58 for Cascading, running on Hadoop. It’s concise and functional, and it’s just good, clean fun.
We don’t need to take a slogan at face value. On the other hand, it is what the Scalding team wanted to advertise, as opposed to, say, that it is scalable. Since it runs on top of Hadoop,59 it inherits its scalability properties. However, since Hadoop is written in Java that was the only language to develop programs running on it, and its Java API was phenomenally verbose. For example, I once counted approximately 250 lines to perform a simple relational join (that was using an early version, and I hope more concise alternatives have been developed since). Therefore, it is easy to understand why Scalding developers would like us to focus on its conciseness, which we can appreciate in the following example of a join, despite the verbiage devoted to static typing [124]:
case class Book(title: String, author: String)
case class ExtendedBook(title: String, author: String, copies: Long)
val library1: TypedPipe[Book]
val library2: TypedPipe[ExtendedBook]
// somehow load them with data
val group1: CoGroupable[String, Book] = library1.groupBy { _.title }
val group2: CoGroupable[String, ExtendedBook] = library2.groupBy { _.title }
val theJoin: CoGrouped[String, (Book, ExtendedBook)] = group1.join(group2)
Many data manipulation libraries have the data frame as their central data structure. It can be described as tabular or relational data or, in the jargon of statistics, a two-dimensional collection with one row per observation and one column per variable. There is something about this data structure that makes it so valuable. For example, Spark co-author and Databricks Chief Technologist Matei Zaharia says that “[data frames] give you a very concise way of writing expressions, doing operations” (my transcription from [125]).
Dataframes are built into R as the data structure to represent empirical observations, but how R developers work with them has changed over time. An R library, Dplyr, provides an alternative to built-in capabilities and won out with users, who praise its conciseness — its speed also does help. Take for instance this tweet thread between Vince Buffalo and J. J. Emerson, two evolutionary geneticists [126]:
I ♥ how R makes much of data analysis trivial. How good are genotype imputations by freq class? table(real_geno, imputed_geno, cut(freq, 5))
Some of my old R code (100’s [sic] of lines) could now be reduced to a 20-liner with plyr, ggplot, etc.
Haha, yea mine too. The Hadley-sphere is a black hole for wasted/unnecessary lines of code.
The Hadley-sphere mentioned here is a set of packages, including Dplyr, by the same developer, Hadley Wickham. While not all equally popular, some — Dplyr and Ggplot2 above all — have largely replaced corresponding built-ins or core libraries bundled with the language for most users, a rare feat among libraries, as the advantages of incumbency are as real as in politics (see also Requests in Sec. 1.2.10.3).
Tony Fischetti puts it in no uncertain terms in his blog post “How dplyr replaced my most common R idioms” [127]:
Perhaps more importantly, the dplyr code uses many fewer lines and assignments and is more terse and probably more readable with its neat-o “%.%” operator.
His blog post clarifies that these conciseness advantages are more important than a 2x speed-up for his use cases. The operator mentioned in the quote is a version of the pipe operator (see Sec. 1.2.9.3).
A problem with Dplyr is that its author, to facilitate interactive use, made life harder for people who want to create programs with it [128]. R is a programming language, but, even more importantly, a console for statistical analysis by non-programmers. This dual use is uncommon among programming languages — other examples are MATLAB and SAS. Nonetheless, it is better to support interactive and programming uses with the same constructs and libraries to ease the transition from one to the other. To achieve that, Wickham designed two very similar sub-APIs in the same package, one for each use case. I was involved in a discussion about this as the developer of a related package, and I resisted that option calling it “a doubling of the API footprint” [129]. However, my solution required slightly more boilerplate in programming use cases. More recent versions of Dplyr have gone down a similar path, deprecating one of the sub-APIs (for frugal APIs, see Sec. 1.3.3).
We have already touched upon some of Dplyr’s features in Sec. 1.2.9.3, as they are the best known application of R’s
unique macro-like facility. When a data frame table
includes columns a
and b
, we can easily create
a new data frame with an additional column c
with
table = mutate(table, c = a + b)
, as opposed to the
standard R table$c = table$a + table$b
. However, what if we
wanted to parametrize the columns involved in this operation? Let’s
assume the names of the columns are stored, as strings, in three
variables a_name
, b_name
, and
c_name
. In this case, neither of the above expressions will
work. Standard R offers a different operator, [[]]
for this
case: table[[c_name]] = table[[a_name]] + table[[b_name]]
.
Dplyr provides a now deprecated sub API of functions, whose names are
suffixed with a cryptic underscore character, to support this case:
table = mutate_(table, c_name = paste(a_name, "+", b_name))
,
where paste concatenates its argument to yield the string
"a + b"
. While this form avoids the endless repetition of
the data frame’s name, it requires forming R expressions by string
operations, an error-prone and hard-to-read approach. The currently
recommended solution is to refer to the data frame not by its name but
by a conventional name that can occur inside Dplyr expressions only as
in
table = mutate(table, c_name = .data[[a_name]] + .data[[b_name]])
,
which is the longest of all the different versions examined here, and
restores the repetition of names necessary in standard R. As Steele said
about languages, “there are only so many concise notations to go
around,” and the same seems to hold for libraries (see Sec. 1.2.9). This is an instructive example
of the subtle trade-offs between complexity in a library and in its
users’ code and optimizing for different use cases.
Like Keras (see Sec. 1.2.10.1), Dplyr delegates computation to one of several backends, including relational databases. Its developer chose to add advanced SQL features known as window functions, but with limited support[130]:
[…] you can’t access rarely used window function capabilities (unless you write raw SQL), but in return, frequent operations are much more succinct.
Conciseness is difficult; therefore, as we have seen before (see Impala in Sec. 1.2.9.9), an option is making popular features and idioms more concise while allowing a more verbose expression for the less common ones. Here the pain goes further, though, since we have to “punch through the abstraction,”60 reaching for the separate, lower level language of the backend, SQL, and losing the portability61 of “pure” Dplyr expressions.
Unlike R, Python doesn’t come with data frames built-in, but the library Pandas provides equivalent capabilities. However, it is the opinion of some practitioners that it is less elegant than its R counterpart, Dplyr. The project Pandas-ply aims to provide a more Dplyr-like API for Python. Its authors write [131]:
In our opinion, this pandas-ply code is cleaner, more expressive, more readable, more concise, and less error-prone than the original pandas code.
It is still in an experimental stage, and it is far from having the impact on the Python user base that Dplyr had on R users. You can judge whether it should by looking at the following examples, from the Pandas-ply introductory material, the first of which is written using Pandas:
= flights.groupby(['year', 'month', 'day'])
grouped_flights = pd.DataFrame()
output 'arr'] = grouped_flights.arr_delay.mean()
output['dep'] = grouped_flights.dep_delay.mean()
output[= output[(output.arr > 30) & (output.dep > 30)] filtered_output
Pandas-ply lets you instead write:
(flights'year', 'month', 'day'])
.groupby([
.ply_select(= X.arr_delay.mean(),
arr = X.dep_delay.mean())
dep > 30, X.dep > 30)) .ply_where(X.arr
While the line count is similar, the Pandas example is more “busy,”
with temporary results stored in intermediate variables and more
repeated variable names. The Pandas-ply version uses a chained method
call syntax that avoids intermediate variables and uses a special object
X
to simulate some of the Dplyr syntax based on R’s
macro-like facility, which has no equivalent in Python.62
However, even more telling is the comparison with their R closest
counterpart, Dplyr. Recalling the example (see Sec. 1.2.9.9) where we sought to compute, for
each group of data, the entries matching the maximum and minimum for a
variable, it is instructive to try to reproduce that with Pandas and
Pandas-ply. Simply applying a filter operation to grouped data doesn’t
work for either package because grouped data frames don’t have the same
methods as their ungrouped counterparts. In Pandas, we can use this
workaround:
"cyl")
mtcars.groupby(apply(lambda x: x[(x["mpg"]==x["mpg"].max()) | (x["mpg"]==x["mpg"].min())])[
."mpg", "model"]
[
]"mpg") .sort_values(
The same workaround also works for Pandas-ply:
"cyl")
mtcars.groupby(apply(lambda x: x.ply_where((X.mpg==X.mpg.max()) | (X.mpg==X.mpg.min())))
."mpg", "model")
.ply_select("mpg") .sort_values(
In Pandas or Pandas-ply, a grouped data set is more like a list of
groups, and we can use an approach from functional programming to
apply
a function to every element in this list, each a data
frame. In this function, created with a lambda
expression,
the desired selection is performed using standard Pandas data frame
operators. It is hard to ignore the numerous references to
X
, similar to the .data
identifier in Dplyr,
but for which Dplyr offers more concise, albeit less general,
alternatives. On the result of this apply
call, we can
select variables of interest and perform a sort, alternating operators
like [[]]
and methods like sort_values
, but in
closer analogy with Dplyr. Unlike Dplyr, Pandas doesn’t provide
specialized idioms for interactive sessions, as this use case is less
dominant and more recent in Python. Independent of the use case,
treating grouped data frames differently from ungrouped ones makes the
API harder to use for no increase in power. It represents a loss of
orthogonality, the ability for a set of functions or methods to
operate on different types of objects with no exclusions. Here we have
an example of a lost opportunity to use the more concise form:
filter(...) table.group_by(...).
instead of the more verbose:
apply(lambda x: x.filter(...)) table.group_by(...).
while the former is simply disallowed, an arguably intuitive line of code that returns an error instead of something useful.
One of the advantages of being an incumbent, like Pandas, is third-party support. For example, using package Modin together with Pandas, we can partially parallelize a program [132]. Just by importing Modin, unmodified Pandas code will run in parallel. It is the most concise parallelization I have ever seen (compare it with MPI in Sec. 1.5).
1.2.10.2 Statistical Graphics
In statistical graphics, the quest for conciseness starts with its design, independent of any implementation. In The Visual Display of Quantitative Information, author Edward Tufte defined the data-ink ratio, the ratio of ink spent to display data vs. the total amount used in a chart, and recommended maximizing it [133]. It is a somewhat counterintuitive definition, as it may seem to encourage the use of heavy marks, like thick lines and large dots; this outcome, though, is prevented by another one of Tufte’s principles, that of data density — the number of data points per unit of area. The result is graphics that display lots of data with just enough ink, minimize and de-emphasize everything that isn’t data, like grids and scales, and eschew ornamentation altogether.
SimplexCT, a data analysis consultancy, provides a step-by-step example of transforming a chart that Tufte wouldn’t hesitate to call chartjunk into one following his principles [134]. Its starting point is full of useless and distracting graphical elements, as in Fig. 5.
The final chart uses graphical elements sparingly, emphasizes the data, and presents it in two different ways while adding insightful comments (see Fig. 6).
An intermediate step between a graphic conception and its display is its high-level formal description. The best-known language for this task is the grammar of graphics. Invented by Leland Wilkinson, it identifies the elements and combinators needed to define a large variety of statistical graphics [135]. It avoids the verbose generality of using raster or vector graphics — describing a graphics point by point or line by line — but also the limitations of predefined graphics types: the scatterplot, the boxplot, and so forth. It has inspired several concrete implementations. A popular one is Ggplot2 (see below), and its author defines the grammar this way [136]:
A grammar of graphics is a tool that enables us to concisely describe the components of a graphic.
The Vega project develops implementations of the grammar in JSON, including extensions for interactive graphics. Vega-Lite is the most compact of the different implementations, because “Verbose specification impedes rapid authoring and hinders systematic exploration of alternative designs” [137]. Here is a simple example:
{
"$schema": "https://vega.github.io/schema/vega-lite/v5.json",
"data": {"url": "data/penguins.json"},
"mark": "point",
"encoding": {
"x": {
"field": "Flipper Length (mm)",
"type": "quantitative",
"scale": {"zero": false}
},
"y": {
"field": "Body Mass (g)",
"type": "quantitative",
"scale": {"zero": false}
},
"color": {"field": "Species"},
"shape": {"field": "Species"}
}
}
The first line, $schema
, points to the exact
specification of the language this chart is written in, in this case,
Vega-Lite version 5. Then there is the data source and the type of
mark
or sign to associate with the data. The encoding entry
defines the association between visual properties of the mark, like
position, color, and shape, and variables in the data set, together with
the type of each variable, and details of the scale to use — in this
case, it doesn’t need to include zero if the data doesn’t. Many other
properties can be specified, but defaults are well thought-out. You can
write these specifications directly or use libraries in different
languages to generate them. The resulting graphic can be seen in Fig. 7.
We can violate Tuftian principles using Vega-Lite but defaults prod the user in the right direction. There are some compromises to accept, though. For instance, it was impossible to use polar coordinates or more than 5000 data points in the version I used. I think these are more temporary omissions than fundamental limitations. The main focus for Vega-Lite isn’t reproducing every type of statistical graphics ever conceived but exploring its cutting edge, interactivity.
Much more mature is Bokeh, a statistical graphics package for Python. Its documentation reads [138]:
Its goal is to provide elegant, concise construction of novel graphics […]
It may be its goal, but it doesn’t necessarily excel at it. Newer competitors surpass it along the conciseness dimension. For instance, comparing Bokeh and Dash, Flavian Hautbois reports that the former took 139 LOCs to implement a simple dashboard, whereas the latter only took 105 [139]. He also notes that the Bokeh implementation took much longer and felt harder, beyond what the line count suggests. The above Vega-Lite example can be reproduced with the following Python Bokeh code (simplified from [140]):
# data is in data frame "data"
= sorted(data.species.unique())
SPECIES = ['hex', 'circle_x', 'triangle']
MARKERS
= figure()
p = 'Flipper Length (mm)'
p.xaxis.axis_label = 'Body Mass (g)'
p.yaxis.axis_label
"flipper_length_mm", "body_mass_g", source=data,
p.scatter(="species", size=12,
legend_group=factor_mark('species', MARKERS, SPECIES),
marker=factor_cmap('species', 'Category10_3', SPECIES))
color
= "top_left"
p.legend.location = "Species"
p.legend.title
show(p)
There are many differences, but the most important is that
p.scatter
call: all this code is needed to create a
pre-defined type of plot, the scatterplot, not to build one from its
constituent elements, as in the systems that espouse the Grammar of
Graphics. Indeed, Bokeh is, by their developers admission, only
“inspired” by it but does not implement it. Another noteworthy
difference is how the scales (mappings between data values and graphical
properties) must be created from scratch for the marker
(shape
in Vega-Lite) and the color
attributes.
Also, there is no default for axis labels and legend title, and the
default legend location obstructs the data. As a user of Vega-Lite and
Ggplot2 (see below), I seldom had to deal with these details. Picking
reasonable defaults that will work for most data and plot types is far
from trivial and essential to these libraries’ user experience.
For R users, when it comes to graphics, there is a dominant package: Ggplot2 [141]. An implementation of Wilkinson’s Grammar of Graphics, as the initialism in its name suggests, it followed the same path as Dplyr or Keras (see Sec. 1.2.10.1). It didn’t bring radically new capabilities to its users, and pre-existing alternatives are built right into the standard library. Nonetheless, it was able to displace them by sheer force of convenience, partly based on its conciseness. For instance, Bradley Eck, a research scientist, tweeted: “Still blows me away that I can make this plot with 3 lines (read, melt, qplot63) thanks to @hadleywickham and #rstats” [142]. It may be instructive to look at how to produce the same scatterplot in Fig. 7 with Ggplot2:
theme_set(theme_minimal())
ggplot(data = penguins,
aes(x = flipper_length_mm,
y = body_mass_g)) +
geom_point(aes(color = species, shape = species), size = 3)
Ggplot2 allows us to define a ubiquitous graphics type, the scatterplot, with a minimum of code and a complete focus on the essential choices to be made when designing statistical graphics.64 An aspect where Ggplot2 is starting to show its age is its lack of support for interactivity. Wickham has endeavored to build a replacement, Ggvis, to address that. Explaining its relation to some technologies Ggvis is built upon, he writes [143]:
ggvis makes many more assumptions about what you’re trying to do: this allows it to be much more concise, at some cost of generality.
It is clear that, in his view, conciseness is worth giving up some power. However, unlike other examples (see Impala in Sec. 1.2.9.9) where full generality could be achieved at the cost of more code, with Ggvis, as with Dplyr (see sec Sec. 1.2.10.1), we have to access an underlying, lower level technology — a more challenging step.
If we are willing to go lower level to produce statistical graphics, an option is D3.js [144]. Billed a “library for manipulating documents based on data,” it is more general than libraries like Ggplot2, focused on statistical graphics and quick data exploration. On the other hand, it undoubtedly requires more work for the same output. Niko McCarty, a science journalist, endeavored to reproduce our scatterplot in D3.js, but his code is too long to show here — over 100 lines, despite help from additional libraries [145]. That doesn’t mean D3’s author, Mike Bostock, isn’t thinking about minimizing the effort, including a quite ambitious idea [146]:
I’m exploring whether we can have better passive reusability in d3.express. Where by leveraging the structure of reactive documents, we can more easily repurpose code, even if that code wasn’t carefully designed to be reusable.
As if designing for reusability wasn’t challenging enough, Bostock is aiming for unplanned reusability, the philosopher’s stone of software development. However, limited to the specific context of D3, he may be able to pull it off.
1.2.10.3 Web Development
There is nothing special about libraries for data analysis or statistical graphics regarding conciseness, and the selection up to this point reflects my professional experience and interests. Similar observations can be made about other types of libraries. For instance, many developers deal with web frameworks65 and bemoan their complexity. They may refer to two unpleasant characteristics. First, some frameworks force onto developers a specific DBMS,66 an object-relational mapping67 library, an architecture such as MVC,68 and more. This lack of modularity is sometimes positively spun as coming “batteries included,” meaning that any integration challenges are taken care of. Second, some frameworks are more verbose than others: they force developers to specify details they are not interested in or repeat boilerplate code in every application.
In a comparison of three popular frameworks, Django, Flask, and Pyramid, Ryan Scott Brown, a backend developer, crowns Flask the winner [147]:
Flask’s Hello World app69 has to be the simplest out there, clocking in at a puny 7 lines of code in a single Python file.
An advantage of a seven-line application is that we can show it here in its entirety [148]:70
= Flask(__name__)
app
@app.route("/")
def hello_world():
return "Hello, World!"
if __name__ == "__main__":
app.run()
The main elements here are creating an application object; creating a function that returns some HTML with a Python decorator that associates it with a path, which will go to form a URL; and finally, some boilerplate code to start the app. It looks like the bare minimum, but it is possible to perform slightly better (see Hug and FastAPI below). In Pyramid, it gets a bit more complicated [149]:
def hello_world(request):
return Response('Hello World!')
if __name__ == '__main__':
with Configurator() as config:
'hello', '/')
config.add_route(='hello')
config.add_view(hello_world, route_name= config.make_wsgi_app()
app
= make_server('0.0.0.0', 6543, app)
server server.serve_forever()
We start with a simple function that returns some content. Then, that
function and a path must be added to a Configurator
object,
in which they are associated through a name, 'hello'
. What
a round-about way of creating a mapping from URLs to content! Finally,
we need to create an app from this configuration, and pass that to a web
server, and get it started. While this is a typical architecture, there
is no need to be reminded of it in every app, like there is no need to
know about inodes when opening a file in Unix. These are details that
Flask, thankfully, abstracts away. Smaller code size and tighter
abstraction go hand in hand. In addition, Pyramid provides a utility to
bootstrap a more complex project consisting of several directories and
files, but we can get away with a single file in the minimal case.
Django also includes such a utility, which is recommended in most
tutorials. Eventually, I figured out how to run a single file Django app
(based on [150]):
settings.configure(=True,
DEBUG=__name__,
ROOT_URLCONF
)
def hello_world(request):
return HttpResponse("Hello World")
= (re_path(r"^$", hello_world),)
urlpatterns
if __name__ == "__main__":
execute_from_command_line(sys.argv)
After some configuration (the DEBUG
mode allows skipping
complex security-related code, too bad safety isn’t the default), we can
see the familiar content-generating function, an associated URL
definition, and a way to start the application that is close to Flask.
One way Django distinguishes itself from the other two is that it is
more monolithic, meaning that it comes with an included database, ORM,71 template language,72
and more to provide a complete environment for web development. The
other two let the developer choose these components among many available
options, a more modular approach. A final word of caution is that a
minimal application should not be considered a complete quality test of
a library by any stretch of the imagination: here, we use it to
highlight how differences in conciseness can be apparent even at this
beginner level.
Among web frameworks, the conciseness champion may be a slightly less well-known entry, Hug, targeting mostly web APIs.73 Its author, Timothy Crosley, has done the comparison work for us and concluded [151]:
For almost every common Web API task the code written to accomplish it in hug is a small fraction of what is required in other Frameworks.
Its approach is based on designing APIs in familiar Python, then turning them into web APIs by simply decorating, in the Python sense, each function we want to be invoked in response to an HTTP request. All the gory details, from (de)serializing74 arguments and return values to route definition, are taken care of with reasonable defaults. We can also get a standard command line interface out of the same code without additional effort.
Crosley omitted the newer FastAPI from his comparison. Later on, this is what he had to say about it [152]:
Honestly, what you’ve built looks super solid and polished. In many ways, it is what I wanted hug to be - it is really inspiring to see someone build that.
In an exchange of pleasantries, FastAPI developer Sebastián Ramírez acknowledges Hug was an inspiration to his work [153]. FastAPI is explicit about its conciseness. Among its key features, it highlights the following:
Short: Minimize code duplication. Multiple features from each parameter declaration. Fewer bugs.
FastAPI uses an approach similar to Hug’s, with decorators converting regular Python functions into web API routes. It makes better use of Python 3 type annotations and takes better advantage of open standards. While FastAPI isn’t officially intended to build web applications, there is nothing to stop us from recreating our “Hello World!” app:
= FastAPI()
app
@app.get("/")
def root():
return Response("<p>Hello World</p>")
It isn’t very different from the Flask version, save for the boilerplate necessary to start an app. Here, it is unnecessary because an application is started with a command such as Uvicorn with a file containing this code as an argument. FastAPI particularly shines in the processing of arguments and complex data. For example, we can modify the above example to accept two parameters, a string and an int, and return a JSON dictionary containing them:
= FastAPI()
app
@app.get("/{fruit}/{how_many}")
def root(fruit: str, how_many: int):
return {fruit: how_many}
For instance, a valid URL could be “https://example.com/orange/3”
returning the dictionary {"orange":3}
as a response, with
data validation, conversion and reasonable error messages built-in.
The freedom granted by microframeworks to pick your preferred template language allows choosing a new entrant that, strictly speaking, isn’t one [154]:
Dominate is a Python library for creating and manipulating HTML documents using an elegant DOM API. It enables writing HTML pages in pure Python very concisely, which eliminates the need to learn another template language, and to take advantage of Python’s more powerful features.
We usually talk about conciseness in terms of LOC reduction, and defining new domain-specific languages is a way to achieve that. However, templates don’t seem to offer decisive advantages over a library. For example, Dominate and Shiny [155], a dashboard-focused web framework for R, both eschew templates altogether, replacing them with a regular library that generates HTML with more flexibility and conciseness.
Here are some simple examples to get a feel for how to create the same page with these different libraries or template languages. A well-known template language, Jinja2, like all similar technologies, embeds bits of code in a custom programming language inside a string containing HTML [156]:
<!DOCTYPE html>
<html>
<head>
<title>Hello World</title>
</head>
<body>
<div id="header">
<ol>
{% for page in ["home", "about", "contact"] %}<li>
<a href="/{{page}}.html">{{page|title}}</a>
</li>
{% endfor %}</ol>
</div>
<div class="body">
<p>Lorem ipsum..</p>
</div>
</body>
</html>
The only bit of programming here is in that for loop in the middle,
which is used to build a three-item list of links. Statements are
surrounded by {% %}
, expressions by {{ }}
, and
functions are invoked using a pipe-like syntax: page|title
.
It is a quirky language and one more subject that developers need to
learn, but it may represent the path of least resistance for HTML
content creators who aren’t programmers. Dominate can generate the same
HTML with this code (simplified from the Dominate documentation [154]):
doc = dominate.document(title="Hello World")
with doc:
with div(id="header").add(ol()):
for page in ["home", "about", "contact"]:
li(a(page.title(), href="/%s.html" % page))
with div():
attr(cls="body")
p("Lorem ipsum..")
print(doc)
Not only is it more concise, but it also uses the familiar Python
language. However, it may go one step too far by offering three separate
ways to create complex HTML structures: the with
statement,
the add
method, and direct support for iterables75 as arguments. It may support
different programming styles, but it makes the code less readable.
Shiny’s API is exclusively based on functions. Unfortunately, since it
is intended to create dashboards, it didn’t allow me to reproduce the
same HTML exactly. However, the following two snippets generate the same
head and body as the previous code blocks:
$title("Hello World") tags
$div(id = "header",
tags$ol(lapply(list("home", "about", "contact"), function(x)
tags$li(tags$a(href = paste(x, "html", sep = "."), toTitleCase(x))))),
tags$div(class = "body",
tags$p("Lorem ipsum.."))) tags
Nesting in HTML becomes the nesting of calls in shiny. Each
tags
function can take one or more arguments or a list and
operate on it appropriately. If only we didn’t have to write
tags
every time! The likely reason for this verbiage is
that importing only selected symbols from a library isn’t well supported
in R, and the potential for name conflicts would be too high without the
tags
object.
The now-abandoned Minima took the programming-language-first approach of Dominate and Shiny and extended it to all aspects of web development [157]. Web developers must use a plethora of languages: HTML, CSS, JavaScript, and template, query, and backend languages. Minima cut that list down to a single entry: Python. It is too bad we won’t see that project brought to full fruition.
A Python library that, like Dplyr and Ggplot2 for R (see Sec. 1.2.10.1), dislodged the incumbents is Requests. Despite being a replacement for Urllib, which is bundled with every Python distribution, it is downloaded millions of times per week. It is yet another HTTP client library and performs all the operations you can expect from such a package. However, it requires little code, unlike its predecessors. Daniel Greenfeld, a Python developer, writes [158]:
Nuked a 1200 LOC spaghetti code library with 10 lines of code thanks to @kennethreitz’s request [sic] library. Today has been AWESOME.
A 120x code reduction is the second largest shrinkage factor in a concrete project reported in this book (see Sec. 1.2.11 for the record holder). I don’t have the code for this example, nor would I saddle you with 1200 LOCs of spaghetti code, but I have a short one by Requests’ author Kennneth Reitz, which I have ported to modern Python 3 and slightly shortened [159]. First, the Urllib version:
= "https://api.github.com"
gh_url = urllib.request.Request(gh_url)
req = urllib.request.HTTPPasswordMgrWithDefaultRealm()
password_manager None, gh_url, "user", "pass")
password_manager.add_password(= urllib.request.HTTPBasicAuthHandler(password_manager)
auth_manager = urllib.request.build_opener(auth_manager)
opener
urllib.request.install_opener(opener)= urllib.request.urlopen(req)
handler print(handler.getcode())
print(handler.headers.get("content-type"))
And then the Requests version:
= requests.get('https://api.github.com', auth=('user', 'pass'))
response print(response.status_code)
print(response.headers['content-type'])
The Requests version is for the laywoman. She needs to access a URL for which she provides authentication credentials. She is interested in the status code and content type header associated with the response. Knowing the basics of the HTTP protocol is enough to understand this version. The Urllib version is for some high priestess of the web. She needs to be initiated into the mysteries of password and authentication managers, URL openers and handlers. It looks like the manifest of over-engineering, combined with a willingness to rename everything with generic names that obfuscate meaning. The usual offenders are “manager,” “handler,” and “context”: catch-all terms with an aura of competence.
1.2.10.4 More Libraries
Logging code clutters a program’s logic (that is why StackOverflow removed all of it, see Sec. 1.3.1) and is easy to get wrong if consistency is a goal. log_calls aims to solve that. Its creator writes [160]:
log_calls can save you from writing, rewriting, copying, pasting and tweaking a lot of ad hoc, debug-only, boilerplate code — and it can keep your codebase free of that clutter.
log_calls, as the name suggests, is focused on logging function calls
but sports additional features, including logging custom expressions at
any point in the code. It does so with a bare minimum of code and
produces well-organized output. It is based on decorating with
@log_calls
functions whose invocation needs to be
logged:
@log_calls(log_retval=True)
@div_w_remainder(x,y):
return (x//y, x%y)
The call div_w_remainder(5,4)
will generate the
following output:
<== called by <module>
div_w_remainder =5, y=4
arguments: xreturn value: (1, 1)
div_w_remainder ==> returning to <module> div_w_remainder
To reproduce something of this sort without the assistance of Log_calls, here is what I came up with:
def div_w_remainder(x, y):
= inspect.stack()[1][3]
caller print(f"div_w_reminder <==== called by {caller}")
print(f" arguments: x = {x}, y = {y}")
= (x // y, x % y)
retval print(f" div_w_reminder return value = {retval}")
print(f"div_w_reminder returning to {caller}")
return retval
The amount of logging code in the latest snippet is a multiple of the non-logging code and hampers its readability. If one decorator per function or method is too noisy, we can even log all the methods in a class by decorating the class, or all callables in a module with a single call.76
Another programming concern that should not be polluting the main
logic is data validation, which means checking that variables contain
what they are supposed to. Validation is required when accepting user
input, but it may also be appropriate for a function to perform on its
arguments. Its goal is to have a program fail early and with an
exception or an error message if an invalid argument is provided. While
assert condition(x)
may look like a good idea, the problem
is that the evaluation of condition(x)
may fail with its
own exception; or it may return False
, in which case an
AssertionFailure
will be raised. Developers often want more
control over what exceptions are raised. To this end, they will have to
wrap the condition evaluation in a try-except
statement.
Imagine doing so for every argument of a function.
A possible solution is Valid8, a validation package for Python that
makes a strong conciseness claim: it “provides tools to perform
validation in a minimum number of lines” [161]. While it is hard to prove its
optimality, it certainly simplifies the problem of validating data. For
example, let’s say an argument x
is required to be a
positive number. This condition could be checked as follows:
assert isinstance(x, Number) and x > 0, "Not a positive number"
However, assuming control over the exception raised is required, we need a different approach:
if not (isinstance(x, Number) and x > 0):
print("Not a positive number", stderr)
raise ValidationError
Using Valid8 we can instead write:
def is_positive(x):
return x > 0
"x", x,
validate(=Number,
instance_of=is_positive,
custom="Not a positive number") help_msg
Well, not any more concise, even if we could reuse
is_positive
elsewhere, but at least it produces a nicely
formatted message with more information, for instance when
x == 'c'
:
ValidationError[TypeError]: Not a positive number. Error validating [x=c]. HasWrongType: Value should be an instance of <class 'numbers.Number'>. Wrong value: 'c'.
or x == -1
:
ValidationError[ValueError]: Not a positive number. Error validating [x=-1]. InvalidValue: Function [is_positive] returned [False] for value -1.
We still need to achieve a more concise solution, which can be done using companion packages Vtypes and Pyfields. These two packages incorporate validation into type checking and classes, respectively. With Vtypes, we can integrate validation into a type [162]:
= vtype('PositiveInt', int, {'should be positive': lambda x: x >= 0}) PositiveInt
With this definition, isinstance
performs type and
validation checks as in
assert isinstance(2, PositiveInt)
.
Could we tie this into type annotations for a most concise validation of arguments? Type annotations are parameter type declarations ignored by default but available to tools and libraries for static and run-time checking. Using Vtypes as type annotations with a checking library like Pytypes would give a concise way of validating function arguments. This idea didn’t pan out since Pytypes and Vtypes don’t work together, so I decided to roll out my own decorator:77
def typechecked(f):
@wraps(f)
def wrapper(*args, **kwargs):
for arg, val in getcallargs(f, *args, **kwargs).items():
= get_annotations(f).get(arg)
ann_type if ann_type is not None:
if not is_vtype(ann_type):
= ann_type.__name__
name = vtype(name + "__", ann_type)
ann_type
ann_type.validate(arg, val)return f(*args, **kwargs)
return wrapper
This decorator exploits type annotations to type check and validate each argument for which an annotation is available. For instance, with the following definition:
@typechecked
def f(x: PositiveInt, y, z: int):
...
before the body of f
runs, f
’s first and
last arguments will be checked to be a positive integer or any integer,
respectively, whereas no checks will affect the second argument. Now it
is concise to my satisfaction and very readable for good measure. If you
are interested in how this decorator works, here is a quick explanation.
It takes a function as an argument and returns one, like any decorator.
The returned function, wrapper
, is decorated with
wraps
from the Functools package, which adjusts its
signature and other metadata to be the same as function f
.
The body of wrapper
uses getcallargs
, from
package Inspect, to obtain a dictionary of all the arguments in the
function call and their values. Iterating through each of them, it does
nothing if there is no annotation (ann_type
); otherwise, if
the annotation isn’t a vtype
, it converts it to one and
then uses it to validate the argument value,78
before calling f
with its arguments.
I have brought up Java mostly on the verbose side of programming (see
Scalding in Sec. 1.2.10.1, and Kotlin in
Sec. 1.2.9.7), and there will be more
opportunities to do so (see java.util.Arrays
in Sec. 1.5). However, here is some good
news: we can implement a simple but significant data structure, an LRU
cache,79 in just ten lines of Java [163]. On second thought, its author,
engineer Christopher Wu, leveraged Java’s extensive libraries rather
than the language’s intrinsic capabilities. He observed the similarity
between the task he had to solve and a lesser-known data structure
already in the standard library, the LinkedHashMap
. It is a
data structure similar to a hash map but with a guaranteed iteration
order corresponding to insertion or most recent access order (as
determined by the third argument to the super
call below).
It even provides an overridable80 eviction policy
(removeEldestEntry
, limited to “when” to remove, whereas
the order is the same as the iteration order), suggesting that the
authors of this data structure foresaw caching as its most likely use.
These features lead to a concise implementation for an LRU cache:
public LRUCache<K, V> extends LinkedHashMap<K, V> {
private int cacheSize;
public LRUCache(int cacheSize) {
super(16, 0.75, true);
this.cacheSize = cacheSize;
}
protected boolean removeEldestEntry(Map.Entry<K, V> eldest) {
return size() >= cacheSize;
}
}
Since someone else has done the work of implementing
LinkedHashMap
and added it to the standard library,
clearing a high bar for software quality, we are unlikely ever to have
to worry about its implementation and conciseness. However, this
shortcut is unavailable when we need to build an abstraction from
lower-level ingredients. The opposite would be a language with better
abstraction capabilities but less extensive libraries. Are the
programming possibilities worth developing more libraries from scratch?
Which choice will result in fewer LOCs to develop? I don’t think there
is an answer that can be valid for every project.
To wrap up this sampling of conciseness-enabling libraries, let me mention three libraries that, like Vtypes (see above), are language extensions. By this, I mean that, rather than focusing on an application domain, they add language capabilities like manipulating function signatures, defining classes, and imitating constructs from other languages.
The first is Forge, which focuses on building APIs. As we will see in the section on APIs (Sec. 1.3.3), proponents of the humane interface support expanding an API with different variants or combinations of the same calls, often with simple implementations, as long as they capture common usage patterns and prevent repetitive code. Unfortunately, writing such APIs usually requires boilerplate code to shuffle arguments around, declare the same parameter list with minor variations and so on. Devin Fee, author of Forge, takes the popular Python Requests (see Sec. 1.2.10.3) as a use case. Analyzing a file in which Requests’ API is defined, he comments [164]:
The code is a little more than 150 lines, with about 90% of that being boilerplate. Using forge we can get that back down to about 10% its current size, while increasing the literacy of the code.
Indeed there are eight functions defined therein [165], each with a one-line
implementation. The first, request
, calls out to a
different module. All the others invoke request
with minor
variations in the accepted arguments. However, its author duly repeats
the documentation eight times, unaware of or uninterested in any
alternative, leading to an eight-fold duplication in some documentation
entries and many more close matches for a total of over 150 lines. In
contrast, Fee manages to get away with about 20:
= forge.copy(requests.Session.request, exclude='self')(requests.request)
request
def with_method(method):
= forge.modify(
revised 'method', default=method, bound=True,
=forge.FParameter.POSITIONAL_ONLY,
kind
)(request)__name__ = method.lower()
revised.return revised
= with_method('POST')
post = with_method('GET')
get = with_method('PUT')
put = with_method('DELETE')
delete = with_method('OPTIONS')
options = with_method('HEAD')
head = with_method('PATCH') patch
He does so in three parts: first, he replaces the signature of
request
with one from another equivalent function in that
package that doesn’t hide most arguments inside kwargs
.81 This is because he thinks that
fully specified signatures are better, and kwargs
should be
reserved for when the set of allowable arguments is infinite, for
example, for function dict
. Then he defines a higher-order
function that takes request
, binds its argument
method
to the appropriate value (say GET
or
POST
), changes its name accordingly (say "get"
or "post"
), and returns it. Then he uses this function
seven times to create all the other entries in the API. Et voilà, seven
functions in seven LOCs, with all their documentation, as the metadata
is preserved by default. However, there is a catch: to do so, he had to
force seven of the eight functions to share the same signature, whereas
in the original there are four different variants, and some of the
“excess” arguments, unsurprisingly, don’t make sense in all cases. For
example, an HTTP POST can submit data to the server, but an HTTP GET
can’t. Finding in the signature of get
an argument
data
, which will be ignored, is confusing. However, this is
just a poetic license, not a limitation in the library, which
has all the tools to define several signatures, sharing some arguments
but not sharing others. Even a more refined version of this example
would come out way ahead in line count and internal consistency.
Another library for Python, Attrs, focuses on defining classes [166]:
Its main goal is to help you to write concise and correct software without slowing down your code.
Its author, Hynek Schlawack, starts from the observation that many classes require the definition of dunder methods: initialization, comparisons and character representations almost invariably perform the same tasks no matter the class they belong to (see POJOs in Sec. 1.2.9.7). He wryly notes that “while we’re fans of all things artisanal, writing the same nine methods all over again doesn’t qualify for me.” Such a blatant violation of the DRY principle (see Sec. 1.2.3) required a remedy, so much so that some features of Attrs were later incorporated in Python as data classes. The original, though, is still more powerful. Here is a minimal example from the documentation:
@define
class Coordinates:
int
x: int y:
That’s all! How much will you miss writing:
def __init__(self, x: int, y: int):
self.x = x
self.y = y
Dull, isn’t it? And it isn’t just that method; you need also
__eq__
, __ne__
, and all the other comparisons,
__str__
, __repr__
, and __hash__
.
Beyond that, Attrs provide features such as validation and conversion of
fields at initialization, frozen classes, and much more. Given the
popular Pythonista’s motto “explicit is better than implicit,” it isn’t
surprising that the language would require practitioners to write all
the dunder methods explicitly. Luckily, the language is flexible enough,
and its community is creative enough to remedy this design flaw.
Vadr for R “implements workalikes for the author’s (and perhaps your)
favorite features from other languages, making R programs shorter and
more expressive” [167]. The original
Vadr has been supplanted by seven more focused packages, of which,
unfortunately, only two are currently available. Among other features,
Vadr included destructuring assignments, a feature present in a growing
list of languages. It allows writing expressions like
bind[first, ...=middle, last]= a_list
instead of:
= a_list[1]
first = a_list[2:(length(a_list)-1)]
middle = a_list[length(a_list)] last
There are several ways a destructuring assignment is more concise: we
don’t need to repeat the name of the structure being operated upon or
the assignment operator again and again, and the ellipsis symbol
...
replaces working directly with the indexes in many
cases. bind
supports nested lists, and more complex
examples would further highlight its conciseness. Pipelines, also
included in Vadr, allow chaining operations in a more natural order
without creating intermediate variables, as we have seen in Sec. 1.2.9.3. I believe Vadr’s implementation predates the
one that has become the most adopted in R, from package Magrittr. Here
instead of relying on a custom operator, pipelined operations are listed
as arguments to the function chain
:
chain(n,
seq(length=.+1, 0, 2*pi),
cbind(sin(.), cos(.)),
apply(2, diff),
^2,
.
rowSums,
sqrt, sum)
This expression calculates the perimeter of a regular polygon with n
sides. Please note the .
symbol represents the output of
the previous operation in this context.
In addition, Vadr offered macros, whose power we have already discussed talking about Lisp (see Sec. 1.2.9.1), partial application, an essential tool for functional programming, and more.
1.2.11 Programs
Before proceeding to examine some programs, a caveat is in order. The equivalence of programs is a more nuanced proposition than the equivalence of functions, and we will see examples where the larger program has more features. However, the smaller programs still broadly perform the same role while being much more concise (for programs focused on frugality, see Sec. 1.3.4).
If two programs fulfill the same need, should we care whether one is larger than the other? A smaller program takes less storage and loads faster, but those are practical rather than ascetic concerns. Indeed, nothing has prevented programs from ballooning to ever larger sizes [12]. However, some people pay attention to code length. For example, Comparing VPNs,82 Ars Technica’s reporter Jim Salter states the following [168]:
WireGuard weighs in at around 4,000 lines of code; this compares to 600,000 total lines of code for OpenVPN + OpenSSL or 400,000 total lines of code for XFRM+StrongSwan for an IPSEC VPN. Two orders of magnitude fewer lines of code mean a lot less attack surface to find flaws in.
A much smaller codebase also means code that’s more likely to work the way it’s supposed to.
Software reviews seldom highlight conciseness or any other property of the source code. To add to this reviewer’s opinion, Linus Torvalds, author of the Linux operating system, commented:
Can I just once again state my love for it [WireGuard] and hope it gets merged soon? Maybe the code isn’t perfect, but I’ve skimmed it, and compared to the horrors that are OpenVPN and IPSec, it’s a work of art.
This praise sounds even more significant compared to Torvalds’ often abrasive posts.
Lwan is a web server, like many out there, but with a notable feature from our point of view: “Its simple architecture and tiny source code guarantees the entire code base can be easily grokked” [169]. Unlike Lwan, many software systems have reached a level of complexity such that reading the complete source code is no longer practical. This fact has several undesirable consequences, including that developers give up on reading source code beyond what is strictly necessary to add their contributions; and that they write code as if it will never be read, save maybe for a quick code review by a fellow programmer. Imagine how literature would be if writers didn’t read whole books and if no one but the authors themselves ever read most pages. Lwan is like a classic, according to the definition given by Italian writer Italo Calvino [170]: you can read it again and again, as Twitter user neurodrone reports [171]:
Lwan is a work of art. Every time I read through it, I am almost always awe-struck.
We have seen that program size isn’t considered a controllable parameter (see Sec. 1.1), at least according to some. However, there is a counterexample: a program for which a length upper bound has been set [172]:
dwm is only a single binary, and its source code is intended to never exceed 2000 SLOC.
Should there be an upper limit on how many LOCs are used to implement a program? That would be an innovation in software engineering. I am not aware of any methodology that imposes constraints of this sort. Throughout my career, I was never asked to limit my LOC output, nor were my coworkers. If you need one KLOC to fix a bug and can produce it within the available time, so be it. The maintenance bill doesn’t reach the original author. You may object that Dwm, a spartan window manager that is freely available, doesn’t prove this is a viable approach to software engineering. However, its list of features seems to omit only advanced items such as Lua scripting or remote control. On the other hand, its configuration requires recompilation, which “[…] keeps its userbase small and elitist. No novices asking stupid questions.” It is an original twist on the luxury market strategy of keeping goods and services out of reach for most.
A recent trend resulting in smaller programs is replacing statistics- or heuristic-based code with machine learning models. These models are anything but simple (see Deep Learning, Sec. 2.1.4), so you may argue that the complexity has just morphed, but it is learned from data rather than hand-coded. A stunning example is Google Translate, which, in the process of switching from a more traditional approach to deep learning, shed 99.9% of its code, from 500 KLOCs to 500 LOCs [173]. This is the largest code shrinkage factor reported in this book, by quite a margin. The newer version apparently also provides better translations [174].
This approach can benefit even small amateur projects. For example, Jaques Matteheij needed to sort 2 tons of Lego and created a robotic system to help with the task [175]. Switching his approach to machine learning, this is what changed:
[…] approximately 2000 lines of feature detection code, another 2000 or so of tests and glue was [sic] replaced by less than 200 lines of (quite readable) Keras code.
(see Keras, Sec. 1.2.10.1).
Small programs can generate very large outputs with properties such as pseudo-randomness, the attribute of a deterministic sequence of numbers that looks generated by a stochastic process and can pass tests of randomness. It replaces true randomness in most applications that need it, including Monte Carlo simulations, online security, and privacy. The art of developing programs of this type, called random number generators, is a small but active subfield of computer science. A simple option, which gives results good enough to be still in use, is the family of linear congruential generators, defined by the recursion . is the starting value or “seed” and can be user-provided or based on sources of genuine randomness. The choice of parameters , , and is crucial and can result in generators of drastically different quality. More modern options use additional operations, including bit-wise logical operations, bit rotations, and shifts, but aren’t much more complex. A random number generator currently used in most web browsers comes in several versions, each roughly 13 lines of C code (here showing xoroshiro128+, [176]):83
static inline uint64_t rotl(const uint64_t x, int k) {
return (x << k) | (x >> (64 - k));
}
static uint64_t s[2];
uint64_t next(void) {
const uint64_t s0 = s[0];
uint64_t s1 = s[1];
const uint64_t result = s0 + s1;
^= s0;
s1 [0] = rotl(s0, 24) ^ s1 ^ (s1 << 16); // a, b
s[1] = rotl(s1, 37); // c
s
return result;
}
It isn’t hard to follow, but for people unaccustomed to C, let’s
recall that <<
and >>
rotate the
bits in an integer by a given amount, left and right, respectively.
|
and ^
are bit-wise OR and XOR operators,
respectively. Understanding why successive calls to next
return a pseudo-random sequence is much less intuitive. One of the
co-authors of this generator, Sebastiano Vigna, uses a biological
metaphor to describe this class of programs (my translation from [177]):
[Random number] generators start from few lines of code but generate billions of numbers. It’s like creating a live being and make it live.
1.2.11.1 Concise Programs in Mathematics
If we drop the requirement that a short program outputs anything, how long can it go before stopping? It may sound silly, but this question has been the starting point for a mathematical inquiry into the busy beaver numbers [178]. The nth such number, or , is the largest number of steps a Turing machine limited to states can perform before stopping. Using a Turing machine in the definition is helpful because it is easy to define its size, but similar numbers could be defined in any programming language. An alternative definition requires a Turing machine to write the largest number of 1s onto its output tape (the two definitions lead to distinct but closely related functions). Why wouldn’t a simple infinite loop be enough? Because it doesn’t end. The name of the game is preventing the machine from terminating for as long as possible, but not forever. Otherwise stated, we are trying to encode the biggest number that can fit in a size-constrained Turing machine. If you have ever played the game of writing the largest number that will fit on a sticky note, using digits and operators, this is a more mathematical version of that game.
is a function with some properties that I would describe as magical. It increases at a ferocious pace: faster than exponential, even faster than the Ackermann function.84 Indeed, no function growing faster than is computable, like itself. That is, no Turing machine can take in input and return . The proof of this fact is straightforward if we recall the archetypal uncomputable problem: deciding whether a given Turing machine will halt. limits how long we have to run a machine before concluding with certainty that, if it hasn’t stopped so far, it never will. This bound would provide an easy, if slow, solution for the halting problem: given a Turing machine with states, run it for steps: if it stops before reaching that mark, we have our answer. If it doesn’t, it never will. In a way, it turns an infinite question into a finite one. Only a handful of values are known, and relatively small ones are provably independent of a standard formulation of set theory. Therefore, no mathematical proof of their values is possible without non-standard axioms.
If you want to try designing one of these machines, they seem to follow a common template. The main idea is to compute some big numbers and do so as slowly as possible. They implement iterated functions, not unlike the random number generators in the previous section, but closer to the Collatz sequence:
These produce a non-repetitive sequence of numbers, including some large numbers, when carefully tuned. All intermediate operations are implemented with inefficient algorithms that work at a glacial pace but also fit within the narrow confines of these constrained machines. Numbers are encoded using unary notation,85 which simplifies program logic — for instance, a sum becomes just a concatenation — but requires more elementary steps. Performing slow arithmetical operations on large unary numbers gives these machines their prodigious run times. It is a bit of a culture shock for engineers who prize algorithms that are thrifty with time and memory, but here the goals are reversed: use as much time as possible writing only a tiny bit of code, while memory is no object. Although this approach worked before, there isn’t a rule to generate these machines, and their discovery has been halting.
The only complaint I have about these numbers is about their name. Since they are designed to spin idly for the longest time, they should be called super-procrastinator numbers (for procrastination in human endeavors, see Sec. 3.4).
This concept of a small machine that produces a maximal string of all 1s is related to that of Kolmogorov complexity, which is, for any given string , the size of the smallest Turing machine or program that can reproduce [179].86 In other words, the theory of Kolmogorov complexity is the theory of maximally concise programs for a given output. It has applications in the theory of computability, probability (see algorithmic probability in Sec. 2.1.2) and program testing. Concerning the latter, Luc Longpre, Vladik Kreinovich, and Ann Quiroz Gates formally show that short programs are easier to test [180]. An informal statement of their main result reads:
Theorem 1 says that to check whether a given program is correct it is sufficient to check this program only on inputs that are not too complicated […] The more complicated the program and/or the specification […] the more examples we need to test.
In this quote, “not too complicated” for inputs means having low Kolmogorov complexity, whereas “complicated” for programs and specifications means having more characters. Therefore short programs are easier to test. Please note that this theorem allows proving the correctness of programs from a finite test set, contradicting Dijkstra’s famous statement: “Program testing can be used to show the presence of bugs, but never to show their absence!” [3]. Unfortunately, there are many obstacles to making this research valuable in practice: first and foremost, the Kolmogorov complexity is an uncomputable function. An additional obstacle is that simple inputs can be huge, since small programs can generate them, as we have learned by looking into busy beaver numbers (see above). We are still a long way from proving real programs correct using this approach.
1.2.11.2 Obfuscation
So far, we have considered line, statement and token count (see Sec. 1.2.6), as well as number of states (see the previous section) as useful measures of program complexity, but there is a type of programming competition that focuses on character count, known as code golfing. Programs created to win these contests sacrifice everything in the name of program length, namely readability, and resort to dedicated languages created for this purpose, such as GolfScript. For instance, this program prints the first thousand digits of [181]:87
;''
6666,-2%{2+.2/@*\/10.3??2*+}*
`1000<~\;
Unlike most code blocks in this book, we will not try to understand this one.
Counting characters may seem to cross the line between terse and silly, but there are good reasons to do so in specific contexts. For example, we used Turing machines as a computing model while discussing Kolmogorov complexity (see the previous section). Had we used any programming language, a workable definition would have had to limit the character length of a program, not just the number of tokens or lines therein. The reason is that it is possible to encode the desired output, no matter how complex, in a single integer or string, and those count as one token in every programming language I am familiar with. Later on, we will see in Sec. 2.1.2 how statisticians have looked at statistical models as programs. Limiting their character length means, among other constraints, that they can’t include numbers of unbounded magnitude or precision in their code. This constraint is essential because allowing such numbers can have surprising statistical consequences (for instance, see example three in [182]). A limit on the number of tokens or lines alone would not enable proving theorems equivalent to those that follow from a character length limit.
To close this section on concise programs, an example of conciseness veering into obfuscation or, maybe, the visual arts. A random dungeon generator is a program that generates layouts for the game Dungeon & Dragons. Author and developer Robert Nystrom endeavored to write one, in C, that could be printed on a business card ([183]; see also the front cover). GitHub user Joker-vD took a keen interest in it and commented:
Really nice, I had some fun uncompressing and deobfuscating it — there are some nifty tricks in it! […] it is amazing how much can be accomplished in so few lines of code.
1.2.12 Systems
What is valid for programs continues to be so at the level of system architecture, even if the program length metric may not be the one used. For instance, software engineer Brandur Leach explains how the cloud platform Heroku practices architectural minimalism [184]:
Only by concertedly building a minimal stack that’s stable and nearly perfectly operable can we maximize our ability to push forward with new products and ideas.
One takeaway is that maintaining simplicity in the face of supporting new features requires a constant, organized, resourced effort. Another one is that this minimalism is a necessary condition for the evolution of the product, not a “nice to have.”
If the pressure to build new features engenders complexity, we can work to reduce it after the fact. The progression toward complexity is reversible. For example, Stephen Kell, a lecturer in Computer Science, writes [185]:
An underrated way to make progress is by distillation. It’s to recognise repetition, to integrate and unify existing systems where possible.
It is such an important activity that Dru Riley (who also writes a concise newsletter, see Sec. 3.1.1) included it in his LinkedIn resume [186]:
Building scalable services, consolidating duplicate systems and extracting insights.
It is an unusual sighting while scanning resumes and job descriptions. The emphasis is always on building systems and delivering features. The complexity left behind is someone else’s problem, as engineers are shuffled from one assignment to the next and “builders” job hop before the complexity bill is due. Working on reducing complexity is generally considered second-class work and not career-boosting.
Summarizing a keynote by Armstrong, Kell writes [185]:
A theme […] was that creating stuff doesn’t necessarily make progress overall. Proliferation reduces value. Entropy is our enemy. He mentioned the bewildering array of build systems, packaging systems, unidentifiable system software […] and, more generally, copies of the same or similar code and data. Building a “compressor” [see Burn’s “exercise in compression” in Sec. 1.2.1 and Steele’s “Huffman encoding” in Sec. 1.2.9] was his likeably vague solution.
There are even tools to deal with complex architectures and dedicated architect roles whose task is to define them without contributing a single line of code. Both are counterproductive because they will encourage complexity, according to Gergely Orosz, who worked on large infrastructure projects at Uber and Microsoft [187]. If specialized tools are needed to describe an architecture, simplify it. Design tools won’t reduce the coding and maintenance burden for a team.
1.3 Frugal Software
1.3.1 Programming
There is already a specialization of less is more (see Sec. 3.2.1 for the origin of this phrase) for computing: worse is better. It is due to Richard P. Gabriel, a computer scientist known for his work on the Lisp language [188]. His main argument is that software quality doesn’t necessarily increase with the number of features. There is a point where adding more features makes an application less practical because, for instance, it needs more expensive hardware to run; or less usable since it becomes harder to learn. Despite this concept having been formulated early on (see Brooks on “anomalous features” in Sec. 1.3.6) and widely disseminated, it doesn’t seem to have prevented the progressive bloating of applications. Software is still developed as an accumulation of features: the more, the merrier. Marketers want to claim a feature list longer than the competition. Software reviews go through checklists of features. Hardware continues to become more powerful at a roughly exponential rate, removing one reason for restraint. Software resilience engineer Alex Gaynor observed the following [189]:
The most natural implementation of any feature request is additive […]. As this process is repeated, the simplicity of a system is lost and complexity takes its place. […] Every feature request has a constituency — some group who wants it implemented, because they benefit from it. Simplicity does not have a constituency in the same way, it’s what economists call a non-excludable good — everyone benefits from it.
However, there are a few voices in support of simplicity, if a little isolated. Will Shipley, the winner of three Apple design awards, warns us to “start with brevity” [190]. All the other good properties of software should be only increased as necessary — in his words, “as required by testing.” For instance, don’t build flexibility into some software just because flexibility is good, but only if it is a requirement. Likewise, there’s no need to make a videogame as robust as a nuclear installation’s control software.
Software architect and developer Lawrence Kesteloot is even blunter: “[…] refuse to add any features or lines of code unless you need it right now and you need it badly” [191]. Imagine if your boss asked you to perform a task, and your only retort were: “Do we need it badly?” That conversation would be short, I suspect. However, it could last marginally longer if you let your boss know that programming superstars at Google ask this type of questions. In the words of Hacker News user nostrademons [192]:
For those who work inside Google, it’s well worth it to look at Jeff & Sanjay’s commit history and code review dashboard. They aren’t actually all that much more productive in terms of code written than a decent SWE3 who knows his codebase.
The reason they have a reputation as rockstars is that they can apply this productivity to tasks that really matter; they’re able to pick out the really important parts of the problem and then focus their efforts there, so that the end result ends up being much more impactful than what the SWE3 wrote. The SWE3 may spend his time writing a bunch of unit tests that catch bugs that wouldn’t really have happened anyway, or migrating from one system to another that isn’t really a large improvement, or going down an architectural dead end that’ll just have to be rewritten later.
It isn’t so much the code they write, but what they decide to work on, which implies they have the agency not to work on something less relevant. You can’t become a superstar if you have to pick the next entry in a so-called backlog.88 The programmers referred to in this excerpt are, most likely, Jeff Dean and Sanjay Ghemawat, whose work includes MapReduce, GFS, BigTable, and Spanner, cornerstones of the Google infrastructure and influential projects throughout the industry.
Tero Piirainen sees frugality as a competitive advantage [193]:
Minimalism: an undervalued development skillHere’s a cheap trick to make a difference: focus on the bare essentials and get rid of the rest. It’s easy to be different, because others are mostly doing the opposite: tons of crap.
As the CEO of a company making a web analytics product, he must be concerned about page load times, which his software contributes to (see the “website obesity crisis” in Sec. 1.5). Nevertheless, his argument goes beyond that: the additional features his software doesn’t have aren’t left out in a compromise with speed; they are because they would make his product worse. However, I can’t exclude there is a hint of sour grapes as well.
Atwood goes further by calling code “our enemy” [194]:
Every new line of code you willingly bring into the world is code that has to be debugged, code that has to be read and understood, code that has to be supported. Every time you write new code, you should do so reluctantly, under duress, because you completely exhausted all your other options. Code is only our enemy because there are so many of us programmers writing so damn much of it. If you can’t get away with no code, the next best thing is to start with brevity.
Similarly, Kesteloot calls for writing not just less but “the least code that will get the job done” [195]. Elsewhere he writes [196]:
Every line of code you write is a potential bug. Do not write any line of code unless you absolutely need it right now and your program will suffer for the lack of it. Do not write routines speculatively. Do not write abstraction layers you don’t need right now. If an optimization will add any complexity whatsoever […], resist it. You will be sorry in five years when your code is riddled with potentially-buggy [sic] code that you never really needed to write.
It is a resounding rejection of planning for future or predicted needs and a warning against over-engineering. The value of keeping a project small and the uncertainty about future requirements make a strict focus on the present the best strategy. Anyway, “You Aren’t Gonna Need It,” (YAGNI). This principle is part of a software methodology called Extreme Programming that indeed rejects the usefulness of planning, even if its supporters don’t word it this way [197].
While some of these warnings may sound like bordering paranoia, think of the Heartbleed bug (see Sec. 1.5) and other major security breaches or mission- and life-critical software. In those contexts, a defensive attitude is warranted. However, when developing a game or other entertainment software, the best balance between features and robustness or maintainability may be different.
All good software must have one feature: logging; well, until now. From now on, it is optional, with Atwood’s blessing [198]:
Logging means more code. If you’re using a traditional logging framework like log4net, every logged event is at least one additional line of code. The more you log, the larger your code grows. This is a serious problem, because code is the enemy. Visible logging code is clutter — like excessive comments, it actively obscures the code that’s doing the real work in the application. […] We’ve since removed all logging from Stack Overflow, relying exclusively on exception logging. Honestly, I don’t miss it at all.
Blogger and developer Itamar Turner-Tauring urges his readers to stop being a “10x programmer” and become not just an average one but a “0.1x programmer” [199]. Some of his suggestions are about doing more with less, from choosing a higher-level language to increasing code reuse. Others are about doing less, including “drop the feature” and a radical “drop the product.” The former is an invitation to filter feature requests before queuing them. Clients may not know other ways to achieve their goals and may request capabilities that are seldom needed and for which it isn’t worth developing, learning, and maintaining new features. “Dropping the product” is a reminder that not all good ideas are well received, and that it is essential to test the market fit before devoting a lot of resources to developing one. On the other hand, users may not know they need a product or believe in its feasibility until it is in their hands. Please note that, in both cases, “drop” is construed as “do not develop.” Dropping an existing feature or product presents additional hurdles by breaking users’ expectations.
If dropping one product isn’t enough, the management at Basecamp went even further: they dropped all but one, spinning off or selling the others [200]. Their motivation was that they felt a little scattered and didn’t want to grow into a larger company. An external observer may counter that the other products were not remotely as successful as their main one or had been integrated into it. The move also partially reneged on a promise to support their products forever (they added a new product again more recently, see Sec. 3.4).
They aren’t the only ones to drop products. Google has retired so many products that people are wary of adopting the ones that remain [201]. This results from an internal reward structure that favors product launches over incremental improvements rather than any frugal thinking. There is nothing frugal about today’s Google.
1.3.2 Languages
No matter how simple or complex at conception, languages tend to drift towards complexity by incorporating new features over time: generics in Java, coroutines in Python, and objects in Lisp. Therefore, maintaining simplicity in a language doesn’t seem to be a primary concern. We have already seen how the original Lisp, or some of its dialects like Scheme, are relatively small languages with few but powerful features. Yet, it isn’t clear that it was a primary goal in their design. As we have learned in Sec. 1.2.9.1, the original aim was to replace the Turing machine with a language resulting in more concise programs, not a simpler language. However, simplicity helped greatly in creating its first interpreter, for instance.
A language that should drop features is JavaScript. Facundo Olano writes [202]:
JavaScript is loaded with crap (the bad and awful parts); there’s a great functional language somewhere in there, striving to get out; sometimes, a lot of the times, less is more. […] We can (and we should) make a better language out of JavaScript by subsetting it.”
There even is a book about it: JavaScript: The Good Parts [203]. However, in practice, those who avoid the bad parts can’t prevent others from using them and, therefore, still have to deal with programs written in the entire language. Hence, a better option may be writing in JavaScript derivatives such as CoffeeScript or TypeScript, which address JavaScript’s shortcomings while maintaining a high level of compatibility.
The idea of subsetting a language isn’t new. Hoare was scathing in his critique of complex languages such as PL/1 and ADA and proposed to fix them by subsetting, turning some features into extensions to be used when appropriate [204] (see also Sec. 1.5). Indeed, at least two dialects of ADA exist, SPARK and HAC, which are both strict subsets of the original language. On the other hand, he praised Pascal for its simplicity and observed how it was the starting point for several dialects for different applications. Another example of subsetting is RPython, a dialect of Python “amenable to static analysis” [205]. It doesn’t have a rigorous definition: “The exact definition is ‘RPython is everything that our [Pypy’s development team’s] translation toolchain can accept’ :).” It nonetheless serves its purpose as the implementation language for the Python interpreter Pypy. Comparing it to the traditional one in C, “the clear advantage is that such a description [the one in RPython] is shorter and simpler to read, and many implementation details vanish.”
The typesetting language TeX may or may not be considered frugal, but it sure isn’t going to become any more feature rich. Its version number is approaching the constant , indicating that it is reaching its final state, and new versions only incorporate bug fixes. The associated program Metafont’s version is also converging, in its case to , the base of the natural logarithms. Despite this choice, TeX remains the standard for typesetting scientific documents.
Flouting the trend towards complexity, the designers of two lesser-known languages intend to simplify them over time. The authors of Newspeak named their language after the one described in Orwell’s novel 1984 [206]. That language shrank over time to limit the expression of subversive ideas, foretelling keyword-based censorship on today’s social media. Newspeak’s authors call it “an ideal to strive for — a shrinkable language.” However, they aren’t clear about how they intend to shrink it, and, at the time of this writing, one addition and no subtractions appear to be planned. Maybe we need to write a few subversive programs in Newspeak to spring them into action.
Nim is a new statically typed compiled systems programming language. One of its main contributors, Dominik Picheta, admits that it “has bloated a little and we’re now trying to remove as many features as we can” without specifying what criteria would be followed [207]. In the same discussion, Hacker News user weberc2 talks about stripping down Python by, for instance, removing classes. However, it is unclear why classes and not some other feature. Removing features is always controversial and, for programming languages, causes the existing code base to stop working consistently. Without a clear and, ideally, automated way to convert code from the full featured language to the reduced one, shrinking isn’t an option for any programming language with an existing user base. However, as we have seen above, it is possible to define new languages from existing ones for specific purposes by removing features, with no expectation of carrying over existing codebases.
1.3.3 Libraries
We have already talked about libraries from the point of view of how they help developers write concise code. However, we haven’t talked about the libraries themselves, particularly the part visible to its users, its API. There are at least two approaches in API design: minimal and humane. According to Fowler, a minimal interface provides the capabilities of a library with a minimum of complexity, that is, a minimum number of functions, methods, or classes accessible to a user [208]. A humane interface strives to support as many reasonable and prevalent use cases as possible [209]. That means a developer writing against a minimal API will have to write more code but learn less than against a humane one. For example, imagine an array library that provides two functions, one to sort an array and one to extract its leading elements. We can combine them to find the top largest elements (let’s ignore the existence of a specialized algorithm for this problem). Should an API provide this capability or leave it to its users to implement?
Joshua Block, co-author of the Java standard library, wrote this about API design [70]:
Every facet of an API should be as small as possible, but no smaller. You can always add things later, but you can’t take them away.Don’t make the client do anything the library could do. Violating this rule leads to boilerplate code in the client, which is annoying and error-prone.
The two requirements seem to almost directly contradict each other:
keep the API small but make it do as much as reasonably possible. Others
have also observed this contradiction. For example, discussing the
omission of the function getLast
from the java
List
API, Matthew Murdoch, author of a unit testing
framework for Arduino, writes [210]:
[…] design guidelines often conflict and the hardest part of an API designers [sic] job is to balance these conflicts.
Stating the trade-offs and giving guidance on finding the right balance would be more helpful than listing clear but incompatible goals.
What is more frugal than an API with a single entry, a function that does it all? Of course, we can always force unrelated functionality into the same function, but that isn’t particularly helpful. Nonetheless, I found four libraries that can get away, without strain, with a single entry point:
- Parso is a parsing89 library for Python [211];90
- Tangent is a differentiation91 library for advanced machine learning [212];
- Specter is a collection-related library for Clojure [213];
- Loguru is a logging library that strips logging down to its core: “No Handler, no Formatter, no Filter” [214].
These are four highly non-trivial libraries for meaningful tasks, which are all exposing a most minimalistic API.
Testit, a testing library for R, is close to this ideal as well, with
two calls implemented in about 30 LOCs — neither KLOCs nor MLOCs. Its
author, Yihui Xie, is explicit about its minimalism: “There is no plan
to add new features or reinvent anything in this package. It is an
intentionally tiny package” [215]. That “reinvent” could be
construed as a jab directed at a competing package, test_that, which
forces users to rewrite boolean expressions in a new language of
expectations, as in expect_equal(x, y)
instead of
x==y
, for no apparent advantage [216].
Another library for R is Tidyr. While not as minimalistic as the previous ones, it follows an approach of improvement through shrinking: “Just as reshape2 did less than reshape, tidyr does less than reshape2” [217]. These three packages officially aren’t versions of each other, but they cover similar ground, and development remained active only on the most recent at any given time. The renaming strategy is meant to allow their author to break backward compatibility and rectify any perceived design errors in previous packages while minimizing user opposition. The reduction in features, more than the result of a selection, is the outcome of eliminating duplications between this and other related packages while improving their interoperability.
What if we could take libraries that manipulate different types of
sequences — in memory, on disk, lazy,92
and so on — and make them one and the same? Just one library to
implement and to learn? That is what Rich Hickley, author of Lisp
derivative Clojure, has tried with transducers [218]. I recommend the referenced video
over other online documentation as it is the most effective in comparing
and contrasting designs with and without transducers. Transducers are
just like the familiar functional primitives map
,
filter
, and reduce
; only they abstract away
from the actual collection they work on. They aren’t specific to lists
or files or streams etc. To apply them to data, we need to call a
function, transduce
, or other transducer-aware functions,
which accept as arguments a sequence, a transducer, and trigger the
computation. It is a subtle difference that provides a unifying
abstraction over a variety of sequence-like collections, which
significantly reduces the size of the API and the libraries
themselves.
How often are companies racing to endow their products with the most extensive feature list, even if the user experience suffers as a result? Fortunately, that isn’t the case with StaticFrame. Instead, it sports an impressive “negative feature list,” a list of features that it doesn’t offer, goals it doesn’t aim for, and a svelte codebase [219]:
StaticFrame is not a drop-in replacement for Pandas. […] Further, as StaticFrame does not support in-place mutation, architectures that made significant use of mutability in Pandas will require refactoring. StaticFrame is lightweight. It has few dependencies (Pandas is not a dependency). The core library is less than 10,000 lines of code, less than 5% the size of the Pandas code base. StaticFrame does not aspire to be an all-in-one framework for all aspects of data processing and visualization. […]StaticFrame does not implement its own types or numeric computation routines, relying entirely on NumPy.
This definition by negation reminds me of the closing verses of a poem by Nobel Prize winner Eugenio Montale, Non chiederci la parola (Don’t ask the word) [220]:
Codesto solo oggi possiamo dirti,
ciò che non siamo, ciò che non vogliamo.
(All we can tell you today is this, / what we are not, what we do not want). I am afraid this approach did not secure an enormous user base for StaticFrame, but it is a breath of fresh air compared to some of the cluttered do-it-all packages out there.
1.3.4 Programs
If few libraries and frameworks have frugality as a goal, even fewer
complete programs and apps do so. The Unix utility less
seems a promising candidate, at least judging from its name. Yet it
originally was a version of the utility more
for scrolling
through text files, with the additional ability to scroll backward.
Therefore, confusingly, less
has more features than
more
.
The Unix philosophy requires keeping programs small and focused (see Sec. 1.5). It hasn’t always kept faithful to this mandate in more recent versions, for instance, in the case of Systemd (also covered in Sec. 1.5). Another Unix program affected by bloating is Xterm. The people behind suckless.org condemn it in no uncertain terms [221]:
xterm is bloated and unmaintainable. […] It has over 65K lines of code and emulates obscure and obsolete terminals you will never need.The popular alternative, rxvt, has only 32K lines of code. This is just too much for something as simple as a terminal emulator.
suckless.org provides an alternative with st
, a simple
terminal emulator that focuses on the most common terminals and avoids
feature duplication with terminal multiplexers93
like Tmux. You may wonder why st
belongs in the frugal
section, whereas dwm
, another project championed by
suckless.org (see Sec. 1.2.11),
belongs in the concise section. It is a matter of emphasis.
dwm
seems to have the essential features of a window
manager but endeavors to deliver them within a maximum of 2KLOCs of
code. st
’s focus is on dropping emulation of obscure
terminals and features provided by multiplexers, but it is no
heavyweight at about 6KLOCs. I will concede, though, that the line
between frugal and concise is blurred for complex software
artifacts.
Why not write your own application when nothing on the market is frugal enough? It is a rite of passage among tech bloggers: developing a custom platform. For example, Hague completed his with “269 lines of commented Perl” [23]. There is no question his codebase size is a small fraction of that of an off-the-shelf alternative like Jekyll. However, with the abundance of blogging solutions available, those 269 LOCs didn’t have to be written and will, presumably, be lightly used and tested. Nonetheless, running and maintaining your own bare-bones blogging platform may be easier than doing the same with WordPress.
The preceding one was a small selection of programs that are frugal by design, and while there are no claims of completeness anywhere in this book, its brevity reflects the paucity of examples. The industry has long moved decisively in the opposite direction. Programs compete on features, not on simplicity. Code bases have been growing for decades, with no end in sight [12].
1.3.5 User Interfaces
If APIs connect a program’s modules, user interfaces or UIs connect programs and people. It may not be surprising, then, that some of the observations that are valid for one (see Sec. 1.3.3) carry over to the other. For instance, Quip founders Bret Taylor and Kevin Gibbs wrote that “When designing a user interface, it’s much harder to remove something than to add in something new” ([222]; compare with Bloch’s guidelines for APIs in Sec. 1.2.8). Users may be more adaptable than programs, but while they can ignore a new feature, they can’t replace one that has been removed, which is an option for developers. The result is this constant pull towards complexity, which needs to be deliberately countered.
Regarding graphical UIs, Rian van der Merwe formulates principles modeled after Tufte’s data-ink for statistical graphics (see Sec. 1.2.10.2). These principles mandate that a good UI needs to show data of interest to the user, do it efficiently with the pixels available and avoid wasting pixels on anything else [223]. Remember skeuomorphic UIs, which tried to show, say, a calendar as if it were made of paper and leather? Or web 2.0 sites with only 48 and larger fonts? Just the opposite. Looking at fake leather gets old quickly, and huge fonts spread the same information over many screens. A similar approach works for both statistical data and family pictures from a UI standpoint. If the user is after either, the screen should show more of those and less of everything else. The same is true for input buttons or fields or any UI elements.
A radical approach to UI design is using real-world objects instead of graphical elements. It has been adopted by Dynamicland, the brainchild of Bret Victor [224]. This system can track physical items’ positions, recognize text and other marks on their surface and project text and symbols onto them. The cameras and projectors are hidden in the ceiling. No code or data is needed to define UI elements’ appearance and behavior: users stick colorful labels on those they want the system to track. The learning curve is flatter as ordinary objects are already familiar. Compare this restrained approach with Mark Zuckenberg’s Meta’s goal of replacing the real world with a privately-owned artificial one. Its intrusiveness is matched only by the eye-watering development costs: $15B/year most recently and $100B94 over the project’s lifetime up to 2022 [225]. The only downside of ordinary objects is that they are susceptible to coffee spills, unlike their virtual counterparts.
Turner-Tauring loves proposing radical solutions, and UI is no exception. After he suggested we “drop the product” and become a “0.1 programmer” in another post (see Sec. 1.3.1), what could he recommend about the UI? Remove it [226]! He observes that user input isn’t a goal and that if we can help users achieve their aims without it, we should. For instance, we could provide a default value, guess a value from other inputs or just manage without.
1.3.6 Systems
FreeRTOS is a little operating system that reportedly fits in only three files of C code [227]. Before you dismiss it as a toy project, let me add it was backed by a dedicated company that later passed its stewardship to Amazon AWS. It is used mainly for microcontrollers.95 When verifying this information, I found it somewhat outdated [228]. The whole project is over 160 KLOCs at the time of this writing, but most files seem related to porting to different hardware. If I got that distinction right, the core is below 30 KLOCs. Therein there are six C files with less than 20 header files. Still small, but not just three files.
A much more considerable expansion affected the UNIX kernel. About an early version, John Lions writes [6]:
Not least amongst the charms and virtues of the UNIX Time-sharing System is the compactness of its source code. The source code for the permanently resident “nucleus” of the system when only a small number of peripheral devices is represented, is comfortably less than 9,000 lines of code.
This quote comes from a book containing a full copy of that code: A Commentary on the Sixth Edition UNIX Operating System. It became the centerpiece of operating systems classes around the world in the late 70s. Operating systems with 10 or even 100 times more LOCs already existed then.
I lost track of what happened to this UNIX flavor, but it is easier to find information about an independent UNIX implementation, Linux. A book modeled after the Commentary but devoted to this operating system explains why they picked a fairly old version thereof [229]:
The current Linux kernel source code amount is in the number of millions of lines, the 2.6.0 version of the kernel code line is about 5.92 million lines, and the 4.18.X version of the kernel code is extremely large, and it has exceeded 25 million lines! So it is almost impossible to fully annotate and elaborate on these kernels. The 0.12 version of the kernel does not exceed 20,000 lines of code, so it can be explained and commented clearly in a book. Small but complete.
In 2021, the 5.11 release of the Linux kernel had around 30.34 million LOCs, and the expansion has accelerated since [230]. A great deal of this code inflation is due to the inclusion of device drivers in the kernel source, a unique arrangement that moves some of the burdens of driver maintenance to the kernel developers. However, it isn’t just the drivers. As we will see in Sec. 1.5, Systemd, the replacement for the init process, has passed the 1.3 MLOCs mark and has been criticized for this and other reasons. Luckily, Linux kernels are highly configurable, allowing the creation of minimalistic ones for underpowered devices, teaching purposes, or just for fun. Minimal Linux, for instance, is about 1.6 MLOCs [231].
We can find hints at frugality even in The Mythical Man-Month (see Sec. 1.1), hailing from a time when resource constraints dominated any other consideration (even if the book was published in the 90s). In it, Brooks calls for unity of design above feature completeness [5]:
It is better to have a system omit certain anomalous features and improvements, but to reflect one set of design ideas, than to have one that contains many good but independent and uncoordinated ideas.
It is a timid step toward frugality, but it is a start.
1.4 Hardware
I didn’t think about concise hardware until I ran into a keyboard described as follows [232]:
Concise and portable: Designed with ultra-thin, light weight, small size and concise outlook, [the] RK61 keyboard is handy, comfortable and portable to carry to [sic] everywhere for familiar typing in any room.
On second thought, it isn’t a particularly ascetic device but rather one subject to physical constraints. Any keyboard is a compromise between having more and larger keys and keeping them within easy reach or making the device portable, two physical constraints. Some keyboards sport dedicated media keys or an extra-large enter key; others tout their compactness and are even foldable.
A better example of ascetic hardware is the missing keyboard of the iPhone. Introduced by Steve Jobs as a single device to replace three distinct ones — a phone, a music player, and an internet-connected device — it crucially lacked the tiny keyboard of contemporary competitors in exchange for a much larger screen [233]. It was a bold step. John Markoff wrote in the New York Times [234]:
If there is a billion-dollar gamble underlying Apple’s iPhone, it lies in what this smart cellphone does not have: a mechanical keyboard.
The size and weight constraints smartphone designers operate under make this not a pure ascetic choice. Nonetheless, iPhone competitors quickly responded with devices with sliding keyboards and large touchscreens. Despite this and the fact that on-screen typing is slower, mobile keyboards have gone extinct. Rumors had it that the iPhone’s few remaining buttons — all side buttons, with the home button long gone — were on the chopping block for the iPhone 13, but that turned out not to be the case [235]. Maybe a future iPhone will turn on by looking at it or will always be on.
Another less well-known but relevant hardware example is the RISC architecture, from Reduced Instruction Set Computer [236]. Just the name is ascetic: why would we want to reduce the instruction set when CPUs96 with rich instruction sets are plentiful? CPU designers must choose a trade-off between a rich instruction set that allows accomplishing more computation in a single instruction and a simplified one, which can be implemented more efficiently. With the help of compilers, interpreters, and emulators, the same programs can run on either type of CPU. From that point of view, the two architectures are equivalent, and therefore, in our jargon, we can see RISC as a concise architecture. Even before compilers were available, some liked programming with a smaller instruction set (see the IBM 650 in Sec. 1.6). When proposed in the 80s, despite evidence it could work better, the dominant players in the industry were too invested in its opposite, CISC, for Complex Instruction Set Computer. RISC survived as an appropriate choice for low-power applications or where performance was more important than backward compatibility, for instance, in game consoles. One of those low-power applications was the smartphone, quickly making RISC the most widespread CPU type. Now that Apple has chosen RISC for most of its product line, including the desktop, we may be approaching a day of reckoning for CISC. All CPUs today work under energy or power constraints, either because they are battery-powered or because of thermal requirements. In this context, the nimble RISC architecture has the potential to dominate.
1.5 Verbosity and Bloatware
While this book’s main focus so far has been on concise or frugal software, and its verbose and bloated counterpart has been mentioned only as a point of comparison, this is a section devoted to the latter. Just as the bad guys are often more entertaining in fiction, “verboseware” and bloatware have inspired some juicy stories.
They also have some surprising supporters. Concise software has a bad reputation that goes back to the same Dijkstra quoted in section Sec. 1.1 warning about the threat represented by large programs [39]:
[…] a specific phenomenon occurs that even has a well-established name: it is called “the one-liners”. […] one programmer places a one-line program on the desk of another and […] adds the question “Can you code this in less symbols?” — as if this were of any conceptual relevance!
It is somewhat puzzling that he would not consider for a moment that there may be some relationship between reducing 10 LOCs to 1 and reducing 1 million to 100,000, even if different techniques apply at different scales. However, as we have seen in Sec. 1.2.4, his answer was to produce better-structured code, not less of it.
He isn’t the only instance of a Turing awardee defending verbosity in his Turing lecture. Recalling what he considered a youthful mistake, Hoare said [204]:
[For the successor to ALGOL60] my suggestion was to […] relax the ALGOL60 rule of compulsory declaration of variable names […] It was pointed out that the redundancy of ALGOL60 was the best protection against programming and coding errors […] The story of the Mariner space rocket to Venus, lost because of the lack of compulsory declarations in FORTRAN, wasn’t to be published until later.
Hoare seems to embrace the widespread belief that a missed hyphen in Mariner I’s control program caused its premature loss. An inquiry later established that the error was equally minute but in its specification, not in the program itself [237]. Nonetheless, the false belief stuck, as good stories have a way of doing. He continues:
I was eventually persuaded of the need to design programming notations so as to maximize the number of errors which cannot be made, or if made, can be reliably detected at compile time. Perhaps this would make the text of programs longer. Never mind! Wouldn’t you be delighted if your Fairy Godmother offered to wave her wand over your program to remove all its errors and only made the condition that you should write out and key in your whole program three times!
Unfortunately, variable declarations, let alone retyping entire programs, are a primitive way to spell-check variables. I haven’t seen any evidence that languages with this characteristic have a lower bug density, but there is plenty of evidence that they have more lines. Modern IDEs will catch misspelled and uninitialized variables, and the declarative information is “vital” because some compilers use it to allocate the correct amount of memory.
The warnings about the pitfalls of large programs have been around for a long time. The Unix Philosophy, as summarized by open source advocate Eric Raymond, includes this rule about large programs [238]:
Rule of Parsimony: Write a big program only when it is clear by demonstration that nothing else will do.
That doesn’t ensure that modern versions of Unix don’t violate this principle. For instance, Systemd, a Linux program, is “an init system used to bootstrap user space and manage user processes,” which “also provides replacements for various daemons and utilities, including device management, login management, network connection management, and event logging” [239]. It exceeds 1.3 MLOCs and controversially centralizes several responsibilities previously dealt with by separate, smaller programs. Several notable Linux developers have criticized it, but major Linux distributions install it by default [240].
This trend is by no means limited to the Unix world, but, in its case, is in strident contrast with its original philosophy. It affects the software industry as a whole, despite its negative consequences [12]. We have seen in Sec. 1.1 that the size of a program predicts, to some degree, the number of errors to be found in it and the probability of its failure as an engineering project. Yet, there are further ramifications, such as security-related ones.
For instance, consider signing onto web sites using your Facebook credentials (a feature known as single sign-on). The problem isn’t just creating a single point of failure for all of your authentication needs; it is trusting the security of Facebook authentication in the first place. Jason Polakis, an assistant professor of computer science at the University of Illinois at Chicago and an expert on single sign-on, puts it as follows [241]:
The codebase of these services is massive. You have different teams working on different components, and they can interplay in different ways, and you can have a crazy hack that no one expects.
He could have mentioned Facebook’s “move fast and break things” motto, which doesn’t bode well for security, or other reputation-busting activities that made the news. Instead, he focused on the enormous attack surface represented by its sprawling code base, a problem shared with all the other tech behemoths. Polakis was being interviewed on the occasion of a security breach at the social network, affecting 50 million users.
A similar, code-size-based explanation was proposed by Zulfikar Ramzan, CTO at a cloud security startup, for a well-publicized bug known as Heartbleed [242]:
[…] when I looked at the flaw myself I said, “Obviously, this is a pretty simple error.” This comes down to the issue that there’s so much code out there right now, and there’s so much code people are writing. There was a particular protocol called Heartbeat that did not get as much scrutiny.
To be precise, Heartbeat is an extension of TSL and DTSL, two protocols underlying secure communication on the web [243]. We can imagine a world without this feature but also free of the criminal activity enabled by its implementation errors. Indeed, before a patch became available, a recommended workaround was to turn it off. Whoever decided to add it and whoever, most likely inadvertently, introduced this bug will not foot the bill for said activity.
A company that may regret writing so much code is Boeing. It is building Starliner, a space capsule for crewed missions. After a bug was identified and corrected in its control software while in orbit, Boeing decided to review the whole code base for the project, as several additional bugs pointed to a development process failure. Unfortunately for them, there were already a million LOCs to review [244]. It would seem wise to check a company’s development process before reaching the one MLOC mark, let alone having a spaceship in orbit, but that didn’t happen in this instance. I can’t help but imagine a million LOCs written at the speed that some ascribe to this kind of project, namely, one LOC per programmer per day (see Sec. 1.2.1).
Talking about space exploration, I ran into this code comment, part of the container97 management software Kubernetes [245]:
PLEASE DO NOT ATTEMPT TO SIMPLIFY THIS CODE. KEEP THE SPACE SHUTTLE FLYING. This controller is intentionally written in a very verbose style.
The comments further dictate that this program section can only be added to because it is written in this style. Reportedly, it is used at NASA in some contexts to prevent bugs from occurring. Its rationale is that if the code lists all possible conditions in a cascade of ifs, then no condition will be unaccounted for, even if some branches do nothing. However, I don’t know if this approach is widely adopted at NASA or if there is any evidence it contributes to program correctness.
Can large code bases be responsible for crimes, or at least provide a mitigating circumstance for the people accused of those crimes? Eddie Tipton, the confessed mastermind of the largest lottery fraud in US history, apparently tried to suggest this idea. Speaking in court about the system running the lottery, he said [246]:
[The lottery code] just kept growing and growing, […] It became spaghetti codes [sic], unfortunately.
The jury wasn’t moved and sentenced him to 10 years, eventually adding up to 25 in combination with other sentences. Rigging spaghetti code, albeit unpleasant, is still rigging.
Website operators and developers should be particularly sensitive to the size issue because web pages, with their associated media and code, have to travel over the wire to impatient users, who may move on rather than wait. Unfortunately, this is not the case. While many techniques have been adopted to accelerate page load, restraint in designing pages isn’t one of those. In an entertaining presentation, Maciej Cegłowski, a web entrepreneur and activist, details what he calls “the website obesity crisis” [247]. He compares the size of pages of popular websites to that of major works of literature. His prescription is drastic:
I want to share with you my simple two-step secret to improving the performance of any website.
Make sure that the most important elements of the page download and render first.
Stop there.
You don’t need all that other crap. Have courage in your minimalism.
He calls for less of everything other than human-readable text: fewer images and videos, less Javascript and tracking tags. His blog’s home page is over 56% text; by comparison, the corresponding figure for cnn.com is 0.21% [248].
Typically, more planning and thinking go into designing a language than a website. Language designers should know that keeping their creations small makes them more accessible. On the other hand, faced with a long list of requirements, they may try to meet them at the cost of more complexity. That might be what happened to the ADA language, designed in response to a call for proposals by the Department of Defense. None of the existing languages had been found to meet the complex requirements for a general-purpose programming language for defense-related projects. The result wasn’t encouraging [249]:
[The developers of] early Ada [sic] compilers struggled to implement the large, complex language, and both compile-time and run-time performance tended to be slow and tools primitive.
Hoare took time during his Turing Award Lecture to criticize ADA, as well as PL/1 and Algol 68, as too complex and therefore unreliable (apparently, being the awardee didn’t put him in a charitable mood) [204]. His experience with complex languages goes back to his work on designing a successor for Algol 60, during which he had to watch in disbelief as a simple, incremental proposal was sidelined in favor of a much more elaborate and, in his opinion, unfeasible design. Of this, he wrote:
[…] there are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.
It became, after many delays, the quickly abandoned Algol 68.
About ADA, which was designed to become the standard programming language for the Department of Defense, he warned:
Do not allow this language in its present state to be used in applications where reliability is critical, i.e., nuclear power stations, cruise missiles, early warning systems, anti-ballistic missile defense systems. The next rocket to go astray as a result of a programming language error may not be an exploratory space rocket on a harmless trip to Venus [see Mariner I above]: It may be a nuclear warhead exploding over one of our own cities. An unreliable programming language generating unreliable programs constitutes a far greater risk to our environment and to our society than unsafe cars, toxic pesticides, or accidents at nuclear power stations.
It is at least debatable whether this prediction has proven accurate. I can’t think of a software error that caused devastation even remotely comparable to that of the Chernobyl or Fukushima disasters, even if the MCAS-related accidents (see Sec. 1.6), for instance, resulted in more short-term casualties. However, by most accounts, they were not due to the programming language used.
ADA continues to be used in defense and other security-sensitive applications, like flight control for aviation and space. Yet, its adoption rate pales compared to another problematic language, SQL, which is “by far the most used database language” [250]. The same reference contains a well-reasoned criticism of SQL but, for our goals, the following paragraph stands out:
SQL is not a small language. At the time of writing the PostgreSQL implementation contains 469 keywords. Just part 2 (out of 14) of the SQL:2016 standard has 1732 pages.
By comparison, JavaScript has 64 reserved words in its ECMAScript 5/6 standard [251]. Other issues with SQL make queries long and hard to read and inhibit query reuse (see Sec. 1.2.9.9 and Sec. 1.2.10.1). The inventor of the relational model, Edgar F. Codd, devotes a whole chapter in one of his books to SQL’s “serious flaws” [252].
If the DRY principle (see Sec. 1.2.3)
guides Python and Ruby programmers, Java should be considered a WET
language — for Write Explicitly Tenfold. We have already encountered its
proverbial verbosity in comparisons to Clojure (see Sec. 1.2.9.1) and Scala (see Sec. 1.2.10.1). However, the
most extreme example of irreducible verbosity must be the one in
java.util.Arrays
, a module of the Java standard library.
One of its authors is the same Joshua Bloch who preaches the gospel of
minimalistic APIs (see Sec. 1.2.8). In an older version,
we could find the following fatalist comment [253]:
The code for each of the seven primitive types is largely identical. C’est la vie.98
Let us consider that this is as idiomatic Java as it gets, co-authored by three principal contributors to the development of the language. Apparently, Java has no abstraction mechanism to avoid this 7-fold duplication without a performance penalty, which isn’t acceptable in a foundational array library. More recently, the comment disappeared, but the repetitive code remains.
Verbosity isn’t just an inconvenience or a maintenance burden: it can kill complete subfields of computing. At least, this is what Jonathan Dursi puts forward in his post: “HPC is dying, and MPI is killing it” [254]. He accuses MPI, a dominant library for High-Performance Computing, of forcing practitioners to a lower level of abstraction than most need. Moreover, compared to modern alternatives like Spark and Chapel, he shows that MPI takes over twice as much code while leaving fault tolerance to the developer.
Anecdotally, most engineers have had to deal with awful code in their careers, particularly of the verbose variety. Hacker News user conceptjunkie comes to a sobering conclusion [255]:
I don’t know about you, but I’ve rarely worked on a codebase that wasn’t awful in some way, and I am definitely not early in my career. I’ve come to the conclusion that most programmers are simply awful at their jobs, and the developers that can write clear, concise code are a small minority. I’ve known a few, but not many.
I hold a slightly less pessimistic view. It may be that too many people have rapidly entered this field for most of them to have a talent for or a deep interest in computing. However, terrible developers don’t need to be a majority to produce a much larger number of LOCs than their concise peers. Good, concise code is hidden among the KLOCs and MLOCs of verbose horrors.
Moreover, bad code begets bad code. For example, I once needed a particular time-related function, and I thought someone at a client of mine had likely written it before. I found related functions everywhere, from database-related modules to the UI. The search had taken me already longer than writing my own implementation. Should I have started a refactor to bring all these functions into one module while eliminating duplications, or just followed the most direct route to meet my deadline? In most companies, an engineer is not free to make such a decision. You can’t go to a team meeting and say that you won’t meet a deadline because you started a refactor of your own accord. You would be asked immediately: is this refactor a blocker? That is, is its completion a necessary condition to perform the original task? Of course it isn’t, because you can pile up a bit more technical debt,99 let the rot spread a little and meet your deadline. Indeed, you are required to.
This experience brings up a final point: the dominant incentive structure isn’t aligned with quality. Development these days is fragmented into a sequence of self-inflicted emergencies. Invented deadlines and the jargon used by management (“backlog”, “sprint”) are meant to create pressure. It is like trying to win a marathon, going as fast as possible for the first 100 yards, then again and again. It is a technical debt factory. We know the aphorism: “quality, time or resources, pick two.” Since individual contributors don’t have access to additional engineering resources, and time is set by management or the fastest, least scrupulous coworker, quality has to yield. It becomes a race to the bottom, doing the least for a new feature to pass equally rushed tests, product, and code reviews. Week after week, a developer will be called to work on someone else’s awful code and someone else will mess up whatever good there in this developer’s. It is irrational to expect good work to be performed in this environment. It is inevitable that conciseness, which is seldom even recognized as a quality factor, is one of the victims.
1.6 Code Considered Harmful
Some engineers react to the realities of horrible code by trying to write good code, including eliminating verbosity. Others take a more pessimistic position, denying the possibility of writing concise code or using features appropriately. Their approach is more radical: elimination.
Developers likely recognized, in this section’s title, a play on “Go
To Statement Considered Harmful,” an essay in which Dijkstra criticized
the goto
statement, which causes a program to continue at a
line identified by a number or a label [256]. Dijkstra endeavored to rewrite
programs without the goto
and reported that “In all cases
tried […] the program without goto statements turned out to be shorter
and more lucid” [257]. It is
the prototype of all attacks on language features and the only one to
succeed to a great extent. Indeed, many modern languages don’t support
this construct. This omission can be worked around at the cost of a
modest increase in code length or by using alternative instructions such
as break
or return
. That goto
isn’t needed is also supported by the Böhm-Jacopini theorem, which shows
that any program can be rewritten without goto
([258]; see also [259]). More than the statement itself,
it is noteworthy that its proof builds a goto
-free version
of a program via incremental modifications, rather than just claiming
Turing equivalence.100 In other words, ridding programs
of goto
s doesn’t require rewriting them from the ground up.
The main argument against the goto
is that it made programs
hard to understand by allowing a combinatorial explosion of the possible
execution paths through the code. Well-known voices expressing qualified
support for the goto
are those of Donald Knuth [260], author of The Art of Computer
Programming, and Torvalds [261]. Both find that it has
applications where it seems the most appropriate construct. I have seen
excellent programs written with goto
as the only available
control statement and terrible ones written without a single
goto
. Indeed, it is a feature prone to abuse, but many
other dangerous features are defended based on their power and under the
optimistic assumption that only competent people will use them. I don’t
miss the goto
, but I feel like a double standard was
applied to it.
On the evidence that suppressing the goto
didn’t solve
the ills of software development — but it may have mitigated them — the
scapegoating of language features could have fallen out of favor, but it
hasn’t. There is a budding effort to discourage the use of the
if
statement. Francesco Cirillo, the consultant who
invented the Pomodoro Technique, has penned an “Anti-If Manifesto” [262]:
Have you ever wondered how IFs impact on [sic] your code? Avoid dangerous IFs and use Objects to build a code that is flexible, changeable and easily testable, and will help avoid a lot of headaches and weekends spent debugging!
I understand this is an effort to replace if
s with
method dispatch,101 substituting the implicit checking
of types for the explicit checking of conditions. Equivalently, it moves
the responsibility for type checking from the caller to the callee, an
instance of what Bloch recommends for APIs (don’t let the user do what
the API can take care of, see Sec. 1.2.8). I am less clear why
this is framed as an attack on the if
statement rather than
a pitch for object-oriented programming and, specifically,
polymorphism102 or whether Cirillo writes or
foresees programs free of if
s.
While if
fights for its life, the detractors of class
inheritance,103 which is closely related to
polymorphism, are building up their ranks [263]. As language evolution continues
to be based on a popularity contest, as opposed to hard evidence, we
miss opportunities to make actual progress.
If sending individual features into oblivion isn’t enough, why not reduce code to fragments? This is more or less what Robert C. Martin proposes in his best-seller Clean Code [264]:
The first rule of functions is that they should be small. The second rule of functions is that they should be smaller than that.
There is only one way to fulfill this requirement: limiting every function to one line — I considered the zero lines solution but ran into mathematical difficulties. It likely isn’t meant to be taken too literally because, elsewhere in the book, the ideal function length is set at two to four lines. This precept is absolute and independent of other considerations, including:
- whether these functions will be called more than once;
- whether four lines accomplish anything that can be described or tested;
- whether high-quality libraries follow this guideline;
- conciseness or time limits, as function definitions take their own space and effort.
You may be writing this madness off as a necessary hyperbole to make
a book sell, but that would be premature. I once reviewed a 60-KLOC code
base that approached this ideal closely. Function names were summaries
of their own minuscule bodies, and formal and actual parameters had the
same names. I expect this to be hard to imagine even for battle-hardened
programmers, so let me clarify with an example (not from the actual
code). A statement like table2 = join("table0", "table1")
would be transformed into:
def join_table0_table1(table0, table1):
return join(table0, table1)
= "table0"
table0 = "table1"
table1 = join_table0_table1(table0, table1) table2
As a result, no function was invoked twice, ever. If every sequence of more than two LOCs needs to be broken down into separate functions, the reusability of these functions can’t be a priority anymore. To add insult to injury, every constant was assigned to a variable named after its value because of another well-known software engineering guideline according to which hardwired constants are to be avoided. This code was being maintained at a Fortune 500 company, not a fly-by-night operation.
Some put the blame for code verbosity squarely on developers. For example, Tyler Treat, a developer himself by training, writes [265]:
[…] you are not paid to write code. You have never been paid to write code. Indeed, code is a nasty byproduct of being a software engineer.
These statements must sound outlandish to whoever makes a living writing code and are possibly explained by Treat having reached the level of “managing partner,” several levels above the coding trenches. The decisions about whether to write code or not aren’t made by developers in most contexts. A developer not producing code is considered an idle one. For instance, news outlets reported that Twitter engineers, facing a review in preparation for massive layoffs, had to provide management with 50 pages of code produced during the last 30 days, or roughly 90 LOCs per day — a “nasty byproduct” of trying to keep their jobs (and 18 times faster than Kamp wrote Varnish, see Sec. 1.2.1)[266].
A piece of code that turned out to be harmful, even deadly, was the infamous MCAS, a patch in the control software of the new Boeing 737 Max necessitated by changes in engine size and position [267]:
After weighing many possibilities […] Boeing decided to add a new program — what engineers described as essentially some lines of code — to the aircraft’s existing flight control system to counter the destabilizing pitching forces from the new engines.
That program was M. C. A. S.
The new code, together with hardware flaws, resulted in the total loss of two airplanes, with all 346 passengers and crew. It may look like a concise addition to the existing control software — just a few LOCs. However, an ill-advised addition it was: a plane initially designed with intrinsic stability and ultimate pilot control over any electronic aid was modified into one with a computer override in a crucially unstable condition ([268], [269]). As an additional risk factor, the changes had to be downplayed to convince clients and regulatory authorities that the new airplane was essentially the same as the previous 737, sidestepping expensive retraining and certification requirements. Aircraft developed more recently, both by Boeing and its main competitor, Airbus, are designed for computer control. As a result, they didn’t incur any of the software problems of the 737 MAX.
Going back to Brooks and the early days of computing, he discusses the delicate balance between adding features and adding complexity, both for the user and the developer. A system that he is particularly fond of is the IBM 650 ([5]; see Fig. 1):
I am haunted by the memory of the ease of use of the IBM 650, even without an assembler or any other software at all.
You got it right: his favorite system was software-less, bare hardware. Later, it received an assembly language and other software, possibly spoiling the user experience.
Knuth got his computing career started working on the same model. He, too, has fond memories of it [270]:
[…] There was something special about the IBM 650, something that has provided much of the inspiration for my life’s work. Somehow this machine was powerful in spite of its severe limitations. Somehow it was friendly in spite of its primitive man-machine interface. […] The 650 had only 44 operation codes, while the 709 had more than 200 [see Sec. 1.4]; yet I never enjoyed coding for the 709, because I never seemed to be able to write short and elegant programs for it — the opcodes didn’t blend together especially well. By contrast, it was somehow quite easy and pleasant to do complex things on the 650 with very few instructions.
1.7 Methodologies
In a time of proliferating and ever more irrational software methodologies, it is likely for the better that there aren’t ascetic programming workshops, books, and expensive ascetic consultants to promise unrealistic results and to dodge responsibility when they aren’t achieved. However, on a small scale, there are initiatives related to this book’s ideas.
Atwood points us to Spartan Programming, which he finds particularly close to his programming style [271]. This methodology consists of a long list of concrete programming guidelines. These can be described as the simultaneous minimization of various code-related quantities, from the number of lines to the number and scope of variables. Minimizing all those measures may result in trade-offs between them, but the guidelines don’t explain how to find a compromise. Unfortunately, the original Spartan programming description is now a dead link, but you can automatically apply Spartan programming to Java code using Spartanizer, an online service [272].
The take of suckless.org is much more practical [273].
They develop concrete software “with a focus on simplicity, clarity and
frugality.” Their philosophy can be summarized as “keeping things
simple, minimal and usable.” They host the development of several
projects, including Dwm (see Sec. 1.2.11), a window manager with a 2-KLOC
upper bound on code length, and st
(see Sec. 1.3.4), a simplified terminal emulator.
They claim that “the tendency for complex, error-prone and slow software
seems to be prevalent in the present-day software industry.”
Surprisingly, their intended audience is “advanced and experienced
computer users,” whereas it seems reasonable that their emphasis on
simplicity would be most appealing to beginner users. Their notion of
simplicity seems to focus on objective measures such as the number of
features or LOCs:
The more code lines you have removed, the more progress you have made. As the number of lines of code in your software shrinks, the more skilled you have become and the less your software sucks.
Also related is the slow code movement, obviously inspired by the slow food movement. It has a great slogan: “Slower. Less code. But better” [274]. Its website presents an anecdote on solving a seemingly complex problem with two LOCs after an appropriately extended period of analysis, but that is about all there is. Being a slow movement, we are still hoping it will eventually blossom.
2 Science and Engineering
2.1 Epistemology, Statistics and Machine Learning
2.1.1 The Parsimony Principle
Ockham’s razor is a well-known parsimony principle in philosophy [276]:
Pluralitas non est ponenda sine necessitate.
(Plurality should not be posited without necessity). William of Ockham, a medieval friar, theologian, and philosopher, used it in several of his manuscripts for different purposes, the most important of which was to conclude that “God’s existence cannot be deduced by reason alone” [277]. However, this principle, despite its name, did not originate with him. It was well-known in Scholastic philosophy and can be traced back all the way to Aristotle. He wrote: “We may assume the superiority, other things being equal, of the demonstration which derives from fewer postulates or hypotheses” [278]. In more modern times, we find it restated in several variants, for instance, by Isaac Newton: “We are to admit no more causes of natural things than such as are both true and sufficient to explain their appearances” [277].
This bias for simplicity continues in modern statistics and data analysis. For example, John Tukey, a statistician who invented the Fast Fourier Transform and the boxplot, wrote [279]:
The importance of parsimony in data analysis can hardly be overstated. By parsimony we mean both the use of few numerical constants and also the avoidance of undue complexity of form in summarizing and displaying. The need for parsimony is both esthetic and practical.
It isn’t just an experience-based statement. For instance, well-known model selection methods, such as the Akaike Information Criterion and the Bayesian Information Criterion, call for the minimization of a quantity that includes a term proportional to the number of parameters in a model. In particular, when the fit to the data is equally good, both methods will choose the model with fewer parameters.
Not everyone agrees. For instance, statistician and social scientist Andrew Gelman writes [280]:
A lot has been written in statistics about “parsimony” — that is, the desire to explain phenomena using fewer parameters — but I’ve never seen any good general justification for parsimony. (I don’t count “Occam’s Razor,” or “Ockham’s Razor,” or whatever, as a justification. You gotta do better than digging up a 700-year-old quote.)
Maybe it’s because I work in social science, but my feeling is: if you can approximate reality with just a few parameters, fine. If you can use more parameters to fold in more information, that’s even better.
In practice, I often use simple models — because they are less effort to fit and, especially, to understand. But I don’t kid myself that they’re better than more complicated efforts!
Gelman sees parsimony as specific to physics, not as a general statistical or epistemological principle [281].
These two points of view aren’t compatible. I initially thought this could be explained by the existence of two major schools of thought in statistics, frequentist and Bayesian, with Gelman an exponent of the latter [282]. Broadly speaking, in frequentist statistics, the more parameters a model has, the better it fits the data. The statistician needs to actively counter this tendency to prevent overfitting, which is when a model can fit the signal as well as the noise, undermining its use, for instance, in prediction or parameter estimation. Intuitively, such a model “memorizes” past data, as opposed to capturing its regularities. In Bayesian statistics, all parameters are assumed to be drawn from a probability distribution, the prior, either because they are stochastic or to represent our preexisting state of knowledge or ignorance about them. This fact, in and of itself, is an antidote to introducing too many parameters. Let’s see how this works with an example of Bayesian model comparison [283]. models a fair coin with equal probability for heads or tails and no parameters. models a biased coin, with an unknown amount and direction of its bias, represented by a parameter also assumed to be a random variable. The probability of , conditional to the data, is obtained by integrating over the possible values of its parameter, weighted with their prior probability. With a peanut butter analogy, is “spread thinner” than because all probability distributions are normalized to one. With a betting one, makes a sharp bet on a single parameter value, 1/2, whereas keeps its options open. In frequentist statistics, the bias parameter would be considered an unknown but not a random variable. It could be set to 1/2 if the data supported that, and there would be no way could fit the data better than . The latter “contains” as a special case. Frequentists aren’t fools and have developed ways to fix that. However, for Bayesianists, the solution is built into their approach. Wikipedia concludes:
is a more complex model than because it has a free parameter which allows it to model the data more closely. The ability of Bayes factors to take this into account is a reason why Bayesian inference has been put forward as a theoretical justification for and generalisation of Occam’s razor […].
Therefore, while the parsimony principle may be implicit in Bayesian statistics, it is compatible with it. Jefferys and Sharpening agree [284]:
Ockham’s razor, far from being merely an ad hoc principle, can under many practical situations in science be justified as a consequence of Bayesian inference.
In the coin example, though, the prior probabilities of models and were arbitrarily set to the same value. Additionally, we glossed over the choice of prior for the bias parameter in . These choices are mainly left to the statistician or researcher. This has often been cited as making Bayesian statistics inadequate for science, as conclusions are too dependent on the choice of prior — that is why its dominant variant is called subjective Bayesianism [285]. Bayesians reply with several counter-arguments, from the theoretical difficulties affecting frequentism to the existence of uninformative priors, which, at least in simple cases, lead to a convergence of Bayesian and frequentist results.
2.1.2 A Universal Prior
The question of the prior could be solved if we could find something like a universal prior. It would function as a proto-belief, the prior that predates all data collection and thus solves the Bayesian conundrum. Such a concept exists in a subfield called algorithmic probability [286]. As the name suggests, this is a research domain at the intersection of computer science and statistics. The foundational step is to consider statistical models as programs, not just because it is practical to run them on computers rather than with pencil and paper, but also because of advantages at a conceptual level. Traditionally, statistical models take the form of mathematical expressions, and it seems impossible even to contemplate the set of all possible models. Philosopher Elliot Sober states [276]:
Unfortunately, the idea that we can now enumerate the set of all possible future scientific theories is illusory.
Suppose we instead confine ourselves to computable models, that is, those that can be expressed as terminating programs. In that case, even if there is an infinite number of them, each program is a finite object; therefore, they can be enumerated, at least in theory — their number grows exponentially with size. Moreover, bounding the size of a program-as-model provides a simultaneous limit on its parameters’ number and magnitude (or precision). This limits its complexity in a way agnostic to its nature (say linear vs. non-linear). An objection to the generality of this approach may be that even a simple linear model may not be computable. It may be surprising, but most real numbers are uncomputable; that is, no computer can produce their digits to an arbitrary precision [287]. If the parameter in were uncomputable, the model itself would be too. However, since all measurements result in rational numbers, it is debatable whether we will ever have to deal with such an uncomputability barrier, or if it will forever remain as a potential problem. Indeed there are quantum-mechanical and information-theoretic arguments against the possibility of infinite-precision measurements [288].
It is possible to define a probability distribution over the set of computable models that we can use as a prior before any data becomes available. For a program of size ,105 its a priori probability is proportional to . In other words, a model that is one unit larger than another is considered half as likely. Just the fact that we are able to proceed with this definition is progress, but we shouldn’t assume that reality bends to fit mathematics. Nonetheless, for the sake of this book’s thesis all we need to observe is that all alternative priors share the property that those represented by large programs are less likely than small ones in an aggregate sense. This result can be stated more precisely: for every there is an such that all models larger than , in aggregate, have probability less than . Therefore, no matter how we tinker with the prior, there is no way to rigorously state “prefer large models over small ones.” Let’s not overstate what this theorem means: it doesn’t lead to a unique choice of prior.106 Ray Solomonoff, one of the inventors of algorithmic probability, concedes as much in a relatively recent paper where he says he has come to think of this as a strength rather than a liability [289]. However, no prior can systematically prefer larger models over smaller ones. There is no way to escape parsimony, at least in this aggregate sense.
The next step is assuming that the process to be modeled is itself computational; that is, a program can simulate it. We could equivalently state that theories of the natural world can be expressed in a computable way without loss. This is a bit more of a leap of faith and I refer you to the work of Jack Copeland, Mark Sprevak and Oron Shagrir, where the question of whether the Universe is computational is addressed [290]. They examine several alternative answers: the Universe could be a computing device, albeit an enormous one; or, while non-computational, it could be effectively approximated by such a device; or, at least in some reaches we haven’t explored yet, it could be impervious to computational modeling. Either one of the first two answers allows us to proceed with our argument.
Now that we have a prior on programs which, hopefully, can effectively model Nature, deep mathematical results are possible, including the derivation of a prediction algorithm with an error bound depending only on the Kolmogorov complexity of the process to be modeled. Let’s recall that if such a process produces a certain sequence of observations , its Kolmogorov complexity is defined as the length of the shortest program107 able to reproduce (see Sec. 1.2.11.1). Imagine a random generated by fair, independent coin tosses: the only way to reproduce is storing it toss by toss. Hence its length and its Kolmogorov complexity are similar in magnitude. At the opposite end, imagine is times. It is easy to build a short program repeating 0, at least if by “short” we mean .108 The Kolmogorov complexity of these observations is much smaller than in the previous case. It is intuitive that a prediction method would perform better on the repetitive sequence as opposed the random one, and the above error bound formalizes that. The universal prior has other interesting properties that we can’t cover here and its presentation has been quite informal; therefore I encourage you to check the sources. Unfortunately, it isn’t appropriate for practical applications, or at least not yet. There are related approaches that are more so and also rely on short descriptions of data and models, as we will see in the next section.
2.1.3 Minimum Message Length
The relationship between the size of statistical models and their performance is at the center of a model selection method known as Minimum Message Length [291]. Imagine we need to communicate a dataset over a transmission line and decide to send it in two parts: an encoding for a statistical model and then a separate encoding for the data, conditional to the choice of model, for instance, encoding only the error in the model prediction. Plug into this the Shannon-Fano code, which assigns a code length of to a code word , and it turns out that minimizing code length is equivalent to maximizing the posterior probability , where the first part is the probability of the model and the second is the probability of the data given the model. Up to this point, this may look like a Bayesian take on model selection.109 However, the formulation in terms of encodings means that the models have to be expressed finitely, for instance, with finite precision numbers instead of real numbers, creating a bridge with algorithmic probability (see the previous section). This relation is explored in depth by Wallace and Dowe [291] and Vitányi and Li [292]. Here though, the class of models isn’t necessarily the class of all computable models but can be defined more flexibly in a problem-dependent fashion. Thus, we could restrict the space of models to finite precision linear models and apply Minimum Message Length to choose among competing ones, giving up on any universality claims but, at the same time, offering a more practical method. Vitányi and Li go so far as to state:
It is widely believed that the better a theory compresses the data concerning some phenomenon under investigation, the better we have learned, generalized, and the better the theory predicts unknown data. This belief is vindicated in practice and is a form of “Occam’s razor” paradigm about “simplicity” but apparently has not been rigorously proved in a general setting. Here we show that data compression is almost always the best strategy, both in hypotheses identification by using an ideal form of the minimum description length (MDL) principle and in prediction of sequences.
The Minimum Description Length principle is an independently developed relative of Minimum Message Length which tries to free itself from the dependence on the availability of a prior [293]. The “ideal form” mentioned above is a bridge to the algorithmic probability of the previous section. These different lines of research point to the discretization of statistical models and the adoption of size, of either concise programs or concise codes, as a means to control complexity and, as a consequence, ensure good statistical performance.
2.1.4 Deep Learning
Neural networks, a biologically inspired class of statistical models, have gone through different phases of enthusiasm followed by disillusion as researchers explore their possibilities and limitations. In the 2000s, one of the growth phases started when it became possible to train — fit in more traditional statistical jargon — substantially larger networks than before, dubbed deep neural networks because of their structure organized in many layers [294]. These new systems broke performance records on several established AI benchmarks related to vision and language tasks. There was a consensus that scaling to larger networks and data sets was the most direct path to improved performance [295] and even general intelligence [296], a level at which man and machine become indistinguishable.
After a phase of rapid progress and exuberant optimism, we are now in one of diminishing returns. Autonomous driving is perennially a few years away from fruition. Google has been going at it for 12 years, yet its cars still struggle to make left turns in ideal conditions [297]. Moreover, machine learning systems are brittle to small changes in their statistical environment or deliberately generated examples to “trick” them into giving wrong answers, so called adversarial examples [294]. The best neural nets for natural language processing, known as language models, were trained from entire crawls of the internet, vast collections of digitized books, and more. It isn’t far-fetched to say that they ingested everything there is to read, yet their performance is uneven: superhuman on some tasks, disappointing on others [298].
One of the reasons may be that researchers focused too much on statistical performance at all costs, on achieving the coveted SOTA (State Of The Art, the best performance on a benchmark). Throwing more data and neurons at the problem was a reliable way to improve performance, but the rate of progress got overlooked. After observing a decade of progress, Sun and co-authors concluded that “performance increases logarithmically based on volume of training data” [299]. Otherwise stated, an additive improvement requires order-of-magnitude larger datasets and models and two orders of magnitude more computing power, as each data point needs to be processed by a proportionally larger network. The next step is multi-modal systems that process text, images, sound and video, and are trained to perform many disparate tasks [300]. Unfortunately, few well-funded labs can pursue this research, as enormous computational resources are needed.
Two major innovations that led to the current renaissance of neural nets are convolutional nets and deep nets, which, for the same size, show better performance than their non-convolutional and shallow counterparts. Paradoxically, ideas at the origin of the current size race are ways to squeeze more performance out of a system of a given size. Some researchers are starting to question the consensus and finding that smaller networks (still in the 10s of billions of parameters) trained on more data under a fixed computational budget perform better than some of the previous champions [301]. Others are looking again into data efficiency [302].
Deep neural networks seem as far as possible from the concept of parsimony. They include hundreds of billions of adjustable parameters, a large number even considering the giant data sets on which they are trained. Theory and experiments show they should overfit the data but, in practice, they don’t, and researchers are trying to figure out why [303]. In one case, they created abridged versions of large networks without significantly changing their performance [304]. This wasn’t done for practical reasons, such as deployment onto mobile devices [305], but to prove that the larger networks provided good predictions, like their smaller versions. We could say that the parsimonious models were hiding inside the verbose ones.
2.1.5 Extreme parsimony
Ockham’s razor warns against complex explanations beyond what is necessary (see Sec. 2.1.1). Different thinkers and researchers have provided different definitions for “complex” and “necessary,” resulting in various principles and methods. However, an ancient philosopher, Parmenides, stands out for simply dropping the “necessary” part of the razor and reducing his theory of everything to a bare minimum of complexity. He embraced a tautology — “whatever is is, and what is not cannot be” — to the extent that he denied the existence of anything other than an undifferentiated being and considered all perception an illusion [306]:
For this [view], that That Which Is Not exists, can never predominate. You must debar your thought from this way of search, nor let ordinary experience in its variety force you along this way, (namely, that of allowing) the eye, sightless as it is, and the ear, full of sound, and the tongue, to rule; but (you must) judge by means of the Reason (Logos) the much-contested proof which is expounded by me.
Ancient and modern philosophers provided many interpretations of Parmenides’ thought, which we know only through fragments of his writings or indirect sources [307]. Among those, the most appealing to me is the most extreme, known as strict existential monism. According to it, only one entity exists, while every distinction, every difference, and becoming itself must all be an illusion. I associate this barebones ontology with other expressions of extreme minimalism, some of which appear in this book: the empty software module (see noop2 in Sec. 1.2.2), the paintings showing a single square (see Malevich’s squares in Sec. 3.2.3), and the silent piece of music (see Cage’s 4’33” in Sec. 3.3). As for those, it is possible to suspect Parmenides’ philosophy of being a prank for intellectuals. However, his current standing as the father of ontology 2500 years later suggests the opposite.
2.2 Mathematics
Being the first to devise a proof for some important conjecture or open problem provides the most recognition and notoriety, independent of its length. Take Andrew Wiles’s proof of Fermat’s last theorem [308]. This theorem has a deceptively simple formulation. We know there are right triangles with integer-length sides, for instance, 3, 4, and 5, and, by Pythagoras’ Theorem, . Since mathematicians like to try to generalize simple concepts, the problem was expanded as follows: are there positive integer solutions to ? Solutions for were known since antiquity, but none for . Fermat stated he had proved that none existed. His proof being lost or maybe only imagined, his statement stood as a conjecture to the attention of the greatest mathematicians for centuries, but progress was halting: Fermat himself proved the case, which implies further work need only focus on prime exponents.110 Then fell, followed by entire classes of primes. However, the general proof was considered untouchable by most mathematicians. Wiles’s solution took years to put together while also building on voluminous work by other mathematicians, who had created bridges between distant fields in mathematics. When first announced, it took a three-day workshop to explain; when published, after some errors were found and corrected, a full issue of the Annals of Mathematics. It is no lightweight proof, but the problem is solved, and prizes and honors have been bestowed on Wiles. Nonetheless, other mathematicians have already started simplifying it. This type of work is also valued, albeit not as celebrated as the first proof of an original result. For instance, reporting on the award of the Field Medals, the highest recognition in Mathematics, to Peter Scholze, Kenneth Chang writes [309]:
Dr. Scholze gained prominence when he was still in graduate school in 2010, simplifying a complicated book-length, 288-page proof to a novella-size 37-page version.
That more svelte proof did not earn Scholze his Field Medal, though. Instead, it was his work at the intersection of number theory and geometry, where he made several original contributions that spurred new lines of research. Novel and fertile for novel results beat concise.
If long proofs are acceptable, is there a limit? When is a proof too long to be checked accurately? One case stretching the boundary is that of the four-color theorem. It is another deceptively simple statement about coloring maps. Specifically, we are concerned with political maps and the assignment of colors to each state or country. It is required that no two bordering states be assigned the same color. Corner contact such as that between Utah and New Mexico doesn’t count. The theorem states that four colors are enough for any map, real or imagined. Kenneth Appel and Wolfgang Haken published a hybrid proof, with a section written by them and another generated by a program [310]. It wouldn’t have been feasible for humans to perform an exhaustive check on thousands of cases to which the original statement was reduced. Not only did such a proof create a certain amount of controversy, but it also managed to conceal some errors for at least four years. The authors finally rectified these errors in a book, including a computer-generated 400-page appendix. Flawed proofs weren’t unprecedented for this problem: two previous ones, of a more traditional kind, had each withstood scrutiny for 11 years before unrecoverable errors were discovered. The controversy about the recent proof was partly because checking it was beyond human ability. Indeed, mathematicians directly verified the program that wrote the more repetitive part of the proof. Therefore, why not publish that program instead of its 400-page output? I guess that would have been an even more radical innovation. Moreover, some mathematicians find such verbose proof doesn’t help in understanding why this theorem is correct, a somewhat vague complaint. As mathematician and pioneering computer scientist John Von Neumann said: “In mathematics you don’t understand things. You just get used to them,” recognizing that modern mathematics is past a level where intuition is helpful [311].
The four-color theorem is neither unique nor the record holder for size. Wikipedia maintains a list of long proofs, with the length of the worst offender estimated to be between 10,000 and 20,000 pages long [312]. I don’t know if this count includes only one proof or is comprehensive of independent results used as stepping stones.
The prolific mathematician Paul Erdős believed the most elegant and concise proofs come from a mythical book he referred to as The Book. He reportedly said: “You need not believe in God, but, as a mathematician, you should believe in The Book.” Its earthly materialization came into existence in 1998: Proofs from The Book [313]. Its authors, mathematicians Günter Ziegler and Martin Aigner, explain that there are many criteria to pick a proof for the book: its clarity, innovative ideas and ability to reveal something new about the problem. However, there is a clear-cut criterion: “A proof that eats more than 10 pages cannot be a proof for our book” [314]. The authors readily admit that there could be perfect proofs that are longer than ten pages, but they aren’t sure they are within human reach.
2.3 Engineering and Design
In branches of engineering outside computing, as we have mentioned (see Sec. 1.2.1), there is built-in pressure towards lowering the part count and the materials bill, as well as assembly costs. Therefore it is harder to identify or even clearly define any ascetic tendencies. However, some designs accomplish more than the sum of their parts. For example, in the Car of the Century Award ranking, of the top five cars, three are entry-level models that made driving accessible to large swaths of the population for the first time: the Ford Model T, the BMC Mini, and the Volkswagen Beetle [315]. Notably missing from the top five is “the most intelligent application of minimalism ever to succeed as a car,” the Citroën 2CV [316]. Yet the 2CV stayed in production longer than any other model, a record 42 years, graduating from a practical vehicle for farmers to a lifestyle statement.
Likewise, in the evolution of the airplane, simpler shapes are primarily the result of aerodynamic or structural constraints. Yet Antoine de Saint-Exupéry, writer and aviator of the pioneering age (see Fig. 8), found that simplicity was a unifying theme in all engineering endeavors and shared with natural forms. He wrote [317]:
In anything at all, perfection is finally attained not when there is no longer anything to add, but when there is no longer anything to take away, when a body has been stripped down to its nakedness.
This idea seems to have taken hold in modern industrial design, as demonstrated, for instance, in the work of Dieter Rams, the designer behind iconic small appliances and electronic devices made by consumer product company Braun in the 50s and 60s, like the T3 radio, the cylindrical table lighter or the PS2 turntable [318]. The last, but hopefully not least, of his ten design principles is that “Good design is as little design as possible” [319]. This restrained approach resulted in designs that rise above the ebbs and flows of fashion and have aged gracefully, even as technology has made them obsolete. Moreover, these principles were in the service of other goals, including durability. In a recent documentary dedicated to him, Rams expressed his condemnation of the rapid replacement rate of modern technology [320].
If designing with fewer parts is now mainstream, fixing artifacts by removing parts isn’t, as we have seen for computer programs. Psychology research supports the idea that people, facing a broken device, think of adding parts rather than removing them: a piece of tape here, a screw there [321]. The same research finds that subtractive solutions are often overlooked in various tasks, from essay writing to itinerary definition.
Even if engineering has a built-in pressure towards reducing part counts, there are spectacular exceptions. Brick houses require roughly eight bricks per square foot of habitable surface, often exceeding 10,000 bricks for a typical US house.111 Small bricks take more time and mortar to lay but, as long as unassisted workers perform the job, there are limits to the size they can use. Consider how many other aspects of manual work are assisted by power tools: bricklaying isn’t one of them, save for mortar preparation.
At least a home can be built with one or a few types of bricks. Their number may be in the five figures, but the parts are interchangeable. Quite the opposite is true for the Eiffel Tower, which is made of 18038 metal parts described in 5329 drawings [322]. Two factors contributed to this inflation of unique elements: a limit on the weight that could be lifted up the tower during construction and its curved profile, based on Eiffel’s calculations for optimal wind resistance [323]. While the tower’s design is simple, it did not result in a significant amount of modularity.
2.4 Biology
The question of whether biology favors conciseness is too big to handle here. Yet, when talking about genomes, size seems highly variable and not associated with essential characteristics of an organism. Sure, viruses tend to have smaller genomes than bacteria, and these, in turn, smaller ones than multicellular organisms, but there are wild variations even between closely related species. Additionally, we don’t know what most of the genome’s function is and that it is generally believed to include a lot of filler or detritus collectively dismissed as junk DNA. Nonetheless, the parts of unknown function have been slowly shrinking. Mappings for regulatory regions, small RNAs and more have been slowly added to the better-understood mapping for genes. Shrinking doesn’t mean disappearing, though. In 2004, Nobrega and co-authors showed they could remove two megabases of the mouse genome that are devoid of protein-coding genes and grow perfectly viable (“indistinguishable,” as they report) litters of mice [324]. Then, one glorious day, the vast majority of the human genome was declared functional — the opposite of junk — by the researchers of a mega-project named ENCODE112 [325]. Never mind it later emerged that its principal investigators were debating among themselves whether to make the top-line figure 20%, 80%, or anything in between, a telltale sign of an arbitrary number not supported by evidence. The chief bioinformaticist on the project, Ewan Birney, wrote, in a strangely naive confession, that they selected the higher figure because it “best conveys the difference between a genome made mostly of dead wood and one that is alive with activity” [326]. This statement should make clear beyond doubt that these numbers aren’t estimates of anything; they are instead messaging tools in a propaganda war for more funding. And funding do they need: the ENCODE project burned through roughly 400 million dollars, a significant portion of the NIH budget that couldn’t be awarded to “normal” research, and multiple follow-ups are already planned [327]. Jumping on the ENCODE bandwagon may become the only way to keep a laboratory open for a young investigator. Nature could not sit idle in the face of this travesty and made available an incredibly concise genome, that of a little carnivorous plant, Utricularia gibba, which has mostly eliminated anything in its genome that isn’t directly related to genes [328]. These are the types of genomic elements the ENCODE project newly claimed as functional, only working on human samples. If you are thinking about homo sapiens exceptionalism and hypothesize these mysterious genomic regions may be related to higher cognitive functions or free will, I will have to disappoint you. Consider, as a comparison, the common tomato, a not-so-distant cousin of Utricularia, with which it shares an 87 million years old ancestor. While the number of genes in the two genomes is similar, tomato’s size is ten times Utricularia’s and is packed with the same filler abundant in the human genome. That something that evolution can dispose of so quickly, in evolutionary time, could be of the utmost importance in both tomatoes and humans defies credulity [329]. Discovering a single plant genome lacking all protein-coding genes would revolutionize biology, but we won’t because genes are essential to an organism. Concise genomes may not be advantageous for life, but they sure helped us understand it.
3 Literature, Visual Arts and More
Anything expressed just right, whether it’s being really concise or just capturing it — like anything, a really well-put-together sentence or a little doodle, a caricature that looks just like someone but only used four lines, that kind of thing …
This is how Jamie Zawinski, a programmer best known for his role in creating the Netscape Browser, explains beauty in code [14]. It provides a perfect transition from engineering and science into the arts.
3.1 Speaking and Writing
3.1.1 Concise Writing
Calls for conciseness in writing abound and come from writers of all types. For example, The Elements of Style’s author William Strunk Jr. writes [331]:
Vigorous writing is concise. A sentence should contain no unnecessary words, a paragraph no unnecessary sentences, for the same reason that a drawing should have no unnecessary lines and a machine no unnecessary parts.
The Elements of Style is itself an concise book. E. B. White, author of its second edition, describes it as “[…] a forty-three-page summation of the case for cleanliness, accuracy, and brevity in the use of English” [332].
Atwood joins the praise for this book [333]:
There is perhaps no greater single reference on the topic of writing than Strunk and White’s The Elements of Style. It’s one of those essential books you discover in high school or college, and then spend the rest of your life wondering why other textbooks waste your time with all those unnecessary words to get their point across.
The Star Copy Style sheet, used at The Kansas City Star newspaper, warns to “use short sentences […] short first paragraphs” and to “eliminate every superfluous word” [334]. Ernest Hemingway, who worked there as a reporter, credited it for “containing the best rules I ever learned for the business of writing.”
However, writing short sentences doesn’t necessarily produce short books. Hemingway proposed the iceberg theory of omissions [335]:
This [the real ending] was omitted on my new theory that you could omit anything if you knew that you omitted and the omitted part would strengthen the story and make people feel something more than they understood.
(The iceberg metaphor was introduced elsewhere). To buy into conciseness, you don’t need to be a Nobel Prize winner. Dru Riley, author of a popular newsletter, makes conciseness itself a goal [336]:
People don’t realize constraints breed quality, and without constraints you just have a lot of noise. Some weeks I’ll write 13,000 words and I have to distill it down to less than ten percent of that. […] The goal is for the final report to be concise, dense, and readable.
Blogger Mike Crittenden supports reducing the number of ideas or sentences down to one [337]:
There’s no law that says a blog post needs more than one idea or more than one sentence.
Software developer Bruce Eckel doesn’t consider any length too short [338]:
Less is More. The more words you have, the more work your reader must do. Shorter sentences are easier to read and understand.
I am not sure Eckel strictly follows his own rules: the post this quote is taken from has more than 2500 words and takes roughly 16 minutes to read. However, I can’t prove they weren’t necessary.
Having the right goals is already helpful. For example, Jonathan Maltz made conciseness his new years’ resolution [339]:
Biggest one for me in 2020 was succinct written communications. I look back at stuff I’ve written in the past and I’m embarrassed by how many words I wasted.
Even computer programs encourage conciseness, and I don’t mean that of their source code, which is covered in the first part of this book. For example, the text editor Zettlr offers an assist mode to improve a writer’s style. Its “algorithm will give you text credits for short and concise sentences,” among other style criteria [340].
Remember Javascript: The Good Parts, the book about subsetting JavaScript to make it better (see Sec. 1.3.2)? For a double dose of conciseness, that book is only 100 pages long. Olano writes [202]:
Sometimes, a lot of the times, less is more. The book itself is the perfect example: 100 pages long, mostly filled with snippets and examples, and still it had a clearer purpose and communicated it more effectively than most computer books.
What about speeches? Let’s learn from Lord Pannick, counsel for the prosecution in a United Kingdom Supreme Court hearing about an unprecedented suspension of parliament [341]:
[Lord Pannick’s] delivery was smooth and unruffled. Never using two words where one would do.
The counsel for the Government, Lord Keen, was less in a hurry:
He began by spending 20 minutes arguing there was no difference between Scottish and English law in this case. This left people scratching their heads as no one had previously said there was.
The prosecution prevailed.
Four words can make a poem. The complete text of Italian poet Giuseppe Ungaretti’s Mattina roughly translates to “I shine / of immensity” [342]. I am not sure he got his point across, but I don’t think he was concerned about being intelligible. He was an exponent of the hermeticist literary current, thus named for its use of obscure metaphors. A draft, written on a postcard sent from the war front, was found to contain five verses, later reduced to two in the published version. To put such a short poem in context — short even by his standards — we need to consider that each word was vital for him and selected after a painstaking process. In Commiato he writes [343]:
Quando trovo
in questo mio silenzio
una parola
scavata è nella mia vita
come un abisso
(When I find / in this silence of mine / a word / it is dug into my life / like an abyss). Fittingly, he read his poems at a deliberate pace [344].
3.1.2 Reducing Text
Sometimes it is possible to compare a text in long and short forms. What better opportunity to evaluate the merits of conciseness?
In a review of director Simon Godwin’s Romeo and Juliet, New York Times’s Jesse Green writes [345]:
At 90 minutes, it is even shorter than the “two hours’ traffic of our stage” promised in its first lines but rarely honored in performance. (The entire play normally takes about three hours.) Yet […] this […] version scores point after point while whizzing past, or outright cutting, the elements that can make you think it was written not by Shakespeare but by O. Henry on a bender.If the cutting merely left what remains with a much higher proportion of penetrating insight and powerful feeling, that would be enough […]. But the speed serves another function here: telling a story that’s mostly about teenagers with a teenage intensity and recklessness.
It seems that in this case, paraphrasing McLuhan, the length of the medium is the message.
Revisions of a poem or novel may provide insights into how its author sees length. Poems come in all shapes and sizes, from single-verse fragments to multi-page odes. It is thus reasonable to think that there is little external pressure towards conciseness. Yet, in the case of “The Waste Land” by T. S. Eliot, the drafts of the poem show that it lost about half of its lines before being published, partly on the advice of fellow poet Ezra Pound [346].
State of the Union speeches by American Presidents also vary considerably in length. Yet, Cody Keenan, President Obama’s chief speechwriter in his second term, stayed up all night to perform some trimming [347]:
By 5 a.m., a more succinct draft was on its way to the president.
Researchers often chafe at the strict paper length limits enforced by some journals. Not Mark Blaxter, professor of genomics, who was able to recognize that there were also advantages [348]:
I was shocked at how much of our carefully crafted text could be pruned […] to leave only the essential core of what had to be said. Many a choice turn of phrase and illustrative simile had to be sacrificed to precision, flow and brevity. This was (probably) a good thing.
Other writers even enjoy the process. For example, Pamela Erens, author of the novels The Virgins and The Understory, writes [349]:
My favorite part of writing is taking stuff out. […] when I can remove a limp adjective or superfluous sentence from a novel chapter or essay, I feel a rush that is a bit like being airborne.
3.1.3 Conciseness is Hard
When we run into verbose writings or speeches, we should cut the author some slack, because being concise is hard. “I would have written a shorter letter, but I did not have the time” wrote philosopher Blaise Pascal — this quote has also been misattributed to a half dozen other intellectuals [350].
Blogger Mike Crittenden, selecting quotes from his favorite reads for the year, can’t settle on a single one: “I couldn’t pick just one favorite quote from it, so here are two” [351]. It may sound tongue-in-cheek, but it makes sense that the decision between the top two options could be the hardest, as those are closer to the optimum than any other. A mathematical parallel is choosing an element for each of the sets in a collection. The possibility of doing so for infinite collections rests on the axiom of choice, which is independent114 from the other axioms of set theory, and was initially looked upon with suspicion by mathematicians [352]. Luckily no one has read an infinite number of books, so picking one quote from each does not rely on this axiom.
President Obama, who asked his speechwriter to shorten his State of the Union address (see the previous section), wasn’t as good with his memoir, which exceeds 500 pages [353]. He said [354]:
[I am] painfully aware that a more gifted writer could have found a way to tell the same story with greater brevity ([…] a signed copy of the 272-word Gettysburg Address rests inside a glass case)
Even computers find conciseness hard. For example, researchers working on GPT-2, a record-setting AI system to process natural language, reported [355]:
Fine-tuning for the stylistic continuation tasks is sample efficient: 5,000 human samples suffice for strong performance according to humans. For summarization, models trained with 60,000 comparisons learn to copy whole sentences from the input while skipping irrelevant preamble.
In other words, plausibly extending a text was easier for this AI system than summarizing it.
3.1.4 The Law
The text of the law, in the United States as well as in many other countries, is ponderous, to put it mildly. Just the Affordable Healthcare Act alone takes close to a thousand pages. How are we supposed to access healthcare knowing our rights and obligations? Even attempting a count of federal laws is considered futile by the experts at the Library of Congress [356]. The complexity of the law, fueled by its size, jargon, and ambiguous language, forces people to rely on lawyers to interact with the justice system and each other. This is a source of discrimination between those who can afford to hire the best and those who can’t.
In an interview, Google co-founder Larry Page said he “would dearly love all laws to be limited to 50 pages. Every time a new law is enacted an old one should be removed” [357]. It can be easily dismissed as a billionaire’s delusion, but if we accept that ignorance of the law isn’t excusable, then a limit on its length should follow as a corollary.
While there is no size limit on the text of laws, there is a law that sets some constraints on the length of forms the United States Government asks its citizens to fill out: the Paperwork Reduction Act [358]. Passed in 1980 and amended in 1995, it doesn’t impose a size limit outright. Instead, all agencies wishing to collect information must fulfill several requirements, such as demonstating the need for such collection, submitting a plan for its use, and so on. The central authority for these checks is the Office of Information and Regulatory Affairs, which, as a consequence, can produce an estimate of the burden of information gathering on citizens, which was 9.78 billion hours in 2016, roughly 6 minutes per person per day.
I couldn’t find statistics on private contracts, but, in my experience, they are always too long. Lawyered-up companies try to gain the upper hand on smaller ones or private citizens by overwhelming them with long and complex contracts or user agreements, sometimes exceeding 20,000 words [359]. Here is one refreshing exception, though. An energy startup, Voltus, offers one-page contracts and asks: “Could we make all contracts one page?” [360].
3.1.5 Counterpoint
This book was never meant to be an unbiased take on conciseness, but I would be remiss if I didn’t mention that there are texts, specifically novels, of incredible heft. According to Wikipedia, the record holder is Venmurasu for an estimated 3,680,000 words [361]. I am not laying an accusation that it is verbose: according to its author, it paints “a canvas as big as time itself,” a demanding task. The first English entry, A Chronicle of Ancient Sunlight by Henry Williamson, is a distant second with 2,436,924 words. In both cases, it may be arguable that they are cycles of novels rather than a single one. There are about ten entries over a million words. All these will likely be eclipsed by Richard Grossman’s Breeze Avenue, a collective multi-media work expected to total three million pages [362]. Not so much a novel as a “band of meaning impelled by forces analogous to those that undergird physical reality,” it is planned to “emanate,” every 36 days, “a one-of-a-kind 5,000-page evanescent novel.” Five thousand of these works are scheduled over five hundred years, each published online for a short time. The Guinness book of records must employ a tighter definition of a novel, as they list as the current champion Marcel Proust’s Remembrance of Things Past [363], totaling 1,267,069 words in 7 volumes [364]. I do know people who credibly claim to have read all of it.
It’s one thing to write unrepentantly long novels, and another thing to belittle conciseness. Cegłowski’s blog byline is “Brevity is for the weak,” without further explanation [365]. His most-read posts have an average length of about 3500 words for a reading time of 15-20 minutes. There is no contradiction, though, with his views on web design (see the “web obesity crisis” in Sec. 1.5), as his pages are only rich in text and load almost instantly.
3.2 Visual Arts
3.2.1 Architecture
We have mentioned in the Introduction that “Less is more” is an aphorism for frugality. While this concept does not originate with architect Ludwig Mies van der Rohe, it is most commonly associated with him, whose work was characterized by “extreme clarity and simplicity” [366]. He is regarded as one of the pioneers of modernist architecture, and the same minimalistic attitude guides the whole movement. Some maybe took it a little too seriously, as the title of Adolf Loos’ essay Ornament and Crime suggests [367]. He knew something about crime, being convicted of one later in life.
3.2.2 Photography
Alfred Stieglitz is credited with driving the acceptance of photography as an art form with his pictures as well his work as a promoter [368]. His style was aimed at drawing as clear a distinction as possible between photography and painting:
Stieglitz described his most recent work as ‘intensely direct. … Not a trace of hand work on either negative or prints. No diffused focus. Just the straight goods. On [some things] the lens stopped down to 128.115 But everything simplified in spite of endless detail.’
In particular, he was reacting to an early photographic movement known as pictorialism, which emphasized staged subjects and effects. Stieglitz instead saw photography as fundamentally anchored in reality.
Reality can be messy, but the photographer, no matter how direct his technique is, still exercises control over when, where, and in what direction to shoot, which results in the inclusion or exclusion of some visual elements. Landscape photographer Huntington Witherill sees this selection as fundamental [369]:
Finding the strongest way of seeing is really, to my way of thinking, intellectualizing within myself what it is that attracted me to the scene in the first place. And then, doing my best to include all of that within the photograph itself and eliminate everything else out of the photograph.
Editing a shot is an available artistic choice in most genres of photography, but not in photojournalism, where it is strictly limited to keep a picture true to reality. When Steve McCurry, one of the most celebrated photojournalists of his generation, was found to have routinely and secretly edited or even staged his, a scandal erupted [370]. Even the famous Afghan Girl has been published in at least two versions, one more “cleaned up” than the other. Inspection of several edited images reveals that removing distracting elements and enhancing colors are the most common transformations, a window into the thinking of a great photographer, if no longer a photojournalist, as he now fashions himself a “visual storyteller.”
The virtues of visual simplicity are evident not only to outstanding photographers but also to regular professionals. For example, real estate and landscape photographer Anton Gorlin doesn’t mince words [371]:
Take it this way — if it doesn’t add to the story, exclude it. You got it right — include as little as possible.
Indeed, to build great compositions, you must learn to exclude with no remorse or regrets. […] Always aim to include as little as possible.
Magnum Photos photojournalist Jerome Sessini goes one step further [372]: “I look to make photos better by subtraction rather than addition.” He acknowledges that his style evolved towards conciseness but that, initially, he was “introducing the maximum number of elements into the frame” following the example set by two of his role models (see Gilden in Sec. 3.2.4).
3.2.3 Painting
Compared to photography, painting affords the artist the highest degree of control and, specifically, the freedom to include any number of elements or details in an image. Georgia O’Keeffe, a painter (see Fig. 9) personally and professionally related to Stieglitz (see Sec. 3.2.2), advocated including fewer of them [373]:
Nothing is less real than realism. Details are confusing. It is only by selection, by elimination, and by emphasis that we get at the real meaning of things.
O’Keeffe’s quest isn’t just aesthetic but one for meaning. Her words suggest an analogy with statistics (see Sec. 2.1.1). Over-fit models can reproduce every minute detail in the data but, unable to distinguish the signal from the noise, fail to capture the laws governing the system under consideration — a failure seen in their inability to generalize to new data points.
“Art is the elimination of the unnecessary,” said Picasso [374]. Or did he? Despite the abundance of attributions on the web, I couldn’t find one pointing to a source, and exact matches in online searches are all dated 2000 or later. An earlier variant read: “Design is the elimination of the unnecessary.” It probably seemed like a good aphorism, just needing a prestigious, if inauthentic, voice.
Nonetheless, Picasso was familiar with self-imposed limitations [375]:
For a long time I limited myself to one color — as a form of discipline.
He is probably referring to his blue and rose periods — they are two distinct periods, one for each color — during which he used severely restricted palettes, arguably matching his dominant mood at that time.
His contemporary Henri Matisse applied minimalism not just to the color but to every element of a painting [376]:
In a picture every part will be visible and will play its appointed role, whether it be principal or secondary. Everything that is not useful in the picture is, it follows, harmful. A work of art must be harmonious in its entirety: any superfluous detail would replace some other essential detail in the mind of the spectator.
As with music (see 4’33”, Sec. 3.3), it is possible to push simplicity to its extreme consequences, and I suspect nobody represents that possibility better than Kazimir Malevich. He is known, among other works, for his trifecta of paintings depicting just one square each: Black Square, Red Square, and White on White. Peter Schjeldahl, an art critic, wrote [377]:
Malevich is monumental not for what he put into pictorial space but for what he took out: bodily experience, the fundamental theme of Western art since the Renaissance.
This statement places Malevich at the forefront of abstract art, but I think he was another few steps ahead. Corresponding about Black Square, Malevich wrote: “It is from zero, in zero, that the true movement of being begins” [378]. To me, White on White, a painting representing, unsurprisingly, a white square on an otherwise empty canvas, is even closer to zero, with its minimal amount of contrast. Malevich seems to be asking how abstract art can be and what can’t be left out of a painting. He stops right on the edge of answering: “nothing.” Like the silent piece 4’33” but unlike their less famous precursors (see below), Malevich’s squares have stood the test of time, are still displayed in exhibits around the world, and have a place in the history of art.
Hidden under the paint of Black Square, there is a reference to similar, earlier pieces by French humorist Alphonse Allais. He produced several monochrome canvases, filled edge-to-edge with the same color, and a silent funeral march, all satirical works [379]. Allais did not author the first published black square, though. As far as I know, that distinction goes to Paracelsian physician Robert Fludd in his 1617 The History of the Physical and Metaphysical Cosmos, where, among a set of 60 detailed engravings, it appears as an illustration of infinite darkness, the primordial state before creation [380]. It seems that, for Malevich’s squares, timing and context were as consequential as the artwork itself.
3.2.4 Counterpoint
I would be remiss if I didn’t reference some counterexamples to conciseness in the visual arts. Talking about the photography of Jerome Sessini (see Sec. 3.2.2), I mentioned that he moved away from a more crowded style that had been inspired by two of his role models, Mark Cohen and Bruce Gilden. After researching their work, I could find only a few examples of such style in Gilden’s work (for instance, [381] from Coney Island [382]). His images include multiple subjects, none clearly more important than the other, filling the frame, with no background-to-foreground separation. They seem to represent a small fraction of his work [383]. We can find this style thoroughly embraced by the Flemish Renaissance masters Hieronymus Bosch (Garden of Earthly Delights; Triptych of the Haywain) and Pieter Bruegel the Elder (Carnival and Lent; Proverbs [384]) or French engraver Jean Duvet (Fall of Babylon; Apocalypse [385]). The latter’s work, in particular, can be considered an example of horror vacui, the fear of or distaste for empty space. Originally a term used in physics dating back to Aristotle, it was later associated with Victorian architecture as a pejorative. Baroque architecture also makes extensive use of ornamentation to avoid empty space. An artistic need to fill all the available space predates all of these examples, going back at least to the Geometric Age in Greek art. However, that style, like the Islamic arabesque, seems to prefer orderly patterns to Duvet’s complex, almost chaotic arrangements.
3.3 Music
Music, like literature, doesn’t seem constrained by particularly severe length limits, with compositions lasting minutes, several hours, or anything in between. Namely, classical music, with its planned repetitions and taste for variation, seems bent on keeping the music going. Particularly during the romantic period, which saw the premiere of Beethoven’s 9th Symphony — 70 minutes — or Richard Wagner’s opera Die Meistersinger von Nürnberg (The Master-Singers of Nuremberg) — 5 hours and fifteen minutes — it seems conciseness wasn’t an important consideration. Yet Franz Liszt, in his praise of Robert Schumann’s Carnaval, wrote [386]:
[Carnaval] will assume its natural place in the public eye alongside Beethoven’s Diabelli Variations, which in my opinion it even surpasses in melodic invention and conciseness.
Hungarian composer Béla Bartók, who devoted much of his work to the study of folk music, wrote [387]:
[Folk music] is the classical model of how to express an idea musically in the most concise form, with the greatest simplicity of means, freshness and life, briefly yet completely and properly proportioned.
Moving closer to contemporary music, the work of Anton Webern stands out for its brevity. “Webern’s compositions are concise, distilled, and select; just thirty-one of his compositions were published in his lifetime,” Wikipedia reports [388]. A comprehensive recording of his opera omnia fits 6 CDs, and “the Six Bagatelles for string quartet (1913), for instance, last about three minutes in total.” Contemporary composer Arnold Schoenberg, whom Webern reciprocally influenced, described his musical aesthetics as follows: “My music must be brief. Concise! In two notes: not built, but ‘expressed’!!” Later, though, he deviated from these more clearly Webernian ideas.
An alternative to writing short pieces of music is removing the sound. John Cage achieved this with his 4’33”, in which any ensemble performing it is instructed not to make a sound [389]. It wasn’t an original idea, being preceded by half a century by Il Silenzio: pezzo caratteristico e descrittivo (stile moderno) (Silence: characteristic and descriptive piece (modern style)), pseudonymous work tentatively attributed to Edgardo Del Valle de Paz [390]. However, unlike its predecessors, it was meant to be taken seriously. Cage hesitated to “write it,” concerned it would be considered a prank. Whereas the function of computer programs or buildings is relatively clear, music doesn’t have an obvious one. Is 4’33” a concise piece of music? Or does it take that much time to accomplish nothing, like a computer program spinning idly (see the busy beaver numbers in Sec. 1.2.11.1 for interesting yet idle programs)? Or is the goal of 4’33” to provide some time to take in background noise or to reflect? I don’t have a definitive answer but 4’33” has worked like a music piece. It has been recorded, performed in front of an audience, and its score is available for purchase. As for Malevich’s squares (Sec. 3.2.3), it seems like the timing and context of the work have to deal with its success more than the actual “music.”
While revolutionary, the first “serious” silent piece didn’t leave room for many follow-ups, even if Cage himself wrote two of them, 0’0” and One3 = 4’33” (0’00”) + 𝄞. Don’t rush to publish your own 4’32” or 4’34” and hope to wow audiences and critics the way the original did. However, other approaches to minimalistic music don’t require sound to shrink or disappear, only the score. Terry Riley’s in C shows one possibility [391]. While this piece takes only one page of musical notation, its typical performances last 45 minutes to an hour. It specifies 53 loops or phrases to be repeated and combined according to somewhat loose instructions that allow performers a lot of discretion. These instructions are related to the canon, a traditional form prescribing that phrases be performed in the order they are written, with staggered starts for each instrument. Unlike the canon, performers are allowed a variable number of repetitions for each phrase. The result is a repetitive but ever so slightly changing mass of sound, strangely suspended between the calm and frantic, forever stuck in a single harmony, as the title suggests — which is one of the choices that allow so much performance freedom, like the fixed tempo. Unlike the silent piece, this approach allowed for many variations, attracted several composers, and became known as minimal music [392].
3.4 Management
Ascetic management sounds like an oxymoron. Leadership is about getting more done, conquering new markets, vanquishing the competition, and so forth. In this context, you may think of the lean startup, but that is penny-pinching rather than asceticism [393]. Nonetheless, a few isolated voices sing to a different tune. Jotform founder and CEO Aytekin Tank believed in the classic entrepreneur busyness bug: “constantly be building, working, growing and developing the next thing” [394]. Then he started seeing that being busy and being successful aren’t equivalent, and he decided that it was time to have some downtime. “But, doing less or nothing at all is easier said than done,” he writes. He almost echoes photographer Saul Leiter who wrote [395]: “I have a great respect for people who do nothing.” Or singer-songwriter Franco Battiato [396]: “Si salverà chi non ha voglia di far niente, non sa fare niente” (Those will be saved who don’t feel like doing anything, who don’t know how to do anything). Tank primarily faults societal pressure to look constantly busy for his inability to slow down. The remedy he adopted, following the likes of Bill Gates, is to take a block of time completely off. While Gates goes to a retreat to read books, Tank spends a week picking olives. We have no reason to doubt the healing effects of olive picking, but it still is work, and a week per year is not a lifestyle change. While it will take more for Tank to become a leader in idleness, we wish him well on his path to recovery.
Software entrepreneurs Jason Fried and David Heinemeier Hansson are well known for adopting some non-standard management practices at their company, Basecamp. In their book It Doesn’t Have to Be Crazy at Work, they propose their alternative to the culture of workism [397]:
The answer isn’t more hours, it’s less bullshit. Less waste, not more production. And far fewer things that induce distraction, always-on anxiety, and stress.
More recently, they included political discussions into their list of distractions, banning them and offering a buy-out to employees who didn’t like the change. A third of the company took their offer and left, making this organization more frugal overnight [398].
Fried is in favor of procrastination. Familiar with the mandatory “bias for action” in startup job ads? Fried doesn’t buy that [399]:
Not only can most things wait, most things should. You either can’t stop thinking about them, or those thoughts fade away. Let time do some work for you.
In other words, time acts as a filter. Downsides of an idea may manifest themselves, opposition to it solidify or alternatives emerge. Sure, a project may have time-sensitive requirements that preclude the luxury of time. For example, some hustling may be warranted if a project relies a once-in-175-years alignment of planets, as the Voyager missions did [400]. However, hurrying for the sake of it used to be called “being impulsive,” and it wasn’t a compliment. We may call this approach a bias for inaction.
Fried and Heinemeier Hansson run their company to their own tune and, in particular, aren’t pursuing growth at all costs, or, should I say, at any cost. Heinemeier Hansson writes [401]:
WHAT’S NEXT?! Which is really a question of WHAT’S MORE? What else are you going to do in addition to all the shit you’re already doing? It’s so ingrained in our entrepreneurial culture that you must always be on a conquest. Once a set of territories have been subdued, you’re honor-bound to push further north.Thanks, but no thanks. […] We’ve always been great fans of constraints, and capping the headcount in the face of growth is perhaps the biggest constraint of all.
They have indeed capped headcount after whittling their product portfolio down to one (see Sec. 1.3.1). Now, with a third of the company gone, they may have hiring on their task list again. It appears from their LinkedIn page that they have already more than doubled their stated headcount limit and launched a second product. The call of the North may be too hard to resist.
3.5 Humor
Can asceticism be applied throughout your lifestyle? It would seem so from a post by none other than Google’s director of research Peter Norvig, which touts a service called “Functional Lifestyles […] dedicated to making our clients stronger, healthier, more robust, more concise […]” [402]. On closer inspection, the date of the post is April 1. But even without the help of dedicated training, people have been integrating conciseness into their lives. For example, startup founder Jon Schlossberg chose as a byline to his blogging profile “Bacon, hummus, brevity,” elevating brevity to bacon-level concern [403].
Conclusion
What I tried to corroborate with the present book is that conciseness, frugality and parsimony in their various manifestations aren’t just being stingy, or deprivations endured for the sake of character building. Neither are they nice-to-haves, a sort of generalized decluttering that makes things look neater. Instead they go to the heart of how we understand reality, by turning the complexity of observations into simple laws. Likewise, they go to the heart of how we build artifacts, which need to be predictable and reliable, that is, to follow their own simple laws. If we agree with O’Keeffe that art is a way to get “at the real meaning of things,” the same could apply in that domain as well.
The special role that computer programs have in all of this isn’t just as engineered artifacts themselves, but also as an effective way to express these laws. Traditionally, this is the domain of mathematical models, but constraining them to be computable has distinct practical and conceptual advantages. Simplicity or elegance need no longer be in the eye of the beholder but coincide with the existence, implicit or explicit, of short computational descriptions.
The surprising discovery is that to get to the truth – a program true to its specifications, a model true to a natural phenomenon, a painting true to its subject – deploying an abundance of means is counter-productive. Instead of getting more, we get less: less correctness, less maintainability, less accuracy, less elegance and even less beauty. The activities of knowing, building and creating all rest on similar notions of concise effective descriptions.
I hope this books leaves you with a newly found appreciation for the role of asceticism in the modern world and a desire to adopt it in your own endeavors.
Bibliography
For references to on-line material I tried to follow these criteria:
- I provided a Wayback Machine URL which is frozen in time and relies on the commitment by an organization, the Web Archive, to never take it down [404]. Since these links are formed by a prefix followed by the original URL, the original link is also implicitly available. Unfortunately some web sites do not allow this type of archiving and, as a consequence, some “rotting” of links will be unavoidable. As a backup I attempted to save PDF images for every URL in the references, but a minority failed.
- To set a date for a reference item, whenever the content includes a date (post date, reply date, most recent commit in a software project and so on) I used that date. This doesn’t imply the item hasn’t been updated since, but was the posted date at the time I accessed the material. Sometimes I calculated the post date from information such as “posted x years ago.” When no such date was available, I listed the date at the time the material was accessed by me.
- As far as authorship, I trusted the listed author or used the owner of a source code repository as the author. When only a username or handle was listed, I tried to follow links to find a real name, but I did not put on my detective hat to achieve this result, not only in the interest of time but because I didn’t intend to out anyone who wanted to remain anonymous. I am sure I confused group and individual authorship in some cases, and the boundary can be blurry for software; for any mistakes I offer my apologies.
This work is in the public domain, via Wikimedia Commons.↩︎
A discussion board for developers.↩︎
The smallest number of edges that can be removed from a graph to make it acyclic.↩︎
Errors in a program that cause it to fail or behave in unintended ways.↩︎
It can be found in [13] as item 158, without Steele’s change.↩︎
For team projects, speed is measured in lines written by one developer per unit of time, even if multiple developers are working at the same time.↩︎
Code that can never be executed, and can be removed without altering the behavior of a program.↩︎
A way to de-activate code by turning it into a comment.↩︎
An automated analysis that doesn’t require running the target program.↩︎
A software used by software engineers to preserve, compare and combine different versions of the same code.↩︎
The verbose developer is now a department head. The concise one is still an individual contributor.↩︎
Tests that exercise individual components.↩︎
Also known as issue, a unit of work with attached description, stored in a system known as issue tracker, in support of software development.↩︎
Python also offers the option
print((message + '\n') * n), end = '')
.↩︎Undecidable, to be precise, that is there can be no program that solves it.↩︎
A precursor to the computing concept of a function known from the early days of computing [40].↩︎
On different machines, real or virtual, or at least separate processes↩︎
Google’s codebase alone has passed the 2-billion-LOC mark, an unknown fraction thereof currently in use; I don’t think there is a way to make a reasonable industry-wide estimate.↩︎
A fundamental but also elementary algorithm.↩︎
A program to translate code written in a programming language into instruction directly executable by a machine.↩︎
A programming technique whereby a function is defined using itself; it isn’t a circular definition because at each invocation of the function its arguments become smaller or simpler.↩︎
If you are worrying for him, he is now a successful IT manager.↩︎
In object-oriented programming, classes of objects can be associated in a inheritance relation whereby the inheriting class has all the capabilities of the other and adds some of its own.↩︎
A programming paradigm whereby composing and applying functions is emphasized, whereas changes of state and side effects are minimized.↩︎
Using automatic formatting tool Black.↩︎
This implementation is for educational purposes only. Fundamental Python data structures are typically implemented in C and avoiding recursion for efficiency reasons.
partial
is from the standard library module Functools.↩︎A type of processing unit specialized in graphics and image processing, but also used for a variety of other numerical computations.↩︎
Based on systems that include different types of processor, typically optimized for different tasks.↩︎
Application Programming Interface or how programmers access libraries and services.↩︎
A technique whereby data is stored temporarily for performance reasons.↩︎
A data structure to represent sets with particular emphasis on memory efficiency.↩︎
Integrated Development Environment, a software used by engineers to develop applications.↩︎
Used to compare different versions of a program; text-oriented ones aren’t aware of the syntax of a program and process it as a generic sequence of characters.↩︎
A programming language features whereby different functions can share the same name as long as their arguments’ types are different.↩︎
I hope you charge by the LOC if you are an independent contractor.↩︎
A form of documentation interleaved with code.↩︎
A theoretical model of computation.↩︎
Universal in this context means able to read a description of another Turing machine or Lisp function in input and execute it↩︎
A language close to the machine language that computers can understand directly, but readable for people.↩︎
An abstract computer that is emulated in software; it represents an intermediate, optional step between programming languages and machine language which enhances portability, the ability of programs to run on different types of computers.↩︎
Code derived from Bååth’s is available under an Attribution 4.0 International (CC BY 4.0) license.↩︎
The minimum number of one-letter deletions, insertions and substitutions necessary to transform a word into another.↩︎
The process of creating a function from another with a larger number of arguments by setting values for a subset of them.↩︎
Code derived from Hoigaard’s is licensed under an Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0) license.↩︎
As quoted in Hoigaard’s post. A link to the original was “dead” and no licensing information was provided.↩︎
In the context of web applications, the generation of HTML for a page.↩︎
In this context, it means “in the browser.”↩︎
Names are disambiguated where needed with the year of birth.↩︎
Java Virtual Machine.↩︎
Code that is largely repetitive but can’t be avoided.↩︎
[A type of template that accept a variable number of arguments.]↩︎
For languages, it means that working programs written for older version of a language remain working under the definition of the new version.↩︎
[Meaning a neural network module, not a software module.]↩︎
Traditionally the simplest program we can write is one that prints this message and terminates.↩︎
Similar to those of relational databases and inspired by the mathematical concept of relation.↩︎
For Extract Transform Load, an umbrella term for a variety of computations necessary to load a raw data set into a database.↩︎
[A Domain Specific Language; to the best of my knowledge Scalding is just a library for Scala, despite this statement.]↩︎
A system for the processing of large amounts of data.↩︎
This idiom derives from the fact that code is often organized in layers which typically only access the layer immediately underneath.↩︎
The ability of programs to run on different types of computers.↩︎
Despite attempts to extend Python with macros, there is nothing available beyond the experimental stage.↩︎
[
qplot
is a function in Ggplot2.]↩︎That
theme_set
call is necessary to remove the questionable default gray background: not even Ggplot2 is perfect.↩︎I am discussing frameworks together with libraries, as I don’t see the distinction as fundamental.↩︎
Data Base Management System.↩︎
A way to map programming language objects to tables in a database.↩︎
For Model View Controller, a widely adopted structure for web applications.↩︎
[The simplest possible app that performs something, like printing “Hello, World.”]↩︎
However, following the convention used throughout this book of stripping any
import
or equivalent statements.↩︎Object Relational Mapping.↩︎
A component in a web application that supports the generation of HTML documents.↩︎
Like APIs, but accessed through the HTTP protocol rather than function or method calls.↩︎
Converting from and to a format suitable for storage or transmission over a network.↩︎
Data structures that support iteration.↩︎
In my testing the module-wide feature didn’t work.↩︎
The performance penalty on decorated calls has been ignored for the sake of simplicity, and can be improved by moving creation of new
vtype
s to decoration time rather than call time, or memoizingvtype
.↩︎A temporary data store that discards the Least Recently Used item to make room for new ones.↩︎
One that can be replaced with a custom definition.↩︎
A catch-all argument for otherwise undeclared keyword arguments, those specified as
key=value
.↩︎Virtual Private Networks, a networking technology that improves privacy.↩︎
This code is in the public domain.↩︎
This function extends the sequence of sum, multiplication and exponentiation to ever faster growing functions.↩︎
For instance, 5 is represented as 11111.↩︎
This size needs to include the size of any input or, alternately, no input is allowed.↩︎
This code is available under a Creative Commons Attributions ShareAlike 3.0 unported license.↩︎
The depressing name some software methodologies use for a list of planned features or other changes.↩︎
The analysis of the syntactical structure of a computer program.↩︎
To be precise, Parso has a second, much less important, function.↩︎
The process of finding the derivative of a function.↩︎
Generated when needed rather than as soon as possible (eager).↩︎
Programs to control multiple computers from a single terminal window.↩︎
A high estimate, others being as low as 1/3 of that; Meta did not separately report its VR division results until recently.↩︎
Small computers on a single chip used for specific purposes, as opposed to the more general purpose ones used in phones, laptops or data centers.↩︎
Central Processing Unit, or the most important part of a computer, responsible for reading program instructions and executing them.↩︎
A software bundle of an application with libraries and configuration it needs to work. It is meant to allow the execution of said application on a standard infrastructure.↩︎
[That’s life]↩︎
The amount of work caused by choosing a quick solution rather than a high quality one.↩︎
The property of different models of computation to describe the same set of computations as the Turing machine.↩︎
The ability present in many object-oriented languages to execute different code based on the type of an object.↩︎
An object-oriented language feature whereby objects of different types can be dealt with in a uniform way.↩︎
In object-oriented programming, a relation between types according to which an object of the “child” type can be operated on as if it were an instance of the “parent” type.↩︎
This image is licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license, via Wikimedia Commons.↩︎
This is usually defined in terms of universal Turing Machines, but any Turing-equivalent model of computation leads to similar results.↩︎
Because there are multiple choices of computing models, with the only requirement that they be Turing-equivalent.↩︎
A precise definition requires using a specific type of Turing machines, but any variant thereof or reasonable programming language will result in a definition with the same properties, albeit different concrete values.↩︎
The notations means “growing at most like.”↩︎
Even if a pure Bayesian approach would dictate that the only thing that matters is the posterior distribution over models, not which one ends up winning.↩︎
Because if then and which is also covered by the Fermat conjecture unless ↩︎
Notwithstanding that brick construction isn’t the most popular in that country.↩︎
I was involved in a minor role in a preliminary phase of this project, at least 5 years before the events described here, based only on public information.↩︎
This work is in the public domain in the United States, via Wikimedia Commons.↩︎
It neither contradicts nor is it implied by.↩︎
[This refers to a camera setting which maximizes the areas of a photograph that appear sharp.]↩︎