random dungeon generator in obfuscated C

The Ascetic Programmer

How asceticism benefits programming, science, and the arts

Antonio Piccolboni


Cover illustration based on a program by Robert Nystrom [183]. Used by permission.

Many thanks to Marco Scanu and David Bylsma for providing feedback on an early draft, Tom Lindsey and my wife Maria for patient proofreading and to Maria for her unwavering support.


© 2023 Antonio Piccolboni CC BY-NC-ND

Introduction

Ascetic programming is the writing of computer programs under restrictions unrelated to resources such as memory or power. They range from using fewer lines of code to forgoing certain programming language features. My interest in ascetic programming started with challenges among coworkers, which we undertook to keep our skills sharp, our curiosity for programming alive, and, not least, for bragging rights.

We were not the first to think of imposing restrictions on programming other than those on resources. For instance, you may be familiar with Edsger W. Dijkstra’s condemnation of the goto statement (see Sec. 1.6). However, among possible restrictions, the one on program length has caught my long-term interest. It has connections to different facets of computing, from maintainability to testing to theory. These associations are mostly positive, meaning that shorter programs are better along several dimensions: for instance, they are more readable and maintainable and contain fewer errors.

Asceticism originated as a spiritual practice millennia ago. Ascetic restrictions traditionally target worldly pleasures and promise salvation or enlightenment in return. This book will focus mainly on code length restrictions and, more generally, description length restrictions, and our rewards will be plenty and varied, but, unfortunately, won’t include salvation.

Asceticism in programming comes in two flavors, roughly described by the slogans “doing more with less” and “less is more.” On the one hand, we seek to deliver the same functionality with fewer lines of code; on the other, we aim to identify essential features and discard the rest. I will refer to the former as conciseness and to the latter as frugality. We will discuss both as they apply to programming at different scales, from snippets of code to complete systems.

Beyond programming, I discovered unexpected examples of asceticism in statistics — where the preferred jargon is parsimony — philosophy, the visual arts, music, literature, and more. Outside science and engineering, the function of an artifact is less clear or undefined, and thus the boundary between conciseness and frugality becomes blurred. The equivalent of program length may or may not be available: for instance, it is, to a degree, for statistical models but less so for the visual arts. I hope you will find these more speculative forays interesting.

This book is a collection of cases or testimonials where brevity plays a positive role and where verbosity is the villain. The opposing view that verbosity is superior has also been allotted some space, as a counterpoint, without any pretense of equal treatment. Don’t expect a grand theory or a practical method, but rather a series of interconnected anecdotes and expert opinions, interspersed with the occasional theorem and research paper, some from giants in their respective fields and some from regular people working in the trenches. Let the case for asceticism build from the totality of the book rather than a single, decisive argument.

I hope that, by the end, you will be convinced that asceticism benefits engineering, science, the arts, and more. May this book inspire you to adopt unnecessary restrictions in your endeavors.

Who should read this book?

This book covers a wide variety of subjects, from computer programming to science to the arts. It doesn’t go particularly deep in any of them, but, reflecting my own background, it provides the most detail in the section about computing, less so in the one about science and only makes a cursory pass through the arts. I hope that anyone who has some familiarity with programming and mathematics will be able to enjoy most of the book. While there are plenty of internal connections, many of which I highlighted with cross-references, they are not essential for understanding. Therefore, skipping a subsection here or there won’t prevent you from enjoying the rest.

Reading advice

On devices that allow resizing the text view or the font, I recommend you choose settings such that the display shows a full line of code without wrapping — up to 88 characters. Where supported, long lines will scroll horizontally, and only as a last resort will they wrap. Since, in computer programs, indentation is meaningful, and this book often discusses line counts, this will reduce clarity.

For the most part, footnotes contain definitions of concepts or entities that may be unfamiliar to some. If you don’t feel like you need to read a footnote to improve your understanding, it is probably safe to skip it. Other footnotes provide license information for material reproduced in this book when appropriate.

Code snippets are for illustrative purposes only, are stripped of import or equivalent statements to access libraries, and are not guaranteed to be correct even restoring those statements.

1 Computing

The console of an IBM 650 automatic calculator [1], see Sec. 1.6.
Figure 1: The console of an IBM 650 automatic calculator [1],1 see Sec. 1.6.

1.1 Metrics

A first step towards espousing conciseness as a goal in software engineering is convincing ourselves that lines of code, or LOCs, are a valuable measure of program complexity. Certainly, 100,000 lines of well-structured code are simpler to understand or maintain than the same amount of so-called “spaghetti” code, which contains many interactions between its parts. However, one of the pioneering figures in computer science, Edsger W. Dijkstra, posed the problem of size as a fundamental one in software engineering [2]:

[…] widespread underestimation of the specific difficulties of size seems one of the major underlying causes of the current software failure.

Elsewhere, Dijkstra warned again about the dangers of accumulating large codebases [3]:

[…] if we wish to count lines of code, we should not regard them as “lines produced” but as “lines spent”: the current conventional wisdom is so foolish as to book that count on the wrong side of the ledger.

Dijkstra’s preferred solution, namely structured programming, is intended to make it easier to deal with large codebases rather than to reduce code size (see Sec. 1.2.4).

Microsoft co-founder Bill Gates, according to Internet folklore, said:

Measuring programming progress by lines of code is like measuring aircraft building progress by weight.

In more recent times, this sentiment is still present among developers. For example, free software advocate Chris Warburton writes on Hacker News2 [4]:

Software is a liability, not an asset. We should produce as little as needed, and delete it if possible.

In the classic software engineering book The Mythical Man-Month, Frederick P. Brooks Jr. states [5]:

Productivity seems constant in terms of elementary statements, a conclusion that is reasonable in terms of the thought a statement requires and the errors it may include.

Albeit expressed in statements rather than lines, according to the use of the time, this supports the idea that LOCs are a reasonable measure of complexity. This book follows the common practice of counting LOCs. Yet this count is partially dependent on the vagaries of code formatting (see Sec. 1.2.7). An alternative is counting tokens, the size of the representation of a program that a computer builds to run it. It is a formatting-independent measure, which we will adopt in one specific example (see Quicksort in Sec. 1.2.6).

We can often run into invariants like the one referenced above expressed in terms of LOCs. For instance, John Lions, the author of a well-known guide to the Unix operating system (see Sec. 1.3.6), states [6]:

It has often been suggested that 1000 lines of code represents [sic] the practical limit in size for a program which is to be understood and maintained by a single individual.

While the exact value in this statement may be debatable and its origin unclear, what is noteworthy is that it uses LOCs as its unit of measure. Code structure, programming language, or application domain appear to be secondary.

No other quantity has emerged as a viable substitute for LOCs. Herraiz and Hassan review some of the alternatives and come to the following conclusion [7]:

All the complexity metrics are highly correlated with lines of code, and therefore the more complex metrics provide no further information that could not be measured simply with lines of code.

Among the alternatives is cyclomatic complexity [8]. It focuses on the flow of control in a program by viewing it as a graph. A node represents a sequence of statements devoid of control statements, and if and while statements correspond to the subgraphs in Fig. 2.

Graphs corresponding to if and while statements resp.

Figure 2: Graphs corresponding to if and while statements resp.

With this graph representation, the cyclomatic complexity of a program is one more than the circuit rank3 of its associated graph, that is, EN+2PE - N + 2P, where EE is the number of edges, NN is the number of nodes, and PP is the number of connected components. If you suspected this is a long-winded way to count the number of control statements, you would be right, unless we consider multiway branching statements like switch in C. It is possible to double the size of a program without altering its control structure — say, interleaving logging statements with the original ones — and, therefore, without changing its cyclomatic complexity. However, experimental studies show that it correlates well with program size, and those that used it to predict the number of bugs4 while controlling for program size had inconsistent results. A possible conclusion is that cyclomatic complexity isn’t so valuable “since program size is not a controllable feature of commercial software” [8]. However, program size, controllable or not, predicts crucial properties of software.

One of them is the number of bugs. According to Capers Jones, a consultant on software methodologies, it is more than proportional to program size, falling in the range of 0-25 defects per KLOC (a thousand LOCs) for programs smaller than 2 KLOCs, but rising to 4-100 for those over 512 KLOCs ([9]; as quoted in [10]).

However, even more important than the number of bugs is the probability of project failure, defined as substantially missing a project’s goals. For example, Stack Overflow and Discourse co-founder Jeff Atwood writes [11]:

Here’s the single most important decision you can make on your software project if you want it to be successful: keep it small. Small may not accomplish much, but the odds of outright failure — a disturbingly common outcome for most software projects — is [sic] low.

W. Wayt Gibbs, a science writer for Scientific American, explains [12]:

Software is exploding in size as society comes to rely on more powerful computer systems. That faith is often rewarded by disappointment as most large software projects overrun their schedules and many fail outright — usually after most of the development money has been spent.

1.2 Concise Software

1.2.1 Programming

I actually found a way to shorten his code by 1 word and it had only taken me 20 years to do it. […] it was the only time I’d managed to reduce some of Gosper’s code.5 It felt like a real victory. And it was a beautiful piece of code.

In this passage from an interview with Guy L. Steele Jr., inventor of the Lisp dialect Scheme, we read how the quest for conciseness transcended its practical origins and mutated into an aesthetic one [14]. Back in the day when Steele and Bill Gosper, also a founding figure for symbolic computations and Lisp, left their mark on computing, computer memory was so scarce that shortening a program even by one word could have been necessary. Twenty years later, the evolution of hardware and software had been such that there could no longer be practical motivations to shave off a word; only the intellectual ones remained.

Yukihiro Matsumoto, inventor of the programming language Ruby, agrees with the aesthetic qualities of concise code but also adds the language of ethics [15]:

Brevity is one element that helps make code beautiful. […] In the vocabulary of programming, brevity is a virtue.

Dan Ingalls, a key contributor to the implementation of the language Smalltalk, recalls that, as he learned the language FORTRAN, he “tried to see how much shorter I could make programs, and stuff like that,” suggesting that, for him, it was a learning or playful activity [14].

For an experimental study, Lutz Prechelt asked several mostly inexperienced programmers to solve a simple string problem in various languages monitoring several properties of the solutions, including length. He concludes [16]:

Interpersonal variability, that is the capability and behavior differences between programmers using the same language, tends to account for more differences between programs than a change of the programming language.

Focusing on program length, this study finds that choice of language and programmer skill or style equally contribute to variation. Yet the point that matters for us is the wide variability in program length for the same task in the same language (see Sec. 1.2.9 for language-dependent conciseness). In this study, the programmers were allowed to code according to their preferred style. Would they have performed differently had they been asked to minimize length?

They may have, but it isn’t something they will often be asked to do in their careers. We have seen that code size was a priority in the pioneering age of computing, and it might still be when developing software for underpowered devices. However, I am not aware that code size currently is an externally imposed constraint on major platforms. An exception should be web development since program download is part of the load time of a page. Yet, it hasn’t prevented web pages from ballooning in size (see the “web obesity crisis,” Sec. 1.5). Not only minimizing program length is no longer a goal: at least in some situations, there has been pressure to increase it. In an interview, former Microsoft CEO Steve Ballmer recalls how IBM required billing by the KLOC and that there was no way to convince them that this system discouraged the search for more concise solutions [17]. Once a task was estimated to require, say, 20 KLOCs, that put a cap on how much money the developer could make, and it became a target to hit.

Free from external pressure, do programmers aim to keep their programs short? Some argue in favor of slowing down, which is related to program length since time is a limited resource. For example, Poul-Henning Kamp, the author of the web caching program Varnish, estimates that he wrote what he intended to be his “highest quality program — ever” at the breakneck speed of 5 LOCs per hour [18]. If that sounds slow, the equivalent estimate in the previously mentioned Mythical Man-Month is much slower: 10 LOCs per programmer-day.6 Kamp adds that for life-and-death software, where correctness is paramount, such as that developed for human space flight missions, the same estimate drops to one LOC per programmer-day. Sure such a coding speed must discourage code bloat (but see Sec. 1.5 for bloated space software).

It isn’t only rocket science. Matt Lacey, a mobile developer and consultant, in a post titled “You’ve only added two lines — why did that take two days!” lists plenty of reasons why, at times, development may slow down to such a crawl [19]. He reports about fixing a bug, first understanding it in detail, from which subsystems are affected to how to reproduce it. Then, unsatisfied with just making the issue disappear, Lacey endeavors to find its root cause, if the problem may represent a class of related ones and how to fix it with a high-quality solution that won’t break his program elsewhere. Finally, he spends some more time preventing this or similar problems from happening in the future. Two days to perform all these tasks, even if the solution is only two lines, don’t seem like a waste of time.

Everyone in the software industry knows that slowing down is the opposite of how it works. There are entire methodologies aimed at creating arbitrary deadlines and fostering a fictional sense of emergency. Limits on size, on the other hand, are rarely, if ever, encountered. No boss ever told my teammates or me: “you could have done this with less code,” or “let’s try to keep it under 1000 LOCs.” However, I found a counter-example (see Dwm in Sec. 1.2.11 for an additional one). An anonymous Quora user asks if he should “fire someone who unnecessarily wrote 500 lines of code” without context [20]. We don’t know what these lines were supposed to achieve or if there was a pattern of verbosity in this engineer’s work. Rejecting the rush to summary judgment, another user replies:

As a programmer, I’ve written many, many lines of unnecessary code. Why? Because sometimes you have to write 500 unnecessary lines of code to get to the 50 absolutely necessary lines of code.

(for code deletion, see Sec. 1.2.2).

Going back to Kamp and his Varnish, he further writes:

Back when I was a kid I could write 1000 lines in a single sleep-deprived session across midnight, but it didn’t take that long before I discovered that I had to throw out most of it once I woke up again.
I was 40 years old when I started Varnish and I had 22 years of professional experience, a lot of them staring at, and often fixing/improving/refactoring, other peoples [sic] source code.

Patrick Burns, the author of Tao Te Programming, agrees that experience is a factor [21]:

Novices may think that the increase in the number of lines is good. More experienced programmers will know that the good parts are when the number decreases.

He goes one step further: instead of just accruing code slowly and carefully, Burns talks about reducing it (for code deletion, see the next section). As to bug fixes, Burns recommends making them as short as possible. He writes:

Change as little as possible. A one-character change is optimal.

One character is what would have saved the Mariner I probe from destruction (but it wasn’t in its control program, see Sec. 1.5). Burns brings this line of thought to its ultimate consequences and thinks “of programming as an exercise in compression.” While others use this language (see Steele’s “Huffman encoding problem” in Sec. 1.2.9 and Armstrong’s system compressor in Sec. 1.2.12), I believe the word “compression” has a more specific meaning in computing distinct from conciseness. Compressed data aren’t directly usable: they need to undergo a process of decompression that reconstitutes the original before being, for instance, displayed as images. Concise programs are run, read, and edited as they are, and there is no way to reconstruct a verbose version from which they may have been derived. A concise program is an independent entity, which can’t be said about compressed data, tightly bound to its uncompressed version. However, we could also take the other side of the argument: that decompressing a program is one of many transformations programs may undergo before being executed, such as macro expansion or translation into intermediate or machine language. The crucial difference between a compressed program and a concise one is that the latter is human-intelligible; it is the version developers read, modify and maintain. For compressed programs, that is the uncompressed version. A compressed program isn’t intelligible to developers, at least with known compression techniques.

It is unfair to complain that engineers write unnecessarily verbose code when they are not taught or rewarded for being concise. However, one company goes against the grain. Comma.ai, an AI startup, added it to their requirements for aspiring employees [22]:

We value those who can do more with less code; software engineering doesn’t have an external selection pressure for low part count and cost like EE and ME, but you pay massive costs down the line in debugging and upkeep.

In an echo of “art for art’s sake,” James Hague, a video game developer, enthusiastically declares [23]:

Writing concise to-the-purpose solutions is a primary reason for programming in the twenty-first century.

If that is the future, it isn’t “evenly distributed yet” [24].

1.2.2 Deleting Code

If code isn’t written directly in its most concise form, a solution is to delete some of it. Of course, removing random lines won’t work: we need to identify useless ones, if there are any. “I love deleting code,” declares Nathan Marz, the author of the pioneering stream processing system Storm [25]. He isn’t alone. For instance, Chromium browser developer Devlin Cronin writes in a related bug-tracking thread [26]:

It [a proposed code change] has concrete benefits […] the ability to potentially delete a bit of UI code (something that always makes me happy!).

It may sound like an odd statement for someone whose craft is writing code. Artists have destroyed or mutilated their work in fits of disappointment [27] or as a performance piece [28] or have deliberately created ephemeral works [29]. However, it isn’t a widespread practice. In other branches of engineering, there is a pressure to reduce part count, and simplifying a design may be rewarding in and of itself. However, owing to a clear distinction between design and production, unnecessary parts come off a blueprint, not a finished product. Once removed, it is no longer necessary to make, procure or assemble them, and they don’t contribute to the cost or weight of an artifact anymore. In software engineering, once created, a line of code’s further costs are in the future and harder to estimate, in the form of debugging, learning, documentation, maintenance, and, as we have seen, even project failure. In my decades of experience with several employers and many more clients, we hardly ever tried to estimate those delayed costs. Therefore, the incentives to delete are weaker. However, good programmers know how much code they have to read and that someone has to maintain each line and so on. They understand the burdensome side of a line of code.

The know-how on how to delete is sparse. Nobody taught me how to delete code. A post titled Advice on Deleting Code sparked my interest [30]. At 1600 words long, it doesn’t seem to convey much more than an appeal not to leave dead code7 behind, whether commented out8 or otherwise disabled or forgotten. That alone, though, is a good suggestion, as the following anecdote exemplifies.

I once worked on a codebase where a static analysis9 marked 1/3 of the code as unreachable. We supported this evidence with an analysis based on logging in production and for an extended time. Nonetheless, we couldn’t overcome resistance to deleting code that might still be useful to someone someday. This even after giving engineers veto power on removing any code for which they knew there was a use, and even after explaining that “deleting it” just meant “preserving it forever” in version control,10 from where it could be restored on demand. The unspoken truth was that few believed dead code to be harmful enough to warrant any risk of removing active code, thus creating bugs. The business side could not believe we were spending time on a non-customer-facing issue, and, in a way, they were right because nothing came of it.

Inexperience doesn’t help. Developer Neil Kakkar writes [31]:

I was very uncomfortable deleting shit or obsolete code. I considered what was written eons ago sacred.

Under the guidance of senior engineers, he reports having become more aggressive in deleting code but still recommends removing only dead code.

Some people even advocate the appointment of “code removal leaders.” For instance, in a Hacker News discussion about career prospects for senior developers, we read [32]:

I think the most valuable engineering leader would be someone who can remove code, or at least prevent unnecessary code from being written and maintained.

That reminded me of one of the best developers with whom I have ever worked. He had discovered another coworker was generating low-quality code in copious amounts, primarily by cut and paste. To address this, he spent the last half hour of each workday reviewing the other developer’s work, reverting all of his changes if they affected foundational modules. Luckily, the functionality to replace was minimal. This arrangement seemed to work well for everyone. What did this developer get from deleting all that code? His goal was to keep the amount of code he needed to understand and maintain in check. However, that isn’t the only valid motivation.11

Joe Armstrong, creator of the language Erlang, talked about “removing” as fundamental to how he went about his job [14]:

I would write a specific algorithm removing all things that were not necessary for this program. Whenever I got the program, it became shorter as it became more specific.

Reportedly, some can go beyond that. For example, Jon Bentley, inventor of the k-d tree data structure and author of the Programming Pearls series of books, reports that he “once heard a master programmer praised with the phrase, ‘He adds function by deleting code’ ” [15].

Others would like to fix bugs by deleting code. This tongue-in-cheek exchange between Michael B. Johnson, a software architect who contributed to many Pixar films, and Pete Barber, a software engineer, proposes this approach [33]:

In order to fix that bug I had to:
- have a good night’s sleep
- take a 2 hour hike
and then write one line of code.

deleting one line to fix it would have been even better

Fine. Ruin my little victory …

According to psychological studies, fixing by removing doesn’t readily occur to regular people, not just programmers (see Sec. 2.3). Only surgeons have made an art of it, owing to the redundancy in the human body and the scarcity of spare parts for it.

In another passage from [14], Armstrong goes beyond the practical:

If you start removing things, if you get to the point where if you were to remove anything more it would not work any more — at this point it is beautiful.

We will later read a similar statement generalized to all human artifacts and even natural forms (see Sec. 2.3). A crucial property that distinguishes an engineered object from a work of art is that the former serves a specific function: a program computes something or a building must let people in and out — unless it is a prison. In the arts, authors can push conciseness to nihilism, to the silent piece of music (see Sec. 3.3) or the empty canvas (see Sec. 3.2.3). Actually, these extremes can be reached by programmers as well, on occasion. This is the description for JavaScript module Noop2 [34]:

No operation as a module™. The long awaited successor to noop. Noop2 does even less than its predecessor, and is 100% tested to guarantee nothing happens.

Its author, Yoshua Wuyts, considers it “both a joke and actually useful,” reportedly to write tests.

1.2.3 Repetition

The most obvious opportunity for deletion is repetition. With physical artifacts, it is a form of order: the pillars of a bridge, its equally sized spans, the columns of a church, and so forth. Repetition brings about economies of scale, simplifies maintenance, and provides visual rhythm. On the contrary, it is hard to justify in computer programs since it is avoidable. There are many ways to eliminate repetition: for instance, loops, to execute the same code multiple times; and functions or methods, to name a snippet of code and recall it by name. These can optionally have parameters, like their mathematical brethren (see Sec. 1.2.4). Detecting repetition doesn’t require advanced skills or creativity: it is simple and mechanical. So much so that the code analysis service Code Climate includes an analysis tool to detect duplicated code [35]. Even if, according to their website, it “uses a fairly simple algorithm,” it detected a few almost identical duplications that had snuck into my own code. Unfortunately, their removal and the testing of the resulting code are still the user’s responsibility.

Indeed, repetitions need not be identical, but that is a fuzzy statement. Where does the quasi-repetition stop, and where does the “vaguely related code” start? In a simple case, we may find a simple parametrization that unifies all the repetitions under consideration. In a complex one, the corresponding abstraction may or may not be expressible in the language of choice, or it may require highly unsafe constructs whose use is discouraged (see Sec. 1.2.9.2). Another observation is that it is easy to generate duplications even with the best of intentions. Even if we aren’t maliciously wielding the cut-and-paste function in the editor to exceed our KLOC targets (see IBM in Sec. 1.2.1), a repetition could result from the convergence of two similar ideas. We wrote the first copy too long ago to remember, or it is different enough that recognizing it as duplicative requires careful analysis. Even when identification is easy, there may not be time to eliminate repetition because of the work required. If that entails writing a new function, we may also need unit tests,12 because of coding standards, unlike copy-and-paste code. Therefore, the elimination of the repetition must wait. However, writing a ticket13 to mark some duplication we left unattended generally isn’t a career boosting move, and those tickets never get taken care of regardless. The repetition lives to see another day and we move on to more glorious endeavors, like launching a new feature. Duplications accumulate in codebases like in genomes (see Sec. 2.4), slowly drifting apart over time.

Repetitions aren’t without drawbacks, though. Burns issues the following warnings [21]:

If you want to make a change to the repetitious code, then you have to change every occurrence. And before you can change them, you have to find them. If the change is a bug fix and you miss one, then you have just created a pernicious, hard-to-find bug. (“I already fixed that bug, it has to be somewhere else.”) If the change is a modification to the behavior and you miss one, then you have just created a pernicious, hard-to-find buglet. Buglets can grow into bugs.

(Burns doesn’t define a buglet, but, if I understand it correctly, it is a software defect that hasn’t yet manifested itself, a potential bug). That is why many programmers embrace the so-called DRY principle [36]. In the words of Matsumoto [15]:

Redundancy is defined as the duplication of information. In order to eliminate redundancy, we follow the DRY principle: Don’t Repeat Yourself. The concept of DRY is the antithesis of copy-and-paste coding.

Despite Matsumoto’s strong statement, there is nothing in Ruby to enforce this principle, nor in any other language I am aware of. I can only speculate why that is the case. The problem may be deciding what repetitions are worth removing, particularly how small and after how many repetitions. Is printing a new line worth a special construct? Is println() better than print("\n")? Or is x++ better than x = x + 1? Different languages give different answers to these questions. If you had to print the same message nn times, for what value of nn is the following solution

print(message)
print(message)
...
print(message) # n times

better than this alternative:

for i in range(n):
  print(message)

I know which one most people would choose for n=1n=1 or n=10n=10, but, for intermediate values, it is anyone’s guess.14 Without the lens of conciseness, these choices are seen as debatable and even just a matter of style.

Another obstacle to enforcing the DRY principle may be the issue of approximate repetition. Banning only identical copies would only represent a modest level of enforcement. Extending the proscription to approximate ones would face the problem of choosing which ones to remove. Moreover, if we replaced two code snippets with a function f(x), invoked with different values of x to replace each one, we would need to prove or at least corroborate the equivalence of those snippets to f. The equivalence problem is, in general, extremely difficult,15 but less so when the original snippets are similar.

However, the simplest explanation may be that code quality is considered outside the purview of compilers and interpreters and something that can only be enforced and improved by programmers. Nonetheless, there are separate tools, such as CPD, the aptly named Copy-Paste Detector, to detect duplications [37]. In my experience, their use could be more widespread in the industry.

Sometimes the obstacles preventing this process are in our skills and the complexities of the code: finding repetitions and unifying them. At other times, they are a question of priorities and goals. In an interesting anecdote, Dan Abramov recalls how he identified repetitions in an otherwise working piece of code [38]. After refactoring it, this is how he described the results:

The code is half the total size, and the duplication is gone completely! So clean. If we want to change the behavior for a particular direction or a shape, we could do it in a single place instead of updating methods all over the place.

The remarkable twist is that he had to revert his elegant solution under his boss’ orders and became convinced the messy code with the duplications was better, in what may be an engineering version of Stockholm syndrome. The reasons he brings range from the social — he didn’t ask permission — to the over-engineering — the clean solution could not support some future hypothetical use cases (see YAGNI in Sec. 1.3.1). Adding and removing code aren’t just two equally important sides of programming: one is considered an ordinary, day to day activity; the other, an exceptional, dangerous one requiring a broad consensus.

1.2.4 Abstraction

When talking about approximate repetitions, we saw that we could sometimes replace two blocks of code with one, possibly shorter than the sum of the original two, which performs the function of both. To achieve this, we need to unify their commonalities and reduce their differences to a small fraction of the code or, even better, parametrize them. This process is a special case of what is known as abstraction. As Dijkstra said in his Turing Award Lecture [39]:

[…] the only mental tool by means of which a very finite piece of reasoning can cover a myriad cases is called “abstraction”; as a result the effective exploitation of his powers of abstraction must be regarded as one of the most vital activities of a competent programmer.

It is natural to think that a piece of code that covers “a myriad cases” would spare us from the effort of coding each case one by one, thus reducing code length by a corresponding factor. For instance, C. A. R. Hoare, Turing award recipient and inventor of the Quicksort algorithm, declared: “The way to shorten programs is to use procedures”.16

However, that isn’t the mechanism proposed by Dijkstra. In the same speech — indeed, the same paragraph — he advocated using abstraction to reduce the “intellectual effort” required to write a program of a given length, not to reduce its length. He didn’t formulate this recommendation based on empirical data or a mathematical theory. He started instead from the common-sense observation that if all parts of a program interact, this results in many interactions that need to be understood, debugged, maintained, etc. Their number grows with the square of the size of the program. On the other hand, if each part interacts with a small number of others, this results in a much more manageable linear number of interactions.

Dijkstra appears to assume that abstract code results in programs of the latter type, but this is far from obvious. On the contrary, a more general function creates indirect dependencies between parts of a program and, thus, more of the entanglement that Dijkstra suggests reducing. Taking sorting as an example, let’s assume we have two independent programs, one that needs to sort arrays of integers and the other of strings. In the first scenario, each program uses its specific sorting function, and there is no other code-sharing between the two. In the other, they rely on a shared library function that can sort any collection. This sharing constitutes an indirect dependence between the two programs. If changes in the sorting code were to become necessary because of the needs of one program, a bug could appear that cripples the other one.

If keeping the two programs completely independent is paramount, an option is duplicating the sorting function, no matter how general, and letting each program depend on its separate copy. When involving entire libraries, developers call this forking the shared dependency (from the idea of a fork in the road). Albeit an operation that results in more code to maintain, developer, blogger, and speaker Thomas Figg advocates for repetition to be temporarily accepted, to prevent coupling, or to let a better abstraction emerge from examples ([41], see also “interpolation” below).

There are other ways to ensure isolation between parts of a program besides duplication. For instance, modules or functions which are clearly defined and independently tested minimize the coupling risk. In my previous example of modifications to a sorting function, they would be unusual for a task that is so clearly specified. Because of this, such a function is simple to test, and extensive testing reduces the likelihood of introducing bugs when modifying it. Therefore, sharing a sorting function between modules doesn’t induce any coupling worth worrying about. Indeed, such a function is available in any language I can think of as a built-in or standard library function, and few developers would bother creating their own.

A program built primarily out of functions as clearly specified as a sorting function behaves more like several small programs than a single large one. All its components will still run on one machine, sharing its resources and crashing all at once if only one is faulty. However, from the point of view of development, the different parts are more swappable: we can replace a buggy function with another; a slow implementation can give way to a faster one. Microservices take decoupling to the next level by running parts of a program or application in isolation.17 This way, a microservice can fail without dragging down the rest, and graceful degradation is possible. At an extreme scale, that is why the Internet works despite being based on, possibly, billions of LOCs:18 it can keep working even if at any given time parts of it aren’t — there are nonetheless occasional large-scale failures [42].

In summary, conciseness and modularity are two sides of coping with size: if size can’t decrease, by splitting code neatly into modules, we can deal with smaller chunks at once.

If writing more general code is harder and introduces more dependencies, does it offer any advantages? Well, what is better than covering “a myriad cases” (see Dijkstra’s talk above) with a single coding effort? “Wherever there is repetition, there is an opportunity for abstraction,” states Burns [21]. For example, imagine having a sorting function for each type of collection and then for each element type: one for arrays of integers, one for lists of strings, and so on. That would result in a large amount of repetitive code. Indeed we will see an example where a library (in java.util.Arrays, see Sec. 1.5) contains seven copies of essentially the same functions because the language hampers further generalization. Moreover, bugs can lurk for a very long time. For example, an implementation of binary search,19 generally accepted as correct, hid a bug for 20 years after being published [43]. Its Java standard library version contained the same error for nine years [44]. Imagine the effort to fix this bug if every project maintained its version of binary search.

Abstracting over different data types is only one instance of a general phenomenon. Often code can grow as an accumulation of special cases until we realize a general rule or solution that can replace them and covers additional ones as a bonus. Using a statistical analogy, we are replacing individual data points with a model and predicting new ones as a result (for statistical modeling, see Sec. 2.1).

Rich Skrenta has observed something similar [45]:

[…] more code often means less flexibility and functionality. This is counter-intuitive, but a lot of times a simple, elegant solution is faster and more general than the plodding mess of code produced by a programmer of lesser talent.

I had a stark demonstration of this when I was still in college. As part of standard computer science coursework, I was working with other students on implementing a toy programming language. Our instructor asked us to use a modular architecture that subdivided its compiler20 into multiple interacting programs, each assigned to a sub-team. After preparatory meetings and some development, we all met to check that all parts worked together. One of them seemed to work on the simplest programs, but it failed at the first hint of nesting — a for loop inside another for loop, for instance. The student responsible for it promised to fix it and meet us again in a couple of days. At the following meeting, his program failed again, this time on two levels of nesting. This sequence of events repeated itself a few times, with his program producing errors each time on slightly more complicated examples. Concerned about the fate of our project, we asked if we could review his source code. We found that he treated the no-, single- and double-nesting cases separately. He had written a large amount of code to cover all the examples we had provided. He had yet to use recursion21 to deal with unlimited nesting, as required by the definition of the language. It looked like he thought that, in practice, the programs to be compiled wouldn’t be too complicated and that his approach would be enough. The result was a program that had to grow in size proportionally to the complexity of the examples it was required to handle. In this respect, it had something in common with applications whose complexity is proportional to the number of screens to display; or machine learning models, whose complexity grows with the size of the training set (for more on this, see Sec. 2.1.4). Only much later in my career, I realized that my teammate22 was anticipating the future of the software industry:

  1. he was pioneering test-driven development;
  2. he focused on his customers’ requests, with his team-mates being the customers;
  3. he maximized customer value — higher levels of nesting are used less often;
  4. he met his self-imposed rapid-fire deadlines;
  5. he wrote code quickly.

One obstacle to achieving the level of generality of a sorting function or a compiler is that not all abstractions emerge with the same level of clarity. For example, sometimes we create a do_stuff function that we reuse somewhere else before we understand what “stuff” exactly means. One consequence is that it keeps changing as we modify code elsewhere. It may also be harder to test in isolation from the contexts in which it runs. These are the telltale signs of a poor abstraction, but they are seldom perfect from the start. Sometimes the correct generalization emerges over time after seeing it applied in different contexts, akin to interpolation in mathematics. Marz writes [46]:

I view the development of beautiful abstractions as similar to statistical regression: you have a set of points on a graph (your use cases) and you’re looking for the simplest curve that fits those points (a set of abstractions).

The alternative, letting repetitive code proliferate and fixing it at a later time, has its drawbacks, as we have seen.

Another problem is that the connection between abstraction and conciseness isn’t considered fundamental. We have seen how Dijkstra thought abstract code reduced the intellectual effort required per line written, not the number of lines in a program. Furthermore, he didn’t look kindly at attempts to solve problems with a single line of code (see “one-liners” in Sec. 1.5). In my training, different abstraction techniques were touted for their potential to reduce coupling and not for any conciseness advantage. Without a measurable goal, the best level of abstraction becomes a matter of opinion. However, there is some agreement that it can be taken too far (see the LinkedIn example in Sec. 1.2.9.4), for instance, creating hierarchies of classes23 where most classes are never instantiated.

A few pay attention to the conciseness potential of specific programming techniques. For instance, software engineer and blogger Nikolay Grozev wrote an introduction to monads, the advanced programming technique inspired by category theory [47]. First, he develops an example that he deems “far from perfect, as we have repeated ‘glue code.’ ” Later, after applying monads, he observes:

This [using monads] avoids the shortcomings of our previous approach because the bind function implements all the glue code and we don’t have to repeat it.

The crucial point is ensuring that abstractions can be combined seamlessly, providing appropriate combinators. Only then can we reap the full benefits of creating those abstractions. An example could be modifying a module for arithmetic calculations where a division by 0 terminates a program so that it would instead generate a special error value, say "err". This module features four functions, add, sub, mul, and div, for the four basic arithmetical operations. Since a division by 0 can only occur during the execution of div, we start by modifying that function:

def div(num, den):
  return "err" if den==0 else num/den

The problem is that the other functions don’t know how to deal with the value "err", so we can’t just write add(1, div(1, 0)) to calculate 1+1/01+1/0. The intended behavior is that any operation involving "err" also results in the same value. One approach is to check every time we use div:

a = div(1, 0)
b = "err" if a == "err" else add(1, a)

This approach is verbose and error prone. Another option is to modify all four operations to deal with "err" appropriately. However, the “monadic” approach is to define a function, traditionally named bind, to modify the evaluation of the operations:

def bind(op, x, y):
  return "err" if x == "err" or y == "err" else op(x,y)

This way, we can write complex expressions as follows:

bind(sum, 1, bind(div, 1, 0))

Another example from functional programming24 circles relates to higher-order functions, which accept as arguments or return other functions, and are available in many popular languages but aren’t used widely in practice. Software engineer and blogger Gio Carlo Cielo guides us through a complete example in which “our code verbosity has been significantly reduced through the use of the higher-order functions” [48]. A minimal illustration of their use could be applying a function to each element in a list. Without higher-order functions, this is what we would write in Python:

def f(x):
  ... # some definition for f

result = []
for x in range(10):
    result.append(f(x))

With the standard (higher-order) function map, the same example becomes:

map(f, range(10))

This version is simultaneously more concise and more abstract. We obtained a more concise version by hiding details like creating an empty list and appending elements to it. Python offers an alternate form for this task, list comprehensions, using which our example becomes:

[f(x) for x in range(10)]

This code doesn’t eliminate the loop variable x and is limited in its generality. For instance, there is no list comprehension for the fold or reduce higher-order function typical of functional programming. The likely reason is that it is the element of functional programming that Python’s author Guido van Rossum has “always hated the most” and for which he needs to “grab pen and paper to diagram what’s actually being fed into that function before I understand what the reduce() is supposed to do” [49]. However, this issue is wider-reaching than reduce itself and van Rossum’s foibles: it is whether languages should provide the most common abstractions or the tools to build them. Higher-order functions are in the second group. For instance, extending map to a data structure such as a tree is straightforward, as we will see shortly.

Let’s start with some definitions. A tree is made out of nodes. A relation exists among them such that some nodes are parents to other nodes, called children. An ancestor to a node is either parent to that node or an ancestor to its parent. An ancestor is connected to its descendant nodes through a unique path. Finally, there must be a node that is an ancestor to all other nodes, called root (see Fig. 3).

A tree.

Figure 3: A tree.

In Python, let’s represent a node with the list of its children and a tree with its root node. For instance, [[],[]] represents a tree with three nodes, one the parent of the other two. Since trees are most useful when holding some data on each node, let’s say the first element in the list isn’t a child but such data, for instance [1,[2],[3]], where the node holding 1 is the root, which is parent to the other two nodes, containing 2 and 3 respectively. With these definitions and representation conventions, we can define a map for trees as follows:

def map_tree(T, f):
  return [f(T[0])] + list(map(partial(map_tree, f=f), T[1:])) if T else []

With two lines25 of code, we have extended the concept of a map function to trees.26 There is no such possibility for list comprehensions. They are a useful pre-defined abstraction but aren’t amenable to being extended to other data structures.

There isn’t anything special about functional techniques other than that they are powerful techniques for abstraction and, therefore, also for conciseness. Moreover, I wouldn’t want to generate the impression that abstraction is the result of applying sophisticated techniques or adopting esoteric languages, whereas, for the rest of us, there is copy-paste-modify. Sharing a complex method between two classes by introducing a common parent class also helps conciseness. As we have seen above, Hoare found procedures already fit for this task.

1.2.5 Rewrites

A rewrite is a controversial and radical step in software development: take what you have built, throw it away and start from scratch. Rewrites come in different sizes, from those targeting a complete application to those affecting just a few functions. They have been described as company-destroying mistakes [50] or a normal phase of development [5]. Opinion dominates, and I don’t have much to add to this debate but to consider what it means for conciseness. Rewriting some portion of a program isn’t guaranteed to result in less code. Famously, Netscape discarded its Navigator code base for various reasons, one being its size [51]. A prototype released to showcase the new rendering engine Geko was about a tenth of the size of contemporary browsers and could fit a single floppy disk — 1.44 MB. It had swollen to 17 MB by the time the 1.0 version was released, negating any size advantages.

However, it doesn’t have to end that way. Instead, we will see how switching to a higher level language or one better matched with the application can result in radical code savings, for instance, in a 60-KLOC rewrite performed at LinkedIn (see Sec. 1.2.9.4). In that case, the rewrite was management initiated. In other ones, rewrites start from the grassroots. For example, Apple developer Bill Atkinson used to report a negative number of lines contributed in a weekly form that he had to submit, including a “-2000” in the week in which he had rewritten some routines in QuickDraw to be six times faster. Eventually, he wasn’t asked to fill the form anymore [52].

Another example involves Bret Taylor, co-inventor of Google Maps [53]:

[Taylor] was frustrated that the JavaScript was really slow. It would take forever to load. So, one weekend he said, “I’m going to rewrite it.” And he rewrote the entire thing, making it 10 times as fast and a third of the size. He completely threw away the code that a team of engineers had been working on for months.

Unfortunately, we don’t know if the same result could have been achieved incrementally or by giving the right incentives to the original team.

An additional example is a cryptocurrency named XLM, initially based on a fork of a pre-existing one, XRP. Its developers were having doubts about the security of XRP and decided there were better options, eventually opting for the Stellar Consensus Protocol. However they did not merge it: instead, they “created an entirely new codebase designed with safety, minimalism, scalability, interoperability, and multiple implementations in mind,” elevating minimalism to a top-level concern [54].

In some instances, the impetus for a rewrite comes from a shift in hardware technology that makes an existing implementation obsolete. For example, the trend towards using GPUs27 for numerical computation prompted a complete rewrite for an application named Weather Research and Forecasting (WRF; [55]):

WRF is well over half a million lines of code, mostly rooted in Fortran. This new model is based on the same ideas but with less code. Still, the effort to get this up and running on Power9/Volta GPU systems took two and a half years.

We will see more of this as heterogeneous computing28 spreads to all types of devices, from supercomputers to smartphones, prompting partial or complete rewrites [56].

dogpile.cache is a generic API29 that can use different caching30 backends. It is a rewrite of a similar product, Beaker, by the developer of both, Mike Bayer, who explains [57]:

All the ideas of Beaker which “work” are re-implemented in dogpile.cache in a more efficient and succinct manner, and all the cruft […] relegated to the trash heap.

Here we have an example of a rewrite that doesn’t involve management taking a new direction (see the LinkedIn JavaScript rewrite in Sec. 1.2.9.4) or a programming superstar showing his mettle (Taylor, see above). Instead, we witness a developer realizing that an incremental path from his existing code to where he wanted to be was less practicable than a fresh start. It is also an example where succinctness rises to top-level concern instead of being a happy accident.

1.2.6 Algorithms

Algorithms are generally small enough that their code length isn’t an issue, instead relevant for libraries, applications, and systems. With algorithms, we are more interested in their correctness and efficiency. However, it doesn’t mean keeping the line count low isn’t a nice extra. For example, Daniel Lemire, inventor of the xor filter, a replacement for bloom filters31 and other related algorithms, asks [58]:

Could we do even better [on speed and memory usage] while limiting the code to something you can hold in your head?

His xor filter reportedly marks an improvement on traditional measures of algorithm efficiency (speed and memory usage) while also keeping it relatively short:

The complete implementation in Go fits in less than 300 lines and I did not even try to be short. Indeed, any semi-competent Go coder can make it fit within 200 lines.

While I doubt I can memorize 300 LOCs, it is 50 lines shorter [59] than a bloom filter implementation by the same group [60].

Moving onto a more run-of-the-mill Quicksort, a mainstream sorting algorithm, there are different ways to implement it. Angel Sola Orbaiceta, a software engineer and book author, experiments with object-oriented and functional paradigms, using Java and Clojure [61]. He finds out that “functional code is much more readable and compact” for the same task and believes “the more code there is, more time and effort has to be put in maintaining the codebase and more bugs tend to appear.” Specifically, skipping empty lines or those with just a bracket or parenthesis, he ends up with 27 lines for Java and 9 for Clojure.

I tried to replicate his experiment in Python. Quicksort is an application of the divide et impera approach:

When a split is empty or only has one element, return it as is. Some subtleties in how the partition is performed create the opportunity for some variants. Namely, the Quicksort function (abbreviated to qs in the following code blocks) should never invoke itself on its own input, or the algorithm won’t terminate (and run out of memory in practice). We can achieve this by excluding the pivot or all elements equal to the pivot from further sorting. Another choice is either to sort in place — moving entries around in a single array — or to create new arrays as needed: the former is more memory-efficient, the latter quite a bit simpler.

The following version performs the sorting in place:

def qs(xx, lo = 0, hi = None):
    hi = len(xx) - 1 if hi is None else hi
    if (hi <= lo) or lo < 0:
        return
    p = partition(xx, lo, hi)
    qs(xx, lo, p - 1)
    qs(xx, p + 1, hi)

def partition(xx, lo, hi):
    pivot = xx[hi]
    i = lo - 1
    for j in range(lo, hi):
        if xx[j] <= pivot:
            i = i + 1
            xx[i], xx[j] = xx[j], xx[i]
    i = i + 1
    xx[i], xx[hi] = xx[hi], xx[i]
    return i

It takes 17 lines, which compares favorably with the 27 lines of Java in Orbaiceta’s blog post. However, on closer inspection, his version isn’t in-place. Removing that constraint, we can produce this version:

def qs(xx):
    if len(xx) <= 1:
        return xx
    return (
        qs([x for x in xx if x < (pivot := xx[0])])
        + [x for x in xx if x == pivot]
        + qs([x for x in xx if x > pivot])
    )

This code is just seven lines, ignoring the line with just a parenthesis, and it is also straightforward. There is no index accounting to master, no swapping of elements, just plain and simple divide et impera. However, the ascetic programmer can’t help but notice a certain repetitiveness in the three list comprehensions that perform the partition. The only element that changes is the comparison operator used. I looked unsuccessfully for a library function combining these steps; defining our own would take too many lines. However, we still can factor out the filtering operation:

fo = lambda op, xx: [x for x in xx if op(x, xx[0])]

where op is one of <, >, =. We can’t pass them around as parameters in their most familiar form, but there are equivalent functions in the standard package Operators: lt, gt, and eq, respectively. With this parametrization, we can write qs as follows:

qs = lambda xx: xx if len(xx) <= 1 else qs(fo(lt, xx)) + fo(eq, xx) + qs(fo(gt, xx))

I used a short name for fo to fit qs onto one line (formatted with a popular tool, Black, which uses an 88-character bound). For the same reason, I used lambda definitions instead of def. Those tricks are possible because line count is a blunt instrument. If we used token count, a measure of program complexity insensitive to the vagaries of formatting or naming (see the next section), those tricks would not pay off. Indeed, it is easy to perform a token count with the help of package Ast (for abstract syntax tree, a data structure used to compile programs):

def token_count(source):
    return len(list(ast.walk(ast.parse(source, "source", "exec"))))

According to this function, the in-place Quicksort contains 200 tokens, whereas the shorter versions are 75 and 74 tokens long, respectively. Remarkably, the two-line version wins by just one token, owing to its “density.” It is natural to ask if we can go any lower than this. It is possible, indeed, by eight tokens. To achieve this feat, we need to define the partitioning filter as a function of one instead of two arguments. This change requires its definition to be inside the scope of xx:

def qs(xx):
    if len(xx) <= 1:
        return xx
    filter_xx = lambda op: [x for x in xx if op(x, xx[0])]
    return qs(filter_xx(lt)) + filter_xx(eq) + qs(filter_xx(gt))

It makes sense to limit the arguments of a function to those that change within a scope of interest: what else could we be filtering other than xx? What may look like a silly search to shave off a few additional tokens — a pursuit that Dijkstra thought to lack “any conceptual relevance” (see Sec. 1.5) — spurred me to tighten or specialize the filtering abstraction to the needed level of generality, and nothing more than that. There may be a performance penalty for defining a function inside a recursive call, certainly in Python, but we are not trying to optimize performance in this book, and occasionally we aim for the opposite (see the Busy Beaver numbers in Sec. 1.2.11.1).

Code length is also used to evaluate the effort needed to prove the correctness of algorithms (see Sec. 2.2). For instance, Martin Kleppmann, author of Designing Data-Intensive Applications, implemented, in 60 LOCs, a distributed data structure to support group editing, but the proof of its correctness took 800 lines of math for a proof-to-code ratio over 13:1 [62]. Kleppmann maintains it was worthwhile not only to raise confidence in the correctness of a less-than-intuitive algorithm but also to increase its understanding. Nonetheless, the large size of the proof highlights an obstacle to applying formal methods in software engineering. Another barrier is that proofs can only show that an algorithm satisfies its requirements, but those requirements must reflect the real world. Nonetheless, formal methods are valuable in high-stakes applications such as the control software for the French railway network [12] or the S3 storage system [63].

1.2.7 Notation and Formatting

If recognizing and implementing abstractions is considered good programming, abbreviating identifiers or skipping white space is generally frowned upon. Identifiers, as we were taught in school, should be self-explanatory; white space should be used liberally to impart visual structure to the source code.

An exchange from a Hacker News forum shows that at least some developers, including database architect Andres Freund, challenge the orthodoxy [64]:

Usage of acronyms is one of the worst offenders in bad code.

Not really on board with that … It’s a balance. Brevity does have its value too.

The trade-off is memorizing the abbreviations in exchange for dealing with more compact and readable expressions. If a long variable name works as documentation, is having to read it every time you run into a variable a good idea? Whether or not a program uses acronyms, identifiers’ definitions and related documentation must be easily accessible, usually through features of an IDE.32 When a program deals with complex concepts, the idea that even long identifiers can be self-explanatory is unrealistic.

We should look at mathematics as a model. Variables are generally single letters. Mathematicians and physicists develop specialized notations to make expressions more compact. About Einstein’s general relativity, the biographer Walter Isaacson writes [65]:

Einstein’s final equations used the condensed notations of tensors to compress sprawling complexities into squiggly symbols and subscripts.

Clearly, “sprawling complexities” aren’t conducive to readability!

Charles Babbage, inventor of the first mechanical calculator, seemed to agree (as quoted in Kenneth Iverson’s Turing Award lecture [66]):

The quantity of meaning compressed into a small space by algebraic signs is another circumstance that facilitates the reasonings we are accustomed to carry on by their aid.

Even newlines should not get a free pass. In a thread on GitHub, a discussion developed between supporters of a format with shorter lines that was allegedly more friendly towards text-oriented diff tools33 and a more compact one. User tweakimp writes [67]:

I think you can call it [the choice of format] more than preference because if there is no other difference, less [sic] lines is [sic] better.

In hugovks [sic] example the second version is shorter, easier to read and should be preferred in my opinion.

Optimizing code to be read by primitive comparison tools sounds like a bad idea, independent of conciseness. In the seminal book Structure and Interpretation of Computer Programs (SICP [68]), we read that “programs must be written for people to read, and only incidentally for machines to execute.” Even more so, we shouldn’t be writing them to ease the work of primitive diff tools but switch to better ones, like those that analyze the structure of programs.

1.2.8 Documentation and Comments

When it comes to documentation, the more documentation, the better, correct? Actually, who enjoys reading it? Documentation is a cost like code is. It needs to be written and maintained, developers must read it, and it silently becomes outdated.

Michele Simionato, author of several Python packages, reported this about halving the size of the documentation for one of them [69]:

[dropping Python 2] is a huge bonus, since I could remove over 2,000 lines of duplicated documentation/doctests. Having to maintain separate docs for Python 2 and Python 3 effectively stopped any development on the module for several years.

What may be hard to understand with large teams becomes evident for a one-person volunteer project.

Can we eliminate documentation? Joshua Bloch, known for his work on Java, thinks so. About API design, he writes [70]:

APIs should be self-documenting: It should rarely require documentation to read code written to a good API. Indeed, it should rarely require documentation to write it.

Early drafts of APIs should be short, typically one page with class and method signatures and one-line descriptions. This makes it easy to restructure the API when you don’t get it right the first time.

It may look like a lofty goal, but Bloch was likely thinking about some of his work on the Java standard library. Building on fundamental concepts in software engineering, which all developers should be familiar with, alleviates the need for documentation, and simple and consistent API design should complete the task of minimizing the need for documentation. Does the Java standard library meet these lofty goals? For instance, in module java.util.Arrays we find entries such as [71]:

static int binarySearch(byte[] a, byte key)
static byte[] copyOf(byte[] original, int newLength)
static byte[] copyOfRange(byte[] original, int from, int to)

Their behavior should be intuitive. However, why does copyOf need a newLength argument? Shouldn’t a copy be length-preserving? And what happens if newLength is larger or smaller than the original? Luckily, the API reference provides a simple one-line explanation that copyOf is more than a copy: it will happily stop after copying only newLength elements, or continue, padding the new array with 0s until newLength has been reached. The padding value isn’t configurable and depends on the type of the original array — copyOf is overloaded34 for different types. Scanning the different forms of copyOf I found

static <T> T[] copyOf(T[] original, int newLength),

an entry which uses Java generics to parametrize its parameter and return types, for which the padding value is null. So what if T equals byte? Will copyOf pad its result with zeros or ‘nulls’? The answer is that Java generics cannot be instantiated with primitive types, and therefore T can never be instantiated with byte [72]. Thus the apparent ambiguity is resolved. It gets more confusing if we consider this additional form of copyOf:

static <T,U> T[] copyOf(U[] original, int newLength,
                        Class<? extends T[]> newType)

which represents a copy with conversion from type U to type T. What happens if you need to perform the same operation from or to primitive types? I would check with a Java expert before giving up, but I am afraid you will have to write 49 more functions, one for each pair of primitive types.35 This limitation stems from the Java generics design, like the quantity of almost identical copyOf entries for each primitive type, and the API developer must adapt (see java.util.Arrays in Sec. 1.5). However, the documentation leaves the user helpless here, suggesting a consistency in the library that is only aspirational.

In a thread about the “best code bases,” Hacker News user taspeotis has this to say about a code base touted for its line-by-line comments:36 “Every line? Oh dear” [73]. He and other users follow up by sampling the uninformative comments, which add nothing to what is already obvious from the code:

/* Set verbose flag */
  verbose++;
   ...
/* Set advanced flag */
  advanced = 1;

User teknopaul comments:

IMHO Coding guides should ban “SBO comments” Statin [sic] the Bleedin [sic] Obvious. Let devs [sic] decide what SBO is, usually its [sic] bleedin [sic] obvious.

Which sounds agreeable, but for the circular definition of “bleedin [sic] obvious.”

Martin Fowler, author of popular software engineering books, doesn’t like comments [74]:

Comments are not bad — but there are often better options. I always try to write code that doesn’t need comments …

One such option is assert statements, which specify properties to be satisfied by a program when executing at a certain position in the code. While they are yet more code to write, understand and maintain, they can express simple properties that aren’t obvious from reading the rest of the code. These properties are checked at run-time, helping in the early detection of bugs, near their origin, rather than later as consequences snowball. However, even without running any program, they work “as documentation for other developers reading the code, who see the assert and can confidently say that its condition holds” [75]. For instance, we can express the property that a list is sorted with:

is_sorted = lambda xx: all([x <= y for (x,y) in zip(xx[:-1], xx[1:])])

We can use this in the implementation of the algorithm Quicksort (see Sec. 1.2.6) to check and document that each call has to return a sorted list:

def qs(xx):
    if len(xx) <= 1:
        return xx
    filter_op = lambda op: list(filter(lambda x: op(x, xx[0]), xx))
    yy =  qs(filter_op(lt)) + filter_op(eq) + qs(filter_op(gt))
    assert is_sorted(yy)
    return yy

Another option is randomized or property-based tests, which express, in executable form, required properties of inputs and outputs of a function. The idea is that, instead of specifying a set of inputs with corresponding desired outputs for a function, we can define a process to generate inputs and properties that the input-output pair must verify. For instance, when testing a matrix multiplication function, we would pick a matrix generator to provide inputs. Tests could verify properties such as associativity, distributivity, and having the correct dimensions. Additionally, they could check the results of operations involving identity and null matrices. This approach originated in the Haskell language community and was implemented in the package QuickCheck, where it relies on the type system to provide useful default generators even for user-defined types [76]. It has been imitated in many languages to different degrees of fidelity. In Python, the package Hypothesis supports property-based testing [77]. To continue with our Quicksort example, a test that the output is sorted would look as follows:

@given(xx=lists(integers()))
def test_is_sorted(xx):
    assert is_sorted(qs(xx))

In this test, the first line is a decorator that specifies what generator to use for the argument to the test function, one that produces lists of integers. Writing a generator is also a form of documentation because it tells us that qs accepts lists of integers. qs is more general than that, which should prompt us to write generators like lists(one_of(integers(), floats(), text())) which will produce lists of integers, floats, or strings (not mixed in the same list). The second line contains the name and signature of the test function, following the convention used by the package Pytest. The body of the test asserts that the result of applying qs to its argument is sorted. This condition is necessary but not sufficient for qs to be correct, as it could always return the same sorted list. We also need to ensure its output is a permutation of its input, that is, that it contains the same number of occurrences of each element, a task suitable for the data structure Counter from the library Collections:

@given(xx=lists(integers()))
def test_is_permutation(xx):
    assert Counter(qs(xx)) == Counter(xx)

These two property tests are equivalent to a user-selectable number of conventional ones because a property test can repeatedly run on newly generated inputs. They can replace hundreds or thousands of lines of tests written one by one, provide a high level of confidence in the implementation, and a readable specification for qs (for theoretical work showing how much testing is sufficient, see Sec. 1.2.11.1). This flexibility also mitigates the issue of slow-running tests because, by changing a configuration parameter, we can run a more superficial but quick battery of tests or a much slower but more thorough one. Indeed I applied this two-speed strategy in developing the Rmr2 package [78], using Quickcheck for R [79]. While adding new features, I ran a quick set of tests until most bugs were fixed. Then, as we approached a release, I increased the number and size of tests to build more confidence in the correctness of the package. You may be more of a believer in quality over quantity in the writing of tests. In my experience, handwritten tests have several limitations. First, they are often trivial because the correct output has to be calculated by hand. When written in response to bug reports, they catch problems that have already been discovered. Finally, they sometimes share the same faulty assumptions as the code to be tested and, therefore, won’t probe it outside of those.

All these elements make property-based testing valuable, but there are a couple of catches. First, a developer must state properties that a function’s inputs and outputs must satisfy, and implement them. This task is easy for mathematical objects like matrices or well-defined algorithms like those for sorting and searching but less so in other domains. For example, how to approach testing that a graphical routine draws a circle? Another problem is that this approach relies on referential transparency, a complicated way to say that computer functions behave like mathematical functions, that is their outputs depend only on their inputs. For example, if a function has access to a database, its return value may depend on its contents. In this case there are workarounds, like starting with an empty test database and testing read and write functions in combination, but the code to be tested has to be amenable to this technique. Because of these limitations, we should be weary of emulating Dijkstra’s maximalism — the view that all software should be proven correct like a theorem (see testing in Sec. 1.2.11.1). Like formal methods, property testing has a domain of applicability, and within it, it is a great asset.

Another alternative to traditional documentation is doctests, generally written using literate programming — a technique to interleave text and code so that the latter can be executed [80]. If a chunk of code fails, the documentation related to it may need updating. One instance of this is the “Examples” section from the documentation of package Altair_recipies, which includes both the test suite and the output for each test, consisting of statistical graphics [81]. Maintaining such an extensive set of examples covering the whole library and a comprehensive test suite would have been daunting for a one-person, spare-time project. However, since they are one and the same, the cost is cut in half.

1.2.9 Languages

Paul Graham, Lisp hacker turned venture capitalist, wrote an essay devoted to conciseness, “Succinctness is power.” He sees succinctness as a productivity multiplier and believes it to be related to other positive qualities of code, such as readability. He focuses specifically on programming languages. He makes a characteristically articulate case for it [82]:

[…] clearly succinctness is a large part of what higher-level languages are for. If it is not all they’re for, then what else are they for […]

However, his point of view, while related, is different from the one espoused in this book, as he sees conciseness as a desirable feature in languages but not necessarily in programs:

We should be clear that we are talking about the succinctness of languages, not of individual programs. It is certainly possible for individual programs to be too densely written.

While it is possible to run into cryptic code (for instance, see Sec. 1.2.11.2 and the front cover), we shouldn’t regard general programming as a substantially different activity from designing languages. The boundary between core language and standard library is often blurred regardless, more a matter of packaging and versioning than anything else. In SICP, the authors write [68]:

To appreciate this point [that language interpreters are just programs] is to change our images of ourselves as programmers. We come to see ourselves as designers of languages, rather than only users of languages designed by others.

Therefore it is hard to agree with the point made by Graham that, while higher-level languages exist primarily for the sake of program conciseness, this should not be pursued when using these languages. He considers moderate succinctness in programs acceptable but not a goal.

Any non-trivial program should define the concepts relevant to the task on hand and implement data structures and operations to represent and manipulate these concepts, and this extends a language. However, in many languages, some extensions aren’t possible. To pick an example in Python, let’s assume we need to increment a group of variables by one, as in the following example:

a+=1
b+=1
c+=1

It is impossible to encapsulate this snippet in a function such as inc(a, b, c, by = 1) to better express intent and promote reuse. The reason is that function arguments in Python are passed by shallow copy: the value of an integer, a pointer for a list. In the case of an integer, modifying it inside a function would only change the copy rather than the original. Despite this, there actually is a way to implement this function, as shown by Stack Overflow user zwer [83]. Zwer uses a built-in mechanism to access the call stack, a data structure used by the language run-time to implement function calls. By doing that, we can access the list of variables in the calling environment and modify them at will. To my knowledge, debugging is the only recommended use for these features. Indeed, zwer’s solution comes with so many caveats and warnings that the author himself can’t recommend it as a viable one and named it here_be_dragons.

I didn’t select this example to pick on Python: languages are defined by what they allow and what they prevent. Some features are considered “dangerous” because they are believed to lead to the creation of bugs. Limiting side effects is encouraged in functional programming when they are at all possible. In C++, primarily an object-oriented language, parameters that allow side effects must be of reference type. Whether or not we agree with the specific choices made in a language, we need to operate within those limitations. Hence the lofty vision promoted in SICP of programmers as language designers must be tempered with the reality that it has to be achieved within the confines of the language one is using.

Possible ways to push beyond language limits are macros and reflection. The former enables defining abstractions that generate or modify code before execution; the latter, gives the ability to operate on the structures of the language run-time from the program being run, which is used in here_be_dragons. Lisp macros are notable for being expressed in Lisp itself, unlike, for instance, C macros, which use a dedicated macro language. This is natural in Lisp, since, in that language, expressions are also data structures. Every interpreter or compiler for any language needs to build a representation of the program which can only be a data structure, but in Lisp the relationship between code and representation is particularly simple, as programs are represented by lists [84]. In an example, developer and author Adam Tornhill shows how to eliminate code duplication for a specific web application using a macro: “Now […] we have a concise way to express web-pages with a uniform look […]” [85]. Not only is this solution concise, but also conciseness begets consistency because the same code generates all pages. Unfortunately, he doesn’t explain why macros are necessary in this case. Graham devotes a chapter in his “On Lisp” to explaining their domain of application [86]:

Macros can do two things that functions can’t: they can control (or prevent) the evaluation of their arguments, and they are expanded right into the calling context. Any application which requires macros requires, in the end, one or both of these properties.

For instance, being expanded in its calling context, a macro would allow implementing the inc function described above. However, this possibility isn’t unique to macros: pointer and reference parameters allow to modify the calling context in C++, as do var arguments in Pascal, but in a much more controlled fashion. As to the ability to control argument evaluation, for an alternative mechanism, see Sec. 1.2.9.3.

Besides these specific applications, Graham recommends using macros only “when nothing else will do” [86] because both macros and reflection can make programs harder to understand and hide subtle bugs [87]. Nonetheless, macros have an important role in Lisp. For example, the dominant object-oriented extension to Lisp, the Common Lisp Object System, is implemented with macros, as are several iteration constructs. Graham even shows how to implement a completely different language, Prolog (see Sec. 1.2.9.6), with just macros [86].

Steele thinks that the quest for the ideal, all-powerful language is bound to be frustrated. He says [14]:

[…] it’s really hard to make a language that’s great at everything, in part just because there are only so many concise notations to go around. There’s this Huffman encoding37 problem.

If this observation is correct, it suggests three ideas. Please note that these are based on an abstract, mathematical view of programs rather than a practical one. The extent to which they apply in practice is debatable. First, languages that forbid many constructs preclude them from expressing valid computations. Consequently, developers will have to use longer programs to define equivalent computations. An example is variable declarations and explicit initializations, which many advocate on a safety basis (see the Mariner I disaster, Sec. 1.5). Second, if two programs define the same computation, one of the two could have defined a different one. That computation will have to use a longer program. It is reasonable to conclude that a language with many ways to express the same computation will lead to more verbose programs. For instance, it is debatable whether a language needs both a map function, which applies a function to each element in a collection and returns a collection of its return values, and a list comprehension:

map(function, a_list)
[function(x) for x in a_list]

Finally, defining more common computations with shorter programs and accepting some verbosity for the less common ones could be a good compromise but implies different languages tailored for different application domains. For instance, a sum of two vectors in R is an element-by-element sum, even if the two vectors have different lengths, a problem solved with a technique known as recycling. In Python, the sum of two lists is their concatenation. In R, the concatenation function is concisely named c(), probably owing to its widespread use. In Python, there is no element-wise sum, but we can perform it with a relatively complex expression: map(lambda x, y: x + y, zip(x_list, y_list). There are vector libraries for Python, such as Numpy, whereby the sum of two vectors is defined as in R, save for the recycling rules. While we can accomplish the same results in both, the choices of each language and library reflect the decisions of their respective developers, which likely took into account what the most common use cases were expected to be and reserved the most concise forms for them.

1.2.9.1 Lisp

We started talking about Lisp in the previous section because of its macros, but it is notable in other ways. Its inventor, John McCarty, designed it as a concise alternative to the Turing machine38[88]:

Another way to show that Lisp was neater than Turing machines was to write a universal Lisp function and show that it is briefer and more comprehensible than the description of a universal Turing machine.39

The idea to turn it into an actual programming language came to a student of McCarty’s, Steve Russell, who wrote its first interpreter in assembly language.40 It wasn’t a large, challenging project, unlike other languages (both ADA and Algol68 raised concerns about their implementability; see Sec. 1.5). An interpreter for the Lisp in the original paper takes only 53 lines of Common Lisp — a more complex but also practical dialect thereof [89]. This is a remarkable combination: a simpler language should also be less powerful and therefore require more code for the same tasks, including the implementation of its own compiler. However, Lisp sidesteps this trade-off to some degree. The so-called “Computer Language Shootout,” a systematic comparison of many languages focused on small, well-defined tasks, corroborates its conciseness [90]. It ranks Common Lisp first among over 50 languages (including multiple implementations of the same language).

Clojure is a modernized cousin of Lisp, particularly focused on supporting concurrency. It hasn’t lost Lisp’s expressivity while getting none of the verbosity of the Java language, with which it shares the target virtual machine.41 Daniel Lebrero compares two programs for the same task, one in each language, and asks [91]:

So given that the inherent complexity of the problem is the same, and that the Clojure version is able to express the solution in 58 lines of code while the Java version require 441 lines of code, and that the Clojure version is easier to understand, what are those extra 383 (87% of the codebase) lines of perfectly idiomatic Java code about?

Lebrero also notices how much easier to review the Clojure version is, and that size is one or maybe the determining factor. Sure, concise code could be cryptic, and verbose code trivial, but this isn’t the case here:

[…] the Clojure version was way simpler to understand, and […] the fact of having a single file with 58 lines of code was a very important reason for it.

He ascribes the additional LOCs to “incidental complexity”, which he thinks is caused by the use of inappropriate tools. He concludes that LOCs aren’t helpful as a measure of productivity but as one of complexity (see Sec. 1.1).

1.2.9.2 Python

Paul Prescod, Python developer and book author, is quoted by Graham as stating that “Python’s goal is regularity and readability, not succinctness” [82]. Graham considers it a “damning thing to claim about a programming language.” However, many others think Python is concise, in practice, if not by design. For example, Nicholas Hahn, a software engineer, compared Python, Go, and Rust. He finds that “Python has the least code by far” but also that “it’s leaning heavily on features of the image library it’s using, but this is indicative of the general experience of using Python” [92]. For instance, in the Python version, the difference between two images requires a single method call, whereas it is done pixel by pixel in the other two languages. This observation doesn’t tell us much about the intrinsic qualities of each language while still being important in practice. Availability of libraries could depend on popularity, age — older languages accumulate more — and ease of development, which in turn depends on tools and resources available, as well as those intrinsic qualities. The “Computer Language Shootout” mentioned in the previous section ranks Python 19th out of 51 languages for conciseness [90].

It is less verbose than C in many cases. For example, the README file for a project by the CMU Cylab contains equivalent examples in several languages, including Python. However, it omits the one in C because “the same program in C will take too much space and will not fit into the README format” [93]. One of the advantages of concise code is that it is easier to discuss and present. It will fit into a few slides or, in this case, a README file. Verbose code is something to skim in long, painful, lonely sessions and to hide.

Python practitioners aren’t short on enthusiasm for their language of choice and may occasionally overstate the power of certain features. For example, Mahmoud Hashemi, author of package Boltons, would have us believe that “decorators are among Python’s most elegant and succinct language features.” However, all they replace is one assignment: @decorator instead of f = decorator(f) [94]. Hashemi may be referring to the use of higher-order functions, popularized by decorators, rather than their syntax (see higher order functions in Sec. 1.2.4).

1.2.9.3 R

RStudio co-founder Joe Chang, also known for the R package Shiny, thinks R is more expressive than Python and other languages. As the CTO of a company devoted to supporting this language, we should be aware of his biases. However, he is referring to specific features that are unique to R, among popular languages. In an interview, he said (best effort transcription from [95]):

[…] each one of these libraries has a bunch of features and a bunch of syntax that would not be possible certainly with Python, certainly with Java, certainly with C++ […] when you look at the interfaces of these packages users don’t necessarily know what magic we had to do to make this happen, they just know oh it’s really easy it’s really concise when I want to express my plots this way when I express my data manipulations this way […]

The “magic” he refers to includes a macro-like facility that optionally suspends the regular evaluation of expressions and replaces it with specialized variants. For instance, we can refer to columns in a tabular data structure, the data frame, in certain expressions without constantly repeating the name of the table, as in mutate(table, a = b + c), as opposed to the repetitive table$a = table$b + table$c, where a, b and c are names of columns in table. Users love these more concise forms, but having multiple evaluation rules makes programs harder to understand and hinders their parametrization and generalization. For instance, if we wanted to make the above expression general with respect to the names of the columns involved, we would need to use a different syntax (see package Dplyr in Sec. 1.2.10.1).

One task that R performs well is operating on vectors. It had better do, as it doesn’t even provide scalar types, replaced by vectors of length one. While most languages offer extensions for vectors or other types of collections, in R, by design, all library functions need to accept vectors. R developers think in vectors, and the comparatively modest speed of iterative constructs in R is such that they have to use vector functions in day-to-day work. That often helps with conciseness. For instance, Rasmus Bååth takes a simple spelling corrector in 21 lines of Python and translates it to two lines of R without using any extensions [96]. It is two oversized lines, though, and, once automatically reformatted in the R IDE RStudio, his code swells to 12 lines: more than 2, but still a great deal shorter than the original. I took the liberty to modify this code to eliminate one repetition and to use the pipe operator (%>%) to reorder the operations. The “classic” style of function composition in R is f(g(h(a1, a2), a3), a4) which many find counter-intuitive because functions execute in a temporal order opposite to that in which they appear, but also because arguments to the same function can end up far apart. An alternative is using a temporary variable:

x = h(a1, a2)
x = g(x, a3)
f(x, a4)

This approach adds some unnecessary steps and doesn’t convey the rigid organization of a pipeline. Using R’s macro-like facility, we can define a pipe operator similar to the ones in F# and the Unix shell and available in package Magrittr, such that the above expression is equivalent to h(a1, a2) %>% g(a3) %>% f(a4) [96]. In other words, the pipe operator turns the expression on its left side into the first argument to the function call on its right side, pushing the other arguments to the right. With these modifications, here is the concise spell corrector in R:42

readLines("http://www.norvig.com/big.txt") %>%
  paste(collapse = " ") %>%
  tolower %>%
  strsplit("[^a-z]+") %>%
  table %>%
  sort(decreasing = TRUE) %>%
  names -> sorted_words

In this step, we build a vector of words by:

This sorting order will help return the most common among equally close matches for a word. Now onto the actual spell-correcting function:

correct <-
  function(word) {
    distances = adist(word, sorted_words)
    c(sorted_words[distances <= min(distances, 2)],
      word)[1] }

It works by computing the edit distance43 between a given word and our list of correct, frequency-sorted words. This code’s conciseness stems from using vector functions instead of loops and the vast number of functions considered “core” to R (about 700). Bååth observes that, in particular, the function adist, calculating the edit distance between words, performs much of the difficult work. He adds that R also provides a function aspell in the standard library, which would reduce his program to a single line. Unlike the features referred to by Chang earlier as “magic,” I don’t think this example highlights any conciseness advantage intrinsic to the language, save for the use of the pipe operator. Instead, it shows the benefits of a vectorized programming style, whereby we can process entire vectors in one step, and the importance of being familiar with available libraries. We could likely achieve a similar result in Python, but this programming style may be less natural to practitioners of that language. It is a fine line between what a language allows and what it encourages or enforces.

1.2.9.4 JavaScript

While Python is the target of unwavering praise and R of a mix of praise and blame [97], JavaScript is subject to relentless condemnation (see Sec. 1.3.2) while, ironically, being arguably the most widely adopted language. It certainly is so when measured by the number of JavaScript programs simultaneously running since it is the programming language used for web development on the browser side. Because of its ubiquitousness and perceived shortcomings, developers have defined new languages that translate into JavaScript, correcting some of its defects but maintaining a considerable degree of compatibility. One of these, CoffeeScript, is touted for its conciseness. For example, software engineer Simon Grondin implemented “a rate limiter in 31 lines of CoffeeScript” and thinks this is “a demonstration of its exceptional expressiveness” [98]. However, when I revisited this implementation after first becoming aware of it [99], I found that it had grown to 8 times its original size [100] and that finding the core rate limiter logic among the cruft had become difficult. You may wonder whether the initial, concise implementation was more the result of a fresh take on the subject rather than the intrinsic qualities of the language. Eventually, this code seems to have succumbed to the drift toward complexity.

Not so the three-line implementation of partial application,44 also in CoffeeScript, by Smooth CoffeeScript’s author E. Hoigaard, which retained its original purity [101]. partial takes a function func of nn arguments and returns a function of mnm\le n arguments by providing nmn-m values in advance. In this specific implementation, this is accomplished by using the undefined constant as a marker for an argument that has no pre-specified value. For instance, a function rectangle(x,y,h,w) which draws rectangles centered in (x,y)(x,y) with height hh and width ww can be used to define a function that draws only unit squares as follows:

square = partial (rectangle, undefined, undefined, 1, 1)

Hoigaard starts from someone else’s 15-line implementation (see below), bringing it gradually down to its final svelte size, which I further trimmed down to two lines:45

partial = (func, a...) -> (b...) ->
  func (for arg in a then arg ?= b.shift())...

Even without knowing much about CoffeeScript, I could understand this implementation of partial after researching a couple of constructs. The meaning of this code is that partial is a function that accepts another function func plus a variable number of arguments in vector a and returns a function of a variable number of arguments (vector b). The body of this function invokes func with a list of arguments assembled in a list comprehension equal to a element by element when it is defined, otherwise picking the next value from b. Not only is it remarkably concise without being cryptic but also it implements an important feature that other languages lack and can’t add with a simple extension. So does the 15-line version — here reduced to 14 — but its meaning must be extracted from a jungle of indexes and tests:46

partial15lines = () ->
  [func, args...] = arguments
  wrapper = () ->
    i = 0
    j = 0
    res_args = []
    while i < args.length
      if args[i] == undefined
        res_args.push arguments[j]
        j++
      else
        res_args.push args[i]
      i++
    return func.apply null, res_args

Converted to JavaScript, the first step to execute CoffeeScript code, the two-line version expands to 21 LOCs, but an automated conversion may be more verbose than a skilled programmer’s.

In its latest version, ES6, JavaScript has evolved to look and work more like CoffeeScript. However, even the unimproved JavaScript, warts and all, seems to offer the potential for significant code reductions. Kiran Prasad, a software executive, recalls how LinkedIn switched from Ruby to server-side JavaScript (node.js) to meet their scalability goals for some mobile products [102]. In the process, their code base shrank from 60 KLOCs to 1-2 KLOCs, a reduction greater than 30-fold that wasn’t even a goal for this transition. He explains this remarkable achievement in three parts: the node.js version is “framework-free,” therefore free of framework-related cruft (I assume he means web framework, see Sec. 1.2.10.3); they switched to a more functional style from an object-oriented one, which had resulted in a lot of unneeded abstraction layers; and they moved rendering47 client-side48 from server-side, which relocated, rather than eliminated, some of the code. With so many concurrent changes, it is impossible to tell which ones were instrumental to the code savings. For me the most important lesson is that a 30-fold reduction in a codebase, which presumably freed 29 out of 30 engineers from maintenance work, was achieved as a side effect of the quest for scalability rather than being an explicit goal. Had LinkedIn selected a different approach, they might still be saddled with a 60-KLOC code base.

1.2.9.5 Erlang

Maybe not so if they went with Erlang. In a video presentation, its creator, Joe Armstrong, shows how a couple of pages of C translate to one line of Erlang (near 1:20 [103]). Unfortunately, this video is too fuzzy to read the C code. Later, a co-worker cites an internal study at telecommunication firm Ericsson that estimates the ratio between C and Erlang code for the same tasks as 7-fold. A factor to consider is that, just as node.js (see previous section) supports event-driven programming, Erlang includes specialized features to build reliable systems. In particular, it aims to solve the problem of creating them despite the reality of software errors by supporting a system’s ability to recover. Erlang achieves this goal by providing the developer with lightweight processes that don’t share resources and communicate via asynchronous message passing. They are organized in hierarchies, whereby those higher up supervise those below. Rather than trying to code defensively a single process to keep it going at all costs, developers focus on recovering when something goes wrong. The goal isn’t eliminating crashes but making them ordinary, recoverable events. It is easier to understand this concept in the context of the original target application: software for telephone switches. The goal, in that case, is to keep a large number of conversations going, a task that naturally lends itself to concurrency. Other applications written in Erlang are the messaging library RabbitMQ and the messaging app WhatsApp. However, its noteworthy features aren’t limited to its concurrency and fault tolerance models. It is also a functional language with pattern matching and many other modern features. Armstrong wrote a simple side-by-side comparison of Java and Erlang programs computing the area of geometric shapes [104]. This is the Erlang version:

area({rectangle, Width, Ht}) -> Width * Ht;
area({square, X}) -> X * X;
area({circle, R}) -> 3.14159 * R * R.

And this is the Java version:

abstract class Shape {
    abstract double area();
}

class Circle extends Shape {
    final double radius;
    Circle(double radius) { this.radius = radius; }
    double area() { return Math.PI * radius*radius; }
}

class Rectangle extends Shape {
    final double ht;
    final double width;
    Rectangle(double width, double height) {
        this.ht = height;
      this.width = width;
    }
    double area() { return width * ht; }
}

class Square extends Shape {
    final double side;
    Square(double side) {
      this.side = side;
    }
    double area() { return side * side; }
}

Skipping empty or single-character lines, we have a 3:17 ratio in favor of the Erlang program. However, despite its conciseness, more than a million lines of Erlang were necessary for its first production deployment, the ATM switch AXD 301 [105].

1.2.9.6 Prolog

Armstrong supported conciseness in contexts other than his programming language. In another interview, this is what he said about Prolog [14]:

[Prolog is] not widely used. And it’s a great shame because programs are incredibly short.

While it is possible to write all types of applications in Prolog (Armstrong used it for the first version of the Erlang interpreter [106]), that is uncommon. Prolog is a logic-based language used primarily in AI, specifically the flavor known as symbolic AI. Programs consist of facts and rules that allow inferring more facts. A classic example is the implementation of a family tree. For instance, let’s consider the ancestry of Charles II of Spain going back to Philip I and Joanna of Castile in Fig. 4.

A portion of the ancestry of Charles II of Spain.

Figure 4: A portion of the ancestry of Charles II of Spain.49

We can implement this tree and some simple searches in a few lines of Prolog. First, we need to enter what would be considered data in other languages, that is, the parent-offspring relation by listing Prolog facts:

parent("Philip I of Castile", "Charles V Holy Roman Emperor").
parent("Joanna of Castile", "Charles V Holy Roman Emperor").
parent("Philip I of Castile", "Ferdinand I Holy Roman Emperor").
parent("Joanna of Castile", "Ferdinand I Holy Roman Emperor").
/* more parent-child pairs as shown in the graph */

Then we have a set of rules that allow to derive new facts from existing ones:

offspring(X,Y):- parent(Y, X).
ancestor(X, Y):- parent(X, Y).
ancestor(X, Y):- parent(X, Z), ancestor(Z, Y).
descendant(X, Y):- ancestor(Y, X).

These four lines define three relations:

  1. offspring is the inverse of parent;
  2. ancestor contains pairs (X, Y) for which either X is a parent to Y or there exists a Z such that X is a parent to Z and Z is an ancestor to Y;
  3. descendant is the inverse of ancestor.

There is nothing wrong in having two rules for ancestor as it is a way to express disjunction: either one applied successfully will be sufficient to establish that a specific pair is an ancestor-descendant pair. With just a handful of LOCs, we can already ask questions like “Who were the ancestors of Anna of Austria?”:

?- findall(X, ancestor(X, "Anna of Austria"), XX).
XX = ["Ferdinand I Holy Roman Emperor", "Anna of Bohemia and Hungary", "Philip I of Castile", "Joanna of Castile"].

If we want to see how a person is an ancestor of another, and not just whether that person is, we need only two more rules:

path(X, Y, []):- parent(X, Y).
path(X, Y, [Z|P]):- parent(X, Z), path(Z, Y, P).

which mimic the definition of ancestor but add a list variable P to accumulate the ancestry path. [Z|P] is the list obtained by prepending Z to P. We can paraphrase the second rule as “There is a path consisting of Z followed by P between X and Y if there is a Z that has X as a parent and there is a path P between Z and Y”.

?- path("Philip I of Castile", "Charles II of Spain", XX).
XX = ["Charles V Holy Roman Emperor", "Philip II of Spain", "Philip III of Spain", "Maria Anna of Spain", "Mariana of Austria"] .

We can combine path with findall to ask the interpreter and provide not one but all paths connecting two members of the dynasty (the output is truncated):

?- findall(XX, path("Philip I of Castile", "Charles II of Spain", XX), YY).
YY = [["Charles V Holy Roman Emperor", "Philip II of Spain", "Philip III of Spain", "Maria Anna of Spain", "Mariana of Austria"],
["Charles V Holy Roman Emperor", "Philip II of Spain", "Philip III of Spain", "Philip IV of Spain"], ["Charles V Holy Roman Emperor", "Maria of Spain", "Anna of Austria '49", "Philip III of Spain", "Maria Anna of Spain", "Mariana of Austria"],
["Charles V Holy Roman Emperor", "Maria of Spain", "Anna of Austria '49", "Philip III of Spain", "Philip IV of Spain"],
["Ferdinand I Holy Roman Emperor", "Maximilian II Holy Roman Emperor", "Anna of Austria '49", "Philip III of Spain"|...],
["Ferdinand I Holy Roman Emperor", "Maximilian II Holy Roman Emperor", "Anna of Austria '49"|...], ["Ferdinand I Holy Roman Emperor", "Anna of Austria"|...], ["Ferdinand I Holy Roman Emperor"|...], [...|...]|...].

In ordinary families, there should be a unique path linking an ancestor and a descendant, but not in European royal families of the 16th and 17th centuries, and certainly not in this one. Since political expedience dictated marriages, unions between cousins or an uncle and his niece were not unusual. The cumulative effect over several generations was to reduce the genetic diversity to unhealthy levels: Charles II of Spain was extraordinarily sickly and died at just 39; his sister only made it to 21.

Leaving royal inbreeding aside, we can add more features to this genealogy system: finding if two individuals are half-siblings, if they are related in any way, which common ancestors they share, or which one is the closest. Each feature requires one or two new rules. I can’t even think how long it would take me to implement this from first principles in a more traditional language. Behind the scenes, while most languages execute statements in the order they are written or as specified by control structures, Prolog applies a search and unification algorithm. The unification part solves the problem of matching parent(X, "Charles II of Spain") with parent("Philip IV of Spain", "Charles II of Spain") for X = "Philip IV of Spain". The search part explores all the connections between matching facts and rules. It may help to think of it as a database with a powerful query language. It looks like Prolog would free the developer from worrying about algorithms, that is, the specific steps needed to produce results. Unfortunately, it is more complicated. Sometimes the rule order has a drastic impact on the search space. In other cases, there are different sets of rules to accomplish the same task, but they can make the search process more or less efficient: for instance, see how different sorting algorithms can be implemented [107].

1.2.9.7 Kotlin

Kotlin is a language that runs on the JVM50 and is thus seen as an alternative to Java. It prominently advertises its conciseness [108]:

Concise
Reduce boilerplate code […] For example, one can now create a POJO [Plain Old Java Object] with getters, setters, equals(), hashCode(), toString() and copy() in a single line.

POJOs are just Java objects that don’t satisfy any additional specifications. It is customary to provide them with the methods listed in the quote, which used to be written by hand. I distinctly remember when, early in my career, I discovered I had to write POJOs for the simplest tasks, like accessing a database table. It was like running while submerged to the waist: a trivial curiosity required half hour of mindless typing. I couldn’t believe it hadn’t been automated in some way. Senior colleagues reassured me that it was standard procedure and that I would improve with practice.

In Kotlin, data classes replace POJOs. An example would be:

data class Person(var name: String, var age: Int)

In Java, this would look like the following snippet, the longest included in this book:

public class Person {

  private String name;
  private int age;

  public Person() {
  }

  public Person(String name, int age) {
    this.name = name;
    this.age = age;
  }

  public String getName() {
    return name;
  }

  public void setName(String name) {
    this.name = name;
  }

  public int getAge() {
    return age;
  }

  public void setAge(int age) {
    this.age = age;
}

  @Override
  public String toString() {
    StringBuilder sb = new StringBuilder();
    sb.append(Person.class.getName())
      .append('@')
      .append(Integer.toHexString( System.identityHashCode(this)))
      .append('[');
    sb.append("name");
    sb.append('=');
    sb.append(((this.name == null)?"<null>":this.name));
    sb.append(',');
    sb.append("age");
    sb.append('=');
    sb.append(this.age);
    sb.append(',');
    if (sb.charAt((sb.length()- 1)) == ',') {
      sb.setCharAt((sb.length()- 1), ']');
    } else {
      sb.append(']');
    }
    return sb.toString();
  }

  @Override
  public int hashCode() {
    int result = 1;
    result = ((result* 31)
             +((this.name == null)? 0 :this.name.hashCode()));
    result = ((result* 31)+ this.age);
    return result;
  }

  @Override
  public boolean equals(Object other) {
  if (other == this) {
    return true;
  }
  if ((other instanceof Person) == false) {
    return false;
  }
  Person rhs = ((Person) other);
    return (((this.name == rhs.name)
           ||((this.name!= null)
             &&this.name.equals(rhs.name)))
          &&(this.age == rhs.age));
  }
}

Java developers aren’t fools and, in this day and age, use code generators to write this boilerplate51 for them. However, there is a problem with this approach. Generated code needs to be read, understood, and maintained. Granted, it doesn’t look all that difficult, but how do we find out someone didn’t change some detail in the equals method after its generation? That is why code generation isn’t a suitable replacement for abstraction. Moreover, I believe POJOs tell us something about Java. What is wrong with a language that doesn’t allow abstracting this boilerplate away? For instance, one in which a developer can’t express with code the following algorithm: “To implement equality comparison for a class, for each property in that class, call the equals method on that property and return true if all of these calls return true?” It seems like one of the most straightforward and useful abstractions you could come up with. In Python, the specific issue of boilerplate methods can be solved with a library (for instance, Attrs; see Sec. 1.2.10.4), but, apparently, not in Java. Kotlin offers much more than data classes, and replacing POJOs was only one of the motivations behind the language, but one that will get the attention of POJO-aggrieved Java programmers.

1.2.9.8 C++

R’s vector functions perform well because they are written in other, lower-level, but more efficient languages, typically C or C++. The most downloaded R package is indeed one that facilitates mixed R and C++ development, Rcpp. Its maintainers took advantage of new features in Cpp11, a relatively recent update to the C++ standard, to refactor or, should I say, cull the Rcpp code base. Software engineer Kevin Ushey reports the following [109]:

Eliminating the code bloat and replacing [it] with variadic templates52 was awesome. […] For some hard numbers, I count 31,428 lines of code in the inst/include directory of Rcpp11, while there are 92,726 total lines of code in inst/include for Rcpp.

This process resulted in a 3-fold reduction stemming from a single new language feature and without any impact on the Rcpp user, for whom it is entirely transparent. Templates enable the parametrization of variable types, but until C++11, the number of type parameters had to be constant. As a workaround, Rcpp developers, to provide the flexibility of functions with a variable number of arguments, painstakingly replicated and adapted their code for each number up to a somewhat arbitrary maximum, 20 (not unlike what my team-mate did with nesting levels in Sec. 1.2.4, but in that case a better solution was readily available). For example, searching for all the Vector create definitions in Rcpp, we find 20 instances [110]:

static Vector create(const T1& t1)
static Vector create(const T1& t1, const T2& t2)
static Vector create(const T1& t1, const T2& t2,
                     const T3& t3)
static Vector create(const T1& t1, const T2& t2,
                     const T3& t3, const T4& t4)
static Vector create(const T1& t1, const T2& t2,
                     const T3& t3, const T4& t4,
                     const T5& t5)
[14 entries omitted]
static Vector create(const T1& t1, const T2& t2,
                     const T3& t3, const T4& t4,
                     const T5& t5, const T6& t6,
                     const T7& t7, const T8& t8,
                     const T9& t9, const T10& t10,
                     const T11& t11, const T12& t12,
                     const T13& t13, const T14& t14,
                     const T15& t15, const T16& t16,
                     const T17& t17, const T18& t18,
                     const T19& t19, const T20& t20)

To add insult to injury, each definition needs a largely repetitive but slightly customized implementation — here omitted. The developers were aware of this massive amount of repetition. They prefaced this code with a comment describing an imaginary abstraction that deals with the 20 cases in a unified way, which was impossible before C++11.

In Rcpp11, instead, there is only one definition for all these cases, and a single implementation [111]:

static Vector create(Args&&... args)

The line count went from 1245 to 144. However, there is a price to pay for this code reduction: an ever more sprawling C++ standard for a language already quite complex in its first incarnation (see Sec. 1.3.2). Its evolution has been guided by an unwavering commitment to backward compatibility,53 which implies features can’t be removed, and an openness to new features. Its standard committee doesn’t seem particularly concerned about the language becoming harder to learn and read. They aren’t alone in taking this position: for instance, Python has been growing as well, adding list comprehensions, coroutines, and f-strings, among other features. Unlike C++, though, Python abandoned full backward compatibility in its 3.0 version. The co-existence between the two incompatible versions continues at the time of this writing. They were officially supported in parallel for 12 years, explaining why language steering committees often accept backward compatibility as a constraint, no matter how burdensome.

1.2.9.9 Query Languages

SQL queries can quickly grow to be unmanageable. Luckily, there are libraries to side-step this problem, generating SQL from much more concise expressions. Dplyr (see Sec. 1.2.10.1) is one such library. For instance, here is a typical R expression using Dplyr:

mtcars %>%
group_by(cyl) %>%
filter(mpg == min(mpg) | mpg == max(mpg)) %>%
select(mpg, model) %>%
arrange(mpg)

First, remember the meaning of %>%, the pipe operator (see Sec. 1.2.9.3), which takes the expression on its left hand side and makes it the first argument for the function call on its right side. This code takes the data in table mtcars, a car-related dataset, groups it by the number of cylinders, filters the rows that match the minimum and maximum fuel mileage in each group, keeps only specific columns, and sorts the rows by increasing gas mileage. The value of this expression is the following table:

cyl   mpg model
  8  10.4 Cadillac Fleetwood
  8  10.4 Lincoln Continental
  6  17.8 Merc 280C
  8  19.2 Pontiac Firebird
  6  21.4 Hornet 4 Drive
  4  21.4 Volvo 142E
  4  33.9 Toyota Corolla

Dplyr can process datasets in memory or database tables, the latter by generating SQL queries. For example, the following is generated from the above expression (assuming that the target database is Postgres, as SQL is still subject to proprietary variations, despite the existence of a standard):

SELECT `cyl`, `mpg`, `model`
FROM (SELECT `cyl`, `mpg`, `model`
FROM (SELECT `cyl`, `mpg`, `model`,
  MIN(`mpg`) OVER (PARTITION BY `cyl`) AS `q01`,
  MAX(`mpg`) OVER (PARTITION BY `cyl`) AS `q02`
FROM `mtcars`) `q01`
WHERE (`mpg` = `q01` OR `mpg` = `q02`)) `q02`
ORDER BY `mpg`

It is possible that a skilled SQL developer could come up with a more concise query, but there is no denying that the R-Dplyr expression is more clear and concise.

Impala, an SQL variant specialized for big data, adds support for complex types. One of the advantages of this extension is that “common querying patterns can be expressed concisely” [112]. This statement reprises the concept that this isn’t possible for all patterns (see Steele’s “Huffman encoding problem” in Sec. 1.2.9). With knowledge of which are more valuable, it is possible to make those more concise, while allowing the expression of less common patterns in more complex ways rather than dropping them altogether.

1.2.10 Libraries

As we have seen in Sec. 1.2.9, libraries should not be seen as totally distinct from languages. On the contrary, they are language extensions, and the line between them is sometimes blurred. It is no surprise, then, that we can look at them from the point of view of their conciseness, meaning the conciseness they enable in the programs that use them — we will talk about frugal APIs later on (see Sec. 1.3.3). It is almost a given that using a library allows us to write less code than managing without it (but see Sec. 1.2.9.4 for an exception), possibly at the cost of a steep learning curve. However, not all libraries are created equal. We will see through several examples that some libraries specifically target conciseness. Users value this property, whereas they criticize or avoid libraries that force verbose code, particularly in the form of boilerplate.

1.2.10.1 Data Analysis and Processing

A remarkable example is the library Keras for deep learning (which is covered in Sec. 2.1.4). The brainchild of François Chollet, it entered a segment populated by many strong competitors. It doesn’t intend to replace them but only to provide a better API that makes developing deep learning models easier. On the implementation side, it relies on a user-selectable preexisting deep learning library: you can see it as a thin layer of peanut butter and jelly, with the underlying bread provided by others. Therefore it isn’t any faster or more powerful than any of these libraries, but only more convenient and makes developers more productive. These properties are based, at least partly, on the conciseness it enables. Version 0.2.0 opened its README with “Keras is a minimalist, highly modular neural network library” [113]. Among the design principles explained in the same document, we find: “Minimalism. Each module54 should be kept short and simple (<100 lines of code).” Strangely this principle was dropped in later versions, but not the following one [114]:

No separate models [sic] configuration files in a declarative format. Models are described in Python code, which is compact […].

Keras avoids defining a new, separate description language, finding Python is perfectly adequate and highlighting the compactness of model descriptions, which are also modular and easy to combine. Its users agree: fast.ai co-founder Rachel Thomas calls one of the alternatives to Keras, Tensorflow, “verbose and confusing” and recommends switching [115]. Indeed Tensorflow developers gave it the strongest endorsement, by adopting it as their recommended API and incorporating it into their package distribution. A run-of-the-mill feed-forward neural network in Keras looks as straightforward as it should be:

model = (
    Sequential()
    .add(Dense(10, input_dim=4, activation="relu"))
    .add(Dense(1)))

There is a small terminology catch: what everyone else calls a feed-forward neural network here is a sequential model, which may suggest a model for sequences of variable length, such as time series. Its input has length four, as the input_dim parameter specifies. The two layers of this network are added one at a time, calling a function named with the desired connectivity pattern, Dense, and parametrized with their sizes and, optionally, activation functions. Everything else is set to reasonable defaults customizable with additional parameters. Let’s see what this looked like in Tensorflow before Keras was bundled with it:

inputs = placeholder(float32, shape=(None, 4))


# Hidden layer
hidden_size = 10
weights_hidden = Variable(random_normal([hidden_size, 4], stddev=0.01))
bias_hidden = Variable(constant(0.1, shape=(hidden_size, 1)))
hidden = relu(add(matmul(weights_hidden, transpose(inputs)), bias_hidden))


# Output layer
weights_output = Variable(random_normal([1, hidden_size], stddev=0.01), name='weights_output')
bias_output = Variable(random_normal([1, 1]))
output = add(matmul(weights_output, hidden), bias_output)

The general structure is the same: we start with an input layer, created with the opaquely named placeholder, then create a hidden layer, and finally an output layer, but each requires 3-4 lines. Crucial to creating each is the also opaquely named Variable, which creates a vector or matrix object that can be initialized and later optimized during training. You can see different initialization strategies being used. The layers’ definitions make explicit the underlying mathematical operations, such as add, matmul, and relu, and are combined by function composition instead of the add method. On the one hand, this approach enables the complex architectures used in deep learning, for which the Sequential API in Keras is inadequate. On the other, they expose lots of details practitioners are often happy to leave at defaults, resulting in three times as much code. Particularly burdensome are those details about layer dimensions that need to be compatible between layers and are all calculated automatically in Keras. Confusingly, the layer dimensions, for which there are no reasonable defaults, must be specified as parameters to the initialization strategy, for which, instead, there are defaults. It may seem a classic case of power vs. ease-of-use trade-off: after all, reducing a layer to its mathematical definition means we can swap one transformation for another and create all sorts of experimental neural modules. Moreover, composing those modules by name means great flexibility in defining complex architectures. However, Keras isn’t only for feed-forward neural networks: its Sequential API is just a teaser for newcomers or the result of recognizing this type of network is perfectly adequate in many applications. For more advanced ones, it offers a Functional API, more similar to the one native to Tensorflow, but with significant differences. The following is how the same network can be defined using this API:

inputs = keras.Input(shape=(4,))
dense = layers.Dense(10, activation="relu")
hidden = dense(inputs)
outputs = layers.Dense(1)(hidden)
model = keras.Model(inputs=inputs, outputs=outputs)

This approach adds the flexibility researchers need to define cutting edge architectures while not being substantially more verbose than its Sequential API. Like the original Tensorflow API, it uses function composition to assemble neural modules but avoids several of its pitfalls: this API is still centered around machine learning concepts of input and layer rather than the more opaque Variable and placeholder as in Tensorflow; the layer dimensions are specified directly, not through initialization; and there are reasonable defaults for initialization and numeric types. When necessary, we can always define a custom layer at the level of mathematical operations.

However, as good as Keras is, some people think there are better options: it is a very active field at the time of this writing. If Thomas vouches for Keras, fast.ai’s other co-founder Jeremy Howard has already moved on. Reporting on a comparison between Keras and PyTorch, another library for the same class of applications, he writes [116]:

We built models that are faster, more accurate, and more complex than those using Keras, yet were written with much less code.

This code wasn’t immediately available. While I don’t have direct experience with PyTorch, a quick look at plain-vanilla examples showed an API at an abstraction level somewhere between Keras’s functional API and Tensorflow’s, with additional object-oriented boilerplate layered on top, and I wasn’t impressed. However, PyTorch was built for experimentation and not for plain-vanilla models. Therefore, its true worth may not be evident unless someone works with models with the complexity Howard is interested in. Whichever the current champion, it doesn’t look like anyone is finding value in verbose machine learning code, while models are getting more complex over time, putting additional emphasis on compact descriptions. Moreover, no API designer can escape the compromise between the complexity of an API, its power, and the complexity of user code. Nonetheless, some navigate it better than others.

Large deep-learning models are trained on GPUs. While most practitioners don’t have to code directly for the GPU, those writing advanced custom models often do, a task that requires highly specialized skills. Triton, a library from OpenAI, aims to change that. In their words [117]:

Triton makes it possible to reach peak hardware performance with relatively little effort; for example, it can be used to write FP16 matrix multiplication kernels that match the performance of cuBLAS — something that many GPU programmers can’t do — in under 25 lines of code. […] although a variety of systems have recently emerged to make this process easier, we have found them to be either too verbose, lack flexibility or generate code noticeably slower than our hand-tuned baselines.

As a taste of programming for the GPU, let’s consider the “Hello World”55 of this field, which is the sum of two vectors. Here is the Triton version (lightly edited from [117]):

BLOCK = 512
@jit
def add(x, y, z, n):
   pid = program_id(0)
   index = pid * BLOCK + arange(BLOCK)
   mask = index < N
   gpu_x = load(x + index, mask=mask)
   gpu_y = load(y + index, mask=mask)
   store(z + index, gpu_x + gpu_y, mask=mask)

GPUs operate with a high degree of parallelism, and individual threads are organized in blocks. Here, the program_id call provides an id for the current block as the function add runs in parallel on each block of size BLOCK. Within each block, there can be even more parallelism, but the library manages this transparently. The load and store calls are responsible for moving data to the GPU and back. mask prevents out-of-range accesses. Contrast this with the corresponding program written in C using a library called CUDA (lightly edited from [118]):

__global__ void add(int *a, int *b, int *c, int n) {
  int index = threadIdx.x + blockIdx.x * blockDim.x;
  if (index < n)
    c[index] = a[index] + b[index];
}


#define N (2048*2048)
#define THREADS_PER_BLOCK 512
#define BLOCKS ((N + THREADS_PER_BLOCK-1) / THREADS_PER_BLOCK)
int add(int *a, int *b, int *c) {
  int *gpu_a, *gpu_b, *gpu_c;
  int size = N * sizeof(int);

  cudaMemcpy(gpu_a, a, size, cudaMemcpyHostToDevice);
  cudaMemcpy(gpu_b, b, size, cudaMemcpyHostToDevice);

  add_kernel<<<BLOCKS,THREADS_PER_BLOCK>>>(gpu_a, gpu_b, gpu_c, N);
  cudaMemcpy(c, gpu_c, size, cudaMemcpyDeviceToHost);
  cudaFree(gpu_a);
  cudaFree(gpu_b);
  cudaFree(gpu_c);
}

We need a pair of functions, the first of which is a so-called “kernel,” the part that runs on the GPU. It is marked by they keyword __global__, as the documentation explains:

__global__ is a CUDA C/C++ keyword meaning [function] will execute on the device [and] will be called from the host.

Here “device” means GPU. If you find __global__ a confusing choice to mark these functions, you aren’t alone since “global” is already a charged word in programming, mostly having to deal with the scope of variables. Calculating which indexes to operate upon follows a logic close to the Triton example. However, in this case, instead of an index range, we have a purely scalar operation. The second function runs sequentially on the CPU and takes care of memory allocation, deallocation, and moving data in and out of the GPU. It invokes the kernel with a unique triple chevron syntax that provides information about the desired level of parallelism. While already demonstrating the conciseness of Triton, this basic example doesn’t highlight its ability to perform automatic optimizations that are far from trivial and need to be implemented manually in C and CUDA. An example where this ability is on display is the implementation of the function softmax, available in Python with Triton [117] and C with CUDA [119]. However, a complete understanding of these examples requires a more in-depth treatment of GPU programming than I can provide. Libraries such as PyTorch [120] and Numba [121] enable equally or even more concise GPU programming. Unfortunately, they also impose a significant performance penalty over Triton or hand-optimized C CUDA kernels, in a highly performance-sensitive application domain.

The quest for conciseness isn’t limited to deep learning. For example, the authors of SparkSQL, a module that extends the big data processing system Spark with relational capabilities,56 are keen on touting its conciseness, both of its internals as well of sample programs. They do so at least seven times in the paper where it was first described [122]. SparkSQL attempts to combine relational processing with semi-structured, ETL-type57 processing. A basic example from the original article is the following:

employees
  .join(dept, employees("deptId") === dept("id"))
  .where(employees("gender") === "female")
  .groupBy(dept("id"), dept("name"))
  .agg(count("name"))

In this expression, employees and dept are two relational tables. The authors remark that this being just a library within a fully featured programming language (either Scala or Python) means that we can use control structures, variables, and all the features of one together with the powerful declarative style of SQL (see also Sec. 1.2.10.1). Programming directly in SQL would not afford these possibilities. Since evaluation in SparkSQL is delayed as much as feasible, the developer can build abstractions for common subqueries without a performance penalty, because only the final, fully assembled query will be submitted to Spark and therefore optimized as a unit. As a SparkSQL user put it, it is “concise and declarative like SQL, except I can name intermediate results,” and naming is essential to reuse.

The authors of Scalding, also a Scala library, picked this byline for their twitter account [123]:

Scalding is a scala DSL58 for Cascading, running on Hadoop. It’s concise and functional, and it’s just good, clean fun.

We don’t need to take a slogan at face value. On the other hand, it is what the Scalding team wanted to advertise, as opposed to, say, that it is scalable. Since it runs on top of Hadoop,59 it inherits its scalability properties. However, since Hadoop is written in Java that was the only language to develop programs running on it, and its Java API was phenomenally verbose. For example, I once counted approximately 250 lines to perform a simple relational join (that was using an early version, and I hope more concise alternatives have been developed since). Therefore, it is easy to understand why Scalding developers would like us to focus on its conciseness, which we can appreciate in the following example of a join, despite the verbiage devoted to static typing [124]:

case class Book(title: String, author: String)
case class ExtendedBook(title: String, author: String, copies: Long)

val library1: TypedPipe[Book]
val library2: TypedPipe[ExtendedBook]
// somehow load them with data
val group1: CoGroupable[String, Book] = library1.groupBy { _.title }
val group2: CoGroupable[String, ExtendedBook] = library2.groupBy { _.title }
val theJoin: CoGrouped[String, (Book, ExtendedBook)] = group1.join(group2)

Many data manipulation libraries have the data frame as their central data structure. It can be described as tabular or relational data or, in the jargon of statistics, a two-dimensional collection with one row per observation and one column per variable. There is something about this data structure that makes it so valuable. For example, Spark co-author and Databricks Chief Technologist Matei Zaharia says that “[data frames] give you a very concise way of writing expressions, doing operations” (my transcription from [125]).

Dataframes are built into R as the data structure to represent empirical observations, but how R developers work with them has changed over time. An R library, Dplyr, provides an alternative to built-in capabilities and won out with users, who praise its conciseness — its speed also does help. Take for instance this tweet thread between Vince Buffalo and J. J. Emerson, two evolutionary geneticists [126]:

I ♥ how R makes much of data analysis trivial. How good are genotype imputations by freq class? table(real_geno, imputed_geno, cut(freq, 5))
Some of my old R code (100’s [sic] of lines) could now be reduced to a 20-liner with plyr, ggplot, etc.
Haha, yea mine too. The Hadley-sphere is a black hole for wasted/unnecessary lines of code.

The Hadley-sphere mentioned here is a set of packages, including Dplyr, by the same developer, Hadley Wickham. While not all equally popular, some — Dplyr and Ggplot2 above all — have largely replaced corresponding built-ins or core libraries bundled with the language for most users, a rare feat among libraries, as the advantages of incumbency are as real as in politics (see also Requests in Sec. 1.2.10.3).

Tony Fischetti puts it in no uncertain terms in his blog post “How dplyr replaced my most common R idioms” [127]:

Perhaps more importantly, the dplyr code uses many fewer lines and assignments and is more terse and probably more readable with its neat-o “%.%” operator.

His blog post clarifies that these conciseness advantages are more important than a 2x speed-up for his use cases. The operator mentioned in the quote is a version of the pipe operator (see Sec. 1.2.9.3).

A problem with Dplyr is that its author, to facilitate interactive use, made life harder for people who want to create programs with it [128]. R is a programming language, but, even more importantly, a console for statistical analysis by non-programmers. This dual use is uncommon among programming languages — other examples are MATLAB and SAS. Nonetheless, it is better to support interactive and programming uses with the same constructs and libraries to ease the transition from one to the other. To achieve that, Wickham designed two very similar sub-APIs in the same package, one for each use case. I was involved in a discussion about this as the developer of a related package, and I resisted that option calling it “a doubling of the API footprint” [129]. However, my solution required slightly more boilerplate in programming use cases. More recent versions of Dplyr have gone down a similar path, deprecating one of the sub-APIs (for frugal APIs, see Sec. 1.3.3).

We have already touched upon some of Dplyr’s features in Sec. 1.2.9.3, as they are the best known application of R’s unique macro-like facility. When a data frame table includes columns a and b, we can easily create a new data frame with an additional column c with table = mutate(table, c = a + b), as opposed to the standard R table$c = table$a + table$b. However, what if we wanted to parametrize the columns involved in this operation? Let’s assume the names of the columns are stored, as strings, in three variables a_name, b_name, and c_name. In this case, neither of the above expressions will work. Standard R offers a different operator, [[]] for this case: table[[c_name]] = table[[a_name]] + table[[b_name]]. Dplyr provides a now deprecated sub API of functions, whose names are suffixed with a cryptic underscore character, to support this case: table = mutate_(table, c_name = paste(a_name, "+", b_name)), where paste concatenates its argument to yield the string "a + b". While this form avoids the endless repetition of the data frame’s name, it requires forming R expressions by string operations, an error-prone and hard-to-read approach. The currently recommended solution is to refer to the data frame not by its name but by a conventional name that can occur inside Dplyr expressions only as in table = mutate(table, c_name = .data[[a_name]] + .data[[b_name]]), which is the longest of all the different versions examined here, and restores the repetition of names necessary in standard R. As Steele said about languages, “there are only so many concise notations to go around,” and the same seems to hold for libraries (see Sec. 1.2.9). This is an instructive example of the subtle trade-offs between complexity in a library and in its users’ code and optimizing for different use cases.

Like Keras (see Sec. 1.2.10.1), Dplyr delegates computation to one of several backends, including relational databases. Its developer chose to add advanced SQL features known as window functions, but with limited support[130]:

[…] you can’t access rarely used window function capabilities (unless you write raw SQL), but in return, frequent operations are much more succinct.

Conciseness is difficult; therefore, as we have seen before (see Impala in Sec. 1.2.9.9), an option is making popular features and idioms more concise while allowing a more verbose expression for the less common ones. Here the pain goes further, though, since we have to “punch through the abstraction,”60 reaching for the separate, lower level language of the backend, SQL, and losing the portability61 of “pure” Dplyr expressions.

Unlike R, Python doesn’t come with data frames built-in, but the library Pandas provides equivalent capabilities. However, it is the opinion of some practitioners that it is less elegant than its R counterpart, Dplyr. The project Pandas-ply aims to provide a more Dplyr-like API for Python. Its authors write [131]:

In our opinion, this pandas-ply code is cleaner, more expressive, more readable, more concise, and less error-prone than the original pandas code.

It is still in an experimental stage, and it is far from having the impact on the Python user base that Dplyr had on R users. You can judge whether it should by looking at the following examples, from the Pandas-ply introductory material, the first of which is written using Pandas:

grouped_flights = flights.groupby(['year', 'month', 'day'])
output = pd.DataFrame()
output['arr'] = grouped_flights.arr_delay.mean()
output['dep'] = grouped_flights.dep_delay.mean()
filtered_output = output[(output.arr > 30) & (output.dep > 30)]

Pandas-ply lets you instead write:

(flights
  .groupby(['year', 'month', 'day'])
  .ply_select(
    arr = X.arr_delay.mean(),
    dep = X.dep_delay.mean())
  .ply_where(X.arr > 30, X.dep > 30))

While the line count is similar, the Pandas example is more “busy,” with temporary results stored in intermediate variables and more repeated variable names. The Pandas-ply version uses a chained method call syntax that avoids intermediate variables and uses a special object X to simulate some of the Dplyr syntax based on R’s macro-like facility, which has no equivalent in Python.62 However, even more telling is the comparison with their R closest counterpart, Dplyr. Recalling the example (see Sec. 1.2.9.9) where we sought to compute, for each group of data, the entries matching the maximum and minimum for a variable, it is instructive to try to reproduce that with Pandas and Pandas-ply. Simply applying a filter operation to grouped data doesn’t work for either package because grouped data frames don’t have the same methods as their ungrouped counterparts. In Pandas, we can use this workaround:

mtcars.groupby("cyl")
  .apply(lambda x: x[(x["mpg"]==x["mpg"].max()) | (x["mpg"]==x["mpg"].min())])[
    ["mpg", "model"]
  ]
  .sort_values("mpg")

The same workaround also works for Pandas-ply:

mtcars.groupby("cyl")
  .apply(lambda x: x.ply_where((X.mpg==X.mpg.max()) | (X.mpg==X.mpg.min())))
  .ply_select("mpg", "model")
  .sort_values("mpg")

In Pandas or Pandas-ply, a grouped data set is more like a list of groups, and we can use an approach from functional programming to apply a function to every element in this list, each a data frame. In this function, created with a lambda expression, the desired selection is performed using standard Pandas data frame operators. It is hard to ignore the numerous references to X, similar to the .data identifier in Dplyr, but for which Dplyr offers more concise, albeit less general, alternatives. On the result of this apply call, we can select variables of interest and perform a sort, alternating operators like [[]] and methods like sort_values, but in closer analogy with Dplyr. Unlike Dplyr, Pandas doesn’t provide specialized idioms for interactive sessions, as this use case is less dominant and more recent in Python. Independent of the use case, treating grouped data frames differently from ungrouped ones makes the API harder to use for no increase in power. It represents a loss of orthogonality, the ability for a set of functions or methods to operate on different types of objects with no exclusions. Here we have an example of a lost opportunity to use the more concise form:

table.group_by(...).filter(...)

instead of the more verbose:

table.group_by(...).apply(lambda x: x.filter(...))

while the former is simply disallowed, an arguably intuitive line of code that returns an error instead of something useful.

One of the advantages of being an incumbent, like Pandas, is third-party support. For example, using package Modin together with Pandas, we can partially parallelize a program [132]. Just by importing Modin, unmodified Pandas code will run in parallel. It is the most concise parallelization I have ever seen (compare it with MPI in Sec. 1.5).

1.2.10.2 Statistical Graphics

In statistical graphics, the quest for conciseness starts with its design, independent of any implementation. In The Visual Display of Quantitative Information, author Edward Tufte defined the data-ink ratio, the ratio of ink spent to display data vs. the total amount used in a chart, and recommended maximizing it [133]. It is a somewhat counterintuitive definition, as it may seem to encourage the use of heavy marks, like thick lines and large dots; this outcome, though, is prevented by another one of Tufte’s principles, that of data density — the number of data points per unit of area. The result is graphics that display lots of data with just enough ink, minimize and de-emphasize everything that isn’t data, like grids and scales, and eschew ornamentation altogether.

SimplexCT, a data analysis consultancy, provides a step-by-step example of transforming a chart that Tufte wouldn’t hesitate to call chartjunk into one following his principles [134]. Its starting point is full of useless and distracting graphical elements, as in Fig. 5.