The Ascetic Programmer
How asceticism benefits programming, science, and the arts
Cover illustration based on a program by Robert Nystrom [183]. Used by permission.
Many thanks to Marco Scanu and David Bylsma for providing feedback on an early draft, and to my wife Maria for patient proofreading and unwavering support.
Introduction
Ascetic programming is the writing of computer programs under restrictions unrelated to resources such as memory or power. They range from using fewer lines of code to forgoing certain programming language features. My interest in ascetic programming started with challenges among coworkers, which we undertook to keep our skills sharp, our curiosity for programming alive, and, not least, for bragging rights.
We were not the first to think of imposing restrictions on
programming other than those on resources. For instance, you may be
familiar with Edsger W. Dijkstra’s condemnation of the goto
statement (see Sec. 1.6).
However, among possible restrictions, the one on program length has
caught my long-term interest. It has connections to different facets of
computing, from maintainability to testing to theory. These associations
are mostly positive, meaning that shorter programs are better along
several dimensions: for instance, they are more readable and
maintainable and contain fewer errors.
Asceticism originated as a spiritual practice millennia ago. Ascetic restrictions traditionally target worldly pleasures and promise salvation or enlightenment in return. This book will focus mainly on code length restrictions and, more generally, description length restrictions, and our rewards will be plenty and varied, but, unfortunately, won’t include salvation.
Asceticism in programming comes in two flavors, roughly described by the slogans “doing more with less” and “less is more.” On the one hand, we seek to deliver the same functionality with fewer lines of code; on the other, we aim to identify essential features and discard the rest. I will refer to the former as conciseness and to the latter as frugality. We will discuss both as they apply to programming at different scales, from snippets of code to complete systems.
Beyond programming, I discovered unexpected examples of asceticism in statistics — where the preferred jargon is parsimony — philosophy, the visual arts, music, literature, and more. Outside science and engineering, the function of an artifact is less clear or undefined, and thus the boundary between conciseness and frugality becomes blurred. The equivalent of program length may or may not be available: for instance, it is, to a degree, for statistical models but less so for the visual arts. I hope you will find these more speculative forays interesting.
This book is a collection of cases or testimonials where brevity plays a positive role and where verbosity is the villain. The opposing view that verbosity is superior has also been allotted some space, as a counterpoint, without any pretense of equal treatment. Don’t expect a grand theory or a practical method, but rather a series of interconnected anecdotes and expert opinions, interspersed with the occasional theorem and research paper, some from giants in their respective fields and some from regular people working in the trenches. Let the case for asceticism build from the totality of the book rather than a single, decisive argument.
I hope that, by the end, you will be convinced that asceticism benefits engineering, science, the arts, and more. May this book inspire you to adopt unnecessary restrictions in your endeavors.
Who should read this book?
This book covers a wide variety of subjects, from computer programming to science to the arts. It doesn’t go particularly deep in any of them, but, reflecting my own background, it provides the most detail in the section about computing, less so in the one about science and only makes a cursory pass through the arts. I hope that anyone who has some familiarity with programming and mathematics will be able to enjoy most of the book. While there are plenty of internal connections, many of which I highlighted with cross-references, they are not essential for understanding. Therefore, skipping a subsection here or there won’t prevent you from enjoying the rest.
Reading advice
On devices that allow resizing the text view or the font, I recommend you choose settings such that the display shows a full line of code without wrapping — up to 88 characters. Where supported, long lines will scroll horizontally, and only as a last resort will they wrap. Since, in computer programs, indentation is meaningful, and this book often discusses line counts, this will reduce clarity.For the most part, footnotes contain definitions of concepts or entities that may be unfamiliar to some. If you don’t feel like you need to read a footnote to improve your understanding, it is probably safe to skip it. Other footnotes provide license information for material reproduced in this book when appropriate.
Code snippets are for illustrative purposes only, are stripped of
import
or equivalent statements to access libraries, and
are not guaranteed to be correct even restoring those statements.
1 Computing
1.1 Metrics
A first step towards espousing conciseness as a goal in software engineering is convincing ourselves that lines of code, or LOCs, are a valuable measure of program complexity. Certainly, 100,000 lines of well-structured code are simpler to understand or maintain than the same amount of so-called “spaghetti” code, which contains many interactions between its parts. However, one of the pioneering figures in computer science, Edsger W. Dijkstra, posed the problem of size as a fundamental one in software engineering [2]:
[…] widespread underestimation of the specific difficulties of size seems one of the major underlying causes of the current software failure.
Elsewhere, Dijkstra warned again about the dangers of accumulating large codebases [3]:
[…] if we wish to count lines of code, we should not regard them as “lines produced” but as “lines spent”: the current conventional wisdom is so foolish as to book that count on the wrong side of the ledger.
Dijkstra’s preferred solution, namely structured programming, is intended to make it easier to deal with large codebases rather than to reduce code size (see Sec. 1.2.4).
Microsoft co-founder Bill Gates, according to Internet folklore, said:
Measuring programming progress by lines of code is like measuring aircraft building progress by weight.
In more recent times, this sentiment is still present among developers. For example, free software advocate Chris Warburton writes on Hacker News2 [4]:
Software is a liability, not an asset. We should produce as little as needed, and delete it if possible.
In the classic software engineering book The Mythical Man-Month, Frederick P. Brooks Jr. states [5]:
Productivity seems constant in terms of elementary statements, a conclusion that is reasonable in terms of the thought a statement requires and the errors it may include.
Albeit expressed in statements rather than lines, according to the use of the time, this supports the idea that LOCs are a reasonable measure of complexity. This book follows the common practice of counting LOCs. Yet this count is partially dependent on the vagaries of code formatting (see Sec. 1.2.7). An alternative is counting tokens, the size of the representation of a program that a computer builds to run it. It is a formatting-independent measure, which we will adopt in one specific example (see Quicksort in Sec. 1.2.6).
We can often run into invariants like the one referenced above expressed in terms of LOCs. For instance, John Lions, the author of a well-known guide to the Unix operating system (see Sec. 1.3.6), states [6]:
It has often been suggested that 1000 lines of code represents [sic] the practical limit in size for a program which is to be understood and maintained by a single individual.
While the exact value in this statement may be debatable and its origin unclear, what is noteworthy is that it uses LOCs as its unit of measure. Code structure, programming language, or application domain appear to be secondary.
No other quantity has emerged as a viable substitute for LOCs. Herraiz and Hassan review some of the alternatives and come to the following conclusion [7]:
All the complexity metrics are highly correlated with lines of code, and therefore the more complex metrics provide no further information that could not be measured simply with lines of code.
Among the alternatives is cyclomatic complexity [8].
It focuses on the flow of control in a program by viewing it as a graph.
A node represents a sequence of statements devoid of control statements,
and if
and while
statements correspond to the
subgraphs in Fig. 2.
Figure 2: Graphs corresponding to if
and
while
statements resp.
With this graph representation, the cyclomatic complexity of a
program is one more than the circuit rank3 of
its associated graph, that is,
,
where
is the number of edges,
is the number of nodes, and
is the number of connected components. If you suspected this is a
long-winded way to count the number of control statements, you would be
right, unless we consider multiway branching statements like
switch
in C. It is possible to double the size of a program
without altering its control structure — say, interleaving logging
statements with the original ones — and, therefore, without changing its
cyclomatic complexity. However, experimental studies show that it
correlates well with program size, and those that used it to predict the
number of bugs4 while controlling for program size
had inconsistent results. A possible conclusion is that cyclomatic
complexity isn’t so valuable “since program size is not a controllable
feature of commercial software” [8]. However, program size, controllable
or not, predicts crucial properties of software.
One of them is the number of bugs. According to Capers Jones, a consultant on software methodologies, it is more than proportional to program size, falling in the range of 0-25 defects per KLOC (a thousand LOCs) for programs smaller than 2 KLOCs, but rising to 4-100 for those over 512 KLOCs ([9]; as quoted in [10]).
However, even more important than the number of bugs is the probability of project failure, defined as substantially missing a project’s goals. For example, Stack Overflow and Discourse co-founder Jeff Atwood writes [11]:
Here’s the single most important decision you can make on your software project if you want it to be successful: keep it small. Small may not accomplish much, but the odds of outright failure — a disturbingly common outcome for most software projects — is [sic] low.
W. Wayt Gibbs, a science writer for Scientific American, explains [12]:
Software is exploding in size as society comes to rely on more powerful computer systems. That faith is often rewarded by disappointment as most large software projects overrun their schedules and many fail outright — usually after most of the development money has been spent.
1.2 Concise Software
1.2.1 Programming
I actually found a way to shorten his code by 1 word and it had only taken me 20 years to do it. […] it was the only time I’d managed to reduce some of Gosper’s code.5 It felt like a real victory. And it was a beautiful piece of code.
In this passage from an interview with Guy L. Steele Jr., inventor of the Lisp dialect Scheme, we read how the quest for conciseness transcended its practical origins and mutated into an aesthetic one [14]. Back in the day when Steele and Bill Gosper, also a founding figure for symbolic computations and Lisp, left their mark on computing, computer memory was so scarce that shortening a program even by one word could have been necessary. Twenty years later, the evolution of hardware and software had been such that there could no longer be practical motivations to shave off a word; only the intellectual ones remained.
Yukihiro Matsumoto, inventor of the programming language Ruby, agrees with the aesthetic qualities of concise code but also adds the language of ethics [15]:
Brevity is one element that helps make code beautiful. […] In the vocabulary of programming, brevity is a virtue.
Dan Ingalls, a key contributor to the implementation of the language Smalltalk, recalls that, as he learned the language FORTRAN, he “tried to see how much shorter I could make programs, and stuff like that,” suggesting that, for him, it was a learning or playful activity [14].
For an experimental study, Lutz Prechelt asked several mostly inexperienced programmers to solve a simple string problem in various languages monitoring several properties of the solutions, including length. He concludes [16]:
Interpersonal variability, that is the capability and behavior differences between programmers using the same language, tends to account for more differences between programs than a change of the programming language.
Focusing on program length, this study finds that choice of language and programmer skill or style equally contribute to variation. Yet the point that matters for us is the wide variability in program length for the same task in the same language (see Sec. 1.2.9 for language-dependent conciseness). In this study, the programmers were allowed to code according to their preferred style. Would they have performed differently had they been asked to minimize length?
They may have, but it isn’t something they will often be asked to do in their careers. We have seen that code size was a priority in the pioneering age of computing, and it might still be when developing software for underpowered devices. However, I am not aware that code size currently is an externally imposed constraint on major platforms. An exception should be web development since program download is part of the load time of a page. Yet, it hasn’t prevented web pages from ballooning in size (see the “web obesity crisis,” Sec. 1.5). Not only minimizing program length is no longer a goal: at least in some situations, there has been pressure to increase it. In an interview, former Microsoft CEO Steve Ballmer recalls how IBM required billing by the KLOC and that there was no way to convince them that this system discouraged the search for more concise solutions [17]. Once a task was estimated to require, say, 20 KLOCs, that put a cap on how much money the developer could make, and it became a target to hit.
Free from external pressure, do programmers aim to keep their programs short? Some argue in favor of slowing down, which is related to program length since time is a limited resource. For example, Poul-Henning Kamp, the author of the web caching program Varnish, estimates that he wrote what he intended to be his “highest quality program — ever” at the breakneck speed of 5 LOCs per hour [18]. If that sounds slow, the equivalent estimate in the previously mentioned Mythical Man-Month is much slower: 10 LOCs per programmer-day.6 Kamp adds that for life-and-death software, where correctness is paramount, such as that developed for human space flight missions, the same estimate drops to one LOC per programmer-day. Sure such a coding speed must discourage code bloat (but see Sec. 1.5 for bloated space software).
It isn’t only rocket science. Matt Lacey, a mobile developer and consultant, in a post titled “You’ve only added two lines — why did that take two days!” lists plenty of reasons why, at times, development may slow down to such a crawl [19]. He reports about fixing a bug, first understanding it in detail, from which subsystems are affected to how to reproduce it. Then, unsatisfied with just making the issue disappear, Lacey endeavors to find its root cause, if the problem may represent a class of related ones and how to fix it with a high-quality solution that won’t break his program elsewhere. Finally, he spends some more time preventing this or similar problems from happening in the future. Two days to perform all these tasks, even if the solution is only two lines, don’t seem like a waste of time.
Everyone in the software industry knows that slowing down is the opposite of how it works. There are entire methodologies aimed at creating arbitrary deadlines and fostering a fictional sense of emergency. Limits on size, on the other hand, are rarely, if ever, encountered. No boss ever told my teammates or me: “you could have done this with less code,” or “let’s try and keep it under 1000 LOCs.” However, I found a counter-example (see Dwm in Sec. 1.2.11 for an additional one). An anonymous Quora user asks if he should “fire someone who unnecessarily wrote 500 lines of code” without context [20]. We don’t know what these lines were supposed to achieve or if there was a pattern of verbosity in this engineer’s work. Rejecting the rush to summary judgment, another user replies:
As a programmer, I’ve written many, many lines of unnecessary code. Why? Because sometimes you have to write 500 unnecessary lines of code to get to the 50 absolutely necessary lines of code.
(for code deletion, see Sec. 1.2.2).
Going back to Kamp and his Varnish, he further writes:
Back when I was a kid I could write 1000 lines in a single sleep-deprived session across midnight, but it didn’t take that long before I discovered that I had to throw out most of it once I woke up again.
I was 40 years old when I started Varnish and I had 22 years of professional experience, a lot of them staring at, and often fixing/improving/refactoring, other peoples [sic] source code.
Patrick Burns, the author of Tao Te Programming, agrees that experience is a factor [21]:
Novices may think that the increase in the number of lines is good. More experienced programmers will know that the good parts are when the number decreases.
He goes one step further: instead of just accruing code slowly and carefully, Burns talks about reducing it (for code deletion, see the next section). As to bug fixes, Burns recommends making them as short as possible. He writes:
Change as little as possible. A one-character change is optimal.
One character is what would have saved the Mariner I probe from destruction (but it wasn’t in its control program, see Sec. 1.5). Burns brings this line of thought to its ultimate consequences and thinks “of programming as an exercise in compression.” While others use this language (see Steele’s “Huffman encoding problem” in Sec. 1.2.9 and Armstrong’s system compressor in Sec. 1.2.12), I believe the word “compression” has a more specific meaning in computing distinct from conciseness. Compressed data aren’t directly usable: they need to undergo a process of decompression that reconstitutes the original before being, for instance, displayed as images. Concise programs are run, read, and edited as they are, and there is no way to reconstruct a verbose version from which they may have been derived. A concise program is an independent entity, which can’t be said about compressed data, tightly bound to its uncompressed version. However, we could also take the other side of the argument: that decompressing a program is one of many transformations programs may undergo before being executed, such as macro expansion or translation into intermediate or machine language. The crucial difference between a compressed program and a concise one is that the latter is human-intelligible; it is the version developers read, modify and maintain. For compressed programs, that is the uncompressed version. A compressed program isn’t intelligible to developers, at least with known compression techniques.
It is unfair to complain that engineers write unnecessarily verbose code when they are not taught or rewarded for being concise. However, one company goes against the grain. Comma.ai, an AI startup, added it to their requirements for aspiring employees [22]:
We value those who can do more with less code; software engineering doesn’t have an external selection pressure for low part count and cost like EE and ME, but you pay massive costs down the line in debugging and upkeep.
In an echo of “art for art’s sake,” James Hague, a video game developer, enthusiastically declares [23]:
Writing concise to-the-purpose solutions is a primary reason for programming in the twenty-first century.
If that is the future, it isn’t “evenly distributed yet” [24].
1.2.2 Deleting Code
If code isn’t written directly in its most concise form, a solution is to delete some of it. Of course, removing random lines won’t work: we need to identify useless ones, if there are any. “I love deleting code,” declares Nathan Marz, the author of the pioneering stream processing system Storm [25]. He isn’t alone. For instance, Chromium browser developer Devlin Cronin writes in a related bug-tracking thread [26]:
It [a proposed code change] has concrete benefits […] the ability to potentially delete a bit of UI code (something that always makes me happy!).
It may sound like an odd statement for someone whose craft is writing code. Artists have destroyed or mutilated their work in fits of disappointment [27] or as a performance piece [28] or have deliberately created ephemeral works [29]. However, it isn’t a widespread practice. In other branches of engineering, there is a pressure to reduce part count, and simplifying a design may be rewarding in and of itself. However, owing to a clear distinction between design and production, unnecessary parts come off a blueprint, not a finished product. Once removed, it is no longer necessary to make, procure or assemble them, and they don’t contribute to the cost or weight of an artifact anymore. In software engineering, once created, a line of code’s further costs are in the future and harder to estimate, in the form of debugging, learning, documentation, maintenance, and, as we have seen, even project failure. In my decades of experience with several employers and many more clients, we hardly ever tried to estimate those delayed costs. Therefore, the incentives to delete are weaker. However, good programmers know how much code they have to read and that someone has to maintain each line and so on. They understand the burdensome side of a line of code.
The know-how on how to delete is sparse. Nobody taught me how to delete code. A post titled Advice on Deleting Code sparked my interest [30]. At 1600 words long, it doesn’t seem to convey much more than an appeal not to leave dead code7 behind, whether commented out8 or otherwise disabled or forgotten. That alone, though, is a good suggestion, as the following anecdote exemplifies.
I once worked on a codebase where a static analysis9 marked 1/3 of the code as unreachable. We supported this evidence with an analysis based on logging in production and for an extended time. Nonetheless, we couldn’t overcome resistance to deleting code that might still be useful to someone someday. This even after giving engineers veto power on removing any code for which they knew there was a use, and even after explaining that “deleting it” just meant “preserving it forever” in version control,10 from where it could be restored on demand. The unspoken truth was that few believed dead code to be harmful enough to warrant any risk of removing active code, thus creating bugs. The business side could not believe we were spending time on a non-customer-facing issue, and, in a way, they were right because nothing came of it.
Inexperience doesn’t help. Developer Neil Kakkar writes [31]:
I was very uncomfortable deleting shit or obsolete code. I considered what was written eons ago sacred.
Under the guidance of senior engineers, he reports having become more aggressive in deleting code but still recommends removing only dead code.
Some people even advocate the appointment of “code removal leaders.” For instance, in a Hacker News discussion about career prospects for senior developers, we read [32]:
I think the most valuable engineering leader would be someone who can remove code, or at least prevent unnecessary code from being written and maintained.
That reminded me of one of the best developers with whom I have ever worked. He had discovered another coworker was generating low-quality code in copious amounts, primarily by cut and paste. To address this, he spent the last half hour of each workday reviewing the other developer’s work, reverting all of his changes if they affected foundational modules. Luckily, the functionality to replace was minimal. This arrangement seemed to work well for everyone. What did this developer get from deleting all that code? His goal was to keep the amount of code he needed to understand and maintain in check. However, that isn’t the only valid motivation.11
Joe Armstrong, creator of the language Erlang, talked about “removing” as fundamental to how he went about his job [14]:
I would write a specific algorithm removing all things that were not necessary for this program. Whenever I got the program, it became shorter as it became more specific.
Reportedly, some can go beyond that. For example, Jon Bentley, inventor of the k-d tree data structure and author of the Programming Pearls series of books, reports that he “once heard a master programmer praised with the phrase, ‘He adds function by deleting code’ ” [15].
Others would like to fix bugs by deleting code. This tongue-in-cheek exchange between Michael B. Johnson, a software architect who contributed to many Pixar films, and Pete Barber, a software engineer, proposes this approach [33]:
In order to fix that bug I had to:
- have a good night’s sleep
- take a 2 hour hike
and then write one line of code.
deleting one line to fix it would have been even better
Fine. Ruin my little victory …
According to psychological studies, fixing by removing doesn’t readily occur to regular people, not just programmers (see Sec. 2.3). Only surgeons have made an art of it, owing to the redundancy in the human body and the scarcity of spare parts for it.
In another passage from [14], Armstrong goes beyond the practical:
If you start removing things, if you get to the point where if you were to remove anything more it would not work any more — at this point it is beautiful.
We will later read a similar statement generalized to all human artifacts and even natural forms (see Sec. 2.3). A crucial property that distinguishes an engineered object from a work of art is that the former serves a specific function: a program computes something or a building must let people in and out — unless it is a prison. In the arts, authors can push conciseness to nihilism, to the silent piece of music (see Sec. 3.3) or the empty canvas (see Sec. 3.2.3). Actually, these extremes can be reached by programmers as well, on occasion. This is the description for JavaScript module Noop2 [34]:
No operation as a module™. The long awaited successor to noop. Noop2 does even less than its predecessor, and is 100% tested to guarantee nothing happens.
Its author, Yoshua Wuyts, considers it “both a joke and actually useful,” reportedly to write tests.
1.2.3 Repetition
The most obvious opportunity for deletion is repetition. With physical artifacts, it is a form of order: the pillars of a bridge, its equally sized spans, the columns of a church, and so forth. Repetition brings about economies of scale, simplifies maintenance, and provides visual rhythm. On the contrary, it is hard to justify in computer programs since it is avoidable. There are many ways to eliminate repetition: for instance, loops, to execute the same code multiple times; and functions or methods, to name a snippet of code and recall it by name. These can optionally have parameters, like their mathematical brethren (see Sec. 1.2.4). Detecting repetition doesn’t require advanced skills or creativity: it is simple and mechanical. So much so that the code analysis service Code Climate includes an analysis tool to detect duplicated code [35]. Even if, according to their website, it “uses a fairly simple algorithm,” it detected a few almost identical duplications that had snuck into my own code. Unfortunately, their removal and the testing of the resulting code are still the user’s responsibility.
Indeed, repetitions need not be identical, but that is a fuzzy statement. Where does the quasi-repetition stop, and where does the “vaguely related code” start? In a simple case, we may find a simple parametrization that unifies all the repetitions under consideration. In a complex one, the corresponding abstraction may or may not be expressible in the language of choice, or it may require highly unsafe constructs whose use is discouraged (see Sec. 1.2.9.2). Another observation is that it is easy to generate duplications even with the best of intentions. Even if we aren’t maliciously wielding the cut-and-paste function in the editor to exceed our KLOC targets (see IBM in Sec. 1.2.1), a repetition could result from the convergence of two similar ideas. We wrote the first copy too long ago to remember, or it is different enough that recognizing it as duplicative requires careful analysis. Even when identification is easy, there may not be time to eliminate repetition because of the work required. If that entails writing a new function, we may also need unit tests,12 because of coding standards, unlike copy-and-paste code. Therefore, the elimination of the repetition must wait. However, writing a ticket13 to mark some duplication we left unattended generally isn’t a career boosting move, and those tickets never get taken care of regardless. The repetition lives to see another day and we move on to more glorious endeavors, like launching a new feature. Duplications accumulate in codebases like in genomes (see Sec. 2.4), slowly drifting apart over time.
Repetitions aren’t without drawbacks, though. Burns issues the following warnings [21]:
If you want to make a change to the repetitious code, then you have to change every occurrence. And before you can change them, you have to find them. If the change is a bug fix and you miss one, then you have just created a pernicious, hard-to-find bug. (“I already fixed that bug, it has to be somewhere else.”) If the change is a modification to the behavior and you miss one, then you have just created a pernicious, hard-to-find buglet. Buglets can grow into bugs.
(Burns doesn’t define a buglet, but, if I understand it correctly, it is a software defect that hasn’t yet manifested itself, a potential bug). That is why many programmers embrace the so-called DRY principle [36]. In the words of Matsumoto [15]:
Redundancy is defined as the duplication of information. In order to eliminate redundancy, we follow the DRY principle: Don’t Repeat Yourself. The concept of DRY is the antithesis of copy-and-paste coding.
Despite Matsumoto’s strong statement, there is nothing in Ruby to
enforce this principle, nor in any other language I am aware of. I can
only speculate why that is the case. The problem may be deciding what
repetitions are worth removing, particularly how small and after how
many repetitions. Is printing a new line worth a special construct? Is
println()
better than print("\n")
? Or is
x++
better than x = x + 1
? Different languages
give different answers to these questions. If you had to print the same
message
times, for what value of
is the following solution
print(message)
print(message)
...print(message) # n times
better than this alternative:
for i in range(n):
print(message)
I know which one most people would choose for or , but, for intermediate values, it is anyone’s guess.14 Without the lens of conciseness, these choices are seen as debatable and even just a matter of style.
Another obstacle to enforcing the DRY principle may be the issue of
approximate repetition. Banning only identical copies would only
represent a modest level of enforcement. Extending the proscription to
approximate ones would face the problem of choosing which ones to
remove. Moreover, if we replaced two code snippets with a function
f(x)
, invoked with different values of x
to
replace each one, we would need to prove or at least corroborate the
equivalence of those snippets to f
. The equivalence problem
is, in general, extremely difficult,15 but less so when the
original snippets are similar.
However, the simplest explanation may be that code quality is considered outside the purview of compilers and interpreters and something that can only be enforced and improved by programmers. Nonetheless, there are separate tools, such as CPD, the aptly named Copy-Paste Detector, to detect duplications [37]. In my experience, their use could be more widespread in the industry.
Sometimes the obstacles preventing this process are in our skills and the complexities of the code: finding repetitions and unifying them. At other times, they are a question of priorities and goals. In an interesting anecdote, Dan Abramov recalls how he identified repetitions in an otherwise working piece of code [38]. After refactoring it, this is how he described the results:
The code is half the total size, and the duplication is gone completely! So clean. If we want to change the behavior for a particular direction or a shape, we could do it in a single place instead of updating methods all over the place.
The remarkable twist is that he had to revert his elegant solution under his boss’ orders and became convinced the messy code with the duplications was better, in what may be an engineering version of Stockholm syndrome. The reasons he brings range from the social — he didn’t ask permission — to the over-engineering — the clean solution could not support some future hypothetical use cases (see YAGNI in Sec. 1.3.1). Adding and removing code aren’t just two equally important sides of programming: one is considered an ordinary, day to day activity; the other, an exceptional, dangerous one requiring a broad consensus.
1.2.4 Abstraction
When talking about approximate repetitions, we saw that we could sometimes replace two blocks of code with one, possibly shorter than the sum of the original two, which performs the function of both. To achieve this, we need to unify their commonalities and reduce their differences to a small fraction of the code or, even better, parametrize them. This process is a special case of what is known as abstraction. As Dijkstra said in his Turing Award Lecture [39]:
[…] the only mental tool by means of which a very finite piece of reasoning can cover a myriad cases is called “abstraction”; as a result the effective exploitation of his powers of abstraction must be regarded as one of the most vital activities of a competent programmer.
It is natural to think that a piece of code that covers “a myriad cases” would spare us from the effort of coding each case one by one, thus reducing code length by a corresponding factor. For instance, C. A. R. Hoare, Turing award recipient and inventor of the Quicksort algorithm, declared: “The way to shorten programs is to use procedures”.16
However, that isn’t the mechanism proposed by Dijkstra. In the same speech — indeed, the same paragraph — he advocated using abstraction to reduce the “intellectual effort” required to write a program of a given length, not to reduce its length. He didn’t formulate this recommendation based on empirical data or a mathematical theory. He started instead from the common-sense observation that if all parts of a program interact, this results in many interactions that need to be understood, debugged, maintained, etc. Their number grows with the square of the size of the program. On the other hand, if each part interacts with a small number of others, this results in a much more manageable linear number of interactions.
Dijkstra appears to assume that abstract code results in programs of the latter type, but this is far from obvious. On the contrary, a more general function creates indirect dependencies between parts of a program and, thus, more of the entanglement that Dijkstra suggests reducing. Taking sorting as an example, let’s assume we have two independent programs, one that needs to sort arrays of integers and the other of strings. In the first scenario, each program uses its specific sorting function, and there is no other code-sharing between the two. In the other, they rely on a shared library function that can sort any collection. This sharing constitutes an indirect dependence between the two programs. If changes in the sorting code were to become necessary because of the needs of one program, a bug could appear that cripples the other one.
If keeping the two programs completely independent is paramount, an option is duplicating the sorting function, no matter how general, and letting each program depend on its separate copy. When involving entire libraries, developers call this forking the shared dependency (from the idea of a fork in the road). Albeit an operation that results in more code to maintain, developer, blogger, and speaker Thomas Figg advocates for repetition to be temporarily accepted, to prevent coupling, or to let a better abstraction emerge from examples ([41], see also “interpolation” below).
There are other ways to ensure isolation between parts of a program besides duplication. For instance, modules or functions which are clearly defined and independently tested minimize the coupling risk. In my previous example of modifications to a sorting function, they would be unusual for a task that is so clearly specified. Because of this, such a function is simple to test, and extensive testing reduces the likelihood of introducing bugs when modifying it. Therefore, sharing a sorting function between modules doesn’t induce any coupling worth worrying about. Indeed, such a function is available in any language I can think of as a built-in or standard library function, and few developers would bother creating their own.
A program built primarily out of functions as clearly specified as a sorting function behaves more like several small programs than a single large one. All its components will still run on one machine, sharing its resources and crashing all at once if only one is faulty. However, from the point of view of development, the different parts are more swappable: we can replace a buggy function with another; a slow implementation can give way to a faster one. Microservices take decoupling to the next level by running parts of a program or application in isolation.17 This way, a microservice can fail without dragging down the rest, and graceful degradation is possible. At an extreme scale, that is why the Internet works despite being based on, possibly, billions of LOCs:18 it can keep working even if at any given time parts of it aren’t — there are nonetheless occasional large-scale failures [42].
In summary, conciseness and modularity are two sides of coping with size: if size can’t decrease, by splitting code neatly into modules, we can deal with smaller chunks at once.
If writing more general code is harder and introduces more
dependencies, does it offer any advantages? Well, what is better than
covering “a myriad cases” (see Dijkstra’s talk above) with a single
coding effort? “Wherever there is repetition, there is an opportunity
for abstraction,” states Burns [21]. For example, imagine having a
sorting function for each type of collection and then for each element
type: one for arrays of integers, one for lists of strings, and so on.
That would result in a large amount of repetitive code. Indeed we will
see an example where a library (in java.util.Arrays
, see
Sec. 1.5) contains seven
copies of essentially the same functions because the language hampers
further generalization. Moreover, bugs can lurk for a very long time.
For example, an implementation of binary search,19
generally accepted as correct, hid a bug for 20 years after being
published [43].
Its Java standard library version contained the same error for nine
years [44]. Imagine
the effort to fix this bug if every project maintained its version of
binary search.
Abstracting over different data types is only one instance of a general phenomenon. Often code can grow as an accumulation of special cases until we realize a general rule or solution that can replace them and covers additional ones as a bonus. Using a statistical analogy, we are replacing individual data points with a model and predicting new ones as a result (for statistical modeling, see Sec. 2.1).
Rich Skrenta has observed something similar [45]:
[…] more code often means less flexibility and functionality. This is counter-intuitive, but a lot of times a simple, elegant solution is faster and more general than the plodding mess of code produced by a programmer of lesser talent.
I had a stark demonstration of this when I was still in college. As part of standard computer science coursework, I was working with other students on implementing a toy programming language. Our instructor asked us to use a modular architecture that subdivided its compiler20 into multiple interacting programs, each assigned to a sub-team. After preparatory meetings and some development, we all met to check that all parts worked together. One of them seemed to work on the simplest programs, but it failed at the first hint of nesting — a for loop inside another for loop, for instance. The student responsible for it promised to fix it and meet us again in a couple of days. At the following meeting, his program failed again, this time on two levels of nesting. This sequence of events repeated itself a few times, with his program producing errors each time on slightly more complicated examples. Concerned about the fate of our project, we asked if we could review his source code. We found that he treated the no-, single- and double-nesting cases separately. He had written a large amount of code to cover all the examples we had provided. He had yet to use recursion21 to deal with unlimited nesting, as required by the definition of the language. It looked like he thought that, in practice, the programs to be compiled wouldn’t be too complicated and that his approach would be enough. The result was a program that had to grow in size proportionally to the complexity of the examples it was required to handle. In this respect, it had something in common with applications whose complexity is proportional to the number of screens to display; or machine learning models, whose complexity grows with the size of the training set (for more on this, see Sec. 2.1.4). Only much later in my career, I realized that my teammate22 was anticipating the future of the software industry:
- he was pioneering test-driven development;
- he focused on his customers’ requests, with his team-mates being the customers;
- he maximized customer value — higher levels of nesting are used less often;
- he met his self-imposed rapid-fire deadlines;
- he wrote code quickly.
One obstacle to achieving the level of generality of a sorting
function or a compiler is that not all abstractions emerge with the same
level of clarity. For example, sometimes we create a
do_stuff
function that we reuse somewhere else before we
understand what “stuff” exactly means. One consequence is that it keeps
changing as we modify code elsewhere. It may also be harder to test in
isolation from the contexts in which it runs. These are the telltale
signs of a poor abstraction, but they are seldom perfect from the start.
Sometimes the correct generalization emerges over time after seeing it
applied in different contexts, akin to interpolation in mathematics.
Marz writes [46]:
I view the development of beautiful abstractions as similar to statistical regression: you have a set of points on a graph (your use cases) and you’re looking for the simplest curve that fits those points (a set of abstractions).
The alternative, letting repetitive code proliferate and fixing it at a later time, has its drawbacks, as we have seen.
Another problem is that the connection between abstraction and conciseness isn’t considered fundamental. We have seen how Dijkstra thought abstract code reduced the intellectual effort required per line written, not the number of lines in a program. Furthermore, he didn’t look kindly at attempts to solve problems with a single line of code (see “one-liners” in Sec. 1.5). In my training, different abstraction techniques were touted for their potential to reduce coupling and not for any conciseness advantage. Without a measurable goal, the best level of abstraction becomes a matter of opinion. However, there is some agreement that it can be taken too far (see the LinkedIn example in Sec. 1.2.9.4), for instance, creating hierarchies of classes23 where most classes are never instantiated.
A few pay attention to the conciseness potential of specific programming techniques. For instance, software engineer and blogger Nikolay Grozev wrote an introduction to monads, the advanced programming technique inspired by category theory [47]. First, he develops an example that he deems “far from perfect, as we have repeated ‘glue code.’ ” Later, after applying monads, he observes:
This [using monads] avoids the shortcomings of our previous approach because the bind function implements all the glue code and we don’t have to repeat it.
The crucial point is ensuring that abstractions can be combined
seamlessly, providing appropriate combinators. Only then can we reap the
full benefits of creating those abstractions. An example could be
modifying a module for arithmetic calculations where a division by 0
terminates a program so that it would instead generate a special error
value, say "err"
. This module features four functions,
add
, sub
, mul
, and
div
, for the four basic arithmetical operations. Since a
division by 0 can only occur during the execution of div
,
we start by modifying that function:
def div(num, den):
return "err" if den==0 else num/den
The problem is that the other functions don’t know how to deal with
the value "err"
, so we can’t just write
add(1, div(1, 0))
to calculate
.
The intended behavior is that any operation involving "err"
also results in the same value. One approach is to check every time we
use div
:
= div(1, 0)
a = "err" if a == "err" else add(1, a) b
This approach is verbose and error prone. Another option is to modify
all four operations to deal with "err"
appropriately.
However, the “monadic” approach is to define a function, traditionally
named bind
, to modify the evaluation of the operations:
def bind(op, x, y):
return "err" if x == "err" or y == "err" else op(x,y)
This way, we can write complex expressions as follows:
sum, 1, bind(div, 1, 0)) bind(
Another example from functional programming24 circles relates to higher-order functions, which accept as arguments or return other functions, and are available in many popular languages but aren’t used widely in practice. Software engineer and blogger Gio Carlo Cielo guides us through a complete example in which “our code verbosity has been significantly reduced through the use of the higher-order functions” [48]. A minimal illustration of their use could be applying a function to each element in a list. Without higher-order functions, this is what we would write in Python:
def f(x):
# some definition for f
...
= []
result for x in range(10):
result.append(f(x))
With the standard (higher-order) function map
, the same
example becomes:
map(f, range(10))
This version is simultaneously more concise and more abstract. We obtained a more concise version by hiding details like creating an empty list and appending elements to it. Python offers an alternate form for this task, list comprehensions, using which our example becomes:
for x in range(10)] [f(x)
This code doesn’t eliminate the loop variable x
and is
limited in its generality. For instance, there is no list comprehension
for the fold
or reduce
higher-order function
typical of functional programming. The likely reason is that it is the
element of functional programming that Python’s author Guido van Rossum
has “always hated the most” and for which he needs to “grab pen and
paper to diagram what’s actually being fed into that function before I
understand what the reduce() is supposed to do” [49]. However, this issue is
wider-reaching than reduce
itself and van Rossum’s foibles:
it is whether languages should provide the most common abstractions or
the tools to build them. Higher-order functions are in the second group.
For instance, extending map
to a data structure such as a
tree is straightforward, as we will see shortly.
Let’s start with some definitions. A tree is made out of nodes. A relation exists among them such that some nodes are parents to other nodes, called children. An ancestor to a node is either parent to that node or an ancestor to its parent. An ancestor is connected to its descendant nodes through a unique path. Finally, there must be a node that is an ancestor to all other nodes, called root (see Fig. 3).
Figure 3: A tree.
In Python, let’s represent a node with the list of its children and a
tree with its root node. For instance, [[],[]]
represents a
tree with three nodes, one the parent of the other two. Since trees are
most useful when holding some data on each node, let’s say the first
element in the list isn’t a child but such data, for instance
[1,[2],[3]]
, where the node holding 1
is the
root, which is parent to the other two nodes, containing 2
and 3
respectively. With these definitions and
representation conventions, we can define a map
for trees
as follows:
def map_tree(T, f):
return [f(T[0])] + list(map(partial(map_tree, f=f), T[1:])) if T else []
With two lines25 of code, we have extended the
concept of a map
function to trees.26
There is no such possibility for list comprehensions. They are a useful
pre-defined abstraction but aren’t amenable to being extended to other
data structures.
There isn’t anything special about functional techniques other than that they are powerful techniques for abstraction and, therefore, also for conciseness. Moreover, I wouldn’t want to generate the impression that abstraction is the result of applying sophisticated techniques or adopting esoteric languages, whereas, for the rest of us, there is copy-paste-modify. Sharing a complex method between two classes by introducing a common parent class also helps conciseness. As we have seen above, Hoare found procedures already fit for this task.
1.2.5 Rewrites
A rewrite is a controversial and radical step in software development: take what you have built, throw it away and start from scratch. Rewrites come in different sizes, from those targeting a complete application to those affecting just a few functions. They have been described as company-destroying mistakes [50] or a normal phase of development [5]. Opinion dominates, and I don’t have much to add to this debate but to consider what it means for conciseness. Rewriting some portion of a program isn’t guaranteed to result in less code. Famously, Netscape discarded its Navigator code base for various reasons, one being its size [51]. A prototype released to showcase the new rendering engine Geko was about a tenth of the size of contemporary browsers and could fit a single floppy disk — 1.44 MB. It had swollen to 17 MB by the time the 1.0 version was released, negating any size advantages.
However, it doesn’t have to end that way. Instead, we will see how switching to a higher level language or one better matched with the application can result in radical code savings, for instance, in a 60-KLOC rewrite performed at LinkedIn (see Sec. 1.2.9.4). In that case, the rewrite was management initiated. In other ones, rewrites start from the grassroots. For example, Apple developer Bill Atkinson used to report a negative number of lines contributed in a weekly form that he had to submit, including a “-2000” in the week in which he had rewritten some routines in QuickDraw to be six times faster. Eventually, he wasn’t asked to fill the form anymore [52].
Another example involves Bret Taylor, co-inventor of Google Maps [53]:
[Taylor] was frustrated that the JavaScript was really slow. It would take forever to load. So, one weekend he said, “I’m going to rewrite it.” And he rewrote the entire thing, making it 10 times as fast and a third of the size. He completely threw away the code that a team of engineers had been working on for months.
Unfortunately, we don’t know if the same result could have been achieved incrementally or by giving the right incentives to the original team.
An additional example is a cryptocurrency named XLM, initially based on a fork of a pre-existing one, XRP. Its developers were having doubts about the security of XRP and decided there were better options, eventually opting for the Stellar Consensus Protocol. However they did not merge it: instead, they “created an entirely new codebase designed with safety, minimalism, scalability, interoperability, and multiple implementations in mind,” elevating minimalism to a top-level concern [54].
In some instances, the impetus for a rewrite comes from a shift in hardware technology that makes an existing implementation obsolete. For example, the trend towards using GPUs27 for numerical computation prompted a complete rewrite for an application named Weather Research and Forecasting (WRF; [55]):
WRF is well over half a million lines of code, mostly rooted in Fortran. This new model is based on the same ideas but with less code. Still, the effort to get this up and running on Power9/Volta GPU systems took two and a half years.
We will see more of this as heterogeneous computing28 spreads to all types of devices, from supercomputers to smartphones, prompting partial or complete rewrites [56].
dogpile.cache
is a generic API29
that can use different caching30 backends. It is a
rewrite of a similar product, Beaker, by the developer of both, Mike
Bayer, who explains [57]:
All the ideas of Beaker which “work” are re-implemented in dogpile.cache in a more efficient and succinct manner, and all the cruft […] relegated to the trash heap.
Here we have an example of a rewrite that doesn’t involve management taking a new direction (see the LinkedIn JavaScript rewrite in Sec. 1.2.9.4) or a programming superstar showing his mettle (Taylor, see above). Instead, we witness a developer realizing that an incremental path from his existing code to where he wanted to be was less practicable than a fresh start. It is also an example where succinctness rises to top-level concern instead of being a happy accident.
1.2.6 Algorithms
Algorithms are generally small enough that their code length isn’t an issue, instead relevant for libraries, applications, and systems. With algorithms, we are more interested in their correctness and efficiency. However, it doesn’t mean keeping the line count low isn’t a nice extra. For example, Daniel Lemire, inventor of the xor filter, a replacement for bloom filters31 and other related algorithms, asks [58]:
Could we do even better [on speed and memory usage] while limiting the code to something you can hold in your head?
His xor filter reportedly marks an improvement on traditional measures of algorithm efficiency (speed and memory usage) while also keeping it relatively short:
The complete implementation in Go fits in less than 300 lines and I did not even try to be short. Indeed, any semi-competent Go coder can make it fit within 200 lines.
While I doubt I can memorize 300 LOCs, it is 50 lines shorter [59] than a bloom filter implementation by the same group [60].
Moving onto a more run-of-the-mill Quicksort, a mainstream sorting algorithm, there are different ways to implement it. Angel Sola Orbaiceta, a software engineer and book author, experiments with object-oriented and functional paradigms, using Java and Clojure [61]. He finds out that “functional code is much more readable and compact” for the same task and believes “the more code there is, more time and effort has to be put in maintaining the codebase and more bugs tend to appear.” Specifically, skipping empty lines or those with just a bracket or parenthesis, he ends up with 27 lines for Java and 9 for Clojure.
I tried to replicate his experiment in Python. Quicksort is an application of the divide et impera approach:
- pick some threshold or pivot;
- partition the values to sort based on that threshold;
- sort the two splits separately;
- concatenate them.
When a split is empty or only has one element, return it as is. Some
subtleties in how the partition is performed create the opportunity for
some variants. Namely, the Quicksort function (abbreviated to
qs
in the following code blocks) should never invoke itself
on its own input, or the algorithm won’t terminate (and run out of
memory in practice). We can achieve this by excluding the pivot or all
elements equal to the pivot from further sorting. Another choice is
either to sort in place — moving entries around in a single array — or
to create new arrays as needed: the former is more memory-efficient, the
latter quite a bit simpler.
The following version performs the sorting in place:
def qs(xx, lo = 0, hi = None):
= len(xx) - 1 if hi is None else hi
hi if (hi <= lo) or lo < 0:
return
= partition(xx, lo, hi)
p - 1)
qs(xx, lo, p + 1, hi)
qs(xx, p
def partition(xx, lo, hi):
= xx[hi]
pivot = lo - 1
i for j in range(lo, hi):
if xx[j] <= pivot:
= i + 1
i = xx[j], xx[i]
xx[i], xx[j] = i + 1
i = xx[hi], xx[i]
xx[i], xx[hi] return i
It takes 17 lines, which compares favorably with the 27 lines of Java in Orbaiceta’s blog post. However, on closer inspection, his version isn’t in-place. Removing that constraint, we can produce this version:
def qs(xx):
if len(xx) <= 1:
return xx
return (
for x in xx if x < (pivot := xx[0])])
qs([x + [x for x in xx if x == pivot]
+ qs([x for x in xx if x > pivot])
)
This code is just seven lines, ignoring the line with just a parenthesis, and it is also straightforward. There is no index accounting to master, no swapping of elements, just plain and simple divide et impera. However, the ascetic programmer can’t help but notice a certain repetitiveness in the three list comprehensions that perform the partition. The only element that changes is the comparison operator used. I looked unsuccessfully for a library function combining these steps; defining our own would take too many lines. However, we still can factor out the filtering operation:
= lambda op, xx: [x for x in xx if op(x, xx[0])] fo
where op
is one of <
, >
,
=
. We can’t pass them around as parameters in their most
familiar form, but there are equivalent functions in the standard
package Operators: lt
, gt
, and
eq
, respectively. With this parametrization, we can write
qs
as follows:
= lambda xx: xx if len(xx) <= 1 else qs(fo(lt, xx)) + fo(eq, xx) + qs(fo(gt, xx)) qs
I used a short name for fo
to fit qs
onto
one line (formatted with a popular tool, Black, which uses an
88-character bound). For the same reason, I used lambda
definitions instead of def
. Those tricks are possible
because line count is a blunt instrument. If we used token count, a
measure of program complexity insensitive to the vagaries of formatting
or naming (see the next section), those tricks would not pay off.
Indeed, it is easy to perform a token count with the help of package Ast
(for abstract syntax tree, a data structure used to compile
programs):
def token_count(source):
return len(list(ast.walk(ast.parse(source, "source", "exec"))))
According to this function, the in-place Quicksort contains 200
tokens, whereas the shorter versions are 75 and 74 tokens long,
respectively. Remarkably, the two-line version wins by just one token,
owing to its “density.” It is natural to ask if we can go any lower than
this. It is possible, indeed, by eight tokens. To achieve this feat, we
need to define the partitioning filter as a function of one instead of
two arguments. This change requires its definition to be inside the
scope of xx
:
def qs(xx):
if len(xx) <= 1:
return xx
= lambda op: [x for x in xx if op(x, xx[0])]
filter_xx return qs(filter_xx(lt)) + filter_xx(eq) + qs(filter_xx(gt))
It makes sense to limit the arguments of a function to those that
change within a scope of interest: what else could we be filtering other
than xx
? What may look like a silly search to shave off a
few additional tokens — a pursuit that Dijkstra thought to lack “any
conceptual relevance” (see Sec. 1.5) — spurred me to tighten or
specialize the filtering abstraction to the needed level of generality,
and nothing more than that. There may be a performance penalty for
defining a function inside a recursive call, certainly in Python, but we
are not trying to optimize performance in this book, and occasionally we
aim for the opposite (see the Busy Beaver numbers in Sec. 1.2.11.1).
Code length is also used to evaluate the effort needed to prove the correctness of algorithms (see Sec. 2.2). For instance, Martin Kleppmann, author of Designing Data-Intensive Applications, implemented, in 60 LOCs, a distributed data structure to support group editing, but the proof of its correctness took 800 lines of math for a proof-to-code ratio over 13:1 [62]. Kleppmann maintains it was worthwhile not only to raise confidence in the correctness of a less-than-intuitive algorithm but also to increase its understanding. Nonetheless, the large size of the proof highlights an obstacle to applying formal methods in software engineering. Another barrier is that proofs can only show that an algorithm satisfies its requirements, but those requirements must reflect the real world. Nonetheless, formal methods are valuable in high-stakes applications such as the control software for the French railway network [12] or the S3 storage system [63].
1.2.7 Notation and Formatting
If recognizing and implementing abstractions is considered good programming, abbreviating identifiers or skipping white space is generally frowned upon. Identifiers, as we were taught in school, should be self-explanatory; white space should be used liberally to impart visual structure to the source code.
An exchange from a Hacker News forum shows that at least some developers, including database architect Andres Freund, challenge the orthodoxy [64]:
Usage of acronyms is one of the worst offenders in bad code.Not really on board with that … It’s a balance. Brevity does have its value too.
The trade-off is memorizing the abbreviations in exchange for dealing with more compact and readable expressions. If a long variable name works as documentation, is having to read it every time you run into a variable a good idea? Whether or not a program uses acronyms, identifiers’ definitions and related documentation must be easily accessible, usually through features of an IDE.32 When a program deals with complex concepts, the idea that even long identifiers can be self-explanatory is unrealistic.
We should look at mathematics as a model. Variables are generally single letters. Mathematicians and physicists develop specialized notations to make expressions more compact. About Einstein’s general relativity, the biographer Walter Isaacson writes [65]:
Einstein’s final equations used the condensed notations of tensors to compress sprawling complexities into squiggly symbols and subscripts.
Clearly, “sprawling complexities” aren’t conducive to readability!
Charles Babbage, inventor of the first mechanical calculator, seemed to agree (as quoted in Kenneth Iverson’s Turing Award lecture [66]):
The quantity of meaning compressed into a small space by algebraic signs is another circumstance that facilitates the reasonings we are accustomed to carry on by their aid.
Even newlines should not get a free pass. In a thread on GitHub, a discussion developed between supporters of a format with shorter lines that was allegedly more friendly towards text-oriented diff tools33 and a more compact one. User tweakimp writes [67]:
I think you can call it [the choice of format] more than preference because if there is no other difference, less [sic] lines is [sic] better.In hugovks [sic] example the second version is shorter, easier to read and should be preferred in my opinion.
Optimizing code to be read by primitive comparison tools sounds like a bad idea, independent of conciseness. In the seminal book Structure and Interpretation of Computer Programs (SICP [68]), we read that “programs must be written for people to read, and only incidentally for machines to execute.” Even more so, we shouldn’t be writing them to ease the work of primitive diff tools but switch to better ones, like those that analyze the structure of programs.
1.2.8 Documentation and Comments
When it comes to documentation, the more documentation, the better, correct? Actually, who enjoys reading it? Documentation is a cost like code is. It needs to be written and maintained, developers must read it, and it silently becomes outdated.
Michele Simionato, author of several Python packages, reported this about halving the size of the documentation for one of them [69]:
[dropping Python 2] is a huge bonus, since I could remove over 2,000 lines of duplicated documentation/doctests. Having to maintain separate docs for Python 2 and Python 3 effectively stopped any development on the module for several years.
What may be hard to understand with large teams becomes evident for a one-person volunteer project.
Can we eliminate documentation? Joshua Bloch, known for his work on Java, thinks so. About API design, he writes [70]:
APIs should be self-documenting: It should rarely require documentation to read code written to a good API. Indeed, it should rarely require documentation to write it.Early drafts of APIs should be short, typically one page with class and method signatures and one-line descriptions. This makes it easy to restructure the API when you don’t get it right the first time.
It may look like a lofty goal, but Bloch was likely thinking about
some of his work on the Java standard library. Building on fundamental
concepts in software engineering, which all developers should be
familiar with, alleviates the need for documentation, and simple and
consistent API design should complete the task of minimizing the need
for documentation. Does the Java standard library meet these lofty
goals? For instance, in module java.util.Arrays
we find
entries such as [71]:
static int binarySearch(byte[] a, byte key)
static byte[] copyOf(byte[] original, int newLength)
static byte[] copyOfRange(byte[] original, int from, int to)
Their behavior should be intuitive. However, why does
copyOf
need a newLength
argument? Shouldn’t a
copy be length-preserving? And what happens if newLength
is
larger or smaller than the original? Luckily, the API reference provides
a simple one-line explanation that copyOf
is more than a
copy: it will happily stop after copying only newLength
elements, or continue, padding the new array with 0s until
newLength
has been reached. The padding value isn’t
configurable and depends on the type of the original
array
— copyOf
is overloaded34
for different types. Scanning the different forms of copyOf
I found
static <T> T[] copyOf(T[] original, int newLength),
an entry which uses Java generics to parametrize its parameter and
return types, for which the padding value is null
. So what
if T
equals byte
? Will copyOf
pad
its result with zeros or ‘nulls’? The answer is that Java generics
cannot be instantiated with primitive types, and therefore
T
can never be instantiated with byte
[72]. Thus the apparent ambiguity is
resolved. It gets more confusing if we consider this additional form of
copyOf
:
static <T,U> T[] copyOf(U[] original, int newLength,
Class<? extends T[]> newType)
which represents a copy with conversion from type U
to
type T
. What happens if you need to perform the same
operation from or to primitive types? I would check with a Java expert
before giving up, but I am afraid you will have to write 49 more
functions, one for each pair of primitive types.35
This limitation stems from the Java generics design, like the quantity
of almost identical copyOf
entries for each primitive type,
and the API developer must adapt (see java.util.Arrays
in
Sec. 1.5). However, the
documentation leaves the user helpless here, suggesting a consistency in
the library that is only aspirational.
In a thread about the “best code bases,” Hacker News user taspeotis has this to say about a code base touted for its line-by-line comments:36 “Every line? Oh dear” [73]. He and other users follow up by sampling the uninformative comments, which add nothing to what is already obvious from the code:
/* Set verbose flag */
++;
verbose...
/* Set advanced flag */
= 1; advanced
User teknopaul comments:
IMHO Coding guides should ban “SBO comments” Statin [sic] the Bleedin [sic] Obvious. Let devs [sic] decide what SBO is, usually its [sic] bleedin [sic] obvious.
Which sounds agreeable, but for the circular definition of “bleedin [sic] obvious.”
Martin Fowler, author of popular software engineering books, doesn’t like comments [74]:
Comments are not bad — but there are often better options. I always try to write code that doesn’t need comments …
One such option is assert
statements, which specify
properties to be satisfied by a program when executing at a certain
position in the code. While they are yet more code to write, understand
and maintain, they can express simple properties that aren’t obvious
from reading the rest of the code. These properties are checked at
run-time, helping in the early detection of bugs, near their origin,
rather than later as consequences snowball. However, even without
running any program, they work “as documentation for other developers
reading the code, who see the assert and can confidently say that its
condition holds” [75]. For
instance, we can express the property that a list is sorted with:
= lambda xx: all([x <= y for (x,y) in zip(xx[:-1], xx[1:])]) is_sorted
We can use this in the implementation of the algorithm Quicksort (see Sec. 1.2.6) to check and document that each call has to return a sorted list:
def qs(xx):
if len(xx) <= 1:
return xx
= lambda op: list(filter(lambda x: op(x, xx[0]), xx))
filter_op = qs(filter_op(lt)) + filter_op(eq) + qs(filter_op(gt))
yy assert is_sorted(yy)
return yy
Another option is randomized or property-based tests, which express, in executable form, required properties of inputs and outputs of a function. The idea is that, instead of specifying a set of inputs with corresponding desired outputs for a function, we can define a process to generate inputs and properties that the input-output pair must verify. For instance, when testing a matrix multiplication function, we would pick a matrix generator to provide inputs. Tests could verify properties such as associativity, distributivity, and having the correct dimensions. Additionally, they could check the results of operations involving identity and null matrices. This approach originated in the Haskell language community and was implemented in the package QuickCheck, where it relies on the type system to provide useful default generators even for user-defined types [76]. It has been imitated in many languages to different degrees of fidelity. In Python, the package Hypothesis supports property-based testing [77]. To continue with our Quicksort example, a test that the output is sorted would look as follows:
@given(xx=lists(integers()))
def test_is_sorted(xx):
assert is_sorted(qs(xx))
In this test, the first line is a decorator that specifies what
generator to use for the argument to the test function, one that
produces lists of integers. Writing a generator is also a form of
documentation because it tells us that qs
accepts lists of
integers. qs
is more general than that, which should prompt
us to write generators like
lists(one_of(integers(), floats(), text()))
which will
produce lists of integers, floats, or strings (not mixed in the same
list). The second line contains the name and signature of the test
function, following the convention used by the package Pytest. The body
of the test asserts that the result of applying qs
to its
argument is sorted. This condition is necessary but not sufficient for
qs
to be correct, as it could always return the same sorted
list. We also need to ensure its output is a permutation of its input,
that is, that it contains the same number of occurrences of each
element, a task suitable for the data structure Counter
from the library Collections:
@given(xx=lists(integers()))
def test_is_permutation(xx):
assert Counter(qs(xx)) == Counter(xx)
These two property tests are equivalent to a user-selectable number
of conventional ones because a property test can repeatedly run on newly
generated inputs. They can replace hundreds or thousands of lines of
tests written one by one, provide a high level of confidence in the
implementation, and a readable specification for qs
(for
theoretical work showing how much testing is sufficient, see Sec. 1.2.11.1). This
flexibility also mitigates the issue of slow-running tests because, by
changing a configuration parameter, we can run a more superficial but
quick battery of tests or a much slower but more thorough one. Indeed I
applied this two-speed strategy in developing the Rmr2 package [78], using Quickcheck for R [79].
While adding new features, I ran a quick set of tests until most bugs
were fixed. Then, as we approached a release, I increased the number and
size of tests to build more confidence in the correctness of the
package. You may be more of a believer in quality over quantity in the
writing of tests. In my experience, handwritten tests have several
limitations. First, they are often trivial because the correct output
has to be calculated by hand. When written in response to bug reports,
they catch problems that have already been discovered. Finally, they
sometimes share the same faulty assumptions as the code to be tested
and, therefore, won’t probe it outside of those.
All these elements make property-based testing valuable, but there are a couple of catches. First, a developer must state properties that a function’s inputs and outputs must satisfy, and implement them. This task is easy for mathematical objects like matrices or well-defined algorithms like those for sorting and searching but less so in other domains. For example, how to approach testing that a graphical routine draws a circle? Another problem is that this approach relies on referential transparency, a complicated way to say that computer functions behave like mathematical functions, that is their outputs depend only on their inputs. For example, if a function has access to a database, its return value may depend on its contents. In this case there are workarounds, like starting with an empty test database and testing read and write functions in combination, but the code to be tested has to be amenable to this technique. Because of these limitations, we should be weary of emulating Dijkstra’s maximalism — the view that all software should be proven correct like a theorem (see testing in Sec. 1.2.11.1). Like formal methods, property testing has a domain of applicability, and within it, it is a great asset.
Another alternative to traditional documentation is doctests, generally written using literate programming — a technique to interleave text and code so that the latter can be executed [80]. If a chunk of code fails, the documentation related to it may need updating. One instance of this is the “Examples” section from the documentation of package Altair_recipies, which includes both the test suite and the output for each test, consisting of statistical graphics [81]. Maintaining such an extensive set of examples covering the whole library and a comprehensive test suite would have been daunting for a one-person, spare-time project. However, since they are one and the same, the cost is cut in half.
1.2.9 Languages
Paul Graham, Lisp hacker turned venture capitalist, wrote an essay devoted to conciseness, “Succinctness is power.” He sees succinctness as a productivity multiplier and believes it to be related to other positive qualities of code, such as readability. He focuses specifically on programming languages. He makes a characteristically articulate case for it [82]:
[…] clearly succinctness is a large part of what higher-level languages are for. If it is not all they’re for, then what else are they for […]
However, his point of view, while related, is different from the one espoused in this book, as he sees conciseness as a desirable feature in languages but not necessarily in programs:
We should be clear that we are talking about the succinctness of languages, not of individual programs. It is certainly possible for individual programs to be too densely written.
While it is possible to run into cryptic code (for instance, see Sec. 1.2.11.2 and the front cover), we shouldn’t regard general programming as a substantially different activity from designing languages. The boundary between core language and standard library is often blurred regardless, more a matter of packaging and versioning than anything else. In SICP, the authors write [68]:
To appreciate this point [that language interpreters are just programs] is to change our images of ourselves as programmers. We come to see ourselves as designers of languages, rather than only users of languages designed by others.
Therefore it is hard to agree with the point made by Graham that, while higher-level languages exist primarily for the sake of program conciseness, this should not be pursued when using these languages. He considers moderate succinctness in programs acceptable but not a goal.
Any non-trivial program should define the concepts relevant to the task on hand and implement data structures and operations to represent and manipulate these concepts, and this extends a language. However, in many languages, some extensions aren’t possible. To pick an example in Python, let’s assume we need to increment a group of variables by one, as in the following example:
+=1
a+=1
b+=1 c
It is impossible to encapsulate this snippet in a function such as
inc(a, b, c, by = 1)
to better express intent and promote
reuse. The reason is that function arguments in Python are passed by
shallow copy: the value of an integer, a pointer for a list. In the case
of an integer, modifying it inside a function would only change the copy
rather than the original. Despite this, there actually is a way to
implement this function, as shown by Stack Overflow user zwer [83]. Zwer uses a built-in mechanism to
access the call stack, a data structure used by the language run-time to
implement function calls. By doing that, we can access the list of
variables in the calling environment and modify them at will. To my
knowledge, debugging is the only recommended use for these features.
Indeed, zwer’s solution comes with so many caveats and warnings that the
author himself can’t recommend it as a viable one and named it
here_be_dragons
.
I didn’t select this example to pick on Python: languages are defined by what they allow and what they prevent. Some features are considered “dangerous” because they are believed to lead to the creation of bugs. Limiting side effects is encouraged in functional programming when they are at all possible. In C++, primarily an object-oriented language, parameters that allow side effects must be of reference type. Whether or not we agree with the specific choices made in a language, we need to operate within those limitations. Hence the lofty vision promoted in SICP of programmers as language designers must be tempered with the reality that it has to be achieved within the confines of the language one is using.
Possible ways to push beyond language limits are macros and
reflection. The former enables defining abstractions that generate or
modify code before execution; the latter, gives the ability to operate
on the structures of the language run-time from the program being run,
which is used in here_be_dragons
. Lisp macros are notable
for being expressed in Lisp itself, unlike, for instance, C macros,
which use a dedicated macro language. This is natural in Lisp, since, in
that language, expressions are also data structures. Every interpreter
or compiler for any language needs to build a representation of the
program which can only be a data structure, but in Lisp the relation
between code and representation is particularly simple, as programs are
represented by lists [84]. In an example, developer and
author Adam Tornhill shows how to eliminate code duplication for a
specific web application using a macro: “Now […] we have a concise way
to express web-pages with a uniform look […]” [85]. Not only is this solution concise,
but also conciseness begets consistency because the same code generates
all pages. Unfortunately, he doesn’t explain why macros are necessary in
this case. Graham devotes a chapter in his “On Lisp” to explaining their
domain of application [86]:
Macros can do two things that functions can’t: they can control (or prevent) the evaluation of their arguments, and they are expanded right into the calling context. Any application which requires macros requires, in the end, one or both of these properties.
For instance, being expanded in its calling context, a macro would
allow implementing the inc
function described above.
However, this possibility isn’t unique to macros: pointer and reference
parameters allow to modify the calling context in C++, as do
var
arguments in Pascal, but in a much more controlled
fashion. As to the ability to control argument evaluation, for an
alternative mechanism, see Sec. 1.2.9.3.
Besides these specific applications, Graham recommends using macros only “when nothing else will do” [86] because both macros and reflection can make programs harder to understand and hide subtle bugs [87]. Nonetheless, macros have an important role in Lisp. For example, the dominant object-oriented extension to Lisp, the Common Lisp Object System, is implemented with macros, as are several iteration constructs. Graham even shows how to implement a completely different language, Prolog (see Sec. 1.2.9.6), with just macros [86].
Steele thinks that the quest for the ideal, all-powerful language is bound to be frustrated. He says [14]:
[…] it’s really hard to make a language that’s great at everything, in part just because there are only so many concise notations to go around. There’s this Huffman encoding37 problem.
If this observation is correct, it suggests three ideas. Please note
that these are based on an abstract, mathematical view of programs
rather than a practical one. The extent to which they apply in practice
is debatable. First, languages that forbid many constructs preclude them
from expressing valid computations. Consequently, developers will have
to use longer programs to define equivalent computations. An example is
variable declarations and explicit initializations, which many advocate
on a safety basis (see the Mariner I disaster, Sec. 1.5). Second, if two programs
define the same computation, one of the two could have defined a
different one. That computation will have to use a longer program. It is
reasonable to conclude that a language with many ways to express the
same computation will lead to more verbose programs. For instance, it is
debatable whether a language needs both a map
function,
which applies a function to each element in a collection and returns a
collection of its return values, and a list comprehension:
map(function, a_list)
for x in a_list] [function(x)
Finally, defining more common computations with shorter programs and
accepting some verbosity for the less common ones could be a good
compromise but implies different languages tailored for different
application domains. For instance, a sum of two vectors in R is an
element-by-element sum, even if the two vectors have different lengths,
a problem solved with a technique known as recycling. In
Python, the sum of two lists is their concatenation. In R, the
concatenation function is concisely named c()
, probably
owing to its widespread use. In Python, there is no element-wise sum,
but we can perform it with a relatively complex expression:
map(lambda x, y: x + y, zip(x_list, y_list)
. There are
vector libraries for Python, such as Numpy, whereby the sum of two
vectors is defined as in R, save for the recycling rules. While we can
accomplish the same results in both, the choices of each language and
library reflect the decisions of their respective developers, which
likely took into account what the most common use cases were expected to
be and reserved the most concise forms for them.
1.2.9.1 Lisp
We started talking about Lisp in the previous section because of its macros, but it is notable in other ways. Its inventor, John McCarty, designed it as a concise alternative to the Turing machine38[88]:
Another way to show that Lisp was neater than Turing machines was to write a universal Lisp function and show that it is briefer and more comprehensible than the description of a universal Turing machine.39
The idea to turn it into an actual programming language came to a student of McCarty’s, Steve Russell, who wrote its first interpreter in assembly language.40 It wasn’t a large, challenging project, unlike other languages (both ADA and Algol68 raised concerns about their implementability; see Sec. 1.5). An interpreter for the Lisp in the original paper takes only 53 lines of Common Lisp — a more complex but also practical dialect thereof [89]. This is a remarkable combination: a simpler language should also be less powerful and therefore require more code for the same tasks, including the implementation of its own compiler. However, Lisp sidesteps this trade-off to some degree. The so-called “Computer Language Shootout,” a systematic comparison of many languages focused on small, well-defined tasks, corroborates its conciseness [90]. It ranks Common Lisp first among over 50 languages (including multiple implementations of the same language).
Clojure is a modernized cousin of Lisp, particularly focused on supporting concurrency. It hasn’t lost Lisp’s expressivity while getting none of the verbosity of the Java language, with which it shares the target virtual machine.41 Daniel Lebrero compares two programs for the same task, one in each language, and asks [91]:
So given that the inherent complexity of the problem is the same, and that the Clojure version is able to express the solution in 58 lines of code while the Java version require 441 lines of code, and that the Clojure version is easier to understand, what are those extra 383 (87% of the codebase) lines of perfectly idiomatic Java code about?
Lebrero also notices how much easier to review the Clojure version is, and that size is one or maybe the determining factor. Sure, concise code could be cryptic, and verbose code trivial, but this isn’t the case here:
[…] the Clojure version was way simpler to understand, and […] the fact of having a single file with 58 lines of code was a very important reason for it.
He ascribes the additional LOCs to “incidental complexity”, which he thinks is caused by the use of inappropriate tools. He concludes that LOCs aren’t helpful as a measure of productivity but as one of complexity (see Sec. 1.1).
1.2.9.2 Python
Paul Prescod, Python developer and book author, is quoted by Graham as stating that “Python’s goal is regularity and readability, not succinctness” [82]. Graham considers it a “damning thing to claim about a programming language.” However, many others think Python is concise, in practice, if not by design. For example, Nicholas Hahn, a software engineer, compared Python, Go, and Rust. He finds that “Python has the least code by far” but also that “it’s leaning heavily on features of the image library it’s using, but this is indicative of the general experience of using Python” [92]. For instance, in the Python version, the difference between two images requires a single method call, whereas it is done pixel by pixel in the other two languages. This observation doesn’t tell us much about the intrinsic qualities of each language while still being important in practice. Availability of libraries could depend on popularity, age — older languages accumulate more — and ease of development, which in turn depends on tools and resources available, as well as those intrinsic qualities. The “Computer Language Shootout” mentioned in the previous section ranks Python 19th out of 51 languages for conciseness [90].
It is less verbose than C in many cases. For example, the README file for a project by the CMU Cylab contains equivalent examples in several languages, including Python. However, it omits the one in C because “the same program in C will take too much space and will not fit into the README format” [93]. One of the advantages of concise code is that it is easier to discuss and present. It will fit into a few slides or, in this case, a README file. Verbose code is something to skim in long, painful, lonely sessions and to hide.
Python practitioners aren’t short on enthusiasm for their language of
choice and may occasionally overstate the power of certain features. For
example, Mahmoud Hashemi, author of package Boltons, would have us
believe that “decorators are among Python’s most elegant and succinct
language features.” However, all they replace is one assignment:
@decorator
instead of f = decorator(f)
[94].
Hashemi may be referring to the use of higher-order functions,
popularized by decorators, rather than their syntax (see higher order
functions in Sec. 1.2.4).
1.2.9.3 R
RStudio co-founder Joe Chang, also known for the R package Shiny, thinks R is more expressive than Python and other languages. As the CTO of a company devoted to supporting this language, we should be aware of his biases. However, he is referring to specific features that are unique to R, among popular languages. In an interview, he said (best effort transcription from [95]):
[…] each one of these libraries has a bunch of features and a bunch of syntax that would not be possible certainly with Python, certainly with Java, certainly with C++ […] when you look at the interfaces of these packages users don’t necessarily know what magic we had to do to make this happen, they just know oh it’s really easy it’s really concise when I want to express my plots this way when I express my data manipulations this way […]
The “magic” he refers to includes a macro-like facility that
optionally suspends the regular evaluation of expressions and replaces
it with specialized variants. For instance, we can refer to columns in a
tabular data structure, the data frame, in certain expressions without
constantly repeating the name of the table, as in
mutate(table, a = b + c)
, as opposed to the repetitive
table$a = table$b + table$c
, where a
,
b
and c
are names of columns in
table
. Users love these more concise forms, but having
multiple evaluation rules makes programs harder to understand and
hinders their parametrization and generalization. For instance, if we
wanted to make the above expression general with respect to the names of
the columns involved, we would need to use a different syntax (see
package Dplyr in Sec. 1.2.10.1).
One task that R performs well is operating on vectors. It had better
do, as it doesn’t even provide scalar types, replaced by vectors of
length one. While most languages offer extensions for vectors or other
types of collections, in R, by design, all library functions need to
accept vectors. R developers think in vectors, and the comparatively
modest speed of iterative constructs in R is such that they have to use
vector functions in day-to-day work. That often helps with conciseness.
For instance, Rasmus Bååth takes a simple spelling corrector in 21 lines
of Python and translates it to two lines of R without using any
extensions [96]. It is
two oversized lines, though, and, once automatically reformatted in the
R IDE RStudio, his code swells to 12 lines: more than 2, but still a
great deal shorter than the original. I took the liberty to modify this
code to eliminate one repetition and to use the pipe operator
(%>%
) to reorder the operations. The “classic” style of
function composition in R is f(g(h(a1, a2), a3), a4)
which
many find counter-intuitive because functions execute in a temporal
order opposite to that in which they appear, but also because arguments
to the same function can end up far apart. An alternative is using a
temporary variable:
= h(a1, a2)
x = g(x, a3)
x f(x, a4)
This approach adds some unnecessary steps and doesn’t convey the
rigid organization of a pipeline. Using R’s macro-like facility, we can
define a pipe operator similar to the ones in F# and the Unix shell and
available in package Magrittr, such that the above expression is
equivalent to h(a1, a2) %>% g(a3) %>% f(a4)
[96]. In
other words, the pipe operator turns the expression on its left side
into the first argument to the function call on its right side, pushing
the other arguments to the right. With these modifications, here is the
concise spell corrector in R:42
readLines("http://www.norvig.com/big.txt") %>%
paste(collapse = " ") %>%
%>%
tolower strsplit("[^a-z]+") %>%
%>%
table sort(decreasing = TRUE) %>%
-> sorted_words names
In this step, we build a vector of words by:
- loading a large, presumably spell-checked text available online;
- concatenating all lines;
- turning all characters to lowercase;
- splitting the result on non-alphabetic characters to generate words;
- collecting word counts;
- sorting these words from most to least common;
- and, finally, dropping the counts.
This sorting order will help return the most common among equally close matches for a word. Now onto the actual spell-correcting function:
<-
correct function(word) {
= adist(word, sorted_words)
distances c(sorted_words[distances <= min(distances, 2)],
1] } word)[
It works by computing the edit distance43
between a given word and our list of correct, frequency-sorted words.
This code’s conciseness stems from using vector functions instead of
loops and the vast number of functions considered “core” to R (about
700). Bååth observes that, in particular, the function
adist
, calculating the edit distance between words,
performs much of the difficult work. He adds that R also provides a
function aspell
in the standard library, which would reduce
his program to a single line. Unlike the features referred to by Chang
earlier as “magic,” I don’t think this example highlights any
conciseness advantage intrinsic to the language, save for the use of the
pipe operator. Instead, it shows the benefits of a vectorized
programming style, whereby we can process entire vectors in one step,
and the importance of being familiar with available libraries. We could
likely achieve a similar result in Python, but this programming style
may be less natural to practitioners of that language. It is a fine line
between what a language allows and what it encourages or enforces.
1.2.9.4 JavaScript
While Python is the target of unwavering praise and R of a mix of praise and blame [97], JavaScript is subject to relentless condemnation (see Sec. 1.3.2) while, ironically, being arguably the most widely adopted language. It certainly is so when measured by the number of JavaScript programs simultaneously running since it is the programming language used for web development on the browser side. Because of its ubiquitousness and perceived shortcomings, developers have defined new languages that translate into JavaScript, correcting some of its defects but maintaining a considerable degree of compatibility. One of these, CoffeeScript, is touted for its conciseness. For example, software engineer Simon Grondin implemented “a rate limiter in 31 lines of CoffeeScript” and thinks this is “a demonstration of its exceptional expressiveness” [98]. However, when I revisited this implementation after first becoming aware of it [99], I found that it had grown to 8 times its original size [100] and that finding the core rate limiter logic among the cruft had become difficult. You may wonder whether the initial, concise implementation was more the result of a fresh take on the subject rather than the intrinsic qualities of the language. Eventually, this code seems to have succumbed to the drift toward complexity.
Not so the three-line implementation of partial application,44 also in CoffeeScript, by Smooth
CoffeeScript’s author E. Hoigaard, which retained its original
purity [101]. partial
takes a
function func
of
arguments and returns a function of
arguments by providing
values in advance. In this specific implementation, this is accomplished
by using the undefined
constant as a marker for an argument
that has no pre-specified value. For instance, a function
rectangle(x,y,h,w)
which draws rectangles centered in
with height
and width
can be used to define a function that draws only unit squares as
follows:
= partial (rectangle, undefined, undefined, 1, 1) square
Hoigaard starts from someone else’s 15-line implementation (see below), bringing it gradually down to its final svelte size, which I further trimmed down to two lines:45
= (func, a...) -> (b...) ->
partial (for arg in a then arg ?= b.shift())... func
Even without knowing much about CoffeeScript, I could understand this
implementation of partial
after researching a couple of
constructs. The meaning of this code is that partial
is a
function that accepts another function func
plus a variable
number of arguments in vector a
and returns a function of a
variable number of arguments (vector b
). The body of this
function invokes func
with a list of arguments assembled in
a list comprehension equal to a
element by element when it
is defined, otherwise picking the next value from b
. Not
only is it remarkably concise without being cryptic but also it
implements an important feature that other languages lack and can’t add
with a simple extension. So does the 15-line version — here reduced to
14 — but its meaning must be extracted from a jungle of indexes and
tests:46
= () ->
partial15lines [func, args...] = arguments
= () ->
wrapper = 0
i = 0
j = []
res_args while i < args.length
if args[i] == undefined
.push arguments[j]
res_args++
jelse
.push args[i]
res_args++
ireturn func.apply null, res_args
Converted to JavaScript, the first step to execute CoffeeScript code, the two-line version expands to 21 LOCs, but an automated conversion may be more verbose than a skilled programmer’s.
In its latest version, ES6, JavaScript has evolved to look and work
more like CoffeeScript. However, even the unimproved JavaScript, warts
and all, seems to offer the potential for significant code reductions.
Kiran Prasad, a software executive, recalls how LinkedIn switched from
Ruby to server-side JavaScript (node.js
) to meet their
scalability goals for some mobile products [102]. In the process, their code base
shrunk from 60 KLOCs to 1-2 KLOCs, a reduction greater than 30-fold that
wasn’t even a goal for this transition. He explains this remarkable
achievement in three parts: the node.js
version is
“framework-free,” therefore free of framework-related cruft (I assume he
means web framework, see Sec. 1.2.10.3); they switched to a more
functional style from an object-oriented one, which had resulted in a
lot of unneeded abstraction layers; and they moved rendering47 client-side48
from server-side, which relocated, rather than eliminated, some of the
code. With so many concurrent changes, it is impossible to tell which
ones were instrumental to the code savings. For me the most important
lesson is that a 30-fold reduction in a codebase, which presumably freed
29 out of 30 engineers from maintenance work, was achieved as a side
effect of the quest for scalability rather than being an explicit goal.
Had LinkedIn selected a different approach, they might still be saddled
with a 60-KLOC code base.
1.2.9.5 Erlang
Maybe not so if they went with Erlang. In a video presentation, its
creator, Joe Armstrong, shows how a couple of pages of C translate to
one line of Erlang (near 1:20 [103]). Unfortunately, this video is too
fuzzy to read the C code. Later, a co-worker cites an internal study at
telecommunication firm Ericsson that estimates the ratio between C and
Erlang code for the same tasks as 7-fold. A factor to consider is that,
just as node.js
(see previous section) supports
event-driven programming, Erlang includes specialized features to build
reliable systems. In particular, it aims to solve the problem of
creating them despite the reality of software errors by supporting a
system’s ability to recover. Erlang achieves this goal by providing the
developer with lightweight processes that don’t share resources and
communicate via asynchronous message passing. They are organized in
hierarchies, whereby those higher up supervise those below. Rather than
trying to code defensively a single process to keep it going at all
costs, developers focus on recovering when something goes wrong. The
goal isn’t eliminating crashes but making them ordinary, recoverable
events. It is easier to understand this concept in the context of the
original target application: software for telephone switches. The goal,
in that case, is to keep a large number of conversations going, a task
that naturally lends itself to concurrency. Other applications written
in Erlang are the messaging library RabbitMQ and the messaging app
WhatsApp. However, its noteworthy features aren’t limited to its
concurrency and fault tolerance models. It is also a functional language
with pattern matching and many other modern features. Armstrong wrote a
simple side-by-side comparison of Java and Erlang programs computing the
area of geometric shapes [104]. This is the Erlang version:
area({rectangle, Width, Ht}) -> Width * Ht;
area({square, X}) -> X * X;
area({circle, R}) -> 3.14159 * R * R.
And this is the Java version:
abstract class Shape {
abstract double area();
}
class Circle extends Shape {
final double radius;
Circle(double radius) { this.radius = radius; }
double area() { return Math.PI * radius*radius; }
}
class Rectangle extends Shape {
final double ht;
final double width;
Rectangle(double width, double height) {
this.ht = height;
this.width = width;
}
double area() { return width * ht; }
}
class Square extends Shape {
final double side;
Square(double side) {
this.side = side;
}
double area() { return side * side; }
}
Skipping empty or single-character lines, we have a 3:17 ratio in favor of the Erlang program. However, despite its conciseness, more than a million lines of Erlang were necessary for its first production deployment, the ATM switch AXD 301 [105].
1.2.9.6 Prolog
Armstrong supported conciseness in contexts other than his programming language. In another interview, this is what he said about Prolog [14]:
[Prolog is] not widely used. And it’s a great shame because programs are incredibly short.
While it is possible to write all types of applications in Prolog (Armstrong used it for the first version of the Erlang interpreter [106]), that is uncommon. Prolog is a logic-based language used primarily in AI, specifically the flavor known as symbolic AI. Programs consist of facts and rules that allow inferring more facts. A classic example is the implementation of a family tree. For instance, let’s consider the ancestry of Charles II of Spain going back to Philip I and Joanna of Castile in Fig. 4.
Figure 4: A portion of the ancestry of Charles II of Spain.49
We can implement this tree and some simple searches in a few lines of Prolog. First, we need to enter what would be considered data in other languages, that is, the parent-offspring relation by listing Prolog facts:
parent("Philip I of Castile", "Charles V Holy Roman Emperor").
parent("Joanna of Castile", "Charles V Holy Roman Emperor").
parent("Philip I of Castile", "Ferdinand I Holy Roman Emperor").
parent("Joanna of Castile", "Ferdinand I Holy Roman Emperor").
/* more parent-child pairs as shown in the graph */
Then we have a set of rules that allow to derive new facts from existing ones:
offspring(X,Y):- parent(Y, X).
ancestor(X, Y):- parent(X, Y).
ancestor(X, Y):- parent(X, Z), ancestor(Z, Y).
descendant(X, Y):- ancestor(Y, X).
These four lines define three relations:
offspring
is the inverse ofparent
;ancestor
contains pairs(X, Y)
for which eitherX
is a parent toY
or there exists aZ
such thatX
is a parent toZ
andZ
is an ancestor toY
;descendant
is the inverse ofancestor
.
There is nothing wrong in having two rules for ancestor
as it is a way to express disjunction: either one applied successfully
will be sufficient to establish that a specific pair is an
ancestor-descendant pair. With just a handful of LOCs, we can already
ask questions like “Who were the ancestors of Anna of Austria?”:
?- findall(X, ancestor(X, "Anna of Austria"), XX).
XX = ["Ferdinand I Holy Roman Emperor", "Anna of Bohemia and Hungary", "Philip I of Castile", "Joanna of Castile"].
If we want to see how a person is an ancestor of another, and not just whether that person is, we need only two more rules:
path(X, Y, []):- parent(X, Y).
path(X, Y, [Z|P]):- parent(X, Z), path(Z, Y, P).
which mimic the definition of ancestor
but add a list
variable P
to accumulate the ancestry path.
[Z|P]
is the list obtained by prepending Z
to
P
. We can paraphrase the second rule as “There is a path
consisting of Z
followed by P
between
X
and Y
if there is a Z
that has
X
as a parent and there is a path P between Z
and Y
”.
?- path("Philip I of Castile", "Charles II of Spain", XX).
XX = ["Charles V Holy Roman Emperor", "Philip II of Spain", "Philip III of Spain", "Maria Anna of Spain", "Mariana of Austria"] .
We can combine path
with findall
to ask the
interpreter and provide not one but all paths connecting two members of
the dynasty (the output is truncated):
?- findall(XX, path("Philip I of Castile", "Charles II of Spain", XX), YY).
YY = [["Charles V Holy Roman Emperor", "Philip II of Spain", "Philip III of Spain", "Maria Anna of Spain", "Mariana of Austria"],
["Charles V Holy Roman Emperor", "Philip II of Spain", "Philip III of Spain", "Philip IV of Spain"], ["Charles V Holy Roman Emperor", "Maria of Spain", "Anna of Austria '49", "Philip III of Spain", "Maria Anna of Spain", "Mariana of Austria"],
["Charles V Holy Roman Emperor", "Maria of Spain", "Anna of Austria '49", "Philip III of Spain", "Philip IV of Spain"],
["Ferdinand I Holy Roman Emperor", "Maximilian II Holy Roman Emperor", "Anna of Austria '49", "Philip III of Spain"|...],
["Ferdinand I Holy Roman Emperor", "Maximilian II Holy Roman Emperor", "Anna of Austria '49"|...], ["Ferdinand I Holy Roman Emperor", "Anna of Austria"|...], ["Ferdinand I Holy Roman Emperor"|...], [...|...]|...].
In ordinary families, there should be a unique path linking an ancestor and a descendant, but not in European royal families of the 16th and 17th centuries, and certainly not in this one. Since political expedience dictated marriages, unions between cousins or an uncle and his niece were not unusual. The cumulative effect over several generations was to reduce the genetic diversity to unhealthy levels: Charles II of Spain was extraordinarily sickly and died at just 39; his sister only made it to 21.
Leaving royal inbreeding aside, we can add more features to this
genealogy system: finding if two individuals are half-siblings, if they
are related in any way, which common ancestors they share, or which one
is the closest. Each feature requires one or two new rules. I can’t even
think how long it would take me to implement this from first principles
in a more traditional language. Behind the scenes, while most languages
execute statements in the order they are written or as specified by
control structures, Prolog applies a search and unification algorithm.
The unification part solves the problem of matching
parent(X, "Charles II of Spain")
with
parent("Philip IV of Spain", "Charles II of Spain")
for
X = "Philip IV of Spain"
. The search part explores all the
connections between matching facts and rules. It may help to think of it
as a database with a powerful query language. It looks like Prolog would
free the developer from worrying about algorithms, that is, the specific
steps needed to produce results. Unfortunately, it is more complicated.
Sometimes the rule order has a drastic impact on the search space. In
other cases, there are different sets of rules to accomplish the same
task, but they can make the search process more or less efficient: for
instance, see how different sorting algorithms can be implemented [107].
1.2.9.7 Kotlin
Kotlin is a language that runs on the JVM50 and is thus seen as an alternative to Java. It prominently advertises its conciseness [108]:
Concise
Reduce boilerplate code […] For example, one can now create a POJO [Plain Old Java Object] with getters, setters, equals(), hashCode(), toString() and copy() in a single line.
POJOs are just Java objects that don’t satisfy any additional specifications. It is customary to provide them with the methods listed in the quote, which used to be written by hand. I distinctly remember when, early in my career, I discovered I had to write POJOs for the simplest tasks, like accessing a database table. It was like running while submerged to the waist: a trivial curiosity required half hour of mindless typing. I couldn’t believe it hadn’t been automated in some way. Senior colleagues reassured me that it was standard procedure and that I would improve with practice.
In Kotlin, data classes replace POJOs. An example would be:
data class Person(var name: String, var age: Int)
In Java, this would look like the following snippet, the longest included in this book:
public class Person {
private String name;
private int age;
public Person() {
}
public Person(String name, int age) {
this.name = name;
this.age = age;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public int getAge() {
return age;
}
public void setAge(int age) {
this.age = age;
}
@Override
public String toString() {
StringBuilder sb = new StringBuilder();
.append(Person.class.getName())
sb.append('@')
.append(Integer.toHexString( System.identityHashCode(this)))
.append('[');
.append("name");
sb.append('=');
sb.append(((this.name == null)?"<null>":this.name));
sb.append(',');
sb.append("age");
sb.append('=');
sb.append(this.age);
sb.append(',');
sbif (sb.charAt((sb.length()- 1)) == ',') {
.setCharAt((sb.length()- 1), ']');
sb} else {
.append(']');
sb}
return sb.toString();
}
@Override
public int hashCode() {
int result = 1;
= ((result* 31)
result +((this.name == null)? 0 :this.name.hashCode()));
= ((result* 31)+ this.age);
result return result;
}
@Override
public boolean equals(Object other) {
if (other == this) {
return true;
}
if ((other instanceof Person) == false) {
return false;
}
= ((Person) other);
Person rhs return (((this.name == rhs.name)
||((this.name!= null)
&&this.name.equals(rhs.name)))
&&(this.age == rhs.age));
}
}
Java developers aren’t fools and, in this day and age, use code
generators to write this boilerplate51 for them. However,
there is a problem with this approach. Generated code needs to be read,
understood, and maintained. Granted, it doesn’t look all that difficult,
but how do we find out someone didn’t change some detail in the equals
method after its generation? That is why code generation isn’t a
suitable replacement for abstraction. Moreover, I believe POJOs tell us
something about Java. What is wrong with a language that doesn’t allow
abstracting this boilerplate away? For instance, one in which a
developer can’t express with code the following algorithm: “To implement
equality comparison for a class, for each property in that class, call
the equals
method on that property and return true if all
of these calls return true?” It seems like one of the most
straightforward and useful abstractions you could come up with. In
Python, the specific issue of boilerplate methods can be solved with a
library (for instance, Attrs; see Sec. 1.2.10.4), but, apparently, not in Java.
Kotlin offers much more than data classes, and replacing POJOs was only
one of the motivations behind the language, but one that will get the
attention of POJO-aggrieved Java programmers.
1.2.9.8 C++
R’s vector functions perform well because they are written in other, lower-level, but more efficient languages, typically C or C++. The most downloaded R package is indeed one that facilitates mixed R and C++ development, Rcpp. Its maintainers took advantage of new features in Cpp11, a relatively recent update to the C++ standard, to refactor or, should I say, cull the Rcpp code base. Software engineer Kevin Ushey reports the following [109]:
Eliminating the code bloat and replacing [it] with variadic templates52 was awesome. […] For some hard numbers, I count 31,428 lines of code in the inst/include directory of Rcpp11, while there are 92,726 total lines of code in inst/include for Rcpp.
This process resulted in a 3-fold reduction stemming from a single
new language feature and without any impact on the Rcpp user, for whom
it is entirely transparent. Templates enable the parametrization of
variable types, but until C++11, the number of type parameters had to be
constant. As a workaround, Rcpp developers, to provide the flexibility
of functions with a variable number of arguments, painstakingly
replicated and adapted their code for each number up to a somewhat
arbitrary maximum, 20 (not unlike what my team-mate did with nesting
levels in Sec. 1.2.4, but in that case a
better solution was readily available). For example, searching for all
the Vector create
definitions in Rcpp, we find 20 instances
[110]:
static Vector create(const T1& t1)
static Vector create(const T1& t1, const T2& t2)
static Vector create(const T1& t1, const T2& t2,
const T3& t3)
static Vector create(const T1& t1, const T2& t2,
const T3& t3, const T4& t4)
static Vector create(const T1& t1, const T2& t2,
const T3& t3, const T4& t4,
const T5& t5)
[14 entries omitted]
static Vector create(const T1& t1, const T2& t2,
const T3& t3, const T4& t4,
const T5& t5, const T6& t6,
const T7& t7, const T8& t8,
const T9& t9, const T10& t10,
const T11& t11, const T12& t12,
const T13& t13, const T14& t14,
const T15& t15, const T16& t16,
const T17& t17, const T18& t18,
const T19& t19, const T20& t20)
To add insult to injury, each definition needs a largely repetitive but slightly customized implementation — here omitted. The developers were aware of this massive amount of repetition. They prefaced this code with a comment describing an imaginary abstraction that deals with the 20 cases in a unified way, which was impossible before C++11.
In Rcpp11, instead, there is only one definition for all these cases, and a single implementation [111]:
static Vector create(Args&&... args)
The line count went from 1245 to 144. However, there is a price to pay for this code reduction: an ever more sprawling C++ standard for a language already quite complex in its first incarnation (see Sec. 1.3.2). Its evolution has been guided by an unwavering commitment to backward compatibility,53 which implies features can’t be removed, and an openness to new features. Its standard committee doesn’t seem particularly concerned about the language becoming harder to learn and read. They aren’t alone in taking this position: for instance, Python has been growing as well, adding list comprehensions, coroutines, and f-strings, among other features. Unlike C++, though, Python abandoned full backward compatibility in its 3.0 version. The co-existence between the two incompatible versions continues at the time of this writing. They were officially supported in parallel for 12 years, explaining why language steering committees often accept backward compatibility as a constraint, no matter how burdensome.
1.2.9.9 Query Languages
SQL queries can quickly grow to be unmanageable. Luckily, there are libraries to side-step this problem, generating SQL from much more concise expressions. Dplyr (see Sec. 1.2.10.1) is one such library. For instance, here is a typical R expression using Dplyr:
%>%
mtcars group_by(cyl) %>%
filter(mpg == min(mpg) | mpg == max(mpg)) %>%
select(mpg, model) %>%
arrange(mpg)
First, remember the meaning of %>%
, the pipe operator
(see Sec. 1.2.9.3), which takes the expression on
its left hand side and makes it the first argument for the function call
on its right side. This code takes the data in table
mtcars
, a car-related dataset, groups it by the number of
cylinders, filters the rows that match the minimum and maximum fuel
mileage in each group, keeps only specific columns, and sorts the rows
by increasing gas mileage. The value of this expression is the following
table:
cyl mpg model
8 10.4 Cadillac Fleetwood
8 10.4 Lincoln Continental
6 17.8 Merc 280C
8 19.2 Pontiac Firebird
6 21.4 Hornet 4 Drive
4 21.4 Volvo 142E
4 33.9 Toyota Corolla
Dplyr can process datasets in memory or database tables, the latter by generating SQL queries. For example, the following is generated from the above expression (assuming that the target database is Postgres, as SQL is still subject to proprietary variations, despite the existence of a standard):
SELECT `cyl`, `mpg`, `model`
FROM (SELECT `cyl`, `mpg`, `model`
FROM (SELECT `cyl`, `mpg`, `model`,
MIN(`mpg`) OVER (PARTITION BY `cyl`) AS `q01`,
MAX(`mpg`) OVER (PARTITION BY `cyl`) AS `q02`
FROM `mtcars`) `q01`
WHERE (`mpg` = `q01` OR `mpg` = `q02`)) `q02`
ORDER BY `mpg`
It is possible that a skilled SQL developer could come up with a more concise query, but there is no denying that the R-Dplyr expression is more clear and concise.
Impala, an SQL variant specialized for big data, adds support for complex types. One of the advantages of this extension is that “common querying patterns can be expressed concisely” [112]. This statement reprises the concept that this isn’t possible for all patterns (see Steele’s “Huffman encoding problem” in Sec. 1.2.9). With knowledge of which are more valuable, it is possible to make those more concise, while allowing the expression of less common patterns in more complex ways rather than dropping them altogether.
1.2.10 Libraries
As we have seen in Sec. 1.2.9, libraries should not be seen as totally distinct from languages. On the contrary, they are language extensions, and the line between them is sometimes blurred. It is no surprise, then, that we can look at them from the point of view of their conciseness, meaning the conciseness they enable in the programs that use them — we will talk about frugal APIs later on (see Sec. 1.3.3). It is almost a given that using a library allows us to write less code than managing without it (but see Sec. 1.2.9.4 for an exception), possibly at the cost of a steep learning curve. However, not all libraries are created equal. We will see through several examples that some libraries specifically target conciseness. Users value this property, whereas they criticize or avoid libraries that force verbose code, particularly in the form of boilerplate.
1.2.10.1 Data Analysis and Processing
A remarkable example is the library Keras for deep learning (which is covered in Sec. 2.1.4). The brainchild of François Chollet, it entered a segment populated by many strong competitors. It doesn’t intend to replace them but only to provide a better API that makes developing deep learning models easier. On the implementation side, it relies on a user-selectable preexisting deep learning library: you can see it as a thin layer of peanut butter and jelly, with the underlying bread provided by others. Therefore it isn’t any faster or more powerful than any of these libraries, but only more convenient and makes developers more productive. These properties are based, at least partly, on the conciseness it enables. Version 0.2.0 opened its README with “Keras is a minimalist, highly modular neural network library” [113]. Among the design principles explained in the same document, we find: “Minimalism. Each module54 should be kept short and simple (<100 lines of code).” Strangely this principle was dropped in later versions, but not the following one [114]:
No separate models [sic] configuration files in a declarative format. Models are described in Python code, which is compact […].
Keras avoids defining a new, separate description language, finding Python is perfectly adequate and highlighting the compactness of model descriptions, which are also modular and easy to combine. Its users agree: fast.ai co-founder Rachel Thomas calls one of the alternatives to Keras, Tensorflow, “verbose and confusing” and recommends switching [115]. Indeed Tensorflow developers gave it the strongest endorsement, by adopting it as their recommended API and incorporating it into their package distribution. A run-of-the-mill feed-forward neural network in Keras looks as straightforward as it should be:
= (
model
Sequential()10, input_dim=4, activation="relu"))
.add(Dense(1))) .add(Dense(
There is a small terminology catch: what everyone else calls a
feed-forward neural network here is a sequential model, which
may suggest a model for sequences of variable length, such as time
series. Its input has length four, as the input_dim
parameter specifies. The two layers of this network are added one at a
time, calling a function named with the desired connectivity pattern,
Dense
, and parametrized with their sizes and, optionally,
activation functions. Everything else is set to reasonable defaults
customizable with additional parameters. Let’s see what this looked like
in Tensorflow before Keras was bundled with it:
= placeholder(float32, shape=(None, 4))
inputs
# Hidden layer
= 10
hidden_size = Variable(random_normal([hidden_size, 4], stddev=0.01))
weights_hidden = Variable(constant(0.1, shape=(hidden_size, 1)))
bias_hidden = relu(add(matmul(weights_hidden, transpose(inputs)), bias_hidden))
hidden
# Output layer
= Variable(random_normal([1, hidden_size], stddev=0.01), name='weights_output')
weights_output = Variable(random_normal([1, 1]))
bias_output = add(matmul(weights_output, hidden), bias_output) output
The general structure is the same: we start with an input layer,
created with the opaquely named placeholder
, then create a
hidden layer, and finally an output layer, but each requires 3-4 lines.
Crucial to creating each is the also opaquely named
Variable
, which creates a vector or matrix object that can
be initialized and later optimized during training. You can see
different initialization strategies being used. The layers’ definitions
make explicit the underlying mathematical operations, such as
add
, matmul
, and relu
, and are
combined by function composition instead of the add
method.
On the one hand, this approach enables the complex architectures used in
deep learning, for which the Sequential
API in Keras is
inadequate. On the other, they expose lots of details practitioners are
often happy to leave at defaults, resulting in three times as much code.
Particularly burdensome are those details about layer dimensions that
need to be compatible between layers and are all calculated
automatically in Keras. Confusingly, the layer dimensions, for which
there are no reasonable defaults, must be specified as parameters to the
initialization strategy, for which, instead, there are defaults. It may
seem a classic case of power vs. ease-of-use trade-off: after all,
reducing a layer to its mathematical definition means we can swap one
transformation for another and create all sorts of experimental neural
modules. Moreover, composing those modules by name means great
flexibility in defining complex architectures. However, Keras isn’t only
for feed-forward neural networks: its Sequential API is just a teaser
for newcomers or the result of recognizing this type of network is
perfectly adequate in many applications. For more advanced ones, it
offers a Functional API, more similar to the one native to Tensorflow,
but with significant differences. The following is how the same network
can be defined using this API:
= keras.Input(shape=(4,))
inputs = layers.Dense(10, activation="relu")
dense = dense(inputs)
hidden = layers.Dense(1)(hidden)
outputs = keras.Model(inputs=inputs, outputs=outputs) model
This approach adds the flexibility researchers need to define cutting
edge architectures while not being substantially more verbose than its
Sequential API. Like the original Tensorflow API, it uses function
composition to assemble neural modules but avoids several of its
pitfalls: this API is still centered around machine learning concepts of
input and layer rather than the more opaque Variable
and
placeholder
as in Tensorflow; the layer dimensions are
specified directly, not through initialization; and there are reasonable
defaults for initialization and numeric types. When necessary, we can
always define a custom layer at the level of mathematical
operations.
However, as good as Keras is, some people think there are better options: it is a very active field at the time of this writing. If Thomas vouches for Keras, fast.ai’s other co-founder Jeremy Howard has already moved on. Reporting on a comparison between Keras and PyTorch, another library for the same class of applications, he writes [116]:
We built models that are faster, more accurate, and more complex than those using Keras, yet were written with much less code.
This code wasn’t immediately available. While I don’t have direct experience with PyTorch, a quick look at plain-vanilla examples showed an API at an abstraction level somewhere between Keras’s functional API and Tensorflow’s, with additional object-oriented boilerplate layered on top, and I wasn’t impressed. However, PyTorch was built for experimentation and not for plain-vanilla models. Therefore, its true worth may not be evident unless someone works with models with the complexity Howard is interested in. Whichever the current champion, it doesn’t look like anyone is finding value in verbose machine learning code, while models are getting more complex over time, putting additional emphasis on compact descriptions. Moreover, no API designer can escape the compromise between the complexity of an API, its power, and the complexity of user code. Nonetheless, some navigate it better than others.
Large deep-learning models are trained on GPUs. While most practitioners don’t have to code directly for the GPU, those writing advanced custom models often do, a task that requires highly specialized skills. Triton, a library from OpenAI, aims to change that. In their words [117]:
Triton makes it possible to reach peak hardware performance with relatively little effort; for example, it can be used to write FP16 matrix multiplication kernels that match the performance of cuBLAS — something that many GPU programmers can’t do — in under 25 lines of code. […] although a variety of systems have recently emerged to make this process easier, we have found them to be either too verbose, lack flexibility or generate code noticeably slower than our hand-tuned baselines.
As a taste of programming for the GPU, let’s consider the “Hello World”55 of this field, which is the sum of two vectors. Here is the Triton version (lightly edited from [117]):
= 512
BLOCK @jit
def add(x, y, z, n):
= program_id(0)
pid = pid * BLOCK + arange(BLOCK)
index = index < N
mask = load(x + index, mask=mask)
gpu_x = load(y + index, mask=mask)
gpu_y + index, gpu_x + gpu_y, mask=mask) store(z
GPUs operate with a high degree of parallelism, and individual
threads are organized in blocks. Here, the program_id
call
provides an id for the current block as the function add
runs in parallel on each block of size BLOCK
. Within each
block, there can be even more parallelism, but the library manages this
transparently. The load
and store
calls are
responsible for moving data to the GPU and back. mask
prevents out-of-range accesses. Contrast this with the corresponding
program written in C using a library called CUDA (lightly edited from
[118]):
void add(int *a, int *b, int *c, int n) {
__global__ int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < n)
[index] = a[index] + b[index];
c}
#define N (2048*2048)
#define THREADS_PER_BLOCK 512
#define BLOCKS ((N + THREADS_PER_BLOCK-1) / THREADS_PER_BLOCK)
int add(int *a, int *b, int *c) {
int *gpu_a, *gpu_b, *gpu_c;
int size = N * sizeof(int);
(gpu_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(gpu_b, b, size, cudaMemcpyHostToDevice);
cudaMemcpy
<<<BLOCKS,THREADS_PER_BLOCK>>>(gpu_a, gpu_b, gpu_c, N);
add_kernel(c, gpu_c, size, cudaMemcpyDeviceToHost);
cudaMemcpy(gpu_a);
cudaFree(gpu_b);
cudaFree(gpu_c);
cudaFree}
We need a pair of functions, the first of which is a so-called
“kernel,” the part that runs on the GPU. It is marked by they keyword
__global__
, as the documentation explains:
__global__ is a CUDA C/C++ keyword meaning [function] will execute on the device [and] will be called from the host.
Here “device” means GPU. If you find __global__
a
confusing choice to mark these functions, you aren’t alone since
“global” is already a charged word in programming, mostly having to deal
with the scope of variables. Calculating which indexes to operate upon
follows a logic close to the Triton example. However, in this case,
instead of an index range, we have a purely scalar operation. The second
function runs sequentially on the CPU and takes care of memory
allocation, deallocation, and moving data in and out of the GPU. It
invokes the kernel with a unique triple chevron syntax that
provides information about the desired level of parallelism. While
already demonstrating the conciseness of Triton, this basic example
doesn’t highlight its ability to perform automatic optimizations that
are far from trivial and need to be implemented manually in C and CUDA.
An example where this ability is on display is the implementation of the
function softmax
, available in Python with Triton [117] and C with CUDA [119]. However, a complete understanding
of these examples requires a more in-depth treatment of GPU programming
than I can provide. Libraries such as PyTorch [120] and Numba [121] enable equally or even more
concise GPU programming. Unfortunately, they also impose a significant
performance penalty over Triton or hand-optimized C CUDA kernels, in a
highly performance-sensitive application domain.
The quest for conciseness isn’t limited to deep learning. For example, the authors of SparkSQL, a module that extends the big data processing system Spark with relational capabilities,56 are keen on touting its conciseness, both of its internals as well of sample programs. They do so at least seven times in the paper where it was first described [122]. SparkSQL attempts to combine relational processing with semi-structured, ETL-type57 processing. A basic example from the original article is the following:
employees.join(dept, employees("deptId") === dept("id"))
.where(employees("gender") === "female")
.groupBy(dept("id"), dept("name"))
.agg(count("name"))
In this expression, employees
and dept
are
two relational tables. The authors remark that this being just a library
within a fully featured programming language (either Scala or Python)
means that we can use control structures, variables, and all the
features of one together with the powerful declarative style of SQL (see
also Sec. 1.2.10.1).
Programming directly in SQL would not afford these possibilities. Since
evaluation in SparkSQL is delayed as much as feasible, the developer can
build abstractions for common subqueries without a performance penalty,
because only the final, fully assembled query will be submitted to Spark
and therefore optimized as a unit. As a SparkSQL user put it, it is
“concise and declarative like SQL, except I can name intermediate
results,” and naming is essential to reuse.
The authors of Scalding, also a Scala library, picked this byline for their twitter account [123]:
Scalding is a scala DSL58 for Cascading, running on Hadoop. It’s concise and functional, and it’s just good, clean fun.
We don’t need to take a slogan at face value. On the other hand, it is what the Scalding team wanted to advertise, as opposed to, say, that it is scalable. Since it runs on top of Hadoop,59 it inherits its scalability properties. However, since Hadoop is written in Java that was the only language to develop programs running on it, and its Java API was phenomenally verbose. For example, I once counted approximately 250 lines to perform a simple relational join (that was using an early version, and I hope more concise alternatives have been developed since). Therefore, it is easy to understand why Scalding developers would like us to focus on its conciseness, which we can appreciate in the following example of a join, despite the verbiage devoted to static typing [124]:
case class Book(title: String, author: String)
case class ExtendedBook(title: String, author: String, copies: Long)
val library1: TypedPipe[Book]
val library2: TypedPipe[ExtendedBook]
// somehow load them with data
val group1: CoGroupable[String, Book] = library1.groupBy { _.title }
val group2: CoGroupable[String, ExtendedBook] = library2.groupBy { _.title }
val theJoin: CoGrouped[String, (Book, ExtendedBook)] = group1.join(group2)
Many data manipulation libraries have the data frame as their central data structure. It can be described as tabular or relational data or, in the jargon of statistics, a two-dimensional collection with one row per observation and one column per variable. There is something about this data structure that makes it so valuable. For example, Spark co-author and Databricks Chief Technologist Matei Zaharia says that “[data frames] give you a very concise way of writing expressions, doing operations” (my transcription from [125]).
Dataframes are built into R as the data structure to represent empirical observations, but how R developers work with them has changed over time. An R library, Dplyr, provides an alternative to built-in capabilities and won out with users, who praise its conciseness — its speed also does help. Take for instance this tweet thread between Vince Buffalo and J. J. Emerson, two evolutionary geneticists [126]:
I ♥ how R makes much of data analysis trivial. How good are genotype imputations by freq class? table(real_geno, imputed_geno, cut(freq, 5))
Some of my old R code (100’s [sic] of lines) could now be reduced to a 20-liner with plyr, ggplot, etc.
Haha, yea mine too. The Hadley-sphere is a black hole for wasted/unnecessary lines of code.
The Hadley-sphere mentioned here is a set of packages, including Dplyr, by the same developer, Hadley Wickham. While not all equally popular, some — Dplyr and Ggplot2 above all — have largely replaced corresponding built-ins or core libraries bundled with the language for most users, a rare feat among libraries, as the advantages of incumbency are as real as in politics (see also Requests in Sec. 1.2.10.3).
Tony Fischetti puts it in no uncertain terms in his blog post “How dplyr replaced my most common R idioms” [127]:
Perhaps more importantly, the dplyr code uses many fewer lines and assignments and is more terse and probably more readable with its neat-o “%.%” operator.
His blog post clarifies that these conciseness advantages are more important than a 2x speed-up for his use cases. The operator mentioned in the quote is a version of the pipe operator (see Sec. 1.2.9.3).
A problem with Dplyr is that its author, to facilitate interactive use, made life harder for people who want to create programs with it [128]. R is a programming language, but, even more importantly, a console for statistical analysis by non-programmers. This dual use is uncommon among programming languages — other examples are MATLAB and SAS. Nonetheless, it is better to support interactive and programming uses with the same constructs and libraries to ease the transition from one to the other. To achieve that, Wickham designed two very similar sub-APIs in the same package, one for each use case. I was involved in a discussion about this as the developer of a related package, and I resisted that option calling it “a doubling of the API footprint” [129]. However, my solution required slightly more boilerplate in programming use cases. More recent versions of Dplyr have gone down a similar path, deprecating one of the sub-APIs (for frugal APIs, see Sec. 1.3.3).
We have already touched upon some of Dplyr’s features in Sec. 1.2.9.3, as they are the best known application of R’s
unique macro-like facility. When a data frame table
includes columns a
and b
, we can easily create
a new data frame with an additional column c
with
table = mutate(table, c = a + b)
, as opposed to the
standard R table$c = table$a + table$b
. However, what if we
wanted to parametrize the columns involved in this operation? Let’s
assume the names of the columns are stored, as strings, in three
variables a_name
, b_name
, and
c_name
. In this case, neither of the above expressions will
work. Standard R offers a different operator, [[]]
for this
case: table[[c_name]] = table[[a_name]] + table[[b_name]]
.
Dplyr provides a now deprecated sub API of functions, whose names are
suffixed with a cryptic underscore character, to support this case:
table = mutate_(table, c_name = paste(a_name, "+", b_name))
,
where paste concatenates its argument to yield the string
"a + b"
. While this form avoids the endless repetition of
the data frame’s name, it requires forming R expressions by string
operations, an error-prone and hard-to-read approach. The currently
recommended solution is to refer to the data frame not by its name but
by a conventional name that can occur inside Dplyr expressions only as
in
table = mutate(table, c_name = .data[[a_name]] + .data[[b_name]])
,
which is the longest of all the different versions examined here, and
restores the repetition of names necessary in standard R. As Steele said
about languages, “there are only so many concise notations to go
around,” and the same seems to hold for libraries (see Sec. 1.2.9). This is an instructive example
of the subtle trade-offs between complexity in a library and in its
users’ code and optimizing for different use cases.
Like Keras (see Sec. 1.2.10.1), Dplyr delegates computation to one of several backends, including relational databases. Its developer chose to add advanced SQL features known as window functions, but with limited support[130]:
[…] you can’t access rarely used window function capabilities (unless you write raw SQL), but in return, frequent operations are much more succinct.
Conciseness is difficult; therefore, as we have seen before (see Impala in Sec. 1.2.9.9), an option is making popular features and idioms more concise while allowing a more verbose expression for the less common ones. Here the pain goes further, though, since we have to “punch through the abstraction,”60 reaching for the separate, lower level language of the backend, SQL, and losing the portability61 of “pure” Dplyr expressions.
Unlike R, Python doesn’t come with data frames built-in, but the library Pandas provides equivalent capabilities. However, it is the opinion of some practitioners that it is less elegant than its R counterpart, Dplyr. The project Pandas-ply aims to provide a more Dplyr-like API for Python. Its authors write [131]:
In our opinion, this pandas-ply code is cleaner, more expressive, more readable, more concise, and less error-prone than the original pandas code.
It is still in an experimental stage, and it is far from having the impact on the Python user base that Dplyr had on R users. You can judge whether it should by looking at the following examples, from the Pandas-ply introductory material, the first of which is written using Pandas:
= flights.groupby(['year', 'month', 'day'])
grouped_flights = pd.DataFrame()
output 'arr'] = grouped_flights.arr_delay.mean()
output['dep'] = grouped_flights.dep_delay.mean()
output[= output[(output.arr > 30) & (output.dep > 30)] filtered_output
Pandas-ply lets you instead write:
(flights'year', 'month', 'day'])
.groupby([
.ply_select(= X.arr_delay.mean(),
arr = X.dep_delay.mean())
dep > 30, X.dep > 30)) .ply_where(X.arr
While the line count is similar, the Pandas example is more “busy,”
with temporary results stored in intermediate variables and more
repeated variable names. The Pandas-ply version uses a chained method
call syntax that avoids intermediate variables and uses a special object
X
to simulate some of the Dplyr syntax based on R’s
macro-like facility, which has no equivalent in Python.62
However, even more telling is the comparison with their R closest
counterpart, Dplyr. Recalling the example (see Sec. 1.2.9.9) where we sought to compute, for
each group of data, the entries matching the maximum and minimum for a
variable, it is instructive to try and reproduce that with Pandas and
Pandas-ply. Simply applying a filter operation to grouped data doesn’t
work for either package because grouped data frames don’t have the same
methods as their ungrouped counterparts. In Pandas, we can use this
workaround:
"cyl")
mtcars.groupby(apply(lambda x: x[(x["mpg"]==x["mpg"].max()) | (x["mpg"]==x["mpg"].min())])[
."mpg", "model"]
[
]"mpg") .sort_values(
The same workaround also works for Pandas-ply:
"cyl")
mtcars.groupby(apply(lambda x: x.ply_where((X.mpg==X.mpg.max()) | (X.mpg==X.mpg.min())))
."mpg", "model")
.ply_select("mpg") .sort_values(
In Pandas or Pandas-ply, a grouped data set is more like a list of
groups, and we can use an approach from functional programming to
apply
a function to every element in this list, each a data
frame. In this function, created with a lambda
expression,
the desired selection is performed using standard Pandas data frame
operators. It is hard to ignore the numerous references to
X
, similar to the .data
identifier in Dplyr,
but for which Dplyr offers more concise, albeit less general,
alternatives. On the result of this apply
call, we can
select variables of interest and perform a sort, alternating operators
like [[]]
and methods like sort_values
, but in
closer analogy with Dplyr. Unlike Dplyr, Pandas doesn’t provide
specialized idioms for interactive sessions, as this use case is less
dominant and more recent in Python. Independent of the use case,
treating grouped data frames differently from ungrouped ones makes the
API harder to use for no increase in power. It represents a loss of
orthogonality, the ability for a set of functions or methods to
operate on different types of objects with no exclusions. Here we have
an example of a lost opportunity to use the more concise form:
filter(...) table.group_by(...).
instead of the more verbose:
apply(lambda x: x.filter(...)) table.group_by(...).
while the former is simply disallowed, an arguably intuitive line of code that returns an error instead of something useful.
One of the advantages of being an incumbent, like Pandas, is third-party support. For example, using package Modin together with Pandas, we can partially parallelize a program [132]. Just by importing Modin, unmodified Pandas code will run in parallel. It is the most concise parallelization I have ever seen (compare it with MPI in Sec. 1.5).
1.2.10.2 Statistical Graphics
In statistical graphics, the quest for conciseness starts with its design, independent of any implementation. In The Visual Display of Quantitative Information, author Edward Tufte defined the data-ink ratio, the ratio of ink spent to display data vs. the total amount used in a chart, and recommended maximizing it [133]. It is a somewhat counterintuitive definition, as it may seem to encourage the use of heavy marks, like thick lines and large dots; this outcome, though, is prevented by another one of Tufte’s principles, that of data density — the number of data points per unit of area. The result is graphics that display lots of data with just enough ink, minimize and de-emphasize everything that isn’t data, like grids and scales, and eschew ornamentation altogether.
SimplexCT, a data analysis consultancy, provides a step-by-step example of transforming a chart that Tufte wouldn’t hesitate to call chartjunk into one following his principles [134]. Its starting point is full of useless and distracting graphical elements, as in Fig. 5.