How to use Sphinx to give an old book new life

421 readers like this.
Book stack

Image by Kate Ter Haar. Modified by Opensource.com. CC BY-SA 2.0.

The Internet Archive, Project Gutenberg, and Google Books are wonderful sources of historical books, but the finished products of their digitization efforts, while thorough and functional, lack that last bit of polish. For example, one of my interests is historical cooking, specifically Georgian and Regency British cookery and the contemporary period in American cookery, but the PDF versions of the relevant cookbooks are usually just basic black and white scans with no features that aid findability or searchability. The plain text versions, while more searchable, are not aesthetically pleasing and often contain numerous optical character recognition errors.

One of my pet projects is to combine my love of historical cooking and my love of open source by leveraging open source documentation tools to produce polished, web-based versions of 18th-century cookbooks. When I am done, these books will be fully searchable and have hyperlinked tables of contents and indexes. To accomplish this, I have turned to the Sphinx documentation builder, which provides me with a simple way to create, organize, and build functional HTML books with little effort, and as an added bonus, use Sphinx to create PDF and EPUB versions just as easily.

The first book I am converting, Hannah Glasse's "The Art of Cookery Made Plain and Easy," is not quite ready of public consumption, but in this article I share what I have learned from working on it so others can learn from my workflow. Think of it as a highly specialized Sphinx tutorial. Most of what I am going to describe is common knowledge to people working on open source documentation, but my hope is that by sharing my process, I will teach non-open source people with moderate computer skills about a tool that they can use to make public domain texts related to their hobbies and interests more user friendly. Granted, Sphinx does require some familiarity with the command line, but I will explain the complex parts.

Installing Python and Sphinx

Sphinx is the tool used by the Python project to produce their own documentation, so it is built using Python. In order to get started, you will need to install Python on your system. Download the Python 3.x installer for your operating system or install Python 3 using your Linux distribution's package manager (e.g., dnf install python3 on Fedora).

When installing Python, make sure the option to install pip is checked if you are using an installer or install the relevant package when you install the python3 package for your distribution. On Fedora the package is called python3-pip. Once you have Python and pip installed, you can install Sphinx using the pip package manager. Open a Command Prompt on Windows, Terminal on macOS, or your terminal emulator of choice on Linux and type: pip3 install -U sphinx. If you see error messages, rerun the command as a privileged user by prefacing the command with sudo on Linux and macOS, or by using the "Run as Administrator" option to launch a Command Prompt on Windows. Depending on how your system is configured, you may be prompted for an administrator password.

Once the installation is successful, you will have everything needed to get started using Sphinx. Also, there are additional packages that you can installed to provide more options.

Setting up a new Sphinx project

To create a new Sphinx project, open a Command Prompt or Terminal, use cd navigate to the location you want to store your project in (e.g., /home/USERNAME/Projects), and then run the sphinx-quickstart command. The quickstart will prompt you to provide information about your project and make choices about various configuration options. Please note that under Windows, the quickstart's prompts will be surrounded by nonsense characters, but this is harmless and the quickstart will run just fine. This first prompt, "Root path for documentation.", lets you set the location of your project.

If you are running the sphinx-quickstart command from the directory you want to store your files in, you can press enter on your keyboard to accept the default value. If you want to store your project in a sub-directory of your current working directory, enter a name for that sub-directory at the prompt and Sphinx will create the directory and store all the project files in that location.

Next you will be prompted about using separate "source" and "build" directories. The default is no, which places the source files in the project's root directory and builds will end up in the "_build" sub-directory. Selecting the default value is perfectly fine, but if you want to keep your source and build files more separated, you can select yes. This will put the source files in the "source" sub-directory and built output will go in the "build" sub-directory.

The third prompt allows you to change the prefix for the "_static" and "_template" sub-directories. There is no compelling reason to change this from the default option.

The next two prompts are for the project's name and the author's name. The project name can be the title of the book you are converting, but using the original book's author's name for the author field can lead to some oddities like the long dead author's name appearing next to the current year in the copyright information on page footers. It is up to you how to best handle the author field, but I set it to my name and credit the original author within the project's text.

You will then be prompted to assign a version number and a release number to the project, which in this case makes little sense, but it is a required field. Feel free to set this to whatever makes the most sense to you. It can be set to a number less than 1 for a "beta" release and incremented towards 1.0 as you progress towards a finished book, or you can set the version and release to the original book's year of publication and not have to worry about updating it.

The next setting, the project's language option, should be set to the book's original language. Setting the language option to the language the book was originally published in allows Sphinx to create output with proper language metadata in the site's headers and generate text, such as "Table of Contents" and "Quick Search," in the correct language.

For most users, the next two prompts should be set to the default. The first of the two prompts asks the user to select the file extension for project files. This defaults to .rst (for reStructedText, the markup language used by Sphinx), so there is no reason to change it. The second prompt asks for the name of the main file used for the project. This file is where the content of the site's main page and the table of contents resides. By default, this file is called "index.rst" with the "index" portion coming from what is entered at this prompt and the ".rst" extension coming from the previous prompt. If you do want to change it to "frontmatter" or something else, so that it does not conflict with a using the index name for the book's index, you can, but there is very little reason to do so.

The next set of prompts are all related to Sphinx extensions. For a project like this, it is safe to say no to all of them. The only exception is if you plan to host your finished site using GitHub pages, then you should say yes to the "create .nojekyll file" prompt.

Finally, you will be asked if you want to create a Make file and a Windows command file that can be used to generate the HTML output without having to run sphinx-build and manually specify input and output directories. You should say yes to the option that is relevant for your system (Make file for macOS and Linux or the Windows command file for Windows) or say yes to both if you plan on collaborating with other people who use different operating systems. Linux and macOS users will need to make sure that make is installed on their system, but the Windows command file will run without installing anything extra.

Creating chapters and adding content

Now that the setup is done, you leave the command line alone until you need to build your finished product. All the content creation and editing can take place in a text editor. GitHub's Atom editor is an excellent choice, but any text editor will work as long as it saves plain text files.

Start by opening the project's "index.rst" file in Atom, or your text editor of choice. This file is where you will build your table of contents by listing all the files that will be included in the book. It is also a great place to include all the other front matter, like the title page, from the book you are converting.

The first edit you should make to "index.rst" is removing the line that reads "* :ref:`modindex`" under the "Indices and tables" heading. It is completely unnecessary for this kind of project. You can then modify the text of the "index.rst" to suit your project's needs. The lines with equal signs under them are headings and their text can be changed to whatever you want to best match the book you are converting. For example, the "Welcome to [your project's name]'s documentation!" can be replaced by the actual title of the book. The line that reads "Contents:" can be changed or removed, whatever works best for your project (in fact, you can add as much text as you want) just be sure to leave the ".. toctree::" and ":maxdepth: 2" lines intact.

To build a table of contents and start structuring your project you will leave one blank line after the ":maxdepth: 2" line and start listing the content you will include in your book. You need to indent the list so that it lines up with the initial colon in the ":maxdepth: 2" line, so three spaces before the start of the text. These sections need to be in order, so an "introduction" chapter would come before "chapter1" and so forth. These lines should not have the ".rst" file extension, but when you do create the files for the different chapters, you will save them using the same name listed in the "index.rst" plus the ".rst" extension. For example, you would have a line that reads "chapter1" in your "index.rst" file, but the file that contains Chapter 1's content will have the filename "chapter1.rst". For simplicity's sake, save all the files in the same directory as the "index.rst" file.

Creating a chapter and filling it with content involves creating a new file in your text editor, adding in the content from the book in any manner you want (transcribing by hand, copying and pasting OCRed text, or whatever works best for you), and saving the file using the appropriate file name (e.g., "chapter1.rst" for the chapter listed as "chapter1" in "index.rst"). Most of the work comes from getting the formatting of the text correct, and given reStructedText's fairly straightforward markup, that is not too laborious or complex.

When dealing with older texts, most of what you need to know is how to make a heading: a series of = under a line of text, subheadings: a series of - under a line of text, bold text: text with ** to the left and right, and italic text: text with * to the left and right. One thing to note is that the table of contents is generated using headings, so be sure to include at least one heading at the beginning of each file (it can be "Chapter X" with = signs on the line below, if the book you are working on does not contain textual headings for its chapters.)

More reStructedText formatting options can be found in the reStructedText documentation, should you want to use more advanced formatting. While Sphinx can recreate most typographic features of older books, typing the "Long S" and various ligatures is time-consuming using a modern keyboard, so you should decide if you want to include them. You might want to go for authenticity or opt for simplicity.

One feature that you will find almost impossible to recreate is the printer's catchwords at the bottom of each page because the easiest way to convert a book into HTML is chapter based, so the pagination will obviously not be retained. If you are really adamant about keeping the catchwords, you could do the labor-intensive work of creating a file for each individual page and organizing your project that way, but the end results would still not be perfect.

Recreating a book's index

Taking a book's index and recreating it to create a hyperlinked index is probably the most labor-intensive part of this process. You will be reverse engineering the index by reading through the book's index, finding the relevant part of a chapter, be it a recipe or some other kind of text, and including Sphinx's index directive at the appropriate locations. To recreate an index entry that reads "Beef, How to roast," you would find the relevant recipes and add the line ".. index:: Beef; How to roast" after them. The index would then contain a "Beef" main entry and a "How to roast" sub-entry, which would have links to relevant recipes.

Building the website

Once you have created files for each chapter in the book you are converting, filled those files with all the original content, and recreated an index, you need to create the finished HTML. After all that hard work, this is the easiest step. On the command line, in your project's root directory, type make html. This will build your site and place the finished product in either the "build" directory or the "_build" directory, depending on if you selected to separate the source and build directories when you ran the quickstart ("build" if you separated them, "_build" if you did not.) There are even more build options, such as EPUB and PDF. Some of these options require additional software, so check the Sphinx documentation for details. In Sphinx's documentation you will also find a information about a large number of advanced settings and other options, which can help if you want to further tweak your finished project.

With these steps, you've given an old book new life.

4 Comments

A great intro to Sphinx. I've used Sphinx for a few projects, and it really is a great compromise between the complexity of Docbook and the over-simplification of Markdown.

sphinx? For almost all your needs, Calibre will do the job with almost no effort.

I've never heard of using Calibre to generate websites before. Can it do that directly, or do you mean you'd use Calibre to generate an epub, and then dissect the epub and re-purpose the generated HTML?

Either way, I think it's good to know a variety of tools. Sphinx is a good one, as is Calibre (and pandoc, and xmlto, and a few others).

In reply to by Derek Broughton (not verified)

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.