I’ve just been given membership to an online research site that you might want to join, too. It’s called Forgotten Books, an online warehouse of over a million books dating back to the 1500s, and all the way up to the 1940s, all image-over-text editions, fully searchable and readable and downloadable in numerous formats. It’s not free, but it is affordable, and superior to Google Books in several respects.
I was given a free lifetime membership if I blogged about the site (no matter what I said, good or bad). But that benefit wouldn’t be worth anything if the site wasn’t worth using. And it definitely is of some use. And that’s worth knowing about. It has some defects that need fixing (and its management is working on those). But it has uses as well. I’ll summarize my thoughts on both counts.
The Basic Plus Side
Its million-plus books going all the way back to the 1500s are browsable and searchable, images and all. Continuing membership is very affordable. And with it not only can you skim and read countless old books, you can also run searches on them, and build statistical charts with the results–similar to text analyzing features I taught and used back at Columbia University when I worked for the Digital Texts Service at the campus library, where I helped numerous researchers compile their own digital texts, search routines, and statistics. And now anyone can do it, online, for cheap. They’ve already compiled the search indexes, so building searches and statistics do not have the long processing delays you would have doing this in the raw (for more on how they built their indexes, and their limitations and capabilities, see here). You can also search for images in these books (of which there must be millions).
You pay the site not only for the tools and services they provide, but also visual builds. Although the text of these books is public domain, the editions, reconstruction, formatting, and presentation are proprietary (e.g. the images are their reproductions, so they still have photography rights). So if you wanted to reuse images or page views, beyond fair use, you would still need to negotiate permissions through them. Otherwise…
You can grab and read texts as PDFs, kindle, and any number of other versions. If you use the online book reader at the site, it has a bookmarks and notes function, much like a kindle. Dedicated mobile apps for the site are also in development. The number of books you can read or download per month is limited by the level of membership you subscribe to, but you can preview a large portion of every book without limit. And all the other features are unlimited use. They also sell paperback editions of most (and soon all) of their books–if you want to have a hard copy.
You can see along the left margin at the site the many categories of books they have. I was most intrigued by their massive collection of esoteric titles: afterlife and immortality, alchemy, astrology, ciphers and codes, ESP and psychic phenomena, freemasonry and secret societies, magic and witchcraft, theosophy, unexplained phenomena. Thousands of titles. Think of the kinds of data (as well as entertainment) you can reap from a collection like that.
The research ability in early American history is profound as well, so studying what people really were saying around the American Revolution or the Civil War is right there for you to search through. Other interesting subjects include books in philosophy, religion, ancient history, early science, languages (including dictionaries from other centuries, valuable for studying words as they have changed meaning; and books in languages other than English, including Spanish, Latin, Italian, French, and German), as well as fiction, and more. They also have a collection of administrative records (genealogy data, audits and surveys, minutes and reports).
You can start learning all about the Forgotten Books service and what you can do with it here.
Experimenting and Limitations
The service offers as examples of what you can do with it:
With this valuable research information, we can tell you virtually anything about anything, from the most commonly used word in fiction books published in 1765, to the book with the most images of cats in the first 20 pages. Or perhaps some more useful information, such as a list of every word in the English language in order of usage frequency.
Experts with this kind of textual analytics will want to know how clean their OCR was. So I asked. They estimate that on average they might have about a 2% error rate, and that agrees with what I saw (some items, e.g. organizational minutes, are worse than others, e.g. commercially printed novels). But they actually know where most of the errors are (every failure to match their master word list is flagged) and are continually working to clean them up (so the database will continue to get more accurate over time). And because the books are presented as image-over-text, those errors will only matter for searches and analytics, not for reading the text, since you can simply see the page as it was scanned, regardless of whether the OCR got the text right.
In our correspondence about this, the creator of the site informed me that in other OCR projects like Google Books…
Older books have more OCR errors, therefore the frequency of all words will be skewed, showing for every word that its frequency increases over time. [But w]hat is really increasing is the quality of the text. Forgotten Books’ word data does not have this fundamental flaw because non-dictionary words were excluded from the calculations in order to correct for this error.
Their ongoing project to manually improve the text will improve statistical results even further.
To test the site out I decided to experiment.
I went into “administrative records > minutes and reports” and found 6344 titles there. There is an “about” page explaining what kind of works are in that category, and I can list titles alphabetically or by popularity or relevance–but not yet by date of publication. Although that feature is definitely on the way. Its absence is a significant limitation to a historian, but they assure me it won’t be absent for long. Word searches, however, can be limited to specific dates or date ranges, but many of the texts have not been date-coded properly, and so right now you will get dirty search results by date–for example, sometimes a book from 1977 will end up in your “before 1830″ filtered search list. I’m assured they are fixing this as well, with manual verification and correction of every title and date (it’s a top priority), so searches will get cleaner over time.
Two search features worthy of additional note are that the default “book” search (i.e. without using the drop-down menu to search a specific data field like “title”) combines the results of title, author and most 100 common words in the book (excluding stopwords and matching plurals). A convenient feature. Meanwhile, the drop-down menu allows other options such as the “page search,” which allows you to search every page in every book in the library at once. For example, here are the results for the phrase “black magic” (I apologize if that doesn’t load for non-subscribers).
In the “about” page for the category I chose to experiment with we learn interesting things like this:
The Woman’s Committee, United States Council of National Defence: An Interpretative Report by Emily Newell Blair is particularly interesting. This is a report on the First World War (1914-1918) and is particularly notable because the author, Emily Newell Blair, was a renowned feminist and suffragist, providing a feminine and very different perspective from all the male dominated governance of the time. She was also a fantastic writer in her own right and this free ebook of her report should be enjoyed by anyone with an interest in the history of this war a century ago.
And that’s just one gem among thousands to find here.
But when I went exploring in the six thousand entries in this category I found some limitations in the navigation. If you know what you are looking for, it’s easy to find (using the general search functions rather than the category browsing). But if you want to peruse or narrow entries within a category, it’s a bit awkward. They are aware of these issues, and improvements are on the way. But current users should be aware of this. Searches can be run by words in the title, but that doesn’t help when you want to make sure you are searching every book in a specific subject. Right now there is only category browsing, and no category search filter. But that should be an easy feature to add, and they’ve told me they are working on it.
I picked at random from the first few entries of the six thousand or so Minutes 1851 by Reformed Church In America Particular Synod Of New York and noted that the description section is very dirty OCR (but there is a warning note saying to expect that). For example, it begins “Five copies of the Minutee of the last sesfiion of the Partaculftr Synod ef Jklbany were received and laid on the table. Article VL Corrrspondbnck. Nothing oceurred. Article VII. Classical Rkports.” So you can see the character recognition wasn’t doing too well there. But when I went to read the minutes, of course, it becomes all image-over-text so I had little difficulty. But you can see how searching the text will be of limited reliability if the underlying hypertext is as dirty as the description paragraph.
In this case that dirtiness is to be expected, since, being the published minutes of such a long time ago, the inking of the type is a bit poor for modern scanners to handle, although the human eye does fine. You will get cleaner OCR from commercially printed books, even from the same period.
But back to my test. The minutes I browsed contain tables of data, as well as plain news items and interesting remarks, so you can imagine the wealth of historical and cultural information you could dig up here or glean in general. For example, these minutes contained a complete breakdown of all the churches and pastors and schools under that synod (many dozens), a census of members and students for each, cash receipts and outlays, and more. Plus church trials and rulings and appeals regarding the dismissal of pastors and accusations against them. And more.
I then went into the “afterlife and immortality” section and found 180 titles there. I picked one early on, The Astral Plane: Its Scenery, Inhabitants, and Phenomena by C. W. Leadbeater, published in 1900. I found it full of confident descriptions of the astral plane and its contents (by a theosophist, of course). Tons of cultural assumptions to explore and marvel at in there. Likewise interesting alternative religious ideas. And so on. Did you know there are really seven astral planes, each higher than the next? Hence we’re told we must not call the astral plane the fourth dimension. That would be an error. Indeed.
I then went into “philosophy -> metaphysics” and found Philosophy: What Is It? by F. B. Jevons, published in 1914, which is interestingly described as (sic):
One of the branches of the Workers Educational Association expressed a desire to know what Philosophy is; thereby assuming that Philosophy is a concern of the average man and of practical life, and should not be the monopoly of the professed student. Of the truth of this j.view there can be no doubt, and this book consists of the five lectures which, were given by way of an attempt, not so much to answer their question as to bring out the meaning of the question. Hence the interrogative form of the title PhUosophy: what is it? The attempt was necessarily made, in the discussion of the question, to avoid technical terms as far as possible. Without technical terms it is impossible, it may be said, to go very far in the discussion.
Notice the cleaner OCR here (as compared with the minutes before).
Overall Forgotten Books is superior to Google Books (which faces similar defects anyway). The ways you can employ it, the size of the collection, the variety of ways you can read books downloaded or viewed from it, the ability to bookmark and annotate, and the commitment to improve even what flaws it has over time (which I expect will go more rapidly the more subscribers they get), all combine with its relative affordability to make this a site at least worth taking a look at.