an iPad is mounted for taking photos to convert a book into a searchable PDF

how to

Convert books into PDF taking photos of pages

20 Oct , 2012  

Convert books into PDF taking photos of pages and processing them into a searchable digital backup copy of  that you can take anywhere.

Making a PDF from a printed book is simple: you just need to take a photo of every page and crunch all photos into some programs. But it is faster and easier than you might think! Here’s how.

Step 1: take photos of pages.

My improvised photographing rig was just some cardboard boxes to prop up the iPad and a light source to illuminate the book, and to avoid dark shadows on the pages. Other times I just put the book on a chair, the iPad on a desk and the light source next to the iPad.

The ideal setup will give nice full screen photos of the open book, with pages spread straight across. Obviously you should be able to clearly read the text in all photos. Dark photos or curvy pages won’t work well. A good criterion is that you can easily read text on the photos you take.

Once you have a good setup you can start taking photos. It’s just that: turn page, take photo, turn page, take photo, until you have gone through all the book. It takes approximately 10 minutes for a 250 page book, or even less.

Step 2: process photos to isolate content

Once you have all the photos on your computer the content has to be isolated. To do this you can use the excellent open source and free scan tailor ( available for mac, windows and linux.

A sample of page photo converted to text

The program is a full toolkit for going from plenty of scans or photos to content images. It is divided into six different areas which guide you through all the processing phases. Before entering the processing steps you have to select the images from your drive and specify their dpi. I usually go with 300×300 dpi for iPad photos, and apply that to all the images in the folder.

phase 1: fix page orientation

Scan tailor: fix orientation

Rotate the page with the clockwise arrow (pink circle); Rotate all pages in the project with the batch button (blue square)

Rotate pages 90 degrees clockwise or counterclockwise. If you took all the photos with the same orientation you can apply the same transformation in batch.

phase 2: split pages

scan tailor split pages mode, with a page split in two

Split pages mode, with a page automatically split in two

This is the first ‘magic’ mode of the program: it detects the crease line between book pages and generates two distinct images from a photo of an open book. Simply hit the batch button of this phase and let the program go through all the photos. Then review the result on the list on the right. Most of the photos will be split correctly, but you can in any case retouch the few ones manually.

phase 3: deskew

manually deskewing a page with scan tailor

Fine tune the page rotation with one of the two blue handles on the page, or by entering an angle in the box on the left

Deskewing allows you to fine tune the rotation of the page halves detected at the previous step. This phase can be as well be performed automatically with the batch button, however I found out that it is not so precise as the previous one (80% correct) if the photos were taken with the book not fully open. It takes some practice to understand what the program is able to fix automatically and what not.

phase 4: select content

automated content selection with scan tailor may not be so accurate if your finger has been photographed on the text

Automated content selection with scan tailor may not be so accurate if your finger has been photographed on the text

Selecting contenct must be done in order to let the program know which part of the image is relevant for the final output. The rest will be ignored. This mode can also be done in batch and manually retouched later if there are some imprecise recognitions. It turns out that foreign objects in the photos, for example your finger or a hand, can in some cases affect the accuracy of the automated select content phase. In this case a quick selection in manual mode fixes the problem

phase 5: margins

Margin determination

Select some margins for the content. Just make sure that match size with other pages is selected.

This phase simply adds some margins around the content boxes.  Just make sure that “match size with other pages” is selected, so that you have uniform output.

phase 6: output

Output processing of scan tailor, this will convert all photos to actual straight black and white text

Tweak the parameters on one photo, then let batch mode do the magic for you.

This is the final step. Tweak the parameters of this phase until the results are satisfying, then just launch batch processing and leave the computer crunch your pages for some time (300 p books take 20/30 minutes on my core i7 mac). One of the most important features of the program is found in this screen (green box in the image): automatic dewarping. In essence it automatically detects the folds and bends of the page and it flattens it out to an almost perfect rectangle of text. It works so well that sometimes it seems magic! As usual tough you can manually refine the result in the dewarp pane.

You can tweak all parameters on an image and then use batch mode to produce a processed image for every page. I usually output in black and white with double the DPIs I chose at the beginning. I also configure a thicker output, but all these parameters really depend on how you took the photos, so just experiment until you get the best output, then apply to all images.

Step 3 : convert images into a searchable PDF.

once the images have been created by scan tailor, you will have to resort to another program in order to do optical character recognition and obtain a searchable document. I personally use ABBYY Finereader Express for mac, it has no options or tweaks, just point it to the folder of your output images and let it think for a while. It will produce a nice PDF file which you can search, copy, highlight, copy text snippets and so on. You can now ditch your original photos and enjoy a nice digital copy of your book.

ABBYY Finereader is a paid program. I haven’t experimented with free options, but you can also upload the processed photos to google docs or to evernote and their servers will eventually do OCR for you.

Now it’s time to scan your books

This guide has shown you how to convert books into PDF taking photos of pages. You already have everything you need for migrating your old physical library into the digital world! It is a fun little project that everyone can make, so it is really time to ditch your conventional dead-tree library in favor of the digital version.

