If you got a PDF scan of a book with two pages per sheet such as this one (BTW this is a manuscript dating from 1449: “Tractato de septe peccati mortali” by Frate Antonino, image copyright Houghton Library, Harvard University, Cambridge, Mass.)
- Query the PDF for the number of pages and the resolution:
look at the “Pages” output of this command. Now type:
pdftoppm -gray -l 1 ugly.pdf test
then inspect the resulting test-001.pgm file with an image editor to find out the resolution; for the pages and the resolution I got 223 and 1650 x 1275 pts respectively, so these numbers will be used in the following – you should of course adapt them to your results.
- Create a bash script to process a single page:
cat > doone.sh #!/bin/bash page=`printf '%03d' $1` pagenew=`printf '%03d_' $1` gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -dFirstPage=$1 -dLastPage=$1 -sOutputFile="$page.pdf" ugly.pdf pdftoppm -gray "$page.pdf" > "$page.pgm" convert -crop 825x1275 "$page.pgm" "$pagenew.pgm" rm "$page.pgm" ^D chmod u+x doone.sh
Note that for the X-resolution option of the convert command, I enter the half (625) of the horizontal resolution above (1250); in this way the pgm will be split in two vertically. The pdftoppm command has a -mono option to produce monochrome images, and a -r option to set the resolution.
- Run the bash script on all pages:
seq 223 | xargs -n1 ./doone.sh
- Finally concatenate the pages to get hold of the converted PDF:
convert *_.pgm nice.pdf
or do that in two steps:
for i in *.pgm; do convert -compress fax $i `basename $i .pbm`.pdf; done gs -q -sPAPERSIZE=a4 -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=nice.pdf *_.pdf