ImageMagick Fun

The “fun” in the title should be read in your most sarcastic tone of voice… Anyways, one of my professors mailed us a PDF of a scanned document to read (and print out) for the next class. Being that is was scanned in (by what appeared to be the professor literally holding it above a scanner) there was a lot of excess black in the picture.

I don’t know about you, but printing 2 large blocks of solid black, for 22 pages, doesn’t sound like a wise investment of toner. But ah! Why don’t I just crop off the excess part of each page so that just the scanned-in text is visible, and print that out? This has to be easy, right?

Unfortunately it wasn’t as easy as I’d hoped (most of the picture editors that can even handle PDFs can’t print out each layer as a separate page, and there’s no way I’m doing the exact same operation 22 times). ImageMagick looked like the thing I needed, even if it would take some trial-and-error to figure out exactly how much to crop off.

Turned out it only took a couple of runs to figure out exactly how much I could get away with cropping. But I had a worse problem than having to do trial runs: The output looked horrible.

I tried reading the man page, going to the website, and the rest, and couldn’t figure out what to do. Using the -density option seemed to be the right idea, but alas I couldn’t get it to work.

I troubleshot further, even getting to the point of running gs manually to see if Ghostview or ImageMagick was the problem (turned out it was myself, I guess). Eventually I realized that Ghostview was rendering the initial image to ImageMagick at a low resolution (72 DPI) but viewing the source in Okular, it was obvious that much better was possible (I’d estimate 200 DPI although I ended up using 300). So if I could figure out how to get ImageMagick to pass the right DPI to Ghostview I should have the problem fixed.

More directed Google searching revealed I’d had the right flag the whole time, -density. I just had it in the wrong spot. Something like this is right: convert -density 300x300 input.pdf -crop ... output.pdf. Instead I’d been using convert input.pdf -density 300x300 -crop ... output.pdf.

I figured I’d put my experience out there in the great Internet Memory Machine in case others have similar troubles.

10 thoughts on “ImageMagick Fun

  1. mpyne Identicon mpyne Post author

    Baxeico: I considered pdfcrop but it seemed that it’s designed more for removing the ridiculous amount of whitespace provided in the LaTeX-generated PDFs from articles and journals. If it had a feature to remove black borders from scanned documents then I must have missed it.

    Reply
  2. muuloo Identicon muuloo

    Another trick would be to use pdfimages (part of poppler-utils) to extract all the embedded images 1:1 (without letting Ghostscript scale them). This will usually give you the best quality possible when you deal with pdf with embedded images.
    The tool is also very useful when you want to get some other images from a pdf ;-)

    Reply
  3. goffrie Identicon goffrie

    You could use unpaper to automatically crop the black borders for you, instead of using ImageMagick. (You still need to get an image as a pnm first, though.)

    Reply
  4. Pingback: Links 1/3/2010: New Linux Benchmarks ARM Development Studio for Linux | Boycott Novell

  5. twitter Identicon twitter

    I second the poppler-utils recommendation. Once you have the images out by pdfimages, you can start to hack away with image magic. If your scanner was good enough to do text conversion, use pdftotext. pdftohtml is also nice.

    Reply
  6. teebs Identicon teebs

    Ive imported documents into Inkscape and simply put white boxes/shapes to cover the area in question and printing. It only imports one page at a time so it may be a touch tedious on larger documents.

    Reply
  7. Pingback: Destillat #11 | duetsch.info - Open Source, Wet-, Web-, Software

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>