ps2pdf making very large pdf files

Michael Still mikal at stillhq.com
Tue Sep 17 14:36:43 EST 2002


On Tue, 17 Sep 2002, Martijn van Oosterhout wrote:

> On Tue, Sep 17, 2002 at 01:09:12PM +1000, John Griffiths wrote:
> > well i can't code worth a damm but i can make and test
> >
> > At 12:47 PM 9/17/02 +1000, Michael Still wrote:
> > >On Tue, 17 Sep 2002, Martijn van Oosterhout wrote:
> > >
> > >> Unfortunatly I havn't seen an open-source product come close to it.
> > >
> > >How complicated is PDF to parse? We could write one using Panda!
>
> How complicated is postscript to parse? PDFs are just a subset of postscript
> IIRC).

Sorta, but not quite (to my limited understanding of postscript). PDF
doesn't have execution like ps, and only has limited conditionals. The
page descriptions in PDF are like ps ignoring those issues, but the rest
of the file structure is different...

> What you would have to do is parse the postscript in the source PS (so
> sometihng like ghostscript). You would also have to parse the DSCs (document
> structure) the identify page. You would also have to define and extract the
> subdocuments within it (so embedded EPS files, fonts, function libraries)
> are extracted as seperate objects.

This is the bit that scares me.

> Then, once you've defined all the objects you need to optimise. Remove the
> duplicates remembering to pick ones even if they have different names. This
> is so that if you include a dozen EPS files from Illustrator, you only get
> the special Illustrator PS code once. Note this means moving the postscript
> code from inside the EPS file to a global scope so all EPS's can use the
> same code. But then you have to be careful about name collisions.

Ahhh. Panda already does this for you (although if you change filenames, I
assume that you're a twit and don't deserve duplicate elimination).

> Also, if the user has selected downsampling to 72 dpi check all embedded EPS
> files to see if it can be downsampled and if it would save space. This
> applies mostly to embedded bitmaps.

Panda doesn't downsample rasters at the moment, but this is probably less
of the issue than the image reuse. It wouldn't be all that hard to code.

> Update references and cross references. Process the pdfmark commands to
> enable the special PDF features. Work out which glyphs are actually used and
> include only those. If a glyph is only used once, include it direct instead
> of leaving it in the font (but only if it saves space).

This is also a bit of work...

> Finally output the objects again as a PDF compressed. I don't know how much
> of this ps2pdf but it's quite a bit of non-trivial work.

Well, only the streams of page description and rasters can be compressed,
and Panda already supports this.

> I have to say, Adobe has a nice product there.

Yeah, a couple of geek years to implement.

Mikal

PS: The other thing you can do in PDF is page "templates", where you can
draw the repeatedly used elements of several pages only once.

-- 

Michael Still (mikal at stillhq.com)     UMT+10hrs




More information about the linux mailing list