ps2pdf making very large pdf files

Tue Sep 17 17:34:45 EST 2002

On Tue, Sep 17, 2002 at 02:36:43PM +1000, Michael Still wrote:
> On Tue, 17 Sep 2002, Martijn van Oosterhout wrote:
> > How complicated is postscript to parse? PDFs are just a subset of postscript
> > IIRC).
> 
> Sorta, but not quite (to my limited understanding of postscript). PDF
> doesn't have execution like ps, and only has limited conditionals. The
> page descriptions in PDF are like ps ignoring those issues, but the rest
> of the file structure is different...

But can you include raw bits of postscript? Some tricky postscript stuff
does seem to survive the transition to PDF.

> > What you would have to do is parse the postscript in the source PS (so
> > sometihng like ghostscript). You would also have to parse the DSCs (document
> > structure) the identify page. You would also have to define and extract the
> > subdocuments within it (so embedded EPS files, fonts, function libraries)
> > are extracted as seperate objects.
> 
> This is the bit that scares me.

Note that I may have overstated a bit here. While distiller seems to be good
at removing duplicate fonts and function packages, it didn't notice completely
duplicate files. So I ended up doing the following code to force the file
into an object.

[ /BBox [0 0 0 0] /_objdef {EPS_0_follower} /BP pdfmark
... EPS code ...
[ /EP pdfmark

> > Then, once you've defined all the objects you need to optimise. Remove the
> > duplicates remembering to pick ones even if they have different names. This
> > is so that if you include a dozen EPS files from Illustrator, you only get
> > the special Illustrator PS code once. Note this means moving the postscript
> > code from inside the EPS file to a global scope so all EPS's can use the
> > same code. But then you have to be careful about name collisions.
> 
> Ahhh. Panda already does this for you (although if you change filenames, I
> assume that you're a twit and don't deserve duplicate elimination).

That's not too bad. Panda is obviously a bit more sophisticated than I
remember :).

> > Also, if the user has selected downsampling to 72 dpi check all embedded EPS
> > files to see if it can be downsampled and if it would save space. This
> > applies mostly to embedded bitmaps.
> 
> Panda doesn't downsample rasters at the moment, but this is probably less
> of the issue than the image reuse. It wouldn't be all that hard to code.

Well, it works well if someone thinks it's a cool idea to include this scanned
photo on the front page and by the way it's 24-bit 300dpi.

> > Update references and cross references. Process the pdfmark commands to
> > enable the special PDF features. Work out which glyphs are actually used and
> > include only those. If a glyph is only used once, include it direct instead
> > of leaving it in the font (but only if it saves space).
> 
> This is also a bit of work...

I don't know how fonts are stored. All I know is that the program ttf2pfa
does a reasonably good job. Obviously if the font is one that Acrobat
includes by default you don't need to bother.

> PS: The other thing you can do in PDF is page "templates", where you can
> draw the repeatedly used elements of several pages only once.

I wonder if that's that BP/EP/SP pdfmark stuff I mentioned earlier.
-- 
Martijn van Oosterhout   <kleptog at svana.org>   http://svana.org/kleptog/
> There are 10 kinds of people in the world, those that can do binary
> arithmetic and those that can't.