IFX Mail Archive: RE: IFX> PDF/is Issue.

RE: IFX> PDF/is Issue.

From: Hastings, Tom N (hastings@cp10.es.xerox.com)
Date: Fri Mar 14 2003 - 11:50:39 EST

Next message: don@lexmark.com: "RE: IFX> FW: Meeting: {IPP FAX / PDF-is} March 14, 2003 10:00 AM America/Los_Angeles {123123}"

Previous message: Rick Seeler: "RE: IFX> FW: Meeting: {IPP FAX / PDF-is} March 14, 2003 10:00 AM America/Los_Angeles {123123}"
Maybe in reply to: Rick Seeler: "IFX> PDF/is Issue."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Kari,

Thanks for the explanation. That helps a lot for those of us not very
familiar with PDF.

Just to check our objectives for PDF/is:

We want to make sure that existing PDF readers can read PDF/is without any
modification, right?
In other words, the PDF/is specification is a subset of the full PDF
spefication.

However, PDF/is writers will most likely be new or modified PDF writers,
right?
In other words, since PDF/is is a subset of PDF, the PDF/is writer has to
make sure it doesn't emit those features or representations of PDF that are
outside the PDF/is subset when creating a conforming PDF/is file.

Tom

-----Original Message-----
From: Poysa, Kari
Sent: Wednesday, March 12, 2003 06:04
To: Hastings, Tom N; 'Rick Seeler'; 'Carl Kugler'
Cc: ifx@pwg.org
Subject: RE: IFX> PDF/is Issue.

Tom, The Length being discussed here actually is the byte count of the
streams of Image XObjects that belong to the Page. So if the Page is
comprised of more than one image (a.k.a banding), then the sender does not
need to cache even a full page's worth of compressed data in order to be
able to write the Image XObject's stream length in the stream dictionary.

Full PDF allows the writer to enter an indirect object reference into the
required Length entry. This makes it easy to implement writers because the
separate object for the length can be written after all of the image data
has been written. The PDF files are then read in the reverse order starting
from the end of the file. This works well if one has a file system to store
the complete PDF file. So requiring the Length to be a direct value in the
stream dictionary most likely would cause existing writer SW to have to be
modified. One could not keep writing the same kind of files and claim them
PDF/is compliant.

--- Kari ---

-----Original Message-----
From: Hastings, Tom N
Sent: Tuesday, March 11, 2003 5:49 PM
To: Poysa, Kari; 'Rick Seeler'; 'Carl Kugler'
Cc: ifx@pwg.org
Subject: RE: IFX> PDF/is Issue.

Kari,

I think you summed up the argument about tradeoff simply between the Sender
and the Receiver when you said:

"If we require the reader to be able to cache a page's worth of uncompressed
data, surely we can require the writer to cache a page's worth of compressed
data [in order to determine the length and send that length in the stream]."

I assume that PDF has the notion of a length for each page, right? So we
require that the Sender put in a length field for each page of data at the
front of each page of data. Can that length field be sent with the data in
some manner, so that the Sender doesn't have to know the lengths of all of
the pages before sending any?

Tom

-----Original Message-----
From: Poysa, Kari [mailto:Kari.Poysa@usa.xerox.com]
Sent: Friday, March 07, 2003 15:04
To: 'Rick Seeler'; 'Carl Kugler'
Cc: ifx@pwg.org
Subject: RE: IFX> PDF/is Issue.

Rick, I bet this solution can be implemented, but it does have some problems
for the reader that unfortunately I did not see earlier. The difficulty
really is whether we want to make life easy for the streaming writer or the
reader.

If the length follows the image stream, the reader must scan the filtered
stream to find the end of the stream. This can make the reader
implementation both cumbersome and slow, especially if the stream has to be
fully decoded during the PDF file parsing, instead of simply extracting the
correct amount of binary data and passing it to a separate decompression
module. The PDF file parser would have to know details of the compressed
streams which should really be of no interest to the PDF file parser module
and makes creating applications from 3rd party components harder.

In addition, if the reader attempts to decode the stream, how much data
should be cached and decoded at a time? If the end of stream is not found at
first attempt, one has to pass additional data to the decoder and continue
decoding from where previous data ended. This can delay achieving robust
implementations. The alternative, searching for the "endstream" text, is not
100% reliable (although very close) and is a wasted step since no
decompression is achieved yet.

This issue is really at the heart of what "streamable" means, and also has a
big impact on what kind of low resource applications PDF/is can be used for.
I think we should consider it a "MUST" for the writer to prefix the stream
with its length, since the goal is to make the file format streamable
especially at a low resource reader. If we require the reader to be able to
cache a page's worth of uncompressed data, surely we can require the writer
to cache a page's worth of compressed data.

I do understand Ira McDonalds note about streaming writers (see separate
Email). Possibly this issue whether to prefix or postfix image streams with
their lengths should be a negotiable capability between the sender and
receiver?

--- Kari ---

-----Original Message-----
From: Rick Seeler [mailto:rseeler@adobe.com]
Sent: Thursday, March 06, 2003 2:37 PM
To: 'Poysa, Kari'; 'Carl Kugler'
Cc: ifx@pwg.org
Subject: RE: IFX> PDF/is Issue.

Kari,

Yes, the stream length should precede the stream, if possible (this is
allowed). But, in the case where the stream may be long, this may not be
possible for the Producer. In that case, the length should be an indirect
object reference to the length that should come immediately after the
stream.

As for your idea of scanning for "endstream" that's followed by the size
object. This still has the same problem as scanning for "endstream" but
just has more data and a smaller likelihood of occurrence.

Given that, and what I discussed in my previous e-mail on this subject (to
Rob Buckley), I think the best approach might be to:
1) The Producer MUST always write the stream length of all 'Content Streams'
and 'ICC Profile' streams immediately in the object dictionary (before the
stream).
2) When writing image streams, the Producer MAY either write the stream
length before or after the stream, as they prefer.
3) When an image stream is length succeeded (indirect object), the Consumer
SHOULD decode image streams to determine the stream length, when possible.
But, the Consumer MAY (at their peril) scan for the 'endstream' marker.

How does this sound as a solution?

-Rick

-----Original Message-----
From: owner-ifx@pwg.org [mailto:owner-ifx@pwg.org] On Behalf Of Poysa, Kari
Sent: Thursday, March 06, 2003 7:15 AM
To: 'Carl Kugler'
Cc: ifx@pwg.org
Subject: RE: IFX> PDF/is Issue.

In my opinion the goal should be to write the stream length immediately to
the stream dictionary.

Also, the likelihood of "endofstream" to exists in the data is small. We
could also require that if a low resource streaming writer is not able to
add the length directly into the stream directory, then the PDF object for
the length MUST immediately follow the stream object. This way, the reader
can scan for "endofstream" (but of course only if the length was not in the
stream dictionary) and make sure that it is the correct "endofstream" by
verifying that it is immediately followed by something that looks like a
length object. Could reader implementers comment on this?

I think introducing an additional filter like ASCII85 just for spotting the
end of stream adds unnecessary complexity to both writer and reader,
increases file sizes and also requires more memory and processing as the
stream cannot be passed directly to a decompressor.

--- Kari ---

-----Original Message-----
From: Carl Kugler [mailto:kugler@us.ibm.com]
Sent: Wednesday, March 05, 2003 10:50 AM
Cc: ifx@pwg.org
Subject: RE: IFX> PDF/is Issue.

I like the chunking approach. It is efficient, reliable, and has low
overhead for reasonably sized chunks. Also fits well in a typical
implementation that writes a chunk of data at a time.

-Carl

"Zehler, Peter" <PZehler@crt.xerox.com>
Sent by: owner-ifx@pwg.org

03/05/2003 05:00 AM

        To: "'Rick Seeler'" <rseeler@adobe.com>, ifx@pwg.org
        cc:
        Subject: RE: IFX> PDF/is Issue.

Rick,
Why not just increase the size of the length field signature? Could this be
done by the addition of data or comments in the length object or by adding
another object? I don't know pdf very well. I don't think we need 0%
probability of confusion just a statistically insignificant chance.
Pete

Peter Zehler
XEROX
Xerox Architecture Center
Email: PZehler@crt.xerox.com
Voice: (585) 265-8755
FAX: (585) 265-8871
US Mail: Peter Zehler

        Xerox Corp.
       800 Phillips Rd.
       M/S 128-30E
       Webster NY, 14580-9701

-----Original Message-----
From: Rick Seeler [mailto:rseeler@adobe.com]
Sent: Tuesday, March 04, 2003 1:29 PM
To: ifx@pwg.org
Subject: IFX> PDF/is Issue.

During prototyping of PDF/is the following problem arose:

How does the Consumer know when the end of a data stream (See section 3.2.7
of [pdf]) is reached? Normally, in a PDF, the Consumer would consult the
stream length field. The problem here is where to put the length field. If
the length were placed before the stream, the Consumer would know how long
the stream is. This requires the Producer to know the stream's length before
writing it to the Consumer. If, instead, the length were written at the end
of the stream, this would solve the Producer's problem but the Consumer
would not know how to find the length since they can't identify, 100% of the
time, where the stream ends and where the length object is.

An example will illustrate:
First, the normal case...

stream
sdljfiwefnwfubrevurewliysnhr;hgawebfz;h;uwre (lots of binary data here)....
84trhdvfyu7wgf4.nbdrgur4uaru4gb
endstream
12 0 obj
3456 <- the length of the previous stream.
endobj

But, what if the data looked like this...

stream
sdljfiwefnwfubrevurewliysnhr;hgawebfz;h;uwre (lots of binary data here)....
endstream <- the binary data could have a string of bytes that
looked like this.
84trhdvfyu7wgf4.nbdrgur4uaru4gb
endstream
12 0 obj
4567 <- the length of the previous stream.
endobj

Of course, you could look to bytes after the appearance of the word
'endstream' to see if this is really the end of the stream; but you can
always come up with a stream that could match your parsing algorithm's
expectations (although with decreasing percentage of occurrence).

Possible solutions:
1) Write all data using ASCII85 encoding (See Section 3.3.2 of [pdf]). This
will increase stream lengths by 25%. ASCII85 has a stream delimiter which
would solve this problem -- the end of the stream can be known for certain
and the length field can be placed after the stream.
2) Require the Producer to write the stream length before any stream (the
streams would stay binary). The Producer can use banding to break up large
images into small enough chunks so the Producer can cache the stream before
sending.
3) Offer a combination of 1 & 2. The Producer would cache streams if
possible, but may use ASCII85, if necessary.
4) Producer must make certain all streams must not contain a series of bytes
"\0D\0Aendstream" in the stream data. This is how the spec is defined
currently -- but this may be too onerous for the Producer.

Any other ideas? I'm personally leaning toward solution #3.

-Rick

Next message: don@lexmark.com: "RE: IFX> FW: Meeting: {IPP FAX / PDF-is} March 14, 2003 10:00 AM America/Los_Angeles {123123}"
Previous message: Rick Seeler: "RE: IFX> FW: Meeting: {IPP FAX / PDF-is} March 14, 2003 10:00 AM America/Los_Angeles {123123}"
Maybe in reply to: Rick Seeler: "IFX> PDF/is Issue."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Fri Mar 14 2003 - 11:51:35 EST