mirror of
https://github.com/michaelrsweet/pdfio.git
synced 2024-12-27 05:48:20 +01:00
Merge 25834e07ef
into 9c04d1dc20
This commit is contained in:
commit
99da340c86
189
doc/pdfio.md
189
doc/pdfio.md
@ -132,6 +132,195 @@ PDFio exposes several types:
|
|||||||
- `pdfio_stream_t`: An object stream
|
- `pdfio_stream_t`: An object stream
|
||||||
|
|
||||||
|
|
||||||
|
Understanding PDF Files
|
||||||
|
-----------------------
|
||||||
|
|
||||||
|
A PDF file provides data and commands for displaying pages of graphics and text,
|
||||||
|
and is structured in a way that allows it to be displayed in the same way across
|
||||||
|
multiple devices and platforms.
|
||||||
|
The following is a PDF which shows "Hello, World!" on one page:
|
||||||
|
```
|
||||||
|
%PDF-1.0 %Header starts here
|
||||||
|
%âãÏÓ
|
||||||
|
1 0 obj %Body starts here
|
||||||
|
<<
|
||||||
|
/Kids [2 0 R]
|
||||||
|
/Count 1
|
||||||
|
/Type /Pages
|
||||||
|
>>
|
||||||
|
endobj
|
||||||
|
2 0 obj
|
||||||
|
<<
|
||||||
|
/Rotate 0
|
||||||
|
/Parent 1 0 R
|
||||||
|
/Resources 3 0 R
|
||||||
|
/MediaBox [0 0 612 792]
|
||||||
|
/Contents [4 0 R]/Type /Page
|
||||||
|
>>
|
||||||
|
endobj
|
||||||
|
3 0 obj
|
||||||
|
<<
|
||||||
|
/Font
|
||||||
|
<<
|
||||||
|
/F0
|
||||||
|
<<
|
||||||
|
/BaseFont /Times-Italic
|
||||||
|
/Subtype /Type1
|
||||||
|
/Type /Font
|
||||||
|
>>
|
||||||
|
>>
|
||||||
|
>>
|
||||||
|
endobj
|
||||||
|
4 0 obj
|
||||||
|
<<
|
||||||
|
/Length 65
|
||||||
|
>>
|
||||||
|
stream
|
||||||
|
1. 0. 0. 1. 50. 700. cm
|
||||||
|
BT
|
||||||
|
/F0 36. Tf
|
||||||
|
(Hello, World!) Tj
|
||||||
|
ET
|
||||||
|
endstream
|
||||||
|
endobj
|
||||||
|
5 0 obj
|
||||||
|
<<
|
||||||
|
/Pages 1 0 R
|
||||||
|
/Type /Catalog
|
||||||
|
>>
|
||||||
|
endobj
|
||||||
|
xref %Cross-reference table starts here
|
||||||
|
0 6
|
||||||
|
0000000000 65535 f
|
||||||
|
0000000015 00000 n
|
||||||
|
0000000074 00000 n
|
||||||
|
0000000192 00000 n
|
||||||
|
0000000291 00000 n
|
||||||
|
0000000409 00000 n
|
||||||
|
trailer %Trailer starts here
|
||||||
|
<<
|
||||||
|
/Root 5 0 R
|
||||||
|
/Size 6
|
||||||
|
>>
|
||||||
|
startxref
|
||||||
|
459
|
||||||
|
%%EOF
|
||||||
|
```
|
||||||
|
|
||||||
|
### Header
|
||||||
|
This is the first line of a PDF File. This specifies the version of PDF Format used.
|
||||||
|
For Example: '%PDF-1.0'
|
||||||
|
|
||||||
|
Since PDF files almost always contain binary data, they can become corrupted if line
|
||||||
|
endings are changed (for example, if the file is transferred over FTP in text mode).
|
||||||
|
To allow legacy file transfer programs to determine that the file is binary, it is
|
||||||
|
usual to include some bytes withcharacter codes higher than 127 in the header.
|
||||||
|
- For example: %âãÏÓ
|
||||||
|
- The percent sign indicates another header line, the other few bytes are arbitrary
|
||||||
|
character codes in excess of 127. So, the whole header in our example is:
|
||||||
|
|
||||||
|
```
|
||||||
|
%PDF-1.0
|
||||||
|
%âãÏÓ
|
||||||
|
```
|
||||||
|
|
||||||
|
### Body
|
||||||
|
The file body consists of a sequence of objects, each preceded by an object number,
|
||||||
|
generation number, and the obj keyword on one line, and followed by the endobj keyword
|
||||||
|
on another. For Example:
|
||||||
|
|
||||||
|
```
|
||||||
|
1 0 obj
|
||||||
|
<<
|
||||||
|
/Kids [2 0 R]
|
||||||
|
/Count 1
|
||||||
|
/Type /Pages
|
||||||
|
>>
|
||||||
|
endobj
|
||||||
|
```
|
||||||
|
|
||||||
|
Here, the object number is 1, and the generation number is 0 (it almost always is).
|
||||||
|
The content for object 1 is in between the two lines 1 0 obj and endobj.
|
||||||
|
In this case, it’s the dictionary <</Kids [2 0 R] /Count 1 /Type /Pages>>
|
||||||
|
|
||||||
|
### Cross-Reference Table
|
||||||
|
The cross-reference table lists the byte offset of each object in the file body.
|
||||||
|
This allows random access to objects, meaning they don't have to be read in order.
|
||||||
|
Objects that are not used are never read, making the process efficient.
|
||||||
|
Operations like counting the number of pages in a PDF document are fast, even in large files.
|
||||||
|
Each object has an object number and a generation number.
|
||||||
|
- Generation numbers are used when a cross-reference table entry is reused.
|
||||||
|
- For simplicity, we will assume generation numbers to be always zero and ignore them.
|
||||||
|
The cross-reference table consists of:
|
||||||
|
- Header line that indicates the number of entries.
|
||||||
|
- Special entry (the first entry).
|
||||||
|
- One line for each of the object in the file body.
|
||||||
|
|
||||||
|
```
|
||||||
|
0 6 %Six entries in table, starting at 0
|
||||||
|
0000000000 65535 f %Special entry
|
||||||
|
0000000015 00000 n %Object 1 is at byte offset 15
|
||||||
|
0000000074 00000 n %Object 2 is at byte offset 74
|
||||||
|
0000000192 00000 n %etc...
|
||||||
|
0000000291 00000 n
|
||||||
|
0000000409 00000 n %Object 5 is at byte offset 409
|
||||||
|
```
|
||||||
|
|
||||||
|
### Trailer
|
||||||
|
The first line of the trailer is just the trailer keyword. This is followed by the trailer dictionary,
|
||||||
|
which contains at least the /Size entry (Number of entries in the cross-reference table)
|
||||||
|
and the /Root entry (Object number of the document catalog, which is the root element
|
||||||
|
of the graph of objects in the body).
|
||||||
|
There follows a line with just the startxref keyword, a line with a single number (the byte offset of
|
||||||
|
the start of the cross-reference table within the file), and then the line %%EOF, which signals the
|
||||||
|
end of the PDF file.
|
||||||
|
|
||||||
|
```
|
||||||
|
trailer %Trailer keyword
|
||||||
|
<< %The trailer dictinonary
|
||||||
|
/Root 5 0 R
|
||||||
|
/Size 6
|
||||||
|
>>
|
||||||
|
startxref %startxref keyword
|
||||||
|
459 %Byte offset of cross-reference table
|
||||||
|
%%EOF %End-of-file marker
|
||||||
|
```
|
||||||
|
|
||||||
|
How a PDF File is Read
|
||||||
|
----------------------
|
||||||
|
|
||||||
|
To read a PDF file, converting it from a flat series of bytes into a graph of objects in memory,
|
||||||
|
the following steps might typically occur:
|
||||||
|
1. Read the PDF header from the beginning of the file, checking that this is, indeed, a PDF
|
||||||
|
document and retrieving its version number.
|
||||||
|
3. The end-of-file marker is now found, by searching backward from the end of the file.
|
||||||
|
The trailer dictionary can now be read, and the byte offset of the start of the cross-reference
|
||||||
|
table retrieved.
|
||||||
|
5. The cross-reference table can now be read. We now know where each object in the file is.
|
||||||
|
6. At this stage, all the objects can be read and parsed, or we can leave this process until each
|
||||||
|
object is actually needed, reading it on demand.
|
||||||
|
8. We can now use the data, extracting the pages, parsing graphical content, extracting metadata,
|
||||||
|
and so on.
|
||||||
|
This is not an exhaustive description, since there are many possible complications
|
||||||
|
(encryption, linearization, objects, and cross reference streams).
|
||||||
|
|
||||||
|
How a PDF File is Written
|
||||||
|
-------------------------
|
||||||
|
|
||||||
|
Writing a PDF document to a series of bytes in a file is much simpler than
|
||||||
|
reading it—we don’t need to support all of the PDF format, just the subset
|
||||||
|
we intend to use. Writing a PDF file is very fast, since it amounts to little
|
||||||
|
more than flattening the object graph to a series of bytes.
|
||||||
|
1. Output the header.
|
||||||
|
2. Remove any objects which are not referenced by any other object in the
|
||||||
|
PDF. This avoids writing objects which are no longer needed.
|
||||||
|
3. Renumber the objects so they run from 1 to n where n is the number of
|
||||||
|
objects in the file.
|
||||||
|
4. Output the objects one by one, starting with object number one,
|
||||||
|
recording the byte offset of each for the cross-reference table.
|
||||||
|
5. Write the cross-reference table.
|
||||||
|
6. Write the trailer, trailer dictionary, and end-of-file marker.
|
||||||
|
|
||||||
Reading PDF Files
|
Reading PDF Files
|
||||||
-----------------
|
-----------------
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user