Clean up updated docos (Issue #78)

This commit is contained in:
Michael R Sweet 2024-10-25 17:32:38 -04:00
parent 21b8e3b06f
commit 21ac2b52d1
No known key found for this signature in database
GPG Key ID: BE67C75EC81F3244

View File

@ -118,17 +118,20 @@ that are defined in a separate header file:
```c ```c
#include <pdfio-content.h> #include <pdfio-content.h>
``` ```
Understanding PDF Files Understanding PDF Files
----------------------- -----------------------
A PDF file provides data and commands for displaying pages of graphics and text, A PDF file provides data and commands for displaying pages of graphics and text,
and is structured in a way that allows it to be displayed in the same way across and is structured in a way that allows it to be displayed in the same way across
multiple devices and platforms. multiple devices and platforms. The following is a PDF which shows "Hello,
The following is a PDF which shows "Hello, World!" on one page: World!" on one page:
``` ```
%PDF-1.0 %Header starts here %PDF-1.0 % Header starts here
%âãÏÓ %âãÏÓ
1 0 obj %Body starts here 1 0 obj % Body starts here
<< <<
/Kids [2 0 R] /Kids [2 0 R]
/Count 1 /Count 1
@ -175,7 +178,7 @@ endobj
/Type /Catalog /Type /Catalog
>> >>
endobj endobj
xref %Cross-reference table starts here xref % Cross-reference table starts here
0 6 0 6
0000000000 65535 f 0000000000 65535 f
0000000015 00000 n 0000000015 00000 n
@ -183,7 +186,7 @@ xref %Cross-reference table starts here
0000000192 00000 n 0000000192 00000 n
0000000291 00000 n 0000000291 00000 n
0000000409 00000 n 0000000409 00000 n
trailer %Trailer starts here trailer % Trailer starts here
<< <<
/Root 5 0 R /Root 5 0 R
/Size 6 /Size 6
@ -193,29 +196,40 @@ startxref
%%EOF %%EOF
``` ```
### Header
This is the first line of a PDF File. This specifies the version of PDF Format used.
For Example: '%PDF-1.0'
Since PDF files almost always contain binary data, they can become corrupted if line ### Header
endings are changed (for example, if the file is transferred over FTP in text mode).
To allow legacy file transfer programs to determine that the file is binary, it is The header is the first line of a PDF file that specifies the version of the PDF
usual to include some bytes withcharacter codes higher than 127 in the header. format that has been used, for example `%PDF-1.0`.
- For example: %âãÏÓ
- The percent sign indicates another header line, the other few bytes are arbitrary Since PDF files almost always contain binary data, they can become corrupted if
character codes in excess of 127. So, the whole header in our example is: line endings are changed. For example, if the file is transferred using FTP in
text mode or is edited in Notepad on Windows. To allow legacy file transfer
programs to determine that the file is binary, the PDF standard recommends
including some bytes with character codes higher than 127 in the header, for
example:
``` ```
%PDF-1.0
%âãÏÓ %âãÏÓ
``` ```
### Body The percent sign indicates a comment line while the other few bytes are
The file body consists of a sequence of objects, each preceded by an object number, arbitrary character codes in excess of 127. So, the whole header in our example
generation number, and the obj keyword on one line, and followed by the endobj keyword is:
on another. For Example:
``` ```
%PDF-1.0
%âãÏÓ
```
### Body
The file body consists of a sequence of objects, each preceded by an object
number, generation number, and the obj keyword on one line, and followed by the
endobj keyword on another. For example:
```
1 0 obj 1 0 obj
<< <<
/Kids [2 0 R] /Kids [2 0 R]
@ -225,51 +239,60 @@ on another. For Example:
endobj endobj
``` ```
Here, the object number is 1, and the generation number is 0 (it almost always is). In this example, the object number is 1 and the generation number is 0, meaning
The content for object 1 is in between the two lines 1 0 obj and endobj. it is the first version of the object. The content for object 1 is between the
In this case, its the dictionary <</Kids [2 0 R] /Count 1 /Type /Pages>> initial `1 0 obj` and trailing `endobj` lines. In this case, the content is the
dictionary `<</Kids [2 0 R] /Count 1 /Type /Pages>>`.
### Cross-Reference Table ### Cross-Reference Table
The cross-reference table lists the byte offset of each object in the file body. The cross-reference table lists the byte offset of each object in the file body.
This allows random access to objects, meaning they don't have to be read in order. This allows random access to objects, meaning they don't have to be read in
Objects that are not used are never read, making the process efficient. order. Objects that are not used are never read, making the process efficient.
Operations like counting the number of pages in a PDF document are fast, even in large files. Operations like counting the number of pages in a PDF document are fast, even in
Each object has an object number and a generation number. large files.
- Generation numbers are used when a cross-reference table entry is reused.
- For simplicity, we will assume generation numbers to be always zero and ignore them. Each object has an object number and a generation number. Generation numbers
The cross-reference table consists of: are used when a cross-reference table entry is reused. For simplicity, we will
- Header line that indicates the number of entries. assume generation numbers to be always zero and ignore them. The
- Special entry (the first entry). cross-reference table consists of a header line that indicates the number of
- One line for each of the object in the file body. entries, a free entry line for object 0, and a line for each of the objects in
the file body. For example:
``` ```
0 6 %Six entries in table, starting at 0 0 6 % Six entries in table, starting at 0
0000000000 65535 f %Special entry 0000000000 65535 f % Free entry for object 0
0000000015 00000 n %Object 1 is at byte offset 15 0000000015 00000 n % Object 1 is at byte offset 15
0000000074 00000 n %Object 2 is at byte offset 74 0000000074 00000 n % Object 2 is at byte offset 74
0000000192 00000 n %etc... 0000000192 00000 n % etc...
0000000291 00000 n 0000000291 00000 n
0000000409 00000 n %Object 5 is at byte offset 409 0000000409 00000 n % Object 5 is at byte offset 409
``` ```
### Trailer ### Trailer
The first line of the trailer is just the trailer keyword. This is followed by the trailer dictionary,
which contains at least the /Size entry (Number of entries in the cross-reference table) The first line of the trailer is just the `trailer` keyword. This is followed
and the /Root entry (Object number of the document catalog, which is the root element by the trailer dictionary which contains at least the `/Size` entry specifying
of the graph of objects in the body). the number of entries in the cross-reference table and the `/Root` entry which
There follows a line with just the startxref keyword, a line with a single number (the byte offset of references the object for the document catalog which is the root element of the
the start of the cross-reference table within the file), and then the line %%EOF, which signals the graph of objects in the body.
end of the PDF file.
There follows a line with just the `startxref` keyword, a line with a single
number specifying the byte offset of the start of the cross-reference table
within the file, and then the line `%%EOF` which signals the end of the PDF
file.
``` ```
trailer %Trailer keyword trailer % Trailer keyword
<< %The trailer dictinonary << % The trailer dictinonary
/Root 5 0 R /Root 5 0 R
/Size 6 /Size 6
>> >>
startxref %startxref keyword startxref % startxref keyword
459 %Byte offset of cross-reference table 459 % Byte offset of cross-reference table
%%EOF %End-of-file marker %%EOF % End-of-file marker
``` ```