diff --git a/doc/pdfio.md b/doc/pdfio.md index cd8bc76..b82681a 100644 --- a/doc/pdfio.md +++ b/doc/pdfio.md @@ -132,6 +132,195 @@ PDFio exposes several types: - `pdfio_stream_t`: An object stream +Understanding PDF Files +----------------------- + +A PDF file provides data and commands for displaying pages of graphics and text, +and is structured in a way that allows it to be displayed in the same way across +multiple devices and platforms. +The following is a PDF which shows "Hello, World!" on one page: +``` +%PDF-1.0 %Header starts here +%âãÏÓ +1 0 obj %Body starts here +<< +/Kids [2 0 R] +/Count 1 +/Type /Pages +>> +endobj +2 0 obj +<< +/Rotate 0 +/Parent 1 0 R +/Resources 3 0 R +/MediaBox [0 0 612 792] +/Contents [4 0 R]/Type /Page +>> +endobj +3 0 obj +<< +/Font +<< +/F0 +<< +/BaseFont /Times-Italic +/Subtype /Type1 +/Type /Font +>> +>> +>> +endobj +4 0 obj +<< +/Length 65 +>> +stream +1. 0. 0. 1. 50. 700. cm +BT + /F0 36. Tf + (Hello, World!) Tj +ET +endstream +endobj +5 0 obj +<< +/Pages 1 0 R +/Type /Catalog +>> +endobj +xref %Cross-reference table starts here +0 6 +0000000000 65535 f +0000000015 00000 n +0000000074 00000 n +0000000192 00000 n +0000000291 00000 n +0000000409 00000 n +trailer %Trailer starts here +<< +/Root 5 0 R +/Size 6 +>> +startxref +459 +%%EOF +``` + +### Header +This is the first line of a PDF File. This specifies the version of PDF Format used. +For Example: '%PDF-1.0' + +Since PDF files almost always contain binary data, they can become corrupted if line +endings are changed (for example, if the file is transferred over FTP in text mode). +To allow legacy file transfer programs to determine that the file is binary, it is +usual to include some bytes withcharacter codes higher than 127 in the header. +- For example: %âãÏÓ +- The percent sign indicates another header line, the other few bytes are arbitrary +character codes in excess of 127. So, the whole header in our example is: + +``` +%PDF-1.0 +%âãÏÓ +``` + +### Body +The file body consists of a sequence of objects, each preceded by an object number, +generation number, and the obj keyword on one line, and followed by the endobj keyword +on another. For Example: + +``` +1 0 obj +<< +/Kids [2 0 R] +/Count 1 +/Type /Pages +>> +endobj +``` + +Here, the object number is 1, and the generation number is 0 (it almost always is). +The content for object 1 is in between the two lines 1 0 obj and endobj. +In this case, it’s the dictionary <> + +### Cross-Reference Table +The cross-reference table lists the byte offset of each object in the file body. +This allows random access to objects, meaning they don't have to be read in order. +Objects that are not used are never read, making the process efficient. +Operations like counting the number of pages in a PDF document are fast, even in large files. +Each object has an object number and a generation number. + - Generation numbers are used when a cross-reference table entry is reused. + - For simplicity, we will assume generation numbers to be always zero and ignore them. +The cross-reference table consists of: + - Header line that indicates the number of entries. + - Special entry (the first entry). + - One line for each of the object in the file body. + +``` +0 6 %Six entries in table, starting at 0 +0000000000 65535 f %Special entry +0000000015 00000 n %Object 1 is at byte offset 15 +0000000074 00000 n %Object 2 is at byte offset 74 +0000000192 00000 n %etc... +0000000291 00000 n +0000000409 00000 n %Object 5 is at byte offset 409 +``` + +### Trailer +The first line of the trailer is just the trailer keyword. This is followed by the trailer dictionary, +which contains at least the /Size entry (Number of entries in the cross-reference table) +and the /Root entry (Object number of the document catalog, which is the root element +of the graph of objects in the body). +There follows a line with just the startxref keyword, a line with a single number (the byte offset of +the start of the cross-reference table within the file), and then the line %%EOF, which signals the +end of the PDF file. + +``` +trailer %Trailer keyword +<< %The trailer dictinonary +/Root 5 0 R +/Size 6 +>> +startxref %startxref keyword +459 %Byte offset of cross-reference table +%%EOF %End-of-file marker +``` + +How a PDF File is Read +---------------------- + +To read a PDF file, converting it from a flat series of bytes into a graph of objects in memory, +the following steps might typically occur: +1. Read the PDF header from the beginning of the file, checking that this is, indeed, a PDF +document and retrieving its version number. +3. The end-of-file marker is now found, by searching backward from the end of the file. +The trailer dictionary can now be read, and the byte offset of the start of the cross-reference +table retrieved. +5. The cross-reference table can now be read. We now know where each object in the file is. +6. At this stage, all the objects can be read and parsed, or we can leave this process until each +object is actually needed, reading it on demand. +8. We can now use the data, extracting the pages, parsing graphical content, extracting metadata, +and so on. +This is not an exhaustive description, since there are many possible complications +(encryption, linearization, objects, and cross reference streams). + +How a PDF File is Written +------------------------- + +Writing a PDF document to a series of bytes in a file is much simpler than +reading it—we don’t need to support all of the PDF format, just the subset +we intend to use. Writing a PDF file is very fast, since it amounts to little +more than flattening the object graph to a series of bytes. +1. Output the header. +2. Remove any objects which are not referenced by any other object in the +PDF. This avoids writing objects which are no longer needed. +3. Renumber the objects so they run from 1 to n where n is the number of +objects in the file. +4. Output the objects one by one, starting with object number one, +recording the byte offset of each for the cross-reference table. +5. Write the cross-reference table. +6. Write the trailer, trailer dictionary, and end-of-file marker. + Reading PDF Files -----------------