diff --git a/doc/pdfio.md b/doc/pdfio.md index cd8bc76..f25d908 100644 --- a/doc/pdfio.md +++ b/doc/pdfio.md @@ -118,6 +118,159 @@ that are defined in a separate header file: ```c #include ``` +Understanding PDF Files +----------------------- + +A PDF file provides data and commands for displaying pages of graphics and text, +and is structured in a way that allows it to be displayed in the same way across +multiple devices and platforms. +The following is a PDF which shows "Hello, World!" on one page: +``` +%PDF-1.0 %Header starts here +%âãÏÓ +1 0 obj %Body starts here +<< +/Kids [2 0 R] +/Count 1 +/Type /Pages +>> +endobj +2 0 obj +<< +/Rotate 0 +/Parent 1 0 R +/Resources 3 0 R +/MediaBox [0 0 612 792] +/Contents [4 0 R]/Type /Page +>> +endobj +3 0 obj +<< +/Font +<< +/F0 +<< +/BaseFont /Times-Italic +/Subtype /Type1 +/Type /Font +>> +>> +>> +endobj +4 0 obj +<< +/Length 65 +>> +stream +1. 0. 0. 1. 50. 700. cm +BT + /F0 36. Tf + (Hello, World!) Tj +ET +endstream +endobj +5 0 obj +<< +/Pages 1 0 R +/Type /Catalog +>> +endobj +xref %Cross-reference table starts here +0 6 +0000000000 65535 f +0000000015 00000 n +0000000074 00000 n +0000000192 00000 n +0000000291 00000 n +0000000409 00000 n +trailer %Trailer starts here +<< +/Root 5 0 R +/Size 6 +>> +startxref +459 +%%EOF +``` + +### Header +This is the first line of a PDF File. This specifies the version of PDF Format used. +For Example: '%PDF-1.0' + +Since PDF files almost always contain binary data, they can become corrupted if line +endings are changed (for example, if the file is transferred over FTP in text mode). +To allow legacy file transfer programs to determine that the file is binary, it is +usual to include some bytes withcharacter codes higher than 127 in the header. +- For example: %âãÏÓ +- The percent sign indicates another header line, the other few bytes are arbitrary +character codes in excess of 127. So, the whole header in our example is: + +``` +%PDF-1.0 +%âãÏÓ +``` + +### Body +The file body consists of a sequence of objects, each preceded by an object number, +generation number, and the obj keyword on one line, and followed by the endobj keyword +on another. For Example: + +``` +1 0 obj +<< +/Kids [2 0 R] +/Count 1 +/Type /Pages +>> +endobj +``` + +Here, the object number is 1, and the generation number is 0 (it almost always is). +The content for object 1 is in between the two lines 1 0 obj and endobj. +In this case, it’s the dictionary <> + +### Cross-Reference Table +The cross-reference table lists the byte offset of each object in the file body. +This allows random access to objects, meaning they don't have to be read in order. +Objects that are not used are never read, making the process efficient. +Operations like counting the number of pages in a PDF document are fast, even in large files. +Each object has an object number and a generation number. + - Generation numbers are used when a cross-reference table entry is reused. + - For simplicity, we will assume generation numbers to be always zero and ignore them. +The cross-reference table consists of: + - Header line that indicates the number of entries. + - Special entry (the first entry). + - One line for each of the object in the file body. + +``` +0 6 %Six entries in table, starting at 0 +0000000000 65535 f %Special entry +0000000015 00000 n %Object 1 is at byte offset 15 +0000000074 00000 n %Object 2 is at byte offset 74 +0000000192 00000 n %etc... +0000000291 00000 n +0000000409 00000 n %Object 5 is at byte offset 409 +``` + +### Trailer +The first line of the trailer is just the trailer keyword. This is followed by the trailer dictionary, +which contains at least the /Size entry (Number of entries in the cross-reference table) +and the /Root entry (Object number of the document catalog, which is the root element +of the graph of objects in the body). +There follows a line with just the startxref keyword, a line with a single number (the byte offset of +the start of the cross-reference table within the file), and then the line %%EOF, which signals the +end of the PDF file. + +``` +trailer %Trailer keyword +<< %The trailer dictinonary +/Root 5 0 R +/Size 6 +>> +startxref %startxref keyword +459 %Byte offset of cross-reference table +%%EOF %End-of-file marker +``` API Overview @@ -132,6 +285,7 @@ PDFio exposes several types: - `pdfio_stream_t`: An object stream + Reading PDF Files -----------------