From 3de47ea63d00270e3414474b785642315f0389f9 Mon Sep 17 00:00:00 2001 From: ThePhatak <111195860+uddhavphatak@users.noreply.github.com> Date: Mon, 14 Oct 2024 12:43:40 +0530 Subject: [PATCH 1/8] Update pdfio.md update documentation --- doc/pdfio.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/doc/pdfio.md b/doc/pdfio.md index cd8bc76..9cdd54d 100644 --- a/doc/pdfio.md +++ b/doc/pdfio.md @@ -132,6 +132,9 @@ PDFio exposes several types: - `pdfio_stream_t`: An object stream +Understanding PDF Files +----------------- + Reading PDF Files ----------------- From eb5be57b4afb50db36b20c4b57464b80116cc0c2 Mon Sep 17 00:00:00 2001 From: ThePhatak <111195860+uddhavphatak@users.noreply.github.com> Date: Mon, 14 Oct 2024 13:06:06 +0530 Subject: [PATCH 2/8] Update pdfio.md basics of pdf file --- doc/pdfio.md | 67 +++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 66 insertions(+), 1 deletion(-) diff --git a/doc/pdfio.md b/doc/pdfio.md index 9cdd54d..f93f626 100644 --- a/doc/pdfio.md +++ b/doc/pdfio.md @@ -133,7 +133,72 @@ PDFio exposes several types: Understanding PDF Files ------------------ +----------------------- + +A PDF file is structure in a way, so that it would be displayed in the same way across multiple devices and platforms. +The basic structure of PDF File is as follows: + +###Header +- This is the first line of a PDF File. This specifies the version of PDF Format used. +- Example: '%PDF-1.0' + +- Since PDF files almost always contain binary data, they can become corrupted if line endings are changed (for example, if the file is transferred over FTP in text mode). To allow legacy file transfer programs to determine that the file is binary, it is usual to include some bytes withcharacter codes higher than 127 in the header. +- For example: %âãÏÓ +- The percent sign indicates another header line, the other few bytes are arbitrary character codes in excess of 127. So, the whole header in our example is: + +%PDF-1.0 +%âãÏÓ + +###Body +- The file body consists of a sequence of objects, each preceded by an object number, generation number, and the obj keyword on one line, and followed by the endobj keyword on another. +- For Example +''' +1 0 obj +<< +/Kids [2 0 R] +/Count 1 +/Type /Pages +>> +endobj +''' +- Here, the object number is 1, and the generation number is 0 (it almost always is). The content for object 1 is in between the two lines 1 0 obj and endobj. In this case, it’s the dictionary <> + +###Cross-Reference Table +- The cross-reference table lists the byte offset of each object in the file body. +- This allows random access to objects, meaning they don't have to be read in order. +- Objects that are not used are never read, making the process efficient. +- Operations like counting the number of pages in a PDF document are fast, even in large files. +- Each object has an object number and a generation number. + - Generation numbers are used when a cross-reference table entry is reused. + - For simplicity, we would assume generation numbers to be always zero and ignore them. +- The cross-reference table consists of: + - Header line that indicates the number of entries. + - Special entry (the first entry). + - One line for each of the object in the file body. + + ''' +0 6 Six entries in table, starting at 0 +0000000000 65535 **f Special entry** +0000000015 00000 **n Object 1 is at byte offset 15** +0000000074 00000 **n Object 2 is at byte offset 74** +0000000192 00000 **n etc...** +0000000291 00000 **n** +0000000409 00000 **n Object 5 is at byte offset 409** + ''' + + ###Trailer +- The first line of the trailer is just the trailer keyword. This is followed by the trailer dictionary, which contains at least the /Size entry (which gives the number of entries in the cross-reference table) and the /Root entry (which gives the object number of the document catalog, which is the root element of the graph of objects in the body). +- There follows a line with just the startxref keyword, a line with a single number (the byte offset of the start of the cross-reference table within the file), and then the line %%EOF, which signals the end of the PDF file. +''' +trailer Trailer **keyword** +<< **The trailer dictinonary** +/Root 5 0 R +/Size 6 +>> +startxref **startxref keyword** +459 **Byte offset of cross-reference table** +%%EOF **End-of-file marker** +''' Reading PDF Files ----------------- From f5d40a305e142d05258efec5669ae0830173846a Mon Sep 17 00:00:00 2001 From: ThePhatak <111195860+uddhavphatak@users.noreply.github.com> Date: Mon, 14 Oct 2024 13:09:13 +0530 Subject: [PATCH 3/8] Update pdfio.md --- doc/pdfio.md | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/doc/pdfio.md b/doc/pdfio.md index f93f626..2e81039 100644 --- a/doc/pdfio.md +++ b/doc/pdfio.md @@ -135,21 +135,21 @@ PDFio exposes several types: Understanding PDF Files ----------------------- -A PDF file is structure in a way, so that it would be displayed in the same way across multiple devices and platforms. -The basic structure of PDF File is as follows: +A PDF file is structure in a way, so that it would be displayed in the same way +across multiple devices and platforms. The basic structure of PDF File is as follows: -###Header -- This is the first line of a PDF File. This specifies the version of PDF Format used. +### Header +This is the first line of a PDF File. This specifies the version of PDF Format used. - Example: '%PDF-1.0' -- Since PDF files almost always contain binary data, they can become corrupted if line endings are changed (for example, if the file is transferred over FTP in text mode). To allow legacy file transfer programs to determine that the file is binary, it is usual to include some bytes withcharacter codes higher than 127 in the header. +Since PDF files almost always contain binary data, they can become corrupted if line endings are changed (for example, if the file is transferred over FTP in text mode). To allow legacy file transfer programs to determine that the file is binary, it is usual to include some bytes withcharacter codes higher than 127 in the header. - For example: %âãÏÓ - The percent sign indicates another header line, the other few bytes are arbitrary character codes in excess of 127. So, the whole header in our example is: %PDF-1.0 %âãÏÓ -###Body +### Body - The file body consists of a sequence of objects, each preceded by an object number, generation number, and the obj keyword on one line, and followed by the endobj keyword on another. - For Example ''' @@ -163,7 +163,7 @@ endobj ''' - Here, the object number is 1, and the generation number is 0 (it almost always is). The content for object 1 is in between the two lines 1 0 obj and endobj. In this case, it’s the dictionary <> -###Cross-Reference Table +### Cross-Reference Table - The cross-reference table lists the byte offset of each object in the file body. - This allows random access to objects, meaning they don't have to be read in order. - Objects that are not used are never read, making the process efficient. @@ -176,21 +176,21 @@ endobj - Special entry (the first entry). - One line for each of the object in the file body. - ''' -0 6 Six entries in table, starting at 0 +''' +**0 6 Six entries in table, starting at 0** 0000000000 65535 **f Special entry** 0000000015 00000 **n Object 1 is at byte offset 15** 0000000074 00000 **n Object 2 is at byte offset 74** 0000000192 00000 **n etc...** 0000000291 00000 **n** 0000000409 00000 **n Object 5 is at byte offset 409** - ''' +''' - ###Trailer +### Trailer - The first line of the trailer is just the trailer keyword. This is followed by the trailer dictionary, which contains at least the /Size entry (which gives the number of entries in the cross-reference table) and the /Root entry (which gives the object number of the document catalog, which is the root element of the graph of objects in the body). - There follows a line with just the startxref keyword, a line with a single number (the byte offset of the start of the cross-reference table within the file), and then the line %%EOF, which signals the end of the PDF file. ''' -trailer Trailer **keyword** +trailer **Trailer keyword** << **The trailer dictinonary** /Root 5 0 R /Size 6 From 2cadfd8a1edd39b9b5fdd146f1e058d6a0e10b73 Mon Sep 17 00:00:00 2001 From: ThePhatak <111195860+uddhavphatak@users.noreply.github.com> Date: Mon, 14 Oct 2024 13:10:57 +0530 Subject: [PATCH 4/8] Update pdfio.md --- doc/pdfio.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/doc/pdfio.md b/doc/pdfio.md index 2e81039..f948bf4 100644 --- a/doc/pdfio.md +++ b/doc/pdfio.md @@ -146,12 +146,12 @@ Since PDF files almost always contain binary data, they can become corrupted if - For example: %âãÏÓ - The percent sign indicates another header line, the other few bytes are arbitrary character codes in excess of 127. So, the whole header in our example is: -%PDF-1.0 +%PDF-1.0 %âãÏÓ ### Body -- The file body consists of a sequence of objects, each preceded by an object number, generation number, and the obj keyword on one line, and followed by the endobj keyword on another. -- For Example +The file body consists of a sequence of objects, each preceded by an object number, generation number, and the obj keyword on one line, and followed by the endobj keyword on another. +- For Example: ''' 1 0 obj << @@ -164,11 +164,11 @@ endobj - Here, the object number is 1, and the generation number is 0 (it almost always is). The content for object 1 is in between the two lines 1 0 obj and endobj. In this case, it’s the dictionary <> ### Cross-Reference Table -- The cross-reference table lists the byte offset of each object in the file body. -- This allows random access to objects, meaning they don't have to be read in order. -- Objects that are not used are never read, making the process efficient. -- Operations like counting the number of pages in a PDF document are fast, even in large files. -- Each object has an object number and a generation number. +The cross-reference table lists the byte offset of each object in the file body. +This allows random access to objects, meaning they don't have to be read in order. +Objects that are not used are never read, making the process efficient. +Operations like counting the number of pages in a PDF document are fast, even in large files. +Each object has an object number and a generation number. - Generation numbers are used when a cross-reference table entry is reused. - For simplicity, we would assume generation numbers to be always zero and ignore them. - The cross-reference table consists of: From 853fa4fe8fd6d594e8ba85100f842c3ba40945d4 Mon Sep 17 00:00:00 2001 From: ThePhatak <111195860+uddhavphatak@users.noreply.github.com> Date: Mon, 14 Oct 2024 13:14:59 +0530 Subject: [PATCH 5/8] Update pdfio.md --- doc/pdfio.md | 48 +++++++++++++++++++++++++++--------------------- 1 file changed, 27 insertions(+), 21 deletions(-) diff --git a/doc/pdfio.md b/doc/pdfio.md index f948bf4..24d21c8 100644 --- a/doc/pdfio.md +++ b/doc/pdfio.md @@ -146,13 +146,16 @@ Since PDF files almost always contain binary data, they can become corrupted if - For example: %âãÏÓ - The percent sign indicates another header line, the other few bytes are arbitrary character codes in excess of 127. So, the whole header in our example is: +``` %PDF-1.0 %âãÏÓ +``` ### Body The file body consists of a sequence of objects, each preceded by an object number, generation number, and the obj keyword on one line, and followed by the endobj keyword on another. - For Example: -''' + +``` 1 0 obj << /Kids [2 0 R] @@ -160,8 +163,10 @@ The file body consists of a sequence of objects, each preceded by an object numb /Type /Pages >> endobj -''' -- Here, the object number is 1, and the generation number is 0 (it almost always is). The content for object 1 is in between the two lines 1 0 obj and endobj. In this case, it’s the dictionary <> +``` + +Here, the object number is 1, and the generation number is 0 (it almost always is). The content for object 1 is in between the two lines 1 0 obj and endobj. +In this case, it’s the dictionary <> ### Cross-Reference Table The cross-reference table lists the byte offset of each object in the file body. @@ -170,35 +175,36 @@ Objects that are not used are never read, making the process efficient. Operations like counting the number of pages in a PDF document are fast, even in large files. Each object has an object number and a generation number. - Generation numbers are used when a cross-reference table entry is reused. - - For simplicity, we would assume generation numbers to be always zero and ignore them. -- The cross-reference table consists of: + - For simplicity, we would assume generation numbers to be always zero and ignore them. +The cross-reference table consists of: - Header line that indicates the number of entries. - Special entry (the first entry). - One line for each of the object in the file body. -''' -**0 6 Six entries in table, starting at 0** -0000000000 65535 **f Special entry** -0000000015 00000 **n Object 1 is at byte offset 15** -0000000074 00000 **n Object 2 is at byte offset 74** -0000000192 00000 **n etc...** -0000000291 00000 **n** -0000000409 00000 **n Object 5 is at byte offset 409** -''' +``` +0 6 **Six entries in table, starting at 0** +0000000000 65535 **f Special entry** +0000000015 00000 **n Object 1 is at byte offset 15** +0000000074 00000 **n Object 2 is at byte offset 74** +0000000192 00000 **n etc...** +0000000291 00000 **n** +0000000409 00000 **n Object 5 is at byte offset 409** +``` ### Trailer - The first line of the trailer is just the trailer keyword. This is followed by the trailer dictionary, which contains at least the /Size entry (which gives the number of entries in the cross-reference table) and the /Root entry (which gives the object number of the document catalog, which is the root element of the graph of objects in the body). - There follows a line with just the startxref keyword, a line with a single number (the byte offset of the start of the cross-reference table within the file), and then the line %%EOF, which signals the end of the PDF file. -''' -trailer **Trailer keyword** -<< **The trailer dictinonary** + +``` +trailer **Trailer keyword** +<< **The trailer dictinonary** /Root 5 0 R /Size 6 >> -startxref **startxref keyword** -459 **Byte offset of cross-reference table** -%%EOF **End-of-file marker** -''' +startxref **startxref keyword** +459 **Byte offset of cross-reference table** +%%EOF **End-of-file marker** +``` Reading PDF Files ----------------- From df1064ff39db7a7fe3bc1adeb3f8abe63c1b0354 Mon Sep 17 00:00:00 2001 From: ThePhatak <111195860+uddhavphatak@users.noreply.github.com> Date: Mon, 14 Oct 2024 13:20:44 +0530 Subject: [PATCH 6/8] Update pdfio.md --- doc/pdfio.md | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 70 insertions(+) diff --git a/doc/pdfio.md b/doc/pdfio.md index 24d21c8..83b9c4e 100644 --- a/doc/pdfio.md +++ b/doc/pdfio.md @@ -138,6 +138,76 @@ Understanding PDF Files A PDF file is structure in a way, so that it would be displayed in the same way across multiple devices and platforms. The basic structure of PDF File is as follows: +### A small PDF File +The following is a PDF which says "Hello, World" on one page: +``` +%PDF-1.0 **Header starts here** +%âãÏÓ +1 0 obj **Body starts here** +<< +/Kids [2 0 R] +/Count 1 +/Type /Pages +>> +endobj +2 0 obj +<< +/Rotate 0 +/Parent 1 0 R +/Resources 3 0 R +/MediaBox [0 0 612 792] +/Contents [4 0 R]/Type /Page +>> +endobj +3 0 obj +<< +/Font +<< +/F0 +<< +/BaseFont /Times-Italic +/Subtype /Type1 +/Type /Font +>> +>> +>> +endobj +4 0 obj +<< +/Length 65 +>> +stream +1. 0. 0. 1. 50. 700. cm +BT +/F0 36. Tf +(Hello, World!) Tj +ET +endstream +endobj +5 0 obj +<< +/Pages 1 0 R +/Type /Catalog +>> +endobj +xref **Cross-reference table starts here** +0 6 +0000000000 65535 f +0000000015 00000 n +0000000074 00000 n +0000000192 00000 n +0000000291 00000 n +0000000409 00000 n +trailer **Trailer starts here** +<< +/Root 5 0 R +/Size 6 +>> +startxref +459 +%%EOF +``` + ### Header This is the first line of a PDF File. This specifies the version of PDF Format used. - Example: '%PDF-1.0' From 2d2a7126d2ce9863295e77ca3b6e92465ea9f6f2 Mon Sep 17 00:00:00 2001 From: ThePhatak <111195860+uddhavphatak@users.noreply.github.com> Date: Mon, 14 Oct 2024 13:34:27 +0530 Subject: [PATCH 7/8] Update pdfio.md updated doc --- doc/pdfio.md | 62 +++++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 54 insertions(+), 8 deletions(-) diff --git a/doc/pdfio.md b/doc/pdfio.md index 83b9c4e..6b64a56 100644 --- a/doc/pdfio.md +++ b/doc/pdfio.md @@ -179,8 +179,8 @@ endobj stream 1. 0. 0. 1. 50. 700. cm BT -/F0 36. Tf -(Hello, World!) Tj + /F0 36. Tf + (Hello, World!) Tj ET endstream endobj @@ -212,9 +212,13 @@ startxref This is the first line of a PDF File. This specifies the version of PDF Format used. - Example: '%PDF-1.0' -Since PDF files almost always contain binary data, they can become corrupted if line endings are changed (for example, if the file is transferred over FTP in text mode). To allow legacy file transfer programs to determine that the file is binary, it is usual to include some bytes withcharacter codes higher than 127 in the header. +Since PDF files almost always contain binary data, they can become corrupted if line +endings are changed (for example, if the file is transferred over FTP in text mode). +To allow legacy file transfer programs to determine that the file is binary, it is +usual to include some bytes withcharacter codes higher than 127 in the header. - For example: %âãÏÓ -- The percent sign indicates another header line, the other few bytes are arbitrary character codes in excess of 127. So, the whole header in our example is: +- The percent sign indicates another header line, the other few bytes are arbitrary +character codes in excess of 127. So, the whole header in our example is: ``` %PDF-1.0 @@ -222,7 +226,9 @@ Since PDF files almost always contain binary data, they can become corrupted if ``` ### Body -The file body consists of a sequence of objects, each preceded by an object number, generation number, and the obj keyword on one line, and followed by the endobj keyword on another. +The file body consists of a sequence of objects, each preceded by an object number, +generation number, and the obj keyword on one line, and followed by the endobj keyword +on another. - For Example: ``` @@ -235,7 +241,8 @@ The file body consists of a sequence of objects, each preceded by an object numb endobj ``` -Here, the object number is 1, and the generation number is 0 (it almost always is). The content for object 1 is in between the two lines 1 0 obj and endobj. +Here, the object number is 1, and the generation number is 0 (it almost always is). +The content for object 1 is in between the two lines 1 0 obj and endobj. In this case, it’s the dictionary <> ### Cross-Reference Table @@ -262,8 +269,13 @@ The cross-reference table consists of: ``` ### Trailer -- The first line of the trailer is just the trailer keyword. This is followed by the trailer dictionary, which contains at least the /Size entry (which gives the number of entries in the cross-reference table) and the /Root entry (which gives the object number of the document catalog, which is the root element of the graph of objects in the body). -- There follows a line with just the startxref keyword, a line with a single number (the byte offset of the start of the cross-reference table within the file), and then the line %%EOF, which signals the end of the PDF file. +The first line of the trailer is just the trailer keyword. This is followed by the trailer dictionary, +which contains at least the /Size entry (which gives the number of entries in the cross-reference table) +and the /Root entry (which gives the object number of the document catalog, which is the root element +of the graph of objects in the body). +There follows a line with just the startxref keyword, a line with a single number (the byte offset of +the start of the cross-reference table within the file), and then the line %%EOF, which signals the +end of the PDF file. ``` trailer **Trailer keyword** @@ -276,6 +288,40 @@ startxref **startxref keyword** %%EOF **End-of-file marker** ``` +How a PDF File is Read +---------------------- + +To read a PDF file, converting it from a flat series of bytes into a graph of objects in memory, +the following steps might typically occur: +1. Read the PDF header from the beginning of the file, checking that this is, indeed, a PDF +document and retrieving its version number. +3. The end-of-file marker is now found, by searching backward from the end of the file. +The trailer dictionary can now be read, and the byte offset of the start of the cross-reference +table retrieved. +5. The cross-reference table can now be read. We now know where each object in the file is. +6. At this stage, all the objects can be read and parsed, or we can leave this process until each +object is actually needed, reading it on demand. +8. We can now use the data, extracting the pages, parsing graphical content, extracting metadata, +and so on. This is not an exhaustive description, since there are many possible complications +(encryption, linearization, objects, and cross reference streams). + +How a PDF File is Written +------------------------- + +Writing a PDF document to a series of bytes in a file is much simpler than +reading it—we don’t need to support all of the PDF format, just the subset +we intend to use. Writing a PDF file is very fast, since it amounts to little +more than flattening the object graph to a series of bytes. +1. Output the header. +2. Remove any objects which are not referenced by any other object in the +PDF. This avoids writing objects which are no longer needed. +3. Renumber the objects so they run from 1 to n where n is the number of +objects in the file. +4. Output the objects one by one, starting with object number one, +recording the byte offset of each for the cross-reference table. +5. Write the cross-reference table. +6. Write the trailer, trailer dictionary, and end-of-file marker. + Reading PDF Files ----------------- From 25834e07efce811e09c2a057b98228581600f7fd Mon Sep 17 00:00:00 2001 From: ThePhatak <111195860+uddhavphatak@users.noreply.github.com> Date: Tue, 15 Oct 2024 09:38:01 +0530 Subject: [PATCH 8/8] Update pdfio.md addition of lines requeested --- doc/pdfio.md | 57 ++++++++++++++++++++++++++-------------------------- 1 file changed, 28 insertions(+), 29 deletions(-) diff --git a/doc/pdfio.md b/doc/pdfio.md index 6b64a56..b82681a 100644 --- a/doc/pdfio.md +++ b/doc/pdfio.md @@ -135,15 +135,14 @@ PDFio exposes several types: Understanding PDF Files ----------------------- -A PDF file is structure in a way, so that it would be displayed in the same way -across multiple devices and platforms. The basic structure of PDF File is as follows: - -### A small PDF File -The following is a PDF which says "Hello, World" on one page: +A PDF file provides data and commands for displaying pages of graphics and text, +and is structured in a way that allows it to be displayed in the same way across +multiple devices and platforms. +The following is a PDF which shows "Hello, World!" on one page: ``` -%PDF-1.0 **Header starts here** +%PDF-1.0 %Header starts here %âãÏÓ -1 0 obj **Body starts here** +1 0 obj %Body starts here << /Kids [2 0 R] /Count 1 @@ -190,7 +189,7 @@ endobj /Type /Catalog >> endobj -xref **Cross-reference table starts here** +xref %Cross-reference table starts here 0 6 0000000000 65535 f 0000000015 00000 n @@ -198,7 +197,7 @@ xref **Cross-reference table starts here** 0000000192 00000 n 0000000291 00000 n 0000000409 00000 n -trailer **Trailer starts here** +trailer %Trailer starts here << /Root 5 0 R /Size 6 @@ -209,8 +208,8 @@ startxref ``` ### Header -This is the first line of a PDF File. This specifies the version of PDF Format used. -- Example: '%PDF-1.0' +This is the first line of a PDF File. This specifies the version of PDF Format used. +For Example: '%PDF-1.0' Since PDF files almost always contain binary data, they can become corrupted if line endings are changed (for example, if the file is transferred over FTP in text mode). @@ -228,8 +227,7 @@ character codes in excess of 127. So, the whole header in our example is: ### Body The file body consists of a sequence of objects, each preceded by an object number, generation number, and the obj keyword on one line, and followed by the endobj keyword -on another. -- For Example: +on another. For Example: ``` 1 0 obj @@ -252,40 +250,40 @@ Objects that are not used are never read, making the process efficient. Operations like counting the number of pages in a PDF document are fast, even in large files. Each object has an object number and a generation number. - Generation numbers are used when a cross-reference table entry is reused. - - For simplicity, we would assume generation numbers to be always zero and ignore them. + - For simplicity, we will assume generation numbers to be always zero and ignore them. The cross-reference table consists of: - Header line that indicates the number of entries. - Special entry (the first entry). - One line for each of the object in the file body. ``` -0 6 **Six entries in table, starting at 0** -0000000000 65535 **f Special entry** -0000000015 00000 **n Object 1 is at byte offset 15** -0000000074 00000 **n Object 2 is at byte offset 74** -0000000192 00000 **n etc...** -0000000291 00000 **n** -0000000409 00000 **n Object 5 is at byte offset 409** +0 6 %Six entries in table, starting at 0 +0000000000 65535 f %Special entry +0000000015 00000 n %Object 1 is at byte offset 15 +0000000074 00000 n %Object 2 is at byte offset 74 +0000000192 00000 n %etc... +0000000291 00000 n +0000000409 00000 n %Object 5 is at byte offset 409 ``` ### Trailer The first line of the trailer is just the trailer keyword. This is followed by the trailer dictionary, -which contains at least the /Size entry (which gives the number of entries in the cross-reference table) -and the /Root entry (which gives the object number of the document catalog, which is the root element +which contains at least the /Size entry (Number of entries in the cross-reference table) +and the /Root entry (Object number of the document catalog, which is the root element of the graph of objects in the body). There follows a line with just the startxref keyword, a line with a single number (the byte offset of the start of the cross-reference table within the file), and then the line %%EOF, which signals the end of the PDF file. ``` -trailer **Trailer keyword** -<< **The trailer dictinonary** +trailer %Trailer keyword +<< %The trailer dictinonary /Root 5 0 R /Size 6 >> -startxref **startxref keyword** -459 **Byte offset of cross-reference table** -%%EOF **End-of-file marker** +startxref %startxref keyword +459 %Byte offset of cross-reference table +%%EOF %End-of-file marker ``` How a PDF File is Read @@ -302,7 +300,8 @@ table retrieved. 6. At this stage, all the objects can be read and parsed, or we can leave this process until each object is actually needed, reading it on demand. 8. We can now use the data, extracting the pages, parsing graphical content, extracting metadata, -and so on. This is not an exhaustive description, since there are many possible complications +and so on. +This is not an exhaustive description, since there are many possible complications (encryption, linearization, objects, and cross reference streams). How a PDF File is Written