Update pdfio.md

This commit is contained in:
ThePhatak 2024-10-14 13:14:59 +05:30 committed by GitHub
parent 2cadfd8a1e
commit 853fa4fe8f
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -146,13 +146,16 @@ Since PDF files almost always contain binary data, they can become corrupted if
- For example: %âãÏÓ - For example: %âãÏÓ
- The percent sign indicates another header line, the other few bytes are arbitrary character codes in excess of 127. So, the whole header in our example is: - The percent sign indicates another header line, the other few bytes are arbitrary character codes in excess of 127. So, the whole header in our example is:
```
%PDF-1.0 %PDF-1.0
%âãÏÓ %âãÏÓ
```
### Body ### Body
The file body consists of a sequence of objects, each preceded by an object number, generation number, and the obj keyword on one line, and followed by the endobj keyword on another. The file body consists of a sequence of objects, each preceded by an object number, generation number, and the obj keyword on one line, and followed by the endobj keyword on another.
- For Example: - For Example:
'''
```
1 0 obj 1 0 obj
<< <<
/Kids [2 0 R] /Kids [2 0 R]
@ -160,8 +163,10 @@ The file body consists of a sequence of objects, each preceded by an object numb
/Type /Pages /Type /Pages
>> >>
endobj endobj
''' ```
- Here, the object number is 1, and the generation number is 0 (it almost always is). The content for object 1 is in between the two lines 1 0 obj and endobj. In this case, its the dictionary <</Kids [2 0 R] /Count 1 /Type /Pages>>
Here, the object number is 1, and the generation number is 0 (it almost always is). The content for object 1 is in between the two lines 1 0 obj and endobj.
In this case, its the dictionary <</Kids [2 0 R] /Count 1 /Type /Pages>>
### Cross-Reference Table ### Cross-Reference Table
The cross-reference table lists the byte offset of each object in the file body. The cross-reference table lists the byte offset of each object in the file body.
@ -170,35 +175,36 @@ Objects that are not used are never read, making the process efficient.
Operations like counting the number of pages in a PDF document are fast, even in large files. Operations like counting the number of pages in a PDF document are fast, even in large files.
Each object has an object number and a generation number. Each object has an object number and a generation number.
- Generation numbers are used when a cross-reference table entry is reused. - Generation numbers are used when a cross-reference table entry is reused.
- For simplicity, we would assume generation numbers to be always zero and ignore them. - For simplicity, we would assume generation numbers to be always zero and ignore them.
- The cross-reference table consists of: The cross-reference table consists of:
- Header line that indicates the number of entries. - Header line that indicates the number of entries.
- Special entry (the first entry). - Special entry (the first entry).
- One line for each of the object in the file body. - One line for each of the object in the file body.
''' ```
**0 6 Six entries in table, starting at 0** 0 6 **Six entries in table, starting at 0**
0000000000 65535 **f Special entry** 0000000000 65535 **f Special entry**
0000000015 00000 **n Object 1 is at byte offset 15** 0000000015 00000 **n Object 1 is at byte offset 15**
0000000074 00000 **n Object 2 is at byte offset 74** 0000000074 00000 **n Object 2 is at byte offset 74**
0000000192 00000 **n etc...** 0000000192 00000 **n etc...**
0000000291 00000 **n** 0000000291 00000 **n**
0000000409 00000 **n Object 5 is at byte offset 409** 0000000409 00000 **n Object 5 is at byte offset 409**
''' ```
### Trailer ### Trailer
- The first line of the trailer is just the trailer keyword. This is followed by the trailer dictionary, which contains at least the /Size entry (which gives the number of entries in the cross-reference table) and the /Root entry (which gives the object number of the document catalog, which is the root element of the graph of objects in the body). - The first line of the trailer is just the trailer keyword. This is followed by the trailer dictionary, which contains at least the /Size entry (which gives the number of entries in the cross-reference table) and the /Root entry (which gives the object number of the document catalog, which is the root element of the graph of objects in the body).
- There follows a line with just the startxref keyword, a line with a single number (the byte offset of the start of the cross-reference table within the file), and then the line %%EOF, which signals the end of the PDF file. - There follows a line with just the startxref keyword, a line with a single number (the byte offset of the start of the cross-reference table within the file), and then the line %%EOF, which signals the end of the PDF file.
'''
trailer **Trailer keyword** ```
<< **The trailer dictinonary** trailer **Trailer keyword**
<< **The trailer dictinonary**
/Root 5 0 R /Root 5 0 R
/Size 6 /Size 6
>> >>
startxref **startxref keyword** startxref **startxref keyword**
459 **Byte offset of cross-reference table** 459 **Byte offset of cross-reference table**
%%EOF **End-of-file marker** %%EOF **End-of-file marker**
''' ```
Reading PDF Files Reading PDF Files
----------------- -----------------