As developers we commonly use databases, even daily, but do we really know how they work? How do they store information? Which internal structure do they use? This post will cover how information is stored in PostgreSQL, from the table structure to the row’s organization at a disk level.
Getting page information
To get the detailed information of data structure we may use the pageinspect
extension. After installing this extension we will be able to get the low level detail of PostgreSQL pages.
To install it:
# create extension pageinspect;
Tables from our database
PostgreSQL has much more tables than the ones included in our schema. It stores a lot of data, for example, the tables and colums from our schemas.
There are more than one way to get the schema tables, but the easier one is the command \dt
.
# \dt eightkdata.*
The previous sentence will return a list of the relations included in the schema eightkdata.
Page structure in PostgreSQL
Once we know the tables from our schema, it is possible to analyze the page structure using some functions included in the pageinspect
extension. For example, we can get the header information in each page, like it is done in the next example for the first page from table eightkdata.httparchive
.
# SELECT * FROM page_header(get_raw_page('eightkdata.httparchive', 0)); ┌────────────┬──────────┬───────┬───────┬───────┬─────────┬──────────┬─────────┬───────────┐ │ lsn │ checksum │ flags │ lower │ upper │ special │ pagesize │ version │ prune_xid │ ├────────────┼──────────┼───────┼───────┼───────┼─────────┼──────────┼─────────┼───────────┤ │ 6/AA8DD3F0 │ 0 │ 0 │ 928 │ 960 │ 8192 │ 8192 │ 4 │ 0 │ └────────────┴──────────┴───────┴───────┴───────┴─────────┴──────────┴─────────┴───────────┘
The header of a page are the first 24 bytes and contains basic information to management future changes, like pointers to the free space or the size of the page (in PostgreSQL by default it is 8KB).
Each column in page_header
has a meaning:
- lsn: Log Sequence Number: is the address of the next byte to be used in the page xlog.
- checksum: Page checksum.
- flags: Various flag bits.
- lower: The offset where the free space starts, it will be the initial address to the next tuple created.
- upper: The offset where the free space ends.
- special: The offset where the special space starts, it is at the end of the page actually.
- pagesize: It is the size of the page, 8KB by default, but it can be configured.
- version: Version number of the page.
- prune_xid: Signals when pruning operation can be a good option to improve the system.
This data gives a nice view of the page organization that is represented in the next diagram.
There are some data in the previous diagram that doesn’t appear in the page_header
, they are part of PostgreSQL configuration, like the fill factor or the alignment padding.
Fill factor
The fill factor is a value that tells PostgreSQL when to stop storing tuples in the current page and switch to a new one. By default pages are not completely filled. This allows to store tuples updates in the same original page, increasing the system performance.
Alignment padding
To increase I/O operations efficiency PostgreSQL uses a word length that depends on the machine where it is running. In modern computers with 64 bits processors the word length is 8 bytes.
This causes that tuples are not exactly the same size they are occupying in disk, because PostgreSQL uses this extra space (alignment) to increase I/O performance.
Tuple structure in PostgreSQL
Tuples can be analyzed too using another function called heap_page_items
.
# select * from heap_page_items(get_raw_page('eightkdata.httparchive', 0)) limit 10; ┌────┬────────┬──────────┬────────┬────────┬────────┬──────────┬────────┬─────────────┬────────────┬────────┬────────┬───────┐ │ lp │ lp_off │ lp_flags │ lp_len │ t_xmin │ t_xmax │ t_field3 │ t_ctid │ t_infomask2 │ t_infomask │ t_hoff │ t_bits │ t_oid │ ├────┼────────┼──────────┼────────┼────────┼────────┼──────────┼────────┼─────────────┼────────────┼────────┼────────┼───────┤ │ 1 │ 8160 │ 1 │ 29 │ 6202 │ 0 │ 9 │ (0,1) │ 2 │ 2304 │ 24 │ ¤ │ ¤ │ │ 2 │ 8128 │ 1 │ 29 │ 6202 │ 0 │ 9 │ (0,2) │ 2 │ 2304 │ 24 │ ¤ │ ¤ │ │ 3 │ 8096 │ 1 │ 29 │ 6202 │ 0 │ 9 │ (0,3) │ 2 │ 2304 │ 24 │ ¤ │ ¤ │ │ 4 │ 8064 │ 1 │ 29 │ 6202 │ 0 │ 9 │ (0,4) │ 2 │ 2304 │ 24 │ ¤ │ ¤ │ │ 5 │ 8032 │ 1 │ 29 │ 6202 │ 0 │ 9 │ (0,5) │ 2 │ 2304 │ 24 │ ¤ │ ¤ │ │ 6 │ 8000 │ 1 │ 29 │ 6202 │ 0 │ 9 │ (0,6) │ 2 │ 2304 │ 24 │ ¤ │ ¤ │ │ 7 │ 7968 │ 1 │ 29 │ 6202 │ 0 │ 9 │ (0,7) │ 2 │ 2304 │ 24 │ ¤ │ ¤ │ │ 8 │ 7936 │ 1 │ 29 │ 6202 │ 0 │ 9 │ (0,8) │ 2 │ 2304 │ 24 │ ¤ │ ¤ │ │ 9 │ 7904 │ 1 │ 29 │ 6202 │ 0 │ 9 │ (0,9) │ 2 │ 2304 │ 24 │ ¤ │ ¤ │ │ 10 │ 7872 │ 1 │ 29 │ 6202 │ 0 │ 9 │ (0,10) │ 2 │ 2304 │ 24 │ ¤ │ ¤ │ └────┴────────┴──────────┴────────┴────────┴────────┴──────────┴────────┴─────────────┴────────────┴────────┴────────┴───────┘
For each tuple there is some information available about position inside the page, visibility or size among others.
- lp: The index of the tuple in the page.
- lp_off: Offset of the tubple inside the page.
- lp_flags: Keeps the status of the item pointer.
- lp_len: Length of the tuple.
- t_xmin: Transaction number when the tuple was created.
- t_xmax: Transaction number when the tuple was deleted.
- t_field3: It can contains one of two possible values, t_cid or t_xvac. The t_cid is the CID signature from the insert or delete. The t_xvac is the XID for the VACUMM operation when row version changes.
- t_ctid: Current TID.
- t_infomask2: Number of attributes and some flag bits.
- t_infomask: Some flag bits.
- t_hoff: Is the offset where the user data is stored inside the tuple.
TOAST (The Oversized-Attribute Storage Technique)
Looking at page size it is easy to discover that some data cannot be stored in such a small space. For these cases there is a mechanism called TOAST.
By default PostgreSQL has two variables, toast_tuple_threshold
and toast_tuple_target
with value 2K. When a tuple is being stored and is larger than 2K, the fields where it can be applied (not all of them apply to TOAST) are stored in a TOAST table.