A statistic corruption on TOAST chunk case

Introduction

PostgreSQL stores data on a page that usually has 8KB of size, this means that when there isn’t enough space to store the data (for example, text or varchar data type), PostgreSQL uses the TOAST technique which allows us to store this extra data like chunks using other tables, each chunk is stored as a separate row in the TOAST table belonging to the main table. The users don’t need to know if the information is stored on TOAST. PostgreSQL transfers/obtains the data to/from there transparently. Of course not all tables from our database have TOAST, to know what tables have TOAST you can use the following query:

    SELECT
    t1.oid,t1.relname,	t1.relkind,	t1.reltoastrelid, t1.reltoastrelid::regclass
    FROM 	pg_class t1
    INNER JOIN pg_class t2
    On t1.reltoastrelid = t2.oid
    WHERE	t1.relkind = 'r' AND t2.relkind = 't';

The data corruption in PostgreSQL is not very frequent and generally appears after some failure such as disk failures, an unexpected power loss, hardware failure, database systems configured using fsync=off, or to a lesser extent PostgreSQL bugs. Some months ago was found the following error:

    ERROR: missing chunk number 0 for toast value XXXXXXX in pg_toast_2619

This error message is an example that in our database there is data corruption in this TOAST table pg_toast_2619, this table belongs to the statistics table (pg_statistic, you can check it in the code source), and it is used by the planner to take decisions and choose the better plan to execute a query, which can impact directly the database performance, or simply can not run queries due to this error, in addition, this table is populated and updated by the AUTOVACUUM process or manual ANALYZE command, and then, How to solve this data corruption to continue to work well in our database? We will see this in the next section.

Solution

As the information stored in this table is practically temporal and is computed and stored each time there is an ANALYZE or an AUTO-ANALYZE happens by AUTOVACUUM, it is possible to dispense with their data, because you can regenerate again, so we use the following steps to try to solve this issue. Take into account that the access to the table pg_statistic is restricted to superusers

    BEGIN;
    DELETE FROM pg_statistic; --delete data from pg_statistic
    ANALYZE; --populate the statistics table again
    END;

This activity may take a while depending on the dimensions of the database, but if we have identified the rows that have data corruption and to which table they belong to, it is possible to delete only the statistics corresponding to damaged tables and regenerating them and thus we can reduce the recreation time of the new statistics. This can be detected because it only returns the above error with one or a few tables.

    BEGIN;
    DELETE FROM pg_statistic where ((starelid::oid)::regclass)::text ='your_table'; --delete data from pg_statistic for specific table
    ANALYZE 'your_table'; --populate the statistics table for specific table again
    END;

Conclusions

As you can see, it is relatively easy to solve this statistics table data corruption problem, and now the planner can work well and make a good decision about execution strategies, also this allows to AUTOVACUUM make his work fine. All temporal or reconstructible data that become corrupt and affect the PostgreSQL planner or executor can be fixed this way, trying to rebuild the object, for example, pg_statistic_ext_data table or a corrupt index can be REINDEX, but more than this solution, you will need to check your hardware and analyze its health or verify your configuration about fsync, to avoid data corruption.

Low-level data check tools

PostgreSQL has other options to discover and analyze some data corruption such as: amcheck, pageinspect and pg_visibility, in the majority of cases, you will need professional support to try to use it and repair it, But remember the data corruption is not frequent in PostgreSQL ;-)