Log file compression with selective Zstandard in SQLite [part 4 + benchmark]
Watch out! You're reading a series of articles
- Part 1: Log file compression with Gzip and Zstandard + benchmark
This is the first part of a series of articles diving into log files compression techniques.
- Part 2: Log file compression with SQLite + benchmark
The second part utilizes SQLite as an embeddable storage that not only provides querying capaibilites but also is able to compress the logs itself.
- Part 3: Log file compression with Zstandard VFS in SQLite + benchmark
The third part dives deeper into compression capabilities of SQLite using Zstandard compression algorithm.
- (currently reading) Part 4: Log file compression with Zstandard selective column in SQLite
The fourth part explores database level compression of specific columns within SQLite.
You can find the script that was used to seed the database with data, perform compression and run benchmarks here.
During the research I also found another, slightly older (2022) SQLite extension that uses Zstandard to provide compression capabilities. It's provided as a native SQLite extension and does not expect compression of the SQLite database file. It's used within the SQLite and provides a way to control which columns get compressed.
This opens an entiery new dimension, since you can pick specific content to be compressed while retaining standard SQLite schema and index modeling. This extension provides utility function that you can use to decompress the database in a transparent manner as well as selectively pick up the data to be decompressed.
You can find the library in thig Github repo or if you're interested in the context around creating it, read author's blog post.
Using sqlite-zstd extension
In a nutshell, here's how you use the extension:
-- assuming the .so library file is located
-- at /usr/lib/sqlitezstd/libsqlite_zstd.so
-- load the extension
.load /usr/lib/sqlitezstd/libsqlite_zstd
-- enable transparent (de)compression process.
-- this will create another table and a bunch of triggers
-- that will redirect your queries to a compressed table
select zstd_enable_transparent('{"table": "nginx_logs", "column": "json_log", "compression_level": 19, "dict_chooser": "''a''"}');
-- run the compression
select zstd_incremental_maintenance(null, 1);
-- vacuum the database so the space will be freed
vacuum;
-- query the database as you would normally do:
-- for column:
select count(*) from nginx_logs where status = 200;
-- for JSON column:
select count(*) from nginx_logs where json_log->'status' = 200;
TIP
The extension creates a set of triggers that will take care of updating the compressed data whenever it changes. That means you can still operate (run inserts, updates and deletes) the database as you would normally do and the extension will take care of updating it autmatically. In addition you can you utility functions to update the compressed data yourself.
For logs compression, this is not a vital feature as the workload on logs is append only (you only add logs) and the query them.
If you are interested in what is the space distribution of the compressed data, you can use sqlite3_analyzer
utility to analyze the database file. It will print the information about all of the tables and indexes.
./sqlite3_analyzer nginx_logs_columns.db
Manual compression with sqlite-zstd extension
As mentioned above, the extension allows you to use a set of utility functions, which enables you to choose specific columns which should be compressed. Here are the steps to do it:
-- first, you have to train a dictionary for a Zstandard algorithm
select zstd_train_dict_and_save(json_log, 1000000, 10000, 'd') from nginx_logs;
-- create a column where a compressed value will be stored
-- sidenote: you can use STRICT mode to enforce data type
alter table nginx_logs add column compressed BLOB;
-- update the newly created column (you can use anywhere clause you can think of)
-- with a compressed value and specific compression level
update nginx_logs set compressed = zstd_compress(json_log, 22, 1, true);
After compressing the table with a single json_log
column, compressed with level 22
with dictionary, but no indexes or additional values, the database file takes 67Mb. When compared to the original size, it achieved a x6.5 size reduction from the original (uncompressed data).
# database file without column compression
442M nginx_logs_json.db
# database file after compressing a json_log column
67M nginx_logs_json.db
I did run the queries, but the times were so slow that I abandoned it. The solution from a previous blog post yielded orders of magnitude results better. The query I used to measure the time:
select
count(*)
from `_nginx_logs_zstd`
where
-- extract data from JSON
json_extract(
-- decompress the json_log using a dict
zstd_decompress_col(`json_log`, 1, `_json_log_dict`, true),
'$.status'
) = 200;
For example, the above query time took several seconds to complete. Perhaps there could be some optimizations made but I didn't want to spend the time there if out of the box solution was slow already.
Indexing compressed table columns
I started to wonder how I can still gain from the capabilities this extension provides for the purpose of logs compression. I decided to give it another try and created two additional columns (status
and time
) and indexes for them. Those coulmns has been populated with data from the JSON before the compression was applied. My assumption was that if I limit the queries to those columns, I can get the speed of the usual data, while still gaining from the space reduction of the whole JSON payload.
To achieve that I only had to run a few additional commands:
-- create additional columns
alter table nginx_logs add column status text;
alter table nginx_logs add column time text;
-- populate these columns with data from JSON
update nginx_logs set status = json_log->'status';
update nginx_logs set time = json_log->'time_local';
-- create indexes
create index status_idx on nginx_logs (status);
create index time_idx on nginx_logs (time);
After that operation the database size increased (json_log compressed with 22 level + 2 columns and indexes)
130M nginx_logs_json.db
sqlite3_analyzer results
table size: 89M
indexes: 41M
From the above we can see that the table itself (json_log column + 2 other columns) takes 89Mb while the index on these 2 columns itself takes 41Mb. This is a bigger price to pay in terms of the space saved as compared to the original (442Mb), it's around x3.6 reduction from the original.
Running benchmark queries yielded the below results.
Benchmarking compressed data with indexes in SQLite
-- test whether decompression works
select zstd_decompress(compressed, 1, true) from nginx_logs limit 1;
select count(*), 1 from nginx_logs;
-- took 50ms
select count(*), 1 from nginx_logs where status_t = 200;
-- took 12ms
select status_t, count(*) from nginx_logs group by status_t;
-- took 139ms
select status_t, count(*) from nginx_logs where status_t = 200 group by status_t;
-- took 27ms
No surprise here, the query times are low and comparable with the results I got when running the same queries on uncompressed table with columns in Part 3.
Conclusion
In this exploration of SQLite compression techniques using the sqlite-zstd extension, we've uncovered both promising results and potential limitations. The extension offers a flexible approach to selectively compress specific columns within a SQLite database, particularly useful for log data with large JSON payloads. Key takeaways:
- Compression Efficiency: Using sqlite-zstd to compress the JSON log column achieved a significant size reduction, from 442MB to 67MB (about 6.5x smaller).
- Transparent Compression: The extension provides utility functions for transparent compression and decompression, allowing normal database operations without manual intervention.
- Indexing Trade-offs: Adding indexes to specific columns extracted from the JSON improved query performance but increased the overall database size to 130MB. This still represents a 3.6x reduction from the original size.
- Query Performance: Queries on indexed columns showed good performance, with response times ranging from 12ms to 139ms for various operations. Flexibility: The ability to selectively compress columns while maintaining normal SQL operations on others offers a balance between storage efficiency and query performance.
While the sqlite-zstd extension provides valuable compression capabilities, it's important to consider the specific requirements of your use case. For log data where append-only operations and efficient querying are priorities, this approach offers a viable solution. However, for scenarios requiring frequent updates or more complex operations, careful consideration of the performance implications is necessary.
Ultimately, the choice between this method and other compression techniques (such as those explored in earlier parts of this series) will depend on your specific balance of storage constraints, query performance needs, and operational requirements. As always in database optimization, it's crucial to benchmark with your actual data and query patterns to make the best decision for your particular use case.
Thanks for reading!
You reached the end of the four part series. In case you have any questions or comments you can reach out to me via [email protected]