If you look at PostgreSQL
internal optimization techniques it can be quite impressive
to see, how SQL can be optimized and prepared for later
execution. One feature which has been added to PostgreSQL
has caught my eye for a while and I want to share some
experience with this cool new stuff with you.
Let's take a look at CREATE FUNCTION:
bench=# \h CREATE FUNCTION
Command: CREATE FUNCTION
Description: define a new function
Syntax:
CREATE [ OR REPLACE ] FUNCTION
name ( [ [ argmode ] [ argname ] argtype [, ...] ] )
[ RETURNS rettype ]
{ LANGUAGE langname
| IMMUTABLE | STABLE | VOLATILE
| CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT | STRICT
| [ EXTERNAL ] SECURITY INVOKER | [ EXTERNAL ] SECURITY DEFINER
| COST execution_cost
| ROWS result_rows
| SET configuration_parameter { TO value | = value | FROM CURRENT }
| AS 'definition'
| AS 'obj_file', 'link_symbol'
} ...
[ WITH ( attribute [, ...] ) ]
The COST class and the ROWS class can be used nicely to teach the optimizer about the output of a function. Now, this seems quite like a detail - but in practice it isn't a detail but a powerful instrument to optimize applications.
Let's consider a simple example:
bench=# CREATE TABLE t_data (id serial, name text);
NOTICE: CREATE TABLE will create implicit sequence "t_data_id_seq" for serial column "t_data.id"
CREATE TABLE
bench=# INSERT INTO t_data (name) VALUES ('hans');
INSERT 0 1
bench=# INSERT INTO t_data (name) SELECT name FROM t_data;
INSERT 0 1
bench=# INSERT INTO t_data (name) SELECT name FROM t_data;
INSERT 0 2
...
bench=# INSERT INTO t_data (name) SELECT name FROM t_data;
INSERT 0 524288
The idea here is that we have a table containing many different ids but always the same name. This is an especially interesting from an optimization point of view.
Let's define a function returning some data:
bench=# CREATE FUNCTION x() RETURNS setof int4 AS $$
bench$# SELECT id FROM t_data GROUP BY id
bench$# $$ LANGUAGE 'sql';
CREATE FUNCTION
This will give us the following plan:
bench=# EXPLAIN SELECT * FROM x();
QUERY PLAN
------------------------------------------------------------
Function Scan on x (cost=0.00..260.00 rows=1000 width=32)
(1 row)
The planner expects the function to return 1000 rows. The reality is quite different:
bench=# EXPLAIN SELECT id FROM t_data GROUP BY id;
QUERY PLAN
----------------------------------------------------------------------------
Group (cost=145601.40..150845.22 rows=1048764 width=4)
-> Sort (cost=145601.40..148223.31 rows=1048764 width=4)
Sort Key: id
-> Seq Scan on t_data (cost=0.00..15628.64 rows=1048764 width=4)
(4 rows)
As you can see the "real" estimates are ways off. This can lead to bad performance or to a significant memory problem:
bench=# EXPLAIN SELECT x FROM x() GROUP BY x;
QUERY PLAN
-----------------------------------------------------------------
HashAggregate (cost=262.50..264.50 rows=200 width=4)
-> Function Scan on x (cost=0.00..260.00 rows=1000 width=4)
(2 rows)
In this case we got a memory consuming HashAggregate. If the function returns a lot of data this can be quite risky from a memory point of view.
Now, let's run the same query with some inlined SQL:
bench=# EXPLAIN SELECT x.*
FROM (SELECT id FROM t_data GROUP BY id) AS x GROUP BY x.id;
QUERY PLAN
----------------------------------------------------------------------------------
Group (cost=145601.40..163954.77 rows=200 width=4)
-> Group (cost=145601.40..150845.22 rows=1048764 width=4)
-> Sort (cost=145601.40..148223.31 rows=1048764 width=4)
Sort Key: t_data.id
-> Seq Scan on t_data (cost=0.00..15628.64 rows=1048764 width=4)
(5 rows)
This will give us a Group clause which needs some additional sort step.
This is less memory intense than a HashAggregate but it will give us a totally different runtime behavivor.
Using the two new clauses outlined above we can tweak the plan a little:
bench=# DROP FUNCTION x();
DROP FUNCTION
bench=# CREATE FUNCTION x() RETURNS setof int4 AS $$ SELECT id FROM t_data
GROUP BY id $$ LANGUAGE 'sql' ROWS 1000000 COST 50;
CREATE FUNCTION
bench=# EXPLAIN SELECT * FROM x();
QUERY PLAN
-----------------------------------------------------------------
Function Scan on x (cost=0.00..135000.00 rows=1000000 width=4)
(1 row)
The function scan we are performing now is giving us similar costs as the "real" call. In many cases this will give us a significantly improved and more efficient plan.
I can encourage people to take a closer look at this new feature.
In many cases queries containing function calls can be improved nicely. |