Using PostgreSQL Arrays for Faceted Search stage

PostgreSQL supports array datatypes which can be indexed to allow fast searching by overlap and intersection, this makes them ideal for small to medium scale faceted search.

Faceted search is a common feature in web shops; you select a product category and the shop displays a list of attributes that the products in the category have. You then narrow down your search by selecting which properties the products must have, until you are left with a small subset of products that will hopefully contain the product you are looking for. From a database design there are several ways of doing this and each has it's own goods, bads and uglies.

One design that we are probably all familiar with is the Entity-Attribute-Value model:

Products, attributes and a table connecting products to atributes and supplying a value for the attribute. To find which products have all the attributes you selected, you have to do a query on the eav_eav table for each attribute and it's desired value, and check if the number of matches is equal to the number of attributes you are searching for.

Let's hava go. The following will create the three tables and add products, attributes and link them in the eav_eav table. The actual data is nonsense but the principle holds.

code

-- Cleanup.
DROP TABLE IF EXISTS eav_eav;
DROP TABLE IF EXISTS eav_attributes;
DROP TABLE IF EXISTS eav_products;

-- Create the tables and stuff them full of data.
CREATE TABLE eav_products AS
SELECT g::INT AS id, 'prod_' || g AS title
FROM GENERATE_SERIES(1, 10000) AS g;
ALTER TABLE eav_products
    ADD PRIMARY KEY (id);

CREATE TABLE eav_attributes AS
SELECT g::INT AS id, 'attr_' || g AS title
FROM GENERATE_SERIES(1, 1000) AS g;
ALTER TABLE eav_attributes
    ADD PRIMARY KEY (id);

CREATE TABLE eav_eav AS
SELECT eav_products.id AS product_id, eav_attributes.id AS attribute_id, eav_products.id % 10 AS val
FROM eav_products,
     eav_attributes
WHERE (eav_products.id + eav_attributes.id) % 1000 < (eav_products.id % 10);

A query to lookup particular products would look like:

code

SELECT product_id, COUNT(*)
FROM eav_eav
WHERE (attribute_id = 992 and val = 8)
   OR (attribute_id = 993 and val = 8)
   OR (attribute_id = 994 and val = 8)
   OR (attribute_id = 995 and val = 8)
   OR (attribute_id = 996 and val = 8)
   OR (attribute_id = 997 and val = 8)
   OR (attribute_id = 998 and val = 8)
   OR (attribute_id = 999 and val = 8)
GROUP BY product_id
HAVING COUNT(*) = 8;

If you change the number of products and attributes you may find that this particular set nolonger exists in your database so here's a query that generates these queries based on actual data so you are guaranteed to find something (this is important for performance testing, a query that will not return anything can be extremely fast just because the planner could work this out because of an index)

code

SELECT product_id,
       count(*),
       'SELECT product_id, COUNT(*) FROM eav_eav WHERE ' ||
       STRING_AGG('(attribute_id=' || attribute_id || ' and val=' || val || ')', ' OR ') || ' GROUP BY product_id
HAVING COUNT(*) = ' || count(*) || ';'
FROM eav_eav
GROUP BY product_id;

This will return something like:
1	1	"SELECT product_id, COUNT(*) FROM eav_eav WHERE (attribute_id=999 and val=1) GROUP BY product_id
HAVING COUNT(*) = 1;"
2	2	"SELECT product_id, COUNT(*) FROM eav_eav WHERE (attribute_id=998 and val=2) OR (attribute_id=999 and val=2) GROUP BY product_id
HAVING COUNT(*) = 2;"
3	3	"SELECT product_id, COUNT(*) FROM eav_eav WHERE (attribute_id=997 and val=3) OR (attribute_id=998 and val=3) OR (attribute_id=999 and val=3) GROUP BY product_id
HAVING COUNT(*) = 3;"
4	4	"SELECT product_id, COUNT(*) FROM eav_eav WHERE (attribute_id=996 and val=4) OR (attribute_id=997 and val=4) OR (attribute_id=998 and val=4) OR (attribute_id=999 and val=4) GROUP BY product_id
HAVING COUNT(*) = 4;"
5	5	"SELECT product_id, COUNT(*) FROM eav_eav WHERE (attribute_id=995 and val=5) OR (attribute_id=996 and val=5) OR (attribute_id=997 and val=5) OR (attribute_id=998 and val=5) OR (attribute_id=999 and val=5) GROUP BY product_id
HAVING COUNT(*) = 5;"
6	6	"SELECT product_id, COUNT(*) FROM eav_eav WHERE (attribute_id=994 and val=6) OR (attribute_id=995 and val=6) OR (attribute_id=996 and val=6) OR (attribute_id=997 and val=6) OR (attribute_id=998 and val=6) OR (attribute_id=999 and val=6) GROUP BY product_id
HAVING COUNT(*) = 6;"

Without any indexes the queryplan is not good, and neither is the performance.

code

EXPLAIN ANALYSE
SELECT product_id, COUNT(*)
FROM eav_eav
WHERE (attribute_id = 992 and val = 8)
   OR (attribute_id = 993 and val = 8)
   OR (attribute_id = 994 and val = 8)
   OR (attribute_id = 995 and val = 8)
   OR (attribute_id = 996 and val = 8)
   OR (attribute_id = 997 and val = 8)
   OR (attribute_id = 998 and val = 8)
   OR (attribute_id = 999 and val = 8)
GROUP BY product_id
HAVING COUNT(*) = 8;


QUERY PLAN
GroupAggregate  (cost=1708.72..1710.34 rows=1 width=12) (actual time=3.635..3.645 rows=10 loops=1)
  Group Key: product_id
  Filter: (count(*) = 8)
  ->  Sort  (cost=1708.72..1708.90 rows=72 width=4) (actual time=3.627..3.631 rows=80 loops=1)
        Sort Key: product_id
        Sort Method: quicksort  Memory: 28kB
        ->  Seq Scan on eav_eav  (cost=0.00..1706.50 rows=72 width=4) (actual time=0.016..3.612 rows=80 loops=1)
              Filter: ((val = 8) AND ((attribute_id = 992) OR (attribute_id = 993) OR (attribute_id = 994) OR (attribute_id = 995) OR (attribute_id = 996) OR (attribute_id = 997) OR (attribute_id = 998) OR (attribute_id = 999)))
              Rows Removed by Filter: 44920
Planning Time: 0.070 ms
Execution Time: 3.671 ms

Notice how the series of OR statements takes up most of the time as the database plows through all records in eav_eav to find the ones that meet the OR's, and then has to aggregate the results to see if there are enough to satisfy the WHERE.

Adding an index helps a lot here:

code

CREATE INDEX idx_eav_attr_val ON eav_eav(attribute_id, val);

EXPLAIN ANALYSE
SELECT product_id, COUNT(*)
FROM eav_eav
WHERE (attribute_id = 992 and val = 8)
   OR (attribute_id = 993 and val = 8)
   OR (attribute_id = 994 and val = 8)
   OR (attribute_id = 995 and val = 8)
   OR (attribute_id = 996 and val = 8)
   OR (attribute_id = 997 and val = 8)
   OR (attribute_id = 998 and val = 8)
   OR (attribute_id = 999 and val = 8)
GROUP BY product_id
HAVING COUNT(*) = 8;


QUERY PLAN
HashAggregate  (cost=197.50..198.42 rows=1 width=12) (actual time=0.057..0.058 rows=10 loops=1)
  Group Key: product_id
  Filter: (count(*) = 8)
  ->  Bitmap Heap Scan on eav_eav  (cost=35.20..196.96 rows=73 width=4) (actual time=0.027..0.041 rows=80 loops=1)
        Recheck Cond: (((attribute_id = 992) AND (val = 8)) OR ((attribute_id = 993) AND (val = 8)) OR ((attribute_id = 994) AND (val = 8)) OR ((attribute_id = 995) AND (val = 8)) OR ((attribute_id = 996) AND (val = 8)) OR ((attribute_id = 997) AND (val = 8)) OR ((attribute_id = 998) AND (val = 8)) OR ((attribute_id = 999) AND (val = 8)))
        Heap Blocks: exact=10
        ->  BitmapOr  (cost=35.20..35.20 rows=74 width=0) (actual time=0.022..0.022 rows=0 loops=1)
              ->  Bitmap Index Scan on idx_eav_attr_val  (cost=0.00..4.37 rows=8 width=0) (actual time=0.010..0.010 rows=10 loops=1)
                    Index Cond: ((attribute_id = 992) AND (val = 8))
              ->  Bitmap Index Scan on idx_eav_attr_val  (cost=0.00..4.37 rows=8 width=0) (actual time=0.002..0.002 rows=10 loops=1)
                    Index Cond: ((attribute_id = 993) AND (val = 8))
              ->  Bitmap Index Scan on idx_eav_attr_val  (cost=0.00..4.37 rows=8 width=0) (actual time=0.001..0.001 rows=10 loops=1)
                    Index Cond: ((attribute_id = 994) AND (val = 8))
              ->  Bitmap Index Scan on idx_eav_attr_val  (cost=0.00..4.37 rows=8 width=0) (actual time=0.002..0.002 rows=10 loops=1)
                    Index Cond: ((attribute_id = 995) AND (val = 8))
              ->  Bitmap Index Scan on idx_eav_attr_val  (cost=0.00..4.37 rows=8 width=0) (actual time=0.001..0.001 rows=10 loops=1)
                    Index Cond: ((attribute_id = 996) AND (val = 8))
              ->  Bitmap Index Scan on idx_eav_attr_val  (cost=0.00..4.37 rows=8 width=0) (actual time=0.001..0.001 rows=10 loops=1)
                    Index Cond: ((attribute_id = 997) AND (val = 8))
              ->  Bitmap Index Scan on idx_eav_attr_val  (cost=0.00..4.37 rows=8 width=0) (actual time=0.001..0.001 rows=10 loops=1)
                    Index Cond: ((attribute_id = 998) AND (val = 8))
              ->  Bitmap Index Scan on idx_eav_attr_val  (cost=0.00..4.46 rows=17 width=0) (actual time=0.001..0.001 rows=10 loops=1)
                    Index Cond: ((attribute_id = 999) AND (val = 8))
Planning Time: 0.127 ms
Execution Time: 0.098 ms

That has reduced the execution time from ~3.5ms to ~0.1ms, but the planner now spends 0.127ms thinking about how to deal with this query, this is ten times as much as for the sequential scan.

Before we look at how array's perform, lets see how much diskspace this design is using up.

Note: This query uses pg_size_pretty(), which uses proper kilobytes=1024 bytes, and one Megabyte = 1024 KiloBytes = 1048576 Bytes. You *will* see the sum not adding up, but it actually is.

code

SELECT coalesce(relname, 'Total') AS "relation", pg_size_pretty(SUM(pg_relation_size(oid))) AS "size in kB"
FROM pg_class
WHERE relname IN ('eav_products', 'eav_attributes', 'eav_eav', 'idx_eav_product_attributes', 'idx_eav_attr_val')
GROUP BY ROLLUP ((relname))
ORDER BY relname;

relation	 size in kB
eav_attributes	   48 kB
eav_eav	         1952 kB
eav_products	  440 kB
idx_eav_attr_val 1000 kB
Total            3440 kB

Now for the arrays.

First modify the products table to add an attributes column, then fill that column with an array of strings that represent the attribute-value pairs from the eav_eav table.

code

ALTER TABLE eav_products ADD COLUMN attributes TEXT[];

WITH attrs AS (SELECT product_id, ARRAY_AGG(eav_eav.attribute_id || ':' || eav_eav.val) AS a
               FROM eav_products
                        INNER JOIN eav_eav ON eav_products.id = eav_eav.product_id
               GROUP BY product_id
)
UPDATE eav_products
SET attributes = attrs.a
FROM attrs
WHERE attrs.product_id = eav_products.id;

The query for getting records wil ofcourse also change, so here's another query to generate the queries:

code

SELECT id, COUNT(*), 'SELECT * FROM eav_products WHERE attributes @> ARRAY[' || STRING_AGG(quote_literal(a), ',') || '];' AS query FROM eav_products, unnest(attributes) AS a GROUP BY id;

id	count	query
-------------------------
1	1	SELECT * FROM eav_products WHERE attributes @> ARRAY['999:1'];
2	2	SELECT * FROM eav_products WHERE attributes @> ARRAY['998:2','999:2'];
3	3	SELECT * FROM eav_products WHERE attributes @> ARRAY['997:3','998:3','999:3'];

And how is the performance without indexes?

code

EXPLAIN ANALYSE
SELECT id, title
FROM eav_products WHERE attributes @> ARRAY['992:8','993:8','994:8','995:8','996:8','997:8','998:8','999:8'];

QUERY PLAN
Seq Scan on eav_products  (cost=0.00..318.00 rows=1 width=13) (actual time=0.124..2.734 rows=10 loops=1)
  Filter: (attributes @> '{992:8,993:8,994:8,995:8,996:8,997:8,998:8,999:8}'::text[])
  Rows Removed by Filter: 9990
Planning Time: 0.077 ms
Execution Time: 2.747 ms

That's ~2.8ms. It's already faster while doing a sequential scan, so lets add an index. PostgreSQL can index arrays using GIN:

code

CREATE INDEX idx_eav_product_attributes ON eav_products USING GIN(attributes);

EXPLAIN ANALYSE
SELECT id, title
FROM eav_products WHERE attributes @> ARRAY['992:8','993:8','994:8','995:8','996:8','997:8','998:8','999:8'];

QUERY PLAN
Bitmap Heap Scan on eav_products  (cost=68.00..72.01 rows=1 width=13) (actual time=0.027..0.036 rows=10 loops=1)
  Recheck Cond: (attributes @> '{992:8,993:8,994:8,995:8,996:8,997:8,998:8,999:8}'::text[])
  Heap Blocks: exact=10
  ->  Bitmap Index Scan on idx_eav_product_attributes  (cost=0.00..68.00 rows=1 width=0) (actual time=0.022..0.022 rows=10 loops=1)
        Index Cond: (attributes @> '{992:8,993:8,994:8,995:8,996:8,997:8,998:8,999:8}'::text[])
Planning Time: 0.088 ms
Execution Time: 0.063 ms

Planning time is down to ~0.09ms and execution time is ~0.06ms

How much space does all of this take up now? Ofcourse we got rid of the eav_eav table and its index, but we added a large column to the eav_product table and indexed it using a fancy indexing algorithm.

code

SELECT coalesce(relname, 'Total') AS "relation", pg_size_pretty(SUM(pg_relation_size(oid))) AS "size in kB"
FROM pg_class
WHERE relname IN ('eav_products', 'eav_attributes', 'eav_eav', 'idx_eav_product_attributes', 'idx_eav_attr_val')
GROUP BY ROLLUP ((relname))
ORDER BY relname;

relation	           size in kB
eav_attributes	             48 kB
eav_products	           1544 kB
idx_eav_product_attributes  536 kB
Total	                   2128 kB

That's 1300kB less.

So does it scale? What if there are 1M products? Just increase the 10000 in the generate_series to 1000000 and wait a few minutes, depending on how fast your machine is.

Using the same queries the results are

code

DROP TABLE IF EXISTS eav_eav;
DROP TABLE IF EXISTS eav_attributes;
DROP TABLE IF EXISTS eav_products;

CREATE TABLE eav_products AS
SELECT g::INT AS id, 'prod_' || g AS title
FROM GENERATE_SERIES(1, 1000000) AS g;
ALTER TABLE eav_products
    ADD PRIMARY KEY (id);
CREATE TABLE eav_attributes AS
SELECT g::INT AS id, 'attr_' || g AS title
FROM GENERATE_SERIES(1, 1000) AS g;
ALTER TABLE eav_attributes
    ADD PRIMARY KEY (id);
CREATE TABLE eav_eav AS
SELECT eav_products.id AS product_id, eav_attributes.id AS attribute_id, eav_products.id % 10 AS val
FROM eav_products,
     eav_attributes
WHERE (eav_products.id + eav_attributes.id) % 1000 < (eav_products.id % 10);


EXPLAIN ANALYSE
SELECT product_id, COUNT(*)
FROM eav_eav
WHERE (attribute_id = 992 and val = 8)
   OR (attribute_id = 993 and val = 8)
   OR (attribute_id = 994 and val = 8)
   OR (attribute_id = 995 and val = 8)
   OR (attribute_id = 996 and val = 8)
   OR (attribute_id = 997 and val = 8)
   OR (attribute_id = 998 and val = 8)
   OR (attribute_id = 999 and val = 8)
GROUP BY product_id
HAVING COUNT(*) = 8;

QUERY PLAN
Finalize GroupAggregate  (cost=86433.19..87301.10 rows=35 width=12) (actual time=151.333..152.099 rows=1000 loops=1)
  Group Key: product_id
  Filter: (count(*) = 8)
  ->  Gather Merge  (cost=86433.19..87168.30 rows=5920 width=12) (actual time=151.322..154.380 rows=1052 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Partial GroupAggregate  (cost=85433.16..85484.96 rows=2960 width=12) (actual time=134.381..134.768 rows=351 loops=3)
              Group Key: product_id
              ->  Sort  (cost=85433.16..85440.56 rows=2960 width=4) (actual time=134.374..134.492 rows=2667 loops=3)
                    Sort Key: product_id
                    Sort Method: quicksort  Memory: 216kB
                    Worker 0:  Sort Method: quicksort  Memory: 223kB
                    Worker 1:  Sort Method: quicksort  Memory: 225kB
                    ->  Parallel Seq Scan on eav_eav  (cost=0.00..85262.50 rows=2960 width=4) (actual time=0.119..133.891 rows=2667 loops=3)
                          Filter: ((val = 8) AND ((attribute_id = 992) OR (attribute_id = 993) OR (attribute_id = 994) OR (attribute_id = 995) OR (attribute_id = 996) OR (attribute_id = 997) OR (attribute_id = 998) OR (attribute_id = 999)))
                          Rows Removed by Filter: 1497333
Planning Time: 0.126 ms
Execution Time: 154.686 ms


CREATE INDEX idx_eav_attr_val ON eav_eav(attribute_id, val);

EXPLAIN ANALYSE
SELECT product_id, COUNT(*)
FROM eav_eav
WHERE (attribute_id = 992 and val = 8)
   OR (attribute_id = 993 and val = 8)
   OR (attribute_id = 994 and val = 8)
   OR (attribute_id = 995 and val = 8)
   OR (attribute_id = 996 and val = 8)
   OR (attribute_id = 997 and val = 8)
   OR (attribute_id = 998 and val = 8)
   OR (attribute_id = 999 and val = 8)
GROUP BY product_id
HAVING COUNT(*) = 8;

QUERY PLAN
HashAggregate  (cost=15920.41..16008.81 rows=35 width=12) (actual time=3.465..3.616 rows=1000 loops=1)
  Group Key: product_id
  Filter: (count(*) = 8)
  ->  Bitmap Heap Scan on eav_eav  (cost=192.98..15867.14 rows=7103 width=4) (actual time=0.643..2.246 rows=8000 loops=1)
        Recheck Cond: (((attribute_id = 992) AND (val = 8)) OR ((attribute_id = 993) AND (val = 8)) OR ((attribute_id = 994) AND (val = 8)) OR ((attribute_id = 995) AND (val = 8)) OR ((attribute_id = 996) AND (val = 8)) OR ((attribute_id = 997) AND (val = 8)) OR ((attribute_id = 998) AND (val = 8)) OR ((attribute_id = 999) AND (val = 8)))
        Heap Blocks: exact=1054
        ->  BitmapOr  (cost=192.98..192.98 rows=7130 width=0) (actual time=0.537..0.538 rows=0 loops=1)
              ->  Bitmap Index Scan on idx_eav_attr_val  (cost=0.00..20.21 rows=778 width=0) (actual time=0.170..0.170 rows=1000 loops=1)
                    Index Cond: ((attribute_id = 992) AND (val = 8))
              ->  Bitmap Index Scan on idx_eav_attr_val  (cost=0.00..20.21 rows=778 width=0) (actual time=0.075..0.075 rows=1000 loops=1)
                    Index Cond: ((attribute_id = 993) AND (val = 8))
              ->  Bitmap Index Scan on idx_eav_attr_val  (cost=0.00..20.21 rows=778 width=0) (actual time=0.082..0.082 rows=1000 loops=1)
                    Index Cond: ((attribute_id = 994) AND (val = 8))
              ->  Bitmap Index Scan on idx_eav_attr_val  (cost=0.00..20.21 rows=778 width=0) (actual time=0.053..0.053 rows=1000 loops=1)
                    Index Cond: ((attribute_id = 995) AND (val = 8))
              ->  Bitmap Index Scan on idx_eav_attr_val  (cost=0.00..20.21 rows=778 width=0) (actual time=0.039..0.039 rows=1000 loops=1)
                    Index Cond: ((attribute_id = 996) AND (val = 8))
              ->  Bitmap Index Scan on idx_eav_attr_val  (cost=0.00..20.21 rows=778 width=0) (actual time=0.039..0.039 rows=1000 loops=1)
                    Index Cond: ((attribute_id = 997) AND (val = 8))
              ->  Bitmap Index Scan on idx_eav_attr_val  (cost=0.00..37.28 rows=1685 width=0) (actual time=0.038..0.038 rows=1000 loops=1)
                    Index Cond: ((attribute_id = 998) AND (val = 8))
              ->  Bitmap Index Scan on idx_eav_attr_val  (cost=0.00..20.21 rows=778 width=0) (actual time=0.040..0.040 rows=1000 loops=1)
                    Index Cond: ((attribute_id = 999) AND (val = 8))
Planning Time: 0.118 ms
Execution Time: 3.742 ms


SELECT coalesce(relname, 'Total') AS "relation", pg_size_pretty(SUM(pg_relation_size(oid))) AS "size"
FROM pg_class
WHERE relname IN ('eav_products', 'eav_attributes', 'eav_eav', 'idx_eav_product_attributes', 'idx_eav_attr_val')
GROUP BY ROLLUP ((relname))
ORDER BY relname;

relation	    size
eav_attributes	   48 kB
eav_eav	          190 MB
eav_products	   42 MB
idx_eav_attr_val   97 MB
Total	          329 MB 



ALTER TABLE eav_products ADD COLUMN attributes TEXT[];
WITH attrs AS (SELECT product_id, ARRAY_AGG(eav_eav.attribute_id || ':' || eav_eav.val) AS a
               FROM eav_products
                        INNER JOIN eav_eav ON eav_products.id = eav_eav.product_id
               GROUP BY product_id
)
UPDATE eav_products
SET attributes = attrs.a
FROM attrs
WHERE attrs.product_id = eav_products.id;

QUERY PLAN
Gather  (cost=1000.00..39422.57 rows=1 width=15) (actual time=16.744..115.603 rows=1000 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  ->  Parallel Seq Scan on eav_products  (cost=0.00..38422.47 rows=1 width=15) (actual time=6.822..101.955 rows=333 loops=3)
        Filter: (attributes @> '{992:8,993:8,994:8,995:8,996:8,997:8,998:8,999:8}'::text[])
        Rows Removed by Filter: 333000
Planning Time: 0.068 ms
Execution Time: 115.667 ms


CREATE INDEX idx_eav_product_attributes ON eav_products USING GIN(attributes);

EXPLAIN ANALYSE
SELECT id, title
FROM eav_products WHERE attributes @> ARRAY['992:8','993:8','994:8','995:8','996:8','997:8','998:8','999:8'];

QUERY PLAN
Bitmap Heap Scan on eav_products  (cost=100.00..104.01 rows=1 width=15) (actual time=0.505..1.301 rows=1000 loops=1)
  Recheck Cond: (attributes @> '{992:8,993:8,994:8,995:8,996:8,997:8,998:8,999:8}'::text[])
  Heap Blocks: exact=1000
  ->  Bitmap Index Scan on idx_eav_product_attributes  (cost=0.00..100.00 rows=1 width=0) (actual time=0.395..0.395 rows=1000 loops=1)
        Index Cond: (attributes @> '{992:8,993:8,994:8,995:8,996:8,997:8,998:8,999:8}'::text[])
Planning Time: 0.055 ms
Execution Time: 1.349 ms


SELECT coalesce(relname, 'Total') AS "relation", pg_size_pretty(SUM(pg_relation_size(oid))) AS "size"
FROM pg_class
WHERE relname IN ('eav_products', 'eav_attributes', 'eav_eav', 'idx_eav_product_attributes', 'idx_eav_attr_val')
GROUP BY ROLLUP ((relname))
ORDER BY relname;

relation	           size
eav_attributes	            48 kB
eav_products	           153 MB
idx_eav_product_attributes  35 MB
Total	                   188 MB

Both methods scale remarkably well, but the array method is always much faster.

caveats.

Obviously the performance measurements I did are *way* too simple and do not represent a realistic workload. Your mileage may vary.

Yes I know that the time the EAV takes is quote short too, the point is that arrays are an order of magnitude faster. Anything that takes longer will generally cause more problems and create them sooner, at lower workloads.

There is no referential integrity built in; you can remove an attribute from eav_attributes and it will still be referenced in the eav_products table. If this is a problem in your setup then you can build an extremely simple trigger to clean the attributes table. No, triggers are not evil or slow, they are lovely and keep you from writing the exact same functionality in your app.