Google+ Followers

dinsdag 30 september 2014

"Cache database queries heavily, because the database is always the biggest bottleneck in every PHP application."

Today I read onother of those lovely articles that tell you what to do in your web developement.

Lots of good advice, of course, and some pretty bad.

For example, the often quoted mantra: "Cache database queries heavily, because the database is always the biggest bottleneck in every PHP application."

The problem lies in the definition of "bottleneck". It is certainly true that SQL queries usually take more time to run than the rest of the PHP script put together, but that doesn't mean it's slow. It just takes that muche time, get over it. For some reason, PHP users seem to start their project with a benchmark of a script that just contains <?php ?> and then want to use caching to compensate for every extra microsecond.

But the real problem with caching queries is that you can't. The old saying is that there are two things difficult in programming, naming things and when to invalidate your cache. If you do decide to cache queries then you are not just going to run into the latter, you will soon find that you cannot find a way to reliably invalidate your cache. You simply cannot tell that the database has changed without looking at the database. The best you can do is set a timeout and pray that the stale data won't be a problem.... but it wil.... but you know it will.... So how can you make your database related work faster? Well that depends on what you are doing.

if the data from your query is going to be used for further mutations, then just don't cache it. Never. The problems you can get by accidentally feeding a procedure with stale data are simply not worth the performance gain. Don't cache the query that fetches a user's current credit, or his shopping cart, or the current stock for a product, or... well anything that changes while the user is working.

If the data is meant for display only then there is no real risk involved if it's stale. Still, I would never cache the queries, I'd cache the HTML that is generated from the query data. After all: why cache the source and waste time re-creating the HTML that will be identical for the same source data?

So... what if your queries are just too slow?

Make them faster!  Learn SQL, use indexes properly, use the features that your database has to offer, like partial indexes, stored functions, CTE's, windowing and the like.

Ok, but what if I really can't make thing fast enough and I really do need to cache query results?

Simple: don't. Have you not been listening?  Use a materialized view instead. That's a bit like a cache, but it's controlled by the only entity that *can* tell when data is stale: the database.

But please, never cache raw query output in your application, you're just digging a big hole for yourself.

maandag 8 september 2014

Echte wereld PostgreSQL: van XML naar SQL+JSON

Stel je hebt een databron die XML aanlevert en die gegevens moeten in een database terecht komen.

Peulenschil.


Stel je hebt dit XML bestandje:

<articles>
<article>
  <title>Dit is de titel</title>
  <body>Wow, dit is kazig!</body>
  <fotos>
    <foto width="100" height="200">kaas.png</foto>
    <foto width="100" height="200">cheese.png</foto>
  </fotos>
  </article>
<article>
  <title>Appels en peren</title>
  <body>Onvergelijkbare vruchten.</body>
  <fotos>
    <foto width="100" height="200">appel.png</foto>
    <foto width="100" height="200">peer.png</foto>
    <foto width="100" height="200">peren.png</foto>
    <foto width="100" height="200">appelmoes.png</foto>
  </fotos>
  </article>
</articles>

En dat moet in een tabel:

DROP TABLE IF EXISTS articles;
CREATE TABLE articles (id SERIAL, title TEXT, body TEXT, fotos JSON);


Stap 1: Maak een CTE die het XML document opdeelt in brokjes per artikel.

De "sourcedata" CTE is hier alleen gebruikt om de XML te kunnen gebruiken zonder hem eerst in een tabel op te hoeven slaan.

WITH sourcedata AS (SELECT '<articles>
<article>
  <title>Dit is de titel</title>
  <body>Wow, dit is kazig!</body>
  <fotos>
    <foto width="100" height="200">kaas.png</foto>
    <foto width="100" height="200">cheese.png</foto>
  </fotos>
  </article>
<article>
  <title>Appels en peren</title>
  <body>Onvergelijkbare vruchten.</body>
  <fotos>
    <foto width="100" height="200">appel.png</foto>
    <foto width="100" height="200">peer.png</foto>
    <foto width="100" height="200">peren.png</foto>
    <foto width="100" height="200">appelmoes.png</foto>
  </fotos>
  </article>
</articles>'::xml AS x)
, articles AS (SELECT UNNEST(XPATH('/articles/article', x)) AS article FROM sourcedata)
SELECT * FROM articles;





Met XPATH() worden de XML fragmenten uit de XML gehaald. XPATH() geeft een array van gevonden elementen terug en die worden met UNNEST() omgezet naar rijen.


Step 2: Verwerk de XML fragmenten


WITH sourcedata AS (SELECT '<articles>
<article>
  <title>Dit is de titel</title>
  <body>Wow, dit is kazig!</body>
  <fotos>
    <foto width="100" height="200">kaas.png</foto>
    <foto width="100" height="200">cheese.png</foto>
  </fotos>
  </article>
<article>
  <title>Appels en peren</title>
  <body>Onvergelijkbare vruchten.</body>
  <fotos>
    <foto width="100" height="200">appel.png</foto>
    <foto width="100" height="200">peer.png</foto>
    <foto width="100" height="200">peren.png</foto>
    <foto width="100" height="200">appelmoes.png</foto>
  </fotos>
  </article>
</articles>'::xml AS x)
, articles AS (SELECT UNNEST(XPATH('/articles/article', x)) AS article FROM sourcedata)
, fotos    AS (SELECT article, UNNEST(xpath('fotos/foto/text()', article))::text AS foto FROM articles)
, fotojson AS (SELECT article::TEXT, JSON_AGG(foto) AS fotos_json  FROM fotos GROUP BY article::TEXT)
SELECT (xpath('title/text()', article::xml))[1]::text, (xpath('body/text()', article::xml))[1]::text, fotos_json::JSON  FROM fotojson;

Nadat de artikelen zijn opgehaald en ge-unnest, worden de fotos opgehaald, ge-unnest en met JSON_AGG() platgeslagen tot een json-array per artikel.
Vervolgens wordt uit het artikel de title, body en de foto-array opgehaald.


Stap 3: Prop de data in de doeltabel:

WITH sourcedata AS (SELECT '<articles>
<article>
  <title>Dit is de titel</title>
  <body>Wow, dit is kazig!</body>
  <fotos>
    <foto width="100" height="200">kaas.png</foto>
    <foto width="100" height="200">cheese.png</foto>
  </fotos>
  </article>
<article>
  <title>Appels en peren</title>
  <body>Onvergelijkbare vruchten.</body>
  <fotos>
    <foto width="100" height="200">appel.png</foto>
    <foto width="100" height="200">peer.png</foto>
    <foto width="100" height="200">peren.png</foto>
    <foto width="100" height="200">appelmoes.png</foto>
  </fotos>
  </article>
</articles>'::xml AS x)
, articles AS (SELECT UNNEST(XPATH('/articles/article', x)) AS article FROM sourcedata)
, fotos    AS (SELECT article, UNNEST(xpath('fotos/foto/text()', article))::text AS foto FROM articles)
, fotojson AS (SELECT article::TEXT, JSON_AGG(foto) AS fotos_json  FROM fotos GROUP BY article::TEXT)
INSERT INTO articles (title, body, fotos) 
SELECT (xpath('title/text()', article::xml))[1]::text, (xpath('body/text()', article::xml))[1]::text, fotos_json::JSON  FROM fotojson;




En nu kun je allerlei ongein uithalen, zoals opvragen hoeveel foto's er zijn, en wat de eerste foto is:

SELECT id, title, body, json_array_length(fotos) AS num_fotos, fotos->>0 AS eerste_foto FROM articles;

donderdag 4 september 2014

De volgorde van kolommen maakt uit, en niet hoe je dacht...

Vandaag postte heer Schönig een interessant kort artikeltje: http://www.cybertec.at/shrinking-the-storage-footprint-of-data/

Het komt erop neer dat een PostgreSQL tabel kleiner en sneller wordt als je in de tabeldefinitie kolommen van gelijke typen groepeert, met de integers vooraan.

Integers vooraan zetten kan ik nog wel verklaren; records worden veld-voor-veld gelezen en je zoekt veel vaker naar integers dan naar strings, dus hoe eerder de int gevonden wordt in het record, hoe sneller je klaar bent.

Waarom het groeperen opzich uitmaakt... geen idee maar het is wel erg interessant voor de bitf*ckers onder ons.

Why I like PostgreSQL more than MySQL: Errors on triggers.

You can use triggers to modify the contents of tables. However, if the query inside the trigger fails, MySQL gives you only the error itself, not the context. If the query fails because some field is NOT NULL, then you will just get "Field X cannot be NULL".

Now, if the column is not usable in the original query that triggered the trigger, then you may be tempted to start looking for it. But if the column is part of the original query, and the query tries to set it to a value other than NULL, then yoou can only scratch your head.

Other databases tell you the exact query that failed. In this case, the query that the trigger started. They will even tell you the name of the trigger in which the query failed.