Google+ Followers

dinsdag 23 december 2014

Why I like PostgreSQL more than MySQL: REPLACE()

"UPSERT", update a record or insert it if it does not exist,  is a feature that is sorely missing from most database engines.

MySQL has implemented a fools-upsert through REPLACE(), because it's not UPSERT, it's DELETE and INSERT. The statement will first delete any existing record and then create a new one.

This has a few side effects that usually don't show up until you've spent an hour or two debugging:

1. The DELETE will trigger ON DELETE CASCADE. So if you REPLACE a user record, then all records that refer to the user with an ON DELETE CASCADE will be deleted. Then the new user record is created, and what you see is: "the user record still exists the PK is still the same, but all referencing records have disappeared. WUT?!"

2. A new record is created, which means that any fields  that are not mentioned in the REPLACE query will be left at their default values. What you see is:"I did a REPLACE to change the name and now the user's date of birth is empty. WUT?!"

3. Because the DELETE will trigger the CASCADE, you will get a foreign-key violation when you try to REPLACE a record that has relations with a RESTRICT setting. What you see is: "I try to replace a record and it says that it cannot DELETE... WUT?!"


How does PostgreSQL handle this? It doesn't. If you want to change a record, you use UPDATE. If you want to delete a record, use DELETE. If you want to create a record, use CREATE.

maandag 8 december 2014

Stop living with your head in the cloud!

First there was the computer. Then people began to work on computers in environments where data had to be shared, and the network was created. With the need for a central location to store data came the server and with the first server-crash came the redundant setup with multiple servers replicating data between them. As computing power requirements grew, so did the number of servers, until the only realistic name for it was a "server farm". These farms worked well, the were a group if servers working together behind a firewall and loadbalancer, with internal communications making the whole setup fully redundant.

But thats old technology, says marketing, today we use the cloud. The cloud is fast, safe, cheap, secure and scalable, it's perfect!

That sounds just peachy, what kind of marvel is this new "cloud" thing exactly? Well that's the brilliant bit, and it is really brilliant, you see, it's a group if servers working together behind a firewall and loadbalancer, with internal communications making the whole setup fully redundant.

No, that's not a copy/paste error, a cloud really is exactly what we used to call a "server farm".

Now I can hear you mumbling:

"But the cloud is virtualized!"
Virtual servers are still servers, with the same issues regarding crashes and security. Making processing share hardware was considered a bad thing until the word "cloud" was invented.

"But the cloud doesn't use servers, it uses instances!"
It either uses instances of servers. There are options to run an application as a service, which introduces a whole new set of risks.

"The could cannot be hacked!"
It's hooked up to the internet, therefor it can be hacked.
There is no such thing as an unhackable setup, the fact that you can login means that a hacker can too. In fact the risk is greater at a cloud because they can get at you by loggin in to your cloud, or by getting into the cloud management service. It wasn't too long ago that I posted about a company that had to fold because someone got into their cloud and simply deleted the virtual servers. Boom.
At least with real servers you have to get into each one separately (unles the admin is a moron who uses the same credentials everywhere.... sadface)

"It is much more secure than a regular server!" Again: it is a regular server.

"But, are there really no advantages to a cloud over a server-farm?"
No. It *IS* a server farm.

There are certainly benefits to a server farm, but calling it a cloud  doesn't change what it is or how it works.

vrijdag 5 december 2014

Everyday PostgreSQL: de top-N berichten uit een tabel per groep.

Elke zoveel tijd kom je een vraag tegen als "Ik heb blogposts/producten/foto's in meerdere categorieen en nu wil ik per categorie de nieuwste X laten zien."

De standaard oplossing is ingewikkeld doen met lussen of correlated subqueries, maar in PostgreSQL (en meerdere van de betere databases) heb je toegang tot windowing, en daarmee wordt het ineens een elegante oplossing:


DROP TABLE IF EXISTS tmp_artikelen;
CREATE TEMPORARY TABLE tmp_artikelen (id SERIAL, title VARCHAR(100), group_id INT);
INSERT INTO tmp_artikelen VALUES
(1, 'Hoe maak je een salade?', 1),
(2, 'Het water staat hoog.',2),
(3, 'Twee maten om te meten',1),
(4, 'Geniet maar drink met mate.', 1),
(5, 'Wizzel en pop gaan uit.', 1),
(6, 'Tomas de locomotief.', 2),
(7, 'Het weer is weer weerbarstig.', 3),
(8, 'Gnurkie wap wap!', 3);

WITH data_plus_rownumber AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY group_id ORDER BY id DESC) AS r
FROM tmp_artikelen)
SELECT * FROM data_plus_rownumber WHERE r <4;
De CTE maakt een query die naast de data uit de tabel ook het rij-nummer genereert op basis van het group_id en de volgorde van de id's per group_id. De feitelijke query pikt daaruit alle records met een rownummer van minder dan vier, met andere woorden: alleen de eerste drie per group_id.

Het enige wat je hier nog bij moet doen is een beetje sorteren en klaar ben je.

dinsdag 25 november 2014

Realworld problems with MySQL, #1.

I am very aware of all the shortcomings in MySQL, but they still bite me in the behind almost daily. Today was one of those days.


Case in point: some records were accidentally removed from the database. These things happen, people click the wrong button every now and again. Restore from backup is not an option, so I restored a backup to a temporary database so I could copy the records over.

Obviously I ran the insert queries from the command line tool (using the query-editor) inside a transaction to make sure that all queries would work and I would not be left with a broken set of data.

BEGIN;
INSERT ...;
INSERT ...;
INSERT ...;
INSERT ...;
ROLLBACK;


And I was right to do so because one of the id's had changed since the backup was made. No worries, there is no COMMIT statement so all the data  is still unchanged. Just fix the query and try again. This time I got a bunch of unique constraint violations.

Wut?


I had not given any commits yet and the block ends with a ROLLBACK so why did the data suddenly already exist?

And then I remembered...

Yes I put the queries in a transaction but when a query fails in the commandline tool. It stops executing, so it never executed the  ROLLBACK and in MySQL, a BEGIN is executed as "COMMIT;BEGIN;", so when I fixed the first error and tried again, it had secretly committed the queries that did work during the first run.

And now my data really was fscked.

Any other database will give you an error telling you to close the current transaction before opening a new one, saving you from destroying your data by *NOT COMMITTING*.

I don't really have words for just how stupid it is that you can commit data by *not* doing a commit... it's just... argh!

maandag 3 november 2014

A story of fail that never had to be.

Some time ago there was an SQL-injection exploit in Drupal. The issue was tracked to a combination of issues including PDO. The problem is typical for  PHP/MySQL developers attitude towards security, which seems to be "f*ck it".


One reason why prepared statements exist is that they split the parsing of the query syntax from the data. Parsing them separately means they cannot influence eachother and nothing you can put into the data will ever change the way the query is parsed.


This all must have seemed quite silly to the creators of PDO, who decided to implement their own version of prepared statements, which did not suffer from the roundtrip delay or the one-query-at-a-time limit. Unfortunately that does mean that the values are not validated, and that you can still inject anything you want into the query.


The article I refer to mentions that everybody makes mistakes, which is true, but what the PDO guys did is not a mistake, they went out of their way to make a feature that did not implement any of the safety features that people use prepared statements for. And not because they thought that they were not safe enough, but because they thought it was too slow and inconvenient.

That's stupidity of a whole other level.

maandag 27 oktober 2014

Why I like PostgreSQL more than MySQL: Zalando!

Whenever I tell a manager about PostgreSQL the first thing they ask is "who uses it?", asif tools are only good if someone they know uses them.

Well how about, oh, one of the largest mailorder companies in the world: Zalando? Will that do?

http://gotocon.com/berlin-2013/presentation/Why%20Zalando%20trusts%20in%20PostgreSQL

And guess what, they do exactly what I'm always on about: Let the database control the data; define a stored-procedure based API in the database and let applications talk to that.

Permission to look smug sir? Granted!

maandag 20 oktober 2014

Why I like PostgreSQL more than MySQL: casting

MySQL supports two types of integers; signed and unsigned. The difference is that a signed integer uses one bit to indicate that a number is negative or positive. This is convenient for applications where you don't want to allow negative values.

Problem:

DROP TABLE IF EXISTS foo;
CREATE TEMPORARY TABLE foo (a INTEGER UNSIGNED );
INSERT INTO foo (a) values (2);
(SELECT a FROM foo)
UNION ALL
(SELECT -2)

Will yield:
2
0

Notice that the '-2' has been converted to 0, and issues only a warning about having done so. The result is invalid.

The datatype for the column is taken from the first result, and the rest is forced into that type, so reversing the order works:

(SELECT -2)
UNION ALL
(SELECT a FROM foo)
Will yield:
-2
2

But this is not really an option in the realworld, you need something that will always work, and as we've learned over the years: MySQL doesn't have solutions that always work.

The closes we can get is to cast column 'a' to a signed integer (provided that all stored values fit) by adding CAST(a AS SIGNED INTEGER). Yes, that's SIGNED INTEGER, not INTEGER SIGNED, and the keyword INTEGER is optional, so this is also valid: CAST(a AS SIGNED)


(SELECT CAST(a AS SIGNED INTEGER) FROM foo)
UNION ALL
(SELECT -2)
Will yield:
2
-2

But now any value for a that does not fit in an INTEGER UNSIGNED will be truncated...


PostgreSQL does not do UNSIGNED, so there are no issues with casting or truncating. If you need to limit the values of a field to only negative numbers you'd use a CHECK constraint, which will also allow you to set any arbitrary limits on the value, not just "not negative" but also "more than zero", or "between 40 and 120, or "only odd numbers"

donderdag 2 oktober 2014

Why I like PostgreSQL more than MySQL: stubbornly refusing to admit errors.

MySQL has a long history of pretending that there are no such things as errors. Everywhere you turn there are features that allow you to make mistakes and never be told about them, or even worse: pretend that there actually was no error at all. If you've ever restored a database with views you'll know how MySQL treats the view as a table which is later converted into a view, just in case the target database does not support views.

The command line options are much the same; if you are adding a command line option to MySQL in a shell script you can actually tell MySQL not to throw an error if it does not support that option.

From the manual:

If an option is prefixed by --loose, a program does not exit with an error if it does not recognize the option, but instead issues only a warning:
shell> mysql --loose-no-such-optionmysql: WARNING: unknown option '--no-such-option'

So if your script needs to do something that the target database might not be able to do, then just pretend that it can, and it'll be fine. Well, what's the worst that could happen... your database gets destroyed, big deal, you've got backups... right?

dinsdag 30 september 2014

"Cache database queries heavily, because the database is always the biggest bottleneck in every PHP application."

Today I read onother of those lovely articles that tell you what to do in your web developement.

Lots of good advice, of course, and some pretty bad.

For example, the often quoted mantra: "Cache database queries heavily, because the database is always the biggest bottleneck in every PHP application."

The problem lies in the definition of "bottleneck". It is certainly true that SQL queries usually take more time to run than the rest of the PHP script put together, but that doesn't mean it's slow. It just takes that muche time, get over it. For some reason, PHP users seem to start their project with a benchmark of a script that just contains <?php ?> and then want to use caching to compensate for every extra microsecond.

But the real problem with caching queries is that you can't. The old saying is that there are two things difficult in programming, naming things and when to invalidate your cache. If you do decide to cache queries then you are not just going to run into the latter, you will soon find that you cannot find a way to reliably invalidate your cache. You simply cannot tell that the database has changed without looking at the database. The best you can do is set a timeout and pray that the stale data won't be a problem.... but it wil.... but you know it will.... So how can you make your database related work faster? Well that depends on what you are doing.

if the data from your query is going to be used for further mutations, then just don't cache it. Never. The problems you can get by accidentally feeding a procedure with stale data are simply not worth the performance gain. Don't cache the query that fetches a user's current credit, or his shopping cart, or the current stock for a product, or... well anything that changes while the user is working.

If the data is meant for display only then there is no real risk involved if it's stale. Still, I would never cache the queries, I'd cache the HTML that is generated from the query data. After all: why cache the source and waste time re-creating the HTML that will be identical for the same source data?

So... what if your queries are just too slow?

Make them faster!  Learn SQL, use indexes properly, use the features that your database has to offer, like partial indexes, stored functions, CTE's, windowing and the like.

Ok, but what if I really can't make thing fast enough and I really do need to cache query results?

Simple: don't. Have you not been listening?  Use a materialized view instead. That's a bit like a cache, but it's controlled by the only entity that *can* tell when data is stale: the database.

But please, never cache raw query output in your application, you're just digging a big hole for yourself.

maandag 8 september 2014

Echte wereld PostgreSQL: van XML naar SQL+JSON

Stel je hebt een databron die XML aanlevert en die gegevens moeten in een database terecht komen.

Peulenschil.


Stel je hebt dit XML bestandje:

<articles>
<article>
  <title>Dit is de titel</title>
  <body>Wow, dit is kazig!</body>
  <fotos>
    <foto width="100" height="200">kaas.png</foto>
    <foto width="100" height="200">cheese.png</foto>
  </fotos>
  </article>
<article>
  <title>Appels en peren</title>
  <body>Onvergelijkbare vruchten.</body>
  <fotos>
    <foto width="100" height="200">appel.png</foto>
    <foto width="100" height="200">peer.png</foto>
    <foto width="100" height="200">peren.png</foto>
    <foto width="100" height="200">appelmoes.png</foto>
  </fotos>
  </article>
</articles>

En dat moet in een tabel:

DROP TABLE IF EXISTS articles;
CREATE TABLE articles (id SERIAL, title TEXT, body TEXT, fotos JSON);


Stap 1: Maak een CTE die het XML document opdeelt in brokjes per artikel.

De "sourcedata" CTE is hier alleen gebruikt om de XML te kunnen gebruiken zonder hem eerst in een tabel op te hoeven slaan.

WITH sourcedata AS (SELECT '<articles>
<article>
  <title>Dit is de titel</title>
  <body>Wow, dit is kazig!</body>
  <fotos>
    <foto width="100" height="200">kaas.png</foto>
    <foto width="100" height="200">cheese.png</foto>
  </fotos>
  </article>
<article>
  <title>Appels en peren</title>
  <body>Onvergelijkbare vruchten.</body>
  <fotos>
    <foto width="100" height="200">appel.png</foto>
    <foto width="100" height="200">peer.png</foto>
    <foto width="100" height="200">peren.png</foto>
    <foto width="100" height="200">appelmoes.png</foto>
  </fotos>
  </article>
</articles>'::xml AS x)
, articles AS (SELECT UNNEST(XPATH('/articles/article', x)) AS article FROM sourcedata)
SELECT * FROM articles;





Met XPATH() worden de XML fragmenten uit de XML gehaald. XPATH() geeft een array van gevonden elementen terug en die worden met UNNEST() omgezet naar rijen.


Step 2: Verwerk de XML fragmenten


WITH sourcedata AS (SELECT '<articles>
<article>
  <title>Dit is de titel</title>
  <body>Wow, dit is kazig!</body>
  <fotos>
    <foto width="100" height="200">kaas.png</foto>
    <foto width="100" height="200">cheese.png</foto>
  </fotos>
  </article>
<article>
  <title>Appels en peren</title>
  <body>Onvergelijkbare vruchten.</body>
  <fotos>
    <foto width="100" height="200">appel.png</foto>
    <foto width="100" height="200">peer.png</foto>
    <foto width="100" height="200">peren.png</foto>
    <foto width="100" height="200">appelmoes.png</foto>
  </fotos>
  </article>
</articles>'::xml AS x)
, articles AS (SELECT UNNEST(XPATH('/articles/article', x)) AS article FROM sourcedata)
, fotos    AS (SELECT article, UNNEST(xpath('fotos/foto/text()', article))::text AS foto FROM articles)
, fotojson AS (SELECT article::TEXT, JSON_AGG(foto) AS fotos_json  FROM fotos GROUP BY article::TEXT)
SELECT (xpath('title/text()', article::xml))[1]::text, (xpath('body/text()', article::xml))[1]::text, fotos_json::JSON  FROM fotojson;

Nadat de artikelen zijn opgehaald en ge-unnest, worden de fotos opgehaald, ge-unnest en met JSON_AGG() platgeslagen tot een json-array per artikel.
Vervolgens wordt uit het artikel de title, body en de foto-array opgehaald.


Stap 3: Prop de data in de doeltabel:

WITH sourcedata AS (SELECT '<articles>
<article>
  <title>Dit is de titel</title>
  <body>Wow, dit is kazig!</body>
  <fotos>
    <foto width="100" height="200">kaas.png</foto>
    <foto width="100" height="200">cheese.png</foto>
  </fotos>
  </article>
<article>
  <title>Appels en peren</title>
  <body>Onvergelijkbare vruchten.</body>
  <fotos>
    <foto width="100" height="200">appel.png</foto>
    <foto width="100" height="200">peer.png</foto>
    <foto width="100" height="200">peren.png</foto>
    <foto width="100" height="200">appelmoes.png</foto>
  </fotos>
  </article>
</articles>'::xml AS x)
, articles AS (SELECT UNNEST(XPATH('/articles/article', x)) AS article FROM sourcedata)
, fotos    AS (SELECT article, UNNEST(xpath('fotos/foto/text()', article))::text AS foto FROM articles)
, fotojson AS (SELECT article::TEXT, JSON_AGG(foto) AS fotos_json  FROM fotos GROUP BY article::TEXT)
INSERT INTO articles (title, body, fotos) 
SELECT (xpath('title/text()', article::xml))[1]::text, (xpath('body/text()', article::xml))[1]::text, fotos_json::JSON  FROM fotojson;




En nu kun je allerlei ongein uithalen, zoals opvragen hoeveel foto's er zijn, en wat de eerste foto is:

SELECT id, title, body, json_array_length(fotos) AS num_fotos, fotos->>0 AS eerste_foto FROM articles;

donderdag 4 september 2014

De volgorde van kolommen maakt uit, en niet hoe je dacht...

Vandaag postte heer Schönig een interessant kort artikeltje: http://www.cybertec.at/shrinking-the-storage-footprint-of-data/

Het komt erop neer dat een PostgreSQL tabel kleiner en sneller wordt als je in de tabeldefinitie kolommen van gelijke typen groepeert, met de integers vooraan.

Integers vooraan zetten kan ik nog wel verklaren; records worden veld-voor-veld gelezen en je zoekt veel vaker naar integers dan naar strings, dus hoe eerder de int gevonden wordt in het record, hoe sneller je klaar bent.

Waarom het groeperen opzich uitmaakt... geen idee maar het is wel erg interessant voor de bitf*ckers onder ons.

Why I like PostgreSQL more than MySQL: Errors on triggers.

You can use triggers to modify the contents of tables. However, if the query inside the trigger fails, MySQL gives you only the error itself, not the context. If the query fails because some field is NOT NULL, then you will just get "Field X cannot be NULL".

Now, if the column is not usable in the original query that triggered the trigger, then you may be tempted to start looking for it. But if the column is part of the original query, and the query tries to set it to a value other than NULL, then yoou can only scratch your head.

Other databases tell you the exact query that failed. In this case, the query that the trigger started. They will even tell you the name of the trigger in which the query failed.

zondag 24 augustus 2014

HSTORE, JSON en het relationele datamodel.

Onlangs zei iemand tegen mij dat datatypen zoals HSTORE en JSON strict genomen niet thuishoren in een relationeel datamodel en dat zij die het wel gebruiken dus fout bezig zijn.

Het antwoord dat ik hem gaf plaats ik ook hier, met wat extra's.

HSTORE is niet bedoeld voor data waarin relaties afgedwongen moeten worden. Het is bedoeld voor data die "schemaloos" is. Echte schemaloze data bestaat niet (hoe zou je data moeten verwerken als je geen idee hebt wat het formaat van de data is?) maar er bestaat wel data waarvan het schema complex of veranderlijk is waardoor het niet meer realistisch is om het schema uit te werken en de data op te slaan in een genormaliseerd datamodel.

Voor díé data is HSTORE bedoeld. Via HSTORE kun je willekeurige data in een record opslaan als key/value pairs.

Het klassieke voorbeeld van data sie in een HSTORE thuishoort zijn producteigenschappen. Elke fabrikant levert zijn eigen lijst van gegevens over een product. Sommige gegevens worden door alle fabrikanten meegegeven, anderen alleen door bepaalde fabrikanten. Het is niet realistisch om een lijst op te stellen van alle mogelijke attributen die een fabrikant mee zou kunnen gaan willen geven, dus als je deze data wilt kunnen opslaan dan kom je uit op een Entity-Attribute-Value tabel, een tabel met het id van het product, de naam van de eigenschap en de waarde die de fabrikant voor die eigenschap meegeeft.

Voordeel van een EAV tabel is dat je er alles in kwijt kunt als een key/value pair, nadeel is dat je de values niet relationeel kunt verbinden aan een tabel van toegestane waarden, omdat je niet weet onder welke key de fabrikant een eigenschap gaat aanleveren en al helemaal niet in welk formaat de fabrikant de value aan gaat leveren, noemt hij de kleur van een broek "Blauw", "Azuur Blauw", of  "Tricky Blue"?

Wat dat betreft is het dus exact hetzelfde als een HSTORE; een key/value pair waarin je niets kunt afdwingen zonder kunstgrepen. De HSTORE is alleen ogelooflijk veel sneller om uit te lezen dan een EAV. Er komt geen subquery, JOIN of GROUP BY aan te pas.

Het JSON datatype doet daar nog een schepje bovenop omdat je JSON datastructuren die je bijvoorbeeld doorkrijgt vanuit een CMS direct op kunt slaan als JSON, terwijl de data wel indexeerbaar en doorzoekbaar is.

Uiteraard kun je alles wat HSTORE en JSON doen ook in reguliere SQL nabouwen, maar dat is complexer en de oplossing is uiteindelijk trager.

woensdag 13 augustus 2014

Why I like PostgreSQL more than MySQL: UNSIGNED fails.

Take the following piece of SQL:


DROP TABLE IF EXISTS foo;
CREATE TEMPORARY TABLE foo (a INTEGER, b INTEGER UNSIGNED);
INSERT INTO foo (a,b) VALUES (-20, 3);
SELECT a+b FROM foo;

That just adds -20 and 3 to return -17, right?

Wrong:

SQL Error (1690): BIGINT UNSIGNED value is out of range in '(`wv`.`foo`.`a` + `wv`.`foo`.`b`)'

Something does not fit into an BIGINT UNSIGNED, and I would agree that -20 and -17 don't fit, but there is no BIGINT UNSIGNED in the entire table.Obviously, adding two INTEGER values could possibly give a number the size of a BIGINT, but if one of the source values is SIGNED then the output certainly must be SIGNED aswel.

Solution? Use CAST() to mangle the INTEGER UNSIGNED into an INTEGER SIGNED, then it works. It cannot hold the values you wanted anymore, but MySQL doesn't care.

PostgreSQL doesn't do SIGNED and UNSIGNED, so it doesn't suffer from this casting crap.


Why I like PostgreSQL more than MySQL: Hamster and Chipmunk!

If you're involved with embedded electronics then you'll know about the Raspberry Pi, a single board computer with the surface area of a creditcard, that runs Linux.  Such devices are often used as autonomous little brains inside clever devices that monitor, control, play sound and video files, and are even used as cheap computers for schools. Clever devices need clever data storage, so PostgreSQL is an obvious choice.

The good folks that develop PostgreSQL have a buildfarm that tests the functionality of PostgreSQL on many different hardware platforms and operating systems, and Rasberry-Pi is one of them. The two little critters that do the work are called Hamster and Chipmunk.

One more reason why I prefer PostgreSQL.

donderdag 7 augustus 2014

Please, stop calling every remote controlled thing a drone.

Dear people,

A remotely controlled toy aircraft is not a drone, it's a toy remotely controlled aircraft. Stop trying to be interesting by calling them drones.

Thanks.

donderdag 24 juli 2014

To BC or not to BC?

A few weeks ago there was a vote to change the behavior of PHP.

This change has the ability to break existing code, so obviously some people are in favor and some are against.

Those who vote for the change see an opportunity to make their code more consistent and therefore faster to debug and more reliable in the long term.

Those who vote against see a lot of debugging work when the change is applied, with minimal benefits for the future.

As you may have guessed, I am a proponent of any change that makes a language more consistent. I adore Python for it's attitude of "you only need one way to do something, so that's all you get" and I love languages that use strict typing, specifically because it prevents this kind of problems
 
Do the BC protesters have a point? Absolutely, but they are making the point from deep down a hole that they dug for themselves, while furiously shoveling "straight down". If your codebase grows to the size where you have to worry about changes in behaviour in PHP then you have a responsibility to yourself to write tests. Today's development tools are such that writing tests hardly takes any extra time and it completely voids this "Ohnoes, don't change PHP or my code will break!" argument.


So yes, please PHP, become more consistent. Incorporate HHVM while you are at it, and become a real programming language.

dinsdag 15 juli 2014

Why I like JetBrains IDE's: modifications history with unittest results.

Test-driven-developement can greatly improve your development speed and quality. Everytime you make a few changes you can simply run the tests to see if you've completed the feature you are working on, and whether you have broken anything else. When something else breaks then you have to revert, or at least review, the changes you made.


Jetbrain's tools all keep a history of changes made to each file over time, which is pretty cool by itself, but if you're like me, you will have made dozens of changes since the last test and you really can't remember exactly which changes came after the last  successful test.

Jet brains to the rescue!

If you use a JetBrains IDE to run your tests then if will keep track of the testresult in the history. Successfuil tests show up as a green line, failed tests show up as a red line. Now you can accurately review only those changes that came after the last successful test and not worry about forgetting changes or reverting too much.

zaterdag 12 juli 2014

3d printing hype

Plastics have been used in medicine for decades, and nobody acts surprised when some bodypart is replaced by a plastic replica.

So why is it big news whenever that part is created using a 3d printer? I don't know, because the regular plastic parts are created using a CNC-machine, which is the exact same machine as a printer, except it begins with a solid block of plastic and removes the bits that don't look like the final part.

The only reason why you could want to use a printer instead of CNC is that a printer can create hollow shapes directly, where CNC would have to create to halves that have to be glued together.

So please, stop mentioning that a particular part was created using a 3d printer. You don't specify any other manufacturing processes, so why 3d printing, which is nothing special anyway.

vrijdag 11 juli 2014

"You can read the code, right?"

A few weeks ago I talked to a programmer who insisted that you should not add comments inside your code to explain what you are doing. His argument was that this documentation will quickly get outdated as the code evolves, and incorrect documentation is worse than no documentation. Reading the code tells you exactly what the current code does. I asked him why he didn't just update the docs along with the code, but apparently that was just a silly question, he didn't write docs, end of story.

I've seen this kind of behaviour a few times before and always in bad programmers. Not just bad as in "he doesn't stick to convention", but bad as in "his software does not work". And why doesn't it work? Because he has no documentation to stick to. Every time he modifies a routine he has to think back to the time when he first created the routine, to try to remember what it is supposed to do. This works exactly once, and after a month he really does have to read the code.

But, the code does not tell him that there is another routine a somewhere else in the code, that requires that once IF statement so it will never return -1. So when a complaint comes in that the value -1 cannot be used, where is nothing stopping him from removing that, apparently useless, IF statement. a few days later it turns out that the other routine has been messing data up bigtime.

Yeah, let's not document routines, good idea.

dinsdag 8 juli 2014

"Loosly typed languages make you a better programmer"

Loosely typed languages are sometimes thought to make you a better programmer because you have to do a lot more checking while you are writing software.

The sad truth, of course, is that you don't. We've all stared at the screen wondering why $t refuses to increase inside a loop, only to find that there a few lines later, just off-screen, there is a line that assigns TRUE to $t. And we've all sighed and whispered "why doesn't it tell me I can't do that...."

There is nothing good about being loosely typed. People often say that it's "convenient", but... how? Is it really so much work to use two variables instead of one? Does your code become in any way more readable if you re-use variables for different kinds of data? Does finding bugs become easier?

Isn't it at least a little strange that in PHP it's perfectly acceptable to use ctype_digit((string) $var) to check if a variable contains an integer? How is that better than defining $var as an integer, after which it is simply guaranteed to be an integer?

Somebody, tell me what's good about loosely typed variables...

maandag 7 juli 2014

Use the tools Luke! PHPStorm's conditional breakpoints.

Debugging PHP can be hell because of it's hilarious variable-variables that seem to be very attractive to programmers who want to prevent repeating themselves by creating template-code and using said variable-variables instead of the regular names.

Regular debugging by printing rude words at lines of code that seem important just yields endless lists because the routine is repeated over and over and over, doing the same task dozens of times, evaporating the rainforest to save the programmer a few keystrokes.

PHPStorm to the rescue!

Every PHP developer knows about xDebug (if you don't: google it NOW, and don't come back until you know what it is and have decided to use it from now on) but regular breakpoints are not much more useful than the print statements, because your script just keeps stopping at the same breakpoint over and over.
In PHPStorm you can modify the breakpoint to observe the current context. You can tell PHPStorm/xDebug to only break at a particular breakpoint if a certain variable in the context meets a requirement. For example; you can make one of those idiotic variable-variable routines break only when the offending variable variable contains a particular value. Or you can disable the breakpoint inside a loop when the loop has iterated more than two times.

Even more interesting; you can make the breaking of a breakpoint dependant on whether another breakpoint has been visited yet. This means that you can put a breakpoint in a dispatcher and only make it break when it is called after some variable-variable-fart has executed one particular variable.

These two simple features mean that you can really make your script stop where you need it to, and you don't have to press F7 a million times, before accidently pressing F8 and skipping the one bit you actually wanted to see.

donderdag 26 juni 2014

Why I like PostgreSQL more than MySQL: UNSIGNED and ABS()

Let's say that you have a table with unsigned values:

CREATE TEMPORARY TABLE f (d INTEGER UNSIGNED, e INTEGER UNSIGNED);
INSERT INTO f VALUES (0,1);

You want to know the difference between the values of d and e, so you use abs()

SELECT abs(d-e) FROM f;

And you get:

ERROR 1690 (22003): BIGINT UNSIGNED value is out of range in '(`f`.`d` - `f`.`e`)'

I know that 0-1 is -1, which is negative and yes, that would not fit in an USIGNED BIGINT, but where does the UNSIGNED BIGINT come from?

Apparently MySQL is smart enough to know that substracting two INTEGER values will require a BIGINT to store the result, but for some bizar reason it copies the UNSIGNED bit from the source values, which is odd for a function that is used specifically to work with negative values. It works if you change the column types of d and e to SIGNED, but that's not acceptable because that changes the allowed values, both by allowing negative numbers and by halving the range of values that can be stored.

The only thing you can do about this, apparently, is to cast d and e to SIGNED when putting them into the ABS(). I will let you guess how I feel about this.


PostgreSQL, of course, does not have this problem. PostgreSQL does not do UNSIGNED in the first place, but they also test their functions before publishing them.

zondag 22 juni 2014

Codespaces died an unnecessary death-by-naivety.

Some time ago a company named codespaces had a problem. An evil person had gained access to their cloud and deleted so much data that their core-business was basically erased from existance.

So why do I call this unnecessary and naive? Because what happened to them is textbook stuff and indeed you 'll find this in any book about security. Even "The Daily Worse Than Failure" has stories like this every week.

1. Never put business critical data on servers you do not fully control.
There is a reason why firewalls use IP-based security. Only a handfull of people at codespaces needed to be able to login, the rest of the world should not even have been able to ping the server, let alone get to the login page.
Controlling your own servers also means that you have the ability to physically disconnect them from the internet. When stuff begins to disappear, you unplug, stopping the hacker dead in his tracks. Downtime is a nuissance, losing data can, well, cost you your business.

2. Never use the same credentials for multiple servers.
Credentials get compromised, that's just how it is, and that's why you want to limit the amount of access each of the sers of credentials grant. Putting everything behind the same set is just plain stupid. Sorry, there is no other word for it.

3. Always keep your backups separate from the server.
Backups are meant to help you recover from disasters like, say, a hacker stealing your admin credentials and erasing your data. In the 'good old days' people put data on tapes and physically took those tapes to a safe, and a copy to a safe in another building. Why? In case of a burglary, or a fire, or... heck, whatever happened that caused the data to get destroyed. Today apparently it is perfectly acceptable to just have a file "in the cloud" and stick it behind the same password that can erase the original data...

4. Standby-servers are not a luxury.
When the hacker started to shut down servers, backupservers should have kicked in to continue the service. Having different passwords and separate backups, the customer would probably not even have noticed that there was ever any problem.


For those of you who think "well, hind-sight is always 20/20".... no. These points are common knowledge that any syadmin should know by heart. CodeSpaces fell for the cloud hype and it cost them theyr company. It's sad but any sysadmin would have told them this was a very bad setup.

dinsdag 17 juni 2014

Why I like PostgreSQL more than MySQL: The query log.

MySQL can log queries, but only those that worked. Queries that fail are not logged at all. If you don't build some feature into you code that logs errors, and does so in a reliable way, then you're in for hours of debugging.

PostgreSQL logs errors to it's main log, so you can find them regardless of what the application decides to do.

maandag 16 juni 2014

The power of naivety: mobbing brewers to give up their secrets

Sometimes I wonder just how naive and gullable people can be.

The "foodbabe of http://foodbabe.com/  was concerned that there might be something relly nasty in beer, like gycol, so she started a petition to ask brewers to tell their customers what's in their product. Lots of people signed it  and now brewers are "giving up their secrets".

So what's my problem? It is that apparently, 43k people think multi-billion dollar industries that live of the fact that only they know how to produce a particular drink would ever actually tell you the recipy.. They don't do that, end of story. Not for good ingredients, not for bad ones, and no, not even if they put glycol in there, the will never, ever tell you. 

Yes they are putting up lists of ingredients, and guess what, it's the generic recipy for beer. Wow, I'd never have guessed.

So what has this incredibly naive women achieved? She's made 43k people believe that they have the power to do anything by singing a petition.

Well done.

zondag 8 juni 2014

"SQL is hip again"

When NoSQL first became popular, the entire SQL community tried their best to point out that relationless databases are simply not an option for most applications. Still, people with little or no knowledge of SQL flocked to NoSQL because of how "easy" it was to add data and find data... after learning the new query syntax.

But now people seem to be coming to their senses: http://www.theregister.co.uk/2013/08/30/google_f1_deepdive/

Obviously this is in response to a move by Google (because whatever Google does must be the best thing anybody can do, right? It's not like they started using NoSQL in the first place... oh wait....) but at least this makes people re-think their choices, and perhaps save them from destroying their data by ignorance.

zaterdag 7 juni 2014

Why I like PostgreSQL more than MySQL: GET LOCK works.

MySQL uses GET_LOCK('lock_name'); to create an advisory lock. However, if you successfully get a lock, lets say 'A', and you then try to get lock 'B' during the same session, you will silently release lock 'A'.

Not only does this mean that you can never use more than one lock at a time, which is just silly, it also means that you cannot trust any lock you create to actually stay in place until you release it. Why? Because you have no way of doing that, that's why you are using the database in the first place.

But perhaps most worryingly, MySQL does not even issue a warning about releasing the first lock. You have no way of ever detecting that any lock is released implicitly. You don't notice this problem untill your data is corrupted to the point where users start to complain, and then you'll have a lovely time debugging this.


In PostgreSQL you can use pg_advisory_lock() as often as you like to make any number of locks without ever accidently releasing any of them.

What you want is not always what you need.

A few days ago I read a question on the PostgreSQL mailinglists where someone asked if there is an Arduino library to connect to PostgreSQL. He had a network of hundreds of sensors worldwide, measuring weather data and sending it back to a central database.

The short answer was "No, use HTTP" The poster was not happy with this answer and explained that his MySQL solution works just fine, and MySQL is super for supporting it and it was a real shame that the PostgreSQL community flat out refused to create this.

The community replied with a couple of arguments:
- An arduino has limited memory. A library that can speak native PostgreSQL uses up quite a lot of that memory, while the application itself does not require any of it's functionality.
- Is it wise to have a $20 piece of hardware, located somewhere in the world, be able to login directly into a database?
-What happens when the master database is forced to upgrade and the protocol changes, how can you upgrade all the Arduino's remotely, simultaneously, and with an option to revert if there is a problem?
- Connecting to the database directly requires that the database's port is open in all routers etc between the node and the server, and the first thing any sysadmin will do is close all ports he doesn't think necessary, very much including 3306 and 5432.

Using HTTP solves all these problems, by using ony a handfull of bytes to send the data, using port 80, and connecting to a regular webserver that hides the database behind a protocol that has no need to ever change.

The moral of the story is that if you ask people a question, you may not get the answer you want, but you may get an answer you need. The key is to realise that your knowledge does have limits, as does your experience. You *will* encounter people who have thought things through, or who have more experience than you. It is up to you to investigate the claims and use the knowledge to your advantage.




What is a "slow query"?

A slow query is a query that takes longer than you'd like it to take.

Note the words "longer than you'd like". There is no hard limit, no number of milliseconds that define how long "a query" should take. If it takes 6.2 seconds, then that's how long it takes and that's NOT a problem untill you have a need to do it in less than 6.2 seconds.

I have been in serious discussions (including rude words) with hosters who thought they could just pick '1 second' as a maximum runtime for a query, and kill it if it went over that limit. Their reasoning was literally that queries don't take more than a few milliseconds, so anything over  second must be a serious problem. This stems entirely from MySQL's history of not being able to cope with complex queries, and the subsequent progammingstyle of using lots and lots of very very small queries. Small queries complete quickly, and I you have no experience in working wth databases it's logical to assume that sub-second queries are the way to go.

This is of course complete and utter <rude word>

If you need to update 1.5mln rows there is simply no way that that is going to complete in less than a second, and if you run 1.5mln separate update queries that each take 2ms then the hole operation is going to take 300 seconds, which is many times as long as the single update query.

The only case where a query can take "too long" is when *you* need the task to be completed quicker.

There is no such thing as "too long", there are only "too long for your requirements" and "longer than it needs to".


zondag 1 juni 2014

Use the framework Luke!

Every single developer I have met in my carreer has written his own framework and every single one of them was full of workarounds to avoid several serious mistakes in the design of the framework, which he initially did not have time to fix, and now cannot afford to fix for fear of creating more errors.

I've never seen a homemade framework that implemented any kind of automated testing.

It's simply scary how many developers waste months of developing time just so they can write some code themselves, knowing that what they write will not be up to scratch. It usually takes a new employee to look at the code before they realize what they did wrong, but by that time, again, it's not an option to fix it.

So, do yourself a favour, go with an existing framework. Sure, it will not work they way you think it should be done, but realistically; you are not a framework designer, you have not thought it through. They have, and they have already dealt with all the poop that you have probably not even seen before.

zondag 25 mei 2014

Dutch students...

The final exams have begin for highschools in the Netherlands, and as usual there are many complaints about the exams. Most of the complaints are about the conditions of the classroom and errors in the questions. This year around 200.000 students are taking part and with four days to go the number of complaints is at 107.000.

The number of complaints is ridiculous, but about the same as last year. This year however, there are a few unusual and quite worrying complaints, about the words used in the questions. Unusual, complex words like 'sidenote' and 'partial'.

It probably sounds snobbish but how the hell can you live to be 16 and never encounter the words 'sidenote' or 'partial'? Ok, 'partial' does have two dutch counterparts, 'partieel' and 'gedeeltelijk', but still, 16 years without ever encountering 'partieel'?

Is our education system failing or are students getting more stupid? My fear is the latter, because they belong to the google generation that uses their phone to live their lives. There is no margin to put a sidenote into, and the internet does not use long words because they simply don't fit on the tiny screens.


So, if you want to gro up with a working brain: lose the phone!

woensdag 21 mei 2014

Out-of-the-box performance, and why you should not care.

Every now and then someone wants to compare the performance of two databases, one of which he usually doesn't know anything about. In an attempt to make it a fair fight he does two clean installs and does not tune the database he has experience with, thinking that  two untuned databases will make equal use of the server's resources and thus give some sort of baseline performance indication.

The truth is that databases ship with default settings that are designed to make sure that the clean install will startup on any server it is installed on. It does not assume that it can hog 80% of RAM the way a production database can, and it does not associate itself to n-1 of the available cores. Consequently the performance of cleanly installed, untuned databases is very bad

However, this means exactly nothing because nobody runs untuned database servers. The comparison is between two setups that do not occur in the real world.

If you want to know how a database will perform in your situation then you need to test it in your situation; install it on a server that meets the database's preferences (yes, databases prefer certain hardware), tune it to the size of the database, design the schema to make full use of the available features, and run queries under a real-world load.


zondag 27 april 2014

Stupid quotes

SELECT 'foo' IN ('foo'
                 'bar');
 ?column? 
----------
 f
(1 row)


So what went wrong here?  Two things. The linebreak in the query is just presentation and when it's left out you get:

SELECT 'foo' IN ('foo''bar');
 ?column? 
----------
 f
(1 row)

Which makes sense because the two singlequots form an escaped singlequote. The comparison is actually against "foo'bar", and of course that's not equal to "foo".

About using a phrase as a password...

Today I read an article about "wasting user's time".

In this article, the author claims: "A long password phrase is as secure as a short password with numbers and symbols yet easier to remember", showing a screenshot from this website

This is a very dangrous thing to say.

Using a phrase means you are actually reducing the number of  possible combinations, because a phrase will use only known words and follow a known grammer. They are predicatble.
The sentence "i love pizza" is said to take 546 years to hack, but I'd argue it's more like two minutes. Most people will feel the need to start their personal passphrase with "I", they just do. Then they will describe something about themselves, and it will be generic and simple. "I like", "I love", "I think", follow that by any subject you can love, like, hate, enjoy, and that takes care of most of your "secure phrases".
You still have to incorporate capital letters etc to make the words unpredictable and then you're back where you started: "I L0ve P1zzA!?"


donderdag 17 april 2014

Reblog: NoSQL brings down three bitcoin exchanges

A story of fail, and why you should use a RDBMS when you are working with money.

Read the full article at: http://hackingdistributed.com/2014/04/06/another-one-bites-the-dust-flexcoin/?utm_content=buffer8316b&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer

On creating a multilingual website.

How to create a multilingual website?

Easy: don't do it.

I mean: don't try to create a website that can handle all content in many languages.
No matter how hard you try, you will never actually have all content in all languages at all times. That's just how it is. What's more, you will have to make exceptions for pages that simply will never exist in a particular language. Don't think you won't, everybody does and so will you.

Also, words have different lengths in most languages, you cannot simply translate "go home" to the Dutch "ga terug naar huis" and expect that to fit in the same space. You would have to create some überflexible HTML that can make the site look good no matter how long or short the translations get.

What you will end up with is a complex system consisting mostly of exceptions, which is hard to maintain and ultimately not very pretty to look at.

Solution: create a separate copy of the website for each language. That way you don't have to make any exceptions, every site can be perfectly tailored to it's language and it will never link to pages that don't yet exist in that language.

"But then updating the site's html requires separate actions for each site!"

Yes, you will have to do the same thing several times, but not having any exceptions in the html/CSS/backend means you can pretty much copy/paste the changes without breaking anything.




woensdag 16 april 2014

Why I like PostgreSQL more than MySQL: Maria_DB's new dynamic columns vs the age-old HStore.

MariaDB has introduced dynamic columns, and MySQL users are amazed. They can now store key/value pairs in a column and fetch them back out by key.

Sound familiar? It should, because PostgreSQL has supported HStore for years, alongside XML, and recent versions have enabled JSON for the same purpose.

And of course PostgreSQL can use functional and partial, and GIN indexes to index the content of HStore, JSON and XML columns, making them fast to query against.



dinsdag 15 april 2014

Why I like PostgreSQL more than MySQL: fulltext search.

Yes, I  know MySQL has fulltext search too, but it's functionality is limited to doing a boolean search.

In PostgreSQL you get access to a much more usefull set of features, such as dictionaries that allow stemming (also finding 'swim' when you search for 'swimming') filtering stopwords, mathing on word proximity, etc.

For a practical implementation of a so-called "good enough" search engine in pure PostgrSQL, see this blogpost: http://blog.lostpropertyhq.com/postgres-full-text-search-is-good-enough/?utm_source=dbweekly&utm_medium=email

maandag 14 april 2014

Why I like PostgreSQL more than MySQL: ENUM

Opinions vary when it comes to enum, but one thing everybody agrees on is that they should work.

Everyone except MySQL, see: http://blog.endpoint.com/2014/04/sanity-thy-name-is-not-mysql.html

Would any MySQL user's care to try justifying that behaviour? :-)

zondag 13 april 2014

About using non-standard SQL features.

A common argument I hear when I mention the many features of PostgresSQL is "but those are not SQL standard, if we use them we can't easily migrate to a different database." That sounds like a good argument at first, the less work you have to do, the better, right?

Well... no.

Custom features are made custom because either they solve a common problem that's tricky to do in standard SQL, or because it is much, much faster than standard SQL. Most applications never migrate, and those that do, do it exactly once. Forcing an application to be inefficient throughout it's lifetime just so you can shave a few hours off a migration which you probably won't do anyway, is just stupid.

Perhaps more importantly, even the simplest pieces of SQL can and will contain database specific differences. For example; MySQL is not case-sensitive in string comparisons. In short: you will have to check *EVERY* query anyway. Writing complex code to replace a custom SQL feature just makes things harder to migrate.


So what of the handfull of cases that do migrate? Well, they don't migrate, they gain support for another database. Supporting a new database takes time and while you are writing the new code you will have to keep the production side running on the old database. You cannot remove any of the old code because you'd risk not being able to fix urgent issues in production.

What you shold do is split the database-related code into a separate section of the source. When you neeed to start supporting a new database you copy that piece of the code, modify it to suit the new database and instruct your application to use the new codecode.

zaterdag 12 april 2014

The power of communities, and why that's not always a good thing.

Communities are great. They provide support for many products and enable you to share the knowledge of others. Wonderfull!

The only drawback is that the communities of popular tools like PHP and MySQL consist mosty of enthausiasts and beginners. They mean well and they spend hours answering questions to the best of their ability, but they simply do not have enough experience to give the "correct" answer.

What's worse, as I mentioned in my other article "don't contradict the locals", forums are ruled by a few regular users who are considered to be the experts by the rest. They usually don't care about having the right answer, they care about being right and they will come up with the most idiotic reasonings why what their answer is right.

Consequently the bad knowledge is promoted and developers keep messing about with bad solutions.
PHP programers love the intval() function, and MySQL user's are quite happy to add a "GROUP BY" to every query, just in case...


vrijdag 11 april 2014

Active intelligence...

Sometimes I find myself wondering what the world is coming to, and just when I think it can't be that bad I see something that makes me think it's actually much worse...

Today I got a message from meetup.com, informing me about a new group that I might be interested in, called "Active intelligence". Sounds cool, no? So let's look at their intro:

"Active Intelligence Academy provides full day, intensive workshops to get you up to speed on technology topics that you're most interested in. Like us on Facebook!" 

Sigh...

donderdag 10 april 2014

GROUP BY, use it or lose it.

Dear MySQL users,

Please read the flipping manual about what GROUP BY is, and stop sticking it in every single query you write.

Thank you.

dinsdag 8 april 2014

Why I like PostgreSQL more than MySQL: regular expressions in a string-replace.

MySQL's regexp functions are limited to matching patterns, PostgreSQL can also use them to fetch substrings, and to replace substrings.

For example; to remove any spaces from a dutch zipcode:

SELECT regexp_replace('   134 5 AA   ', E'\\s', '', 'g')

That is: replace any whitespace character with an empty string, globally.

This way you can write a CHECK constraint, trigger or rule that accepts a zipcode in the raw format, and validate and clean it before storing it, making life a little easier on the application.

dinsdag 1 april 2014

To the management: read this.

Stefan Koopmanschap wrote a nice article that will seem every familiar to most developers, and that most managers really ought to read.

It explains, among other things, how management often manages to actually prevent the people they manage from doing the tasks they're assigned to do.

http://leftontheweb.com/blog/2014/03/28/Why_Your_Current_Developer_Is_Terrible/

zondag 23 maart 2014

SQL Performance thingies , part 1. Don't modify a datetime to compare against NOW().

When searching for records of a particular age, people tend to do this:

SELECT * FROM table WHERE (NOW() - datefield) >= INTERVAL 5 DAY;

or

SELECT * FROM table WHERE datefield + INTERVAL 5 DAY >= NOW()

These are bad because they require the database to calculate the difference for all records, which is a no-index operation.

The best way to do this is to calculate the actual date  value that marks '5 days ago'. ie: NOW() - INTERVAL 5 DAY:


SELECT * FROM table WHERE datefield >= NOW()  - INTERVAL 5 DAY;

This way the database will see that NOW()-INTERVAL 5 DAY is a constant within the transaction, and calculate it only once. Comparing a date field against a constant can use an index so this is the fastest way of doing this.

zondag 9 maart 2014

Everyday PostgreSQL: Counting votes.

Problem: a movie fansite allows users to vote, and rather than try to block users from voting multiple times, they just log every vote and make their minds up later.


DROP TABLE IF EXISTS movies;
DROP TABLE IF EXISTS votes;

CREATE TABLE movies (name VARCHAR(100));
CREATE TABLE votes (userid INT, moviename VARCHAR(100));

INSERT INTO movies (name) VALUES ('Austin Powers'), ('Tank girl'), ('The fifth element');
INSERT INTO votes (userid, moviename) VALUES (1,'Tank girl'),(1,'Tank girl'),(1,'Tank girl'),(1,'The fifth element'),(2,'Austin Powers'),(2,'Tank girl'),(3,'Tank girl'),(3, 'The fifth element');


So, how do you count unique votes per user, per movie?

One way is to do exactly what the problem describes; count the number of unique votes per movie.
First; create a list of unique votes per movie, using a plain old DISTINCT:

SELECT DISTINCT userid, moviename FROM votes;

Then use that as a sourc of vote information  in a nice CTE:

WITH distinctvotes AS (SELECT DISTINCT userid, moviename FROM votes)
SELECT COUNT(*),moviename FROM distinctvotes INNER JOIN movies ON movies.name=distinctvotes.moviename GROUP BY moviename
ORDER BY 1 DESC;

This approach also enables you to define more rules for which votes count and which don't. The CTE query could for example exclude al users who voted more than ten times in the same minute. Putting that query in a CTE leaves the actual count query clean and easy to manage.

woensdag 5 maart 2014

Stored procedures, yes or no?

Today I stumbled onto a stack-exchange post by a guy who was very, very clear about how extremely bad practice it is to use stored functions. His argument (in it's entirety) was that his 30+ years of experience tought him that they are a maintenance nightmare.

So, is that true? No, not really, but I understand what he's on about.

When you work on scripts or source code you can update each file separately and in the case of PHP or Python you can even update separate files directly on a production server. A stored procedure can depend on other database entities, some of which may even have to be dropped before the update can proceed.

The thing is; you can't really just update separate files on a production server, for bovious reasons, I would think. A proper update should always bring the system into a single-user mode, then do the update, test the new code, and then bring the server back into mutli-user mode.

If the system is going into single-user mode anyway, it nolonger matters how many changes you make, you can easily execute a single SQL file that removes and re-creates a set of dependent SP's. It is exactly the same operation you would do if you had to update several interdependent sourcefiles.

So no, it's not a nightmare, it's a simple matter of getting your procedures in order.

woensdag 26 februari 2014

Why I like PostgreSQL more than MySQL: Check constraints

Databases can basically only store strings and numbers. Your business however has a frightening number of rules an regulations about what the data should look like.

A sales price for example must always be more than zero. A discount must always be between zero and the salesprice. A tax rate is always between 0% and 100%. The number of units per package is always one or more. A product cannot be marked as sold-out if backorders are allowed. And so on...

In MySQL you can do this with triggers, which works, but is hardly elegant, let alone easy to manage.

In PostgreSQL you can use a CHECK constraint on s single field:


DROP TABLE IF EXISTS foo;
CREATE TABLE foo (id INT, price NUMERIC(10,2) CONSTRAINT legal_price_check CHECK  (price>0));
INSERT INTO foo (id, price) VALUES (1,0);

"ERROR:  new row for relation "foo" violates check constraint "legal_price_check""


Or on he table as a whole:
DROP TABLE IF EXISTS foo;
CREATE TABLE foo (id INT, price NUMERIC(10,2), weight NUMERIC(5,2), shippingcost NUMERIC(4,2),
 CONSTRAINT legal_shipping_cost CHECK  ((weight<= 100 AND shippingcost=0) OR (weight>100 and shippingcost>0)));

INSERT INTO foo (id, price, weight, shippingcost) VALUES (1,10, 90,0);
INSERT INTO foo (id, price, weight, shippingcost) VALUES (1,10, 90,10);

The first record works, the second is denied because the shippingcost should be zero if the weight is below 100.

Checking the progress of a MySQL ALTER TABLE query?

I've often said that MySQL spend way too much of their time working around it's shortcomings, and today I say this post:

http://www.mysqlperformanceblog.com/2014/02/26/monitor-alter-table-progress-innodb_file_per_table/

It shows a way to see the progress of an ALTER TABLE statement in MySQL. Normally I'd say "that's a cool gadget for a special X-mas post in a blog", but he's dead serious about it: ALTER TABLE in MySQL rebuilds the entire table, which not only takes a long time on large tables, it also locks the table while doing it. In large production systems you cannot alter a large table without seriously disturbing the functionality of the system. Hence, you want to be able to tell the users how long the system is going to be down; progressbar.

Kudo's for finding out how to do it, but seriously, shouldn't you switch to a database that does not suffer from this ridiculous issue in the first place?

dinsdag 25 februari 2014

Why I like PostgreSQL more than MySQL: PL/PgSQL

MySQL supports stored functions, true. Unfortunately they only support the SQL-standard way, which is quite limited and not a very userfriendly language to write. (Isn't it ironic; the one time they do stick to the standard and they do it where they were better off using something custom) One of the most annoying limitations of the SQL standard PL language is that you cannot execute dynamic queries; queries that are composed in variables.

Dynamic queries allow you to build a query that suites the current requirements, depending on things like the content of the parameters, the content of the records and of course: business rules.

This can be particularly usefull for filterin and sorting. A function for fetchin the latest products may include a sortorder parameter that controls which field should be used to determine order, and that field will have to be added to the query in an ORDER BY clause. Obviously you could just create an ORDER-BY clause that uses a large CASE statement, but queryplanners cannot optimize this because they can't predict which field will be used.

In PostreSQL's PL/PgSQL you can include a command like this (from the manual)

EXECUTE 'UPDATE tbl SET '
        || quote_ident(colname)
        || ' = '
        || quote_literal(newvalue)
        || ' WHERE key = '
        || quote_literal(keyvalue);
As you can see it concatenates a series of strings with quoted variables and calls EXECUTE on it.
Effectively this means that your stored function can use complex logic to construct the query and execute it.

"Isn't that dangerous?"


Talking to developers, #1: what's important in databases?

From a conversation with a "developer":

"What do you think is the most important aspect of a database?" I ask.

"Performance!" The developer says.

"What is performance?" I said.

"Well it should be fast!"

"Fast at what?"

"Running queries!"

"What kind of queries?"

"All kinds, everything must be really fast."

"What about reliability?"

"What about it?"

"Well, shouldn't you be able to trust your database to save the data you tell it to save verbatim?"

"No, we do that in the application, the database just has to store the data, really quickly."

"Doesn't the application need time to make sure the data is valid?"

"Of course, but the database is much slower at that."

"Ok... so how do you do these validations?"

"We run queries on the database to see if the data we are about to send actually exists before we send it."

*FACEPALM*

zaterdag 22 februari 2014

Every day PostgreSQL: Using XML functions for quick HTML previews.

Over tbe past months I've been working with OpenCV for image processing. One of the projects I'm working on requires comparin histograms. I won't go into any sort of detail about that, suffice it to say that I noe have a tabel that holds a few million comparison results from a few thousand images that have been compared to eachother. The next step is of course to process the results and find images that meet the requirements. A regular query can return the names of the images, but is need to see the actual images to check if the query returned matching images. I could un the query,fetch the data into a language and write code that generates HTML with the names of the images in an IMG tag, or I could use PostgreSQL's XML functions to generate an HTML page, so I only have to read one field from the resultset and print it.

-- Create a table of images
DROP TABLE IF EXISTS foo;
CREATE TEMPORARY TABLE foo (id int, filename TEXT);

-- Step one: create image tags
INSERT INTO foo (id, filename) VALUES (1,'basil.jpg'),(2,'sybil.jpg'),(3,'polly.jpg');
SELECT XMLELEMENT(name img, XMLATTRIBUTES(filename AS src)) FROM foo;

-- Step 2: aggreate the image tags inside a body tag.
-- I use CTE here to make the code more readable.
WITH imagetags AS ( SELECT XMLELEMENT(name img, XMLATTRIBUTES(filename AS src)) AS imgtag FROM foo )
SELECT XMLELEMENT(name body, XMLAGG(imgtag)) FROM imagetags;

-- The final code, added an HTML tag and a header with a CSS link.
WITH imagetags AS ( SELECT XMLELEMENT(name img, XMLATTRIBUTES(filename AS src)) AS imgtag FROM foo)
SELECT XMLELEMENT(name html, XMLELEMENT(name head, XMLELEMENT(name link, XMLATTRIBUTES('style.css' AS href, 'stylesheet' AS rel))),
XMLELEMENT(name body, XMLAGG(imgtag))) as h FROM imagetags;

Now all I have to do in the program is execute the query, fetch one record and print the value of column 'h'.

How mainatainable is this? Well, you could just plonk this into a view, which makes updating it a matter of executing the CREATE statement of that view once for every mutation, just like you'd press 'save' after updating a PHP file, and in both cases you have to reload the page to see the result, so there isn't much between both methods.

vrijdag 21 februari 2014

Composer security issue, take care!

Composer has a feature that allows code from one package to overwrite code from another package, and the creators seem reluctant to fix it. Short-term fix: don't update packages you installed through composer. Long-term fix: hope that the composer guys have a change of heart because their initial reaction is "it's not a big deal" Get the detals here: http://blog.astrumfutura.com/2014/02/composer-downloading-random-code-is-not-a-security-vulnerability/

donderdag 20 februari 2014

Why I like PostgreSQL more than MySQL: functional indexes

Indexes enable a database to lookup records quickly. They contain a sorted subset of columns from the target table, which makes them small and easy to traverse.

However, they can only help if the clause that your query uses can be solved by looking at the data that is stored in the index. For example, a query containing the clause "WHERE YEAR(datefield)=2014" will not use an index.

In MySQL the only thing you can do is write a trigger that saves the output of YEAR(datefield) into a separate column or even a separate table, and index that. This is very clumsy and difficult to use in your application.

In PostgreSQL you can simply create an index on YEAR(datefield) and the database will use that if a clauses mentions YEAR(datefield), and it will of course automatically update the index just like any other index.

CREATE INDEX datefield_year_idx ON foo (EXTRACT(YEAR FROM datefield));

This type of indexing can also be used on XML and JSON fields.


maandag 17 februari 2014

What is the best naming convention in databases?

The short answer is: the one you feel comfortable with.

The long answer is that every convension has it's advantages and disadvangtages and both are usually valid points.

For example; I have allways used plural form for table names, because a table holds a collection of data from which the query will make a subselection. There is nothing singular about the entire thing. Arguments against this include that it is sometimes difficult to predict the plural form of a table. That's sort of a valid point, if you have to guess the names of your tables, but I tend to know the names of the tables, I don't guess them.

Similarly some people like to name their foreign key fields after the tables and fields that are involved. Personally I prefer to name the foreign key field after the role it plays in the model, because that is the information I need when I'm working with the field. if a field contains the id of a person who is the teacher for the current record's user, I name the field "teacher_id". That makes sense to me. I already know that teachers are stored in the teachers view, I don't need to put that information in every field that refers to the teachers view.

And these kinds of arguments can be given for every rule of every convention.

So, when you speak to someone who says that his method is the best, he means "I like this".  And that is all it means. Listen to the reasoning andremember the things that make sense to you and that help you to write better code.

zondag 9 februari 2014

Why I like PostgreSQL more than MySQL: LIKE '%hello' can use indexes.

LIKE searches for substrings and if the search expression starts with a wildcard a B-tree index cannot be used (it is sorted on characters that are most likely not even in the search expression).

In PostgreSQL however, you can create a Trigram index which effectively indexes every possible combination of substrings that the LIKE could chop the string into. So, if your field contains "Hello", then the index will contain "Hello", and "ello" and "llo" and so on (yes, a frightening amount of combinations is created and care should be taken if high insert performance is required)  When searching, the index will always have the chopped version of the string in the index and will return a result very quickly indeed.

dinsdag 4 februari 2014

Why I like PostgreSQL more than MySQL: triggers on foreign key events

Triggers let you... well, "trigger" businesslogic whenever a record's content changes. A common usage for this is auditing changes, or preventing illegal values by triggering a check before accepting the query.

The single reason why you would want to do this things in a trigger is that the trigger will fire whenever a particular action is called on the target table, regardles of how that change was initiated. If your PHP script issues a query the trigger will fire. If you run a query by hand, the trigger will fire. If you update a record which leads to a cascading-update, the trigger will fire.

Except when it's MySQL, because MySQL does not fire triggers if the change is the result of a ON DELETE CASCADE or ON UPDATE CASCADE. So, if you are auditing changes to your shoppingcart table you will never see an entry for records that are removed by a cascading delete on the order table.

PostgreSQL does not suffer from this assenine exception.

vrijdag 31 januari 2014

Why I like PostgreSQL more than MySQL: Substring() can use regular expressions

The SUBSTRING() function accepts regular expressions and can return a matching block:

SELECT
SUBSTRING('<a href="http://www.thedolmen.com/">The Dolmen</a>' , 'href=\"([^\"]+)\"')

Returns:
 substring
------------
 The Dolmen
(1 row)


Notice that the expression includes elements that are not returned in the result.
And be aware that there can be only one block in the regular expression, SUBSTRING() can after all only return one value.

donderdag 30 januari 2014

Why I like PostgreSQL more than MySQL: Windowing

When you deal with customers in a shop you will want to log changes to your stock and your orders, both to generate statistics and to have a way of tracing any problems. The simplest way to do this is through a trigger that simply creates a copy of the entire record in a separate auditing table.

MySQL can do this just as well as PostgreSQL, but analysing the data is a different story.

Whenever you look a a history of events one of the first things you will want to know is: how much time elapsed between each row? Traditionally you would solve that by adding  a self-join to fetch the previous record, and subtracting the two timestamps.

Windowing allows you to do it much faster, in a much more flexible way that is infinately more readable:

SELECT
timestampfield - LAG(timestampfield) OVER (ORDER BY timestampfield)
FROM tale;

Yes, it really is that simple. LAG() fetches the previous record and the OVER(ORDER BY ...) indicates that "previous" should be taken from "the record with the highest timestamp that is still lower than the current record's timestamp".

Windowing also allows you to fetch the "second to previous" record, or "the ten previous records", and obviously you can do the same with records following the current record, using LEAD().

There are many other things you can do with windowing, like subdivide the data into groups of equal values in a particular field and ranking.

See also: http://www.postgresql.org/docs/9.1/static/tutorial-window.html

dinsdag 28 januari 2014

Is oracle finally getting a grip on MySQL?

Today the MySQL blogs are talking about a change that Oracle is thinking of making in MySQL 5.7. The change: to enable ONLY_FULL_GROUP_BY by default.

Apparently this is a very big deal, even though it is just a default setting which can simply be disabled (like most things in MySQL that provide reliability).

I guess I'm pleasantly surprised that so many MySQL actually know about this feature...

See also: http://www.tocker.ca/2014/01/24/proposal-to-enable-sql-mode-only-full-group-by-by-default.html

dinsdag 14 januari 2014

Why I like PostgreSQL more than MySQL: RETURNING

Every once in a while you get a request like "When I run this delete query I want to see which records have been deleted.".

The quick and dirty solution would be to run a SELECT  FOR UPDATE query first, to get the data, and follow that with a DELETE query to remove the data. But that requires two roundtrips to the database and if the query parameters are complex then you basically make the whilething twice as slow as it needs to be.

In PostgreSQL you just add "RETURNING *" to the DELETE query.

DROP TABLE IF EXISTS demo;
CREATE TABLE demo (id INTEGER, firstname TEXT, score INTEGER);
INSERT INTO demo VALUES (1,'Carl',24), (2,'Pete',10);

DELETE FROM demo WHERE firstname='Pete' RETURNING *;

That removes the record for 'Pete', and returns it, in one trip to the database.

Similarly, you can fetch the new record after an update:

UPDATE demo SET score=score+6 WHERE firstname='Carl' RETURNING score;

which returns just one record with score='30', the new score for Carl after adding 6 to it.


zondag 12 januari 2014

Why I like PostgreSQL more than MySQL: Combining Queries

So you have a shop that sells powerwashers and your flagship product is the PW-100, and a about a week ago you received a shipment of nozzles that fix a known problem in the PW-100. But, this only affects the PW-100 when used with the universal extender EX-900. Your customers are not ordering them as often as you think they should, so you want to send them an email telling them about this new nozzle. Of course you are very much against useless information so you only want to send the email to people who have actually bought the PW-100 and the EX-900, and only if the have not yet ordered the new nozzle.

This sounds like a bunch of INNER and LEFT joins, but PostgreSQL has proper support for EXCEPT and INTERSECT so you can write the query like this:


SELECT customer_id FROM orders INNER JOIN orderlines ON orderlines.order_id = orders.order_id WHERE product_id='PW-100'

INTERSECT

SELECT customer_id FROM orders INNER JOIN orderlines ON orderlines.order_id = orders.order_id WHERE product_id='PW-100'

EXCEPT

SELECT customer_id FROM orders INNER JOIN orderlines ON orderlines.order_id = orders.order_id WHERE product_id='New_nozzle_for_PW-100'

The power of INTERSECT and EXCEPT become more apparent as the queries get more complex. This example selects only on field but of course it works just as well when selecting a list of fields, which would be a pain for a JOIN syntax.