Even Friendlier SQL with DuckDB
TLDR; DuckDB continues to push the boundaries of SQL syntax to both simplify queries and make more advanced analyses possible. Highlights include dynamic column selection, queries that start with the FROM clause, function chaining, and list comprehensions. We boldly go where no SQL engine has gone before! Who says that SQL should stay frozen in time, chained to a 1999 version of the specification? As a comparison, do folks remember what JavaScript felt like before Promises? Those didn’t launch until 2012! It’s clear that innovation at the programming syntax layer can have a profoundly positive impact on an entire language ecosystem. We believe there are many valid reasons for innovation in the SQL language, among them opportunities to simplify basic queries and also to make more dynamic analyses possible. Many of these features arose from community suggestions! Please let us know your SQL pain points on Discord or GitHub and join us as we change what it feels like to write SQL! If you have not had a chance to read the first installment in this series, please take a quick look here. The future is now The first few enhancements in this list were included in the “Ideas for the Future” section of the prior post. Reusable column aliases When working with incremental calculated expressions in a select statement, traditional SQL dialects force you to either write out the full expression for each column or create a common table expression (CTE) around each step of the calculation. Now, any column alias can be reused by subsequent columns within the same select statement. Not only that, but these aliases can be used in the where and order by clauses as well. Old way 1: Repeat yourself select 'These are the voyages of the starship Enterprise...' AS intro, instr('These are the voyages of the starship Enterprise...', 'starship') AS starship_loc substr('These are the voyages of the starship Enterprise...', instr('These are the voyages of the starship Enterprise...', 'starship') + len('starship') + 1) AS trimmed_intro; Old way 2: All the CTEs WITH intro_cte AS ( SELECT 'These are the voyages of the starship Enterprise...' AS intro ), starship_loc_cte AS ( SELECT intro, instr(intro, 'starship') AS starship_loc FROM intro_cte ) SELECT intro, starship_loc, substr(intro, starship_loc + len('starship') + 1) AS trimmed_intro FROM starship_loc_cte; New way SELECT 'These are the voyages of the starship Enterprise...' AS intro, instr(intro, 'starship') AS starship_loc, substr(intro, starship_loc + len('starship') + 1) AS trimmed_intro; intro starship_loc trimmed_intro These are the voyages of the starship Enterprise… 30 Enterprise… Dynamic column selection Databases typically prefer strictness in column definitions and flexibility in the number of rows. This can help by enforcing data types and recording column level metadata. However, in data science workflows and elsewhere, it is very common to dynamically generate columns (for example during feature engineering). No longer do you need to know all of your column names up front! DuckDB can select and even modify columns based on regular expression pattern matching, EXCLUDE or REPLACE modifiers, and even lambda functions (see the section on lambda functions below for details!). Let’s take a look at some facts gathered about the first season of Star Trek. Using DuckDB’s httpfs extension, we can query a csv dataset directly from GitHub. It has several columns so let’s DESCRIBE it. INSTALL httpfs; LOAD httpfs; CREATE TABLE trek_facts AS SELECT * FROM 'https://raw.githubusercontent.com/Alex-Monahan/example_datasets/main/Star_Trek-Season_1.csv'; DESCRIBE trek_facts; column_name column_type null key default extra season_num BIGINT YES NULL NULL NULL episode_num BIGINT YES NULL NULL NULL aired_date DATE YES NULL NULL NULL cnt_kirk_hookups BIGINT YES NULL NULL NULL cnt_downed_redshirts BIGINT YES NULL NULL NULL bool_aliens_almost_took_over_planet BIGINT YES NULL NULL NULL bool_aliens_almost_took_over_enterprise BIGINT YES NULL NULL NULL cnt_vulcan_nerve_pinch BIGINT YES NULL NULL NULL cnt_warp_speed_orders BIGINT YES NULL NULL NULL highest_warp_speed_issued BIGINT YES NULL NULL NULL bool_hand_phasers_fired BIGINT YES NULL NULL NULL bool_ship_phasers_fired BIGINT YES NULL NULL NULL bool_ship_photon_torpedos_fired BIGINT YES NULL NULL NULL cnt_transporter_pax BIGINT YES NULL NULL NULL cnt_damn_it_jim_quote BIGINT YES NULL NULL NULL cnt_im_givin_her_all_shes_got_quote BIGINT YES NULL NULL NULL cnt_highly_illogical_quote BIGINT YES NULL NULL NULL bool_enterprise_saved_the_day BIGINT YES NULL NULL NULL COLUMNS() with regular expressions The COLUMNS expression can accept a string parameter that is a regular expression and will return all column names that match the pattern. How did warp change over the first season? Let’s examine any column name that contains the word warp. SELECT episode_num, COLUMNS('.*warp.*') FROM trek_facts; episode_num cnt_warp_speed_orders highest_warp_speed_issued 0 1 1 1 0 0 2 1 1 3 1 0 … … … 27 1 1 28 0 0 29 2 8 The COLUMNS expression can also be wrapped by other functions to apply those functions to each selected column. Let’s simplify the above query to look at the maximum values across all episodes: SELECT MAX(COLUMNS('.*warp.*')) FROM trek_facts; max(trek_facts.cnt_warp_speed_orders) max(trek_facts.highest_warp_speed_issued) 5 8 We can also create a WHERE clause that applies across multiple columns. All columns must match the filter criteria, which is equivalent to combining them with AND. Which episodes had at least 2 warp speed orders and at least a warp speed level of 2? SELECT episode_num, COLUMNS('.*warp.*') FROM trek_facts WHERE COLUMNS('.*warp.*') = 2; -- cnt_warp_speed_orders = 2 -- AND -- highest_warp_speed_issued = 2 episode_num cnt_warp_speed_orders highest_warp_speed_issued 14 3 7 17 2 7 18 2 8 29 2 8 COLUMNS() with EXCLUDE and REPLACE Individual columns can also be either excluded or replaced prior to applying calculations on them. For example, since our dataset only includes season 1, we do not need to find the MAX of that column. It would be highly illogical. SELECT MAX(COLUMNS(* EXCLUDE season_num)) FROM trek_facts; max(trek_facts.episode_num) max(trek_facts.aired_date) max(trek_facts.cnt_kirk_hookups) … max(trek_facts.bool_enterprise_saved_the_day) 29 1967-04-13 2 … 1 The REPLACE syntax is also useful when applied to a dynamic set of columns. In this example, we want to convert the dates into timestamps prior to finding the maximum value in each column. Previously this would have required an entire subquery or CTE to pre-process just that single column! SELECT MAX(COLUMNS(* REPLACE aired_date::timestamp AS aired_date)) FROM trek_facts; max(trek_facts.season_num) max(trek_facts.episode_num) max(aired_date := CAST(aired_date AS TIMESTAMP)) … max(trek_facts.bool_enterprise_saved_the_day) 1 29 1967-04-13 00:00:00 … 1 COLUMNS() with lambda functions The most flexible way to query a dynamic set of columns is through a lambda function. This allows for any matching criteria to be applied to the names of the columns, not just regular expressions. See more details about lambda functions below. For example, if using the LIKE syntax is more comfortable, we can select columns matching a LIKE pattern rather than with a regular expression. SELECT episode_num, COLUMNS(col - col LIKE '%warp%') FROM trek_facts WHERE COLUMNS(col - col LIKE '%warp%') = 2; episode_num cnt_warp_speed_orders highest_warp_speed_issued 14 3 7 17 2 7 18 2 8 29 2 8 Automatic JSON to nested types conversion The first installment in the series mentioned JSON dot notation references as future work. However, the team has gone even further! Instead of referring to JSON-typed columns using dot notation, JSON can now be automatically parsed into DuckDB’s native types for significantly faster performance, compression, as well as that friendly dot notation! First, install and load the httpfs and json extensions if they don’t come bundled with the client you are using. Then query a remote JSON file directly as if it were a table! INSTALL httpfs; LOAD httpfs; INSTALL json; LOAD json; SELECT starfleet[10].model AS starship FROM 'https://raw.githubusercontent.com/vlad-saling/star-trek-ipsum/master/src/content/content.json'; starship USS Farragut - NCC-1647 - Ship on which James Kirk served as a phaser station operator. Attacked by the Dikironium Cloud Creature, killing half the crew. ad. Now for some new SQL capabilities beyond the ideas from the prior post! FROM first in SELECT statements When building a query, the first thing you need to know is where your data is coming FROM. Well then why is that the second clause in a SELECT statement?? No longer! DuckDB is building SQL as it should have always been - putting the FROM clause first! This addresses one of the longest standing complaints about SQL, and the DuckDB team implemented it in 2 days. FROM my_table SELECT my_column; Not only that, the SELECT statement can be completely removed and DuckDB will assume all columns should be SELECTed. Taking a look at a table is now as simple as: FROM my_table; -- SELECT * FROM my_table Other statements like COPY are simplified as well. COPY (FROM trek_facts) TO 'phaser_filled_facts.parquet'; This has an additional benefit beyond saving keystrokes and staying in a development flow state: autocomplete will have much more context when you begin to choose columns to query. Give the AI a helping hand! Note that this syntax is completely optional, so your SELECT * FROM keyboard shortcuts are safe, even if they are obsolete… 🙂 Function chaining Many SQL blogs advise the use of CTEs instead of subqueries. Among other benefits, they are much more readable. Operations are compartmentalized into discrete chunks and they can be read in order top to bottom instead of forcing the reader to work their way inside out. DuckDB enables the same interpretability improvement for every scalar function! Use the dot operator to c