|
|
Subscribe / Log in / New account

A report from OpenSQLCamp

LWN.net needs you!

Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing

November 2, 2010

This article was contributed by Josh Berkus

What do you get when you put together 80 to 100 hard-core database geeks from ten different open source databases for a weekend? OpenSQLCamp, which was held most recently at MIT. Begun three years ago, OpenSQLCamp is a semi-annual unconference for open source database hackers to meet and collaborate on ideas and theories in the industry. It's held at various locations alternately in Europe and the United States, and organized and run by volunteers. This year's conference was organized by Sheeri Cabral, a MySQL community leader who works for PalominoDB.

This year's event included database hackers who work on MySQL, MariaDB, PostgreSQL, VoltDB, Tokutek, and Drizzle. In contrast to the popular perception that the various database systems are in a no-holds barred competition for industry supremacy, most people who develop these systems are more interested in collaborating with their peers than arguing with them. And although it's OpenSQLCamp, programmers from "NoSQL" databases were welcome and present, including MongoDB, Membase, Cassandra, and BerkeleyDB.

While the conference was mainly database engine developers, several high-end users were present, including staff from Rackspace, GoDaddy, VMWare, and WidgetBox. The conference's location meant the participation of a few MIT faculty, including conference co-chair Bradley Kuzsmaul. While few of the students who registered actually turned up, attendees were able to learn informally about the software technologies which are now hot in universities (lots of work on multi-processor scaling, apparently).

Friday

The conference started with a reception at the WorkBar, a shared office space in downtown Boston. After a little drinking and socializing, participants slid immediately into discussing database and database industry topics, including speculation on what Oracle is going to do with all of its open source databases (answer: nobody knows, including the people who work there), recent releases of PostgreSQL and MySQL, and how VoltDB works. Whiteboard markers came out and several people shifted to technical discussions and continued the discussion until 11pm.

Jignesh Shah of VMWare brought up some interesting SSD testing results. In high-transaction environments, it seems that batching database writes actually reduces throughput and increases response times, completely contrary to performance on spinning disks. For example, Jignesh had experimented with asynchronous commit with large buffers, which means that the database returns a success message to the client and fsyncs the data in batches afterward. This reduced database write throughput, whereas on a standard spinning disk RAID it would have increased it up to 30%. There was a great deal of speculation as to why that was.

A second topic of discussion, which shifted to a whiteboard for comprehensibility, was how to put the "consistency" in "eventual consistency" without increasing response time. This became a session on Sunday. This problem, which is basic to distributed databases, is the question of how you can ensure that any write conflict is resolved in exactly the same way on all database nodes for a transactional database which is replicated or partitioned across multiple servers. Historical solutions have included attempting to synchronize timestamps (which is impossible), using centralized transaction counter servers (which become bottlenecks), and using vector clocks (which are insufficiently determinative on a large number of nodes). VoltDB addresses this by a two-phase commit approach in which the node accepting the writes checks modification timestamps on all nodes which could conflict. As with many approaches, this solution maintains consistency and throughput at a substantial sacrifice in response times.

Saturday

The conference days were held at MIT, rather ironically in the William H. Gates building. For those who haven't seen Frank Gehry's sculptural architecture feat, it's as confusing on the inside as it is on the outside outside, so the first day started late. As usual with unconferences, the first task was to organize a schedule; participants proposed sessions and spent a long time rearranging them in an effort to avoid double-scheduling, which led to some "concurrency issues" with different versions of the schedule. Eventually we had four tracks for the four rooms, nicknamed "SELECT, INSERT, UPDATE and DELETE".

As much as I wanted to attend everything, it wasn't possible, so I'll just write up a few of the talks here. Some of the talks and discussions will also be available as videos from the conference web site later. I attended and ran mostly discussion sessions, which I find to be the most useful events of an unconference.

Monty Taylor of Drizzle talked about their current efforts to add multi-tenancy support, and discussed implementations and tradeoffs with other database developers. Multi-tenancy is another hot topic now that several companies are going into "database as a service" (DaaS); it is the concept that multiple businesses can share the same physical database while having complete logical separation of data and being unaware of each other. The primary implementation difficulty is that there is a harsh tradeoff between security and performance, since the more isolated users are from each other, the less physical resources they share. As a result, no single multi-tenancy implementation can be perfect.

Since it was first described in the early 80's, many databases have implemented Multi-Version Concurrency Control (MVCC). MVCC is a set of methods which allow multiple users to read and modify the same data concurrently while minimizing conflicts and locks, supporting the "Atomicity", "Consistency", and "Isolation" in ACID transactions. While the concept is conventional wisdom at this point, implementations are fairly variable. So, on request, I moderated a panel on MVCC in PostgreSQL, InnoDB, Cassandra, CouchDB and BerkeleyDB. The discussion covered the basic differences in approach as well as the issues with data garbage collection.

Jignesh Shah of VMWare and Tim Callagan of VoltDB presented on current issues in database performance in virtualized environments. The first, mostly solved issue was figuring out degrees of overcommit for virtualized databases sharing the same physical machine. Jignesh had tested with PostgreSQL and found the optimal level in benchmark tests to be around 20% overcommit, meaning five virtual machines (VMs) each entitled to 25% of the server's CPU and RAM.

One work in progress is I/O scheduling. While VMWare engineers have optimized sharing CPU and RAM among multiple VMs running databases on the same machine, sharing I/O without conflicts or severe overallocation still needs work.

The other major unsolved issue is multi-socket scaling. As it turns out, attempting to scale a single VM across multiple sockets is extremely inefficient with current software, resulting in tremendous drops in throughput as soon as the first thread migrates to a second socket. The current workaround is to give the VMs socket affinity and to run one VM per socket, but nobody is satisfied with this.

After lunch, Bradley ran a Q&A panel on indexing with developers from VoltDB, Tokutek, Cassandra, PostgreSQL, and Percona. Panelists answered questions about types of indexes, databases without indexes, performance optimizations, and whether server hardware advances would cause major changes in indexing technology in the near future. The short answer to that one is "no".

As is often the case with "camp" events, the day ended with a hacking session. However, only the Drizzle team really took advantage of it; for most attendees, it was a networking session.

Sunday

Elena Zannoni joined the conference in order to talk about the state of tracing on Linux. Several database geeks were surprised to find out that SystemTap was not going to be included in the Linux kernel, and that there was no expected schedule for release of utrace/uprobes. Many database engineers have been waiting for Linux to provide an alternative to Dtrace, and it seems that we still have longer to wait.

The VoltDB folks, who are local to Boston, showed up in force and did a thorough presentation on their architecture, use case, and goals. VoltDB is a transactional, SQL-compliant distributed database with strong consistency. It's aimed at large companies building new in-house applications for which they need extremely high transaction processing rates and very high availability. VoltDB does this by requiring users to write their applications to address the database, including putting all transactions into stored procedures which are then precompiled and executed in batches on each node. It's an approach which sacrifices response times and general application portability in return for tremendous throughput, into the 100,000's of transactions per second.

Some of the SQL geeks at the conference discussed how to make developers more comfortable with SQL. Currently many application developers not only don't understand SQL, but actively hate and fear it. The round-table discussed why this is and some ideas for improvement, including: teaching university classes, contributing to object-relational mappers (ORMs), explaining SQL in relation to functional languages, doing fun "SQL tricks" demos, and working on improving DBA attitudes towards developers.

In the last track of the day, I mediated a freewheeling discussion on "The Future of Databases", in which participants tried to answer "What databases will we be using and developing in 2020?" While nobody there had a crystal ball, embedded databases with offline synchronization, analytical databases which support real-time calculations, and database-as-a-service featured heavily in the discussion.

Wrap-up

While small, OpenSQLCamp was fascinating due to the caliber of attendee; I learned more about several new databases over lunch than I had in the previous year of blog reading. If you work on open-source database technology, are a high-end user, or are just very interested in databases, you should consider attending next year. Watch the OpenSQLCamp web site for videos to be posted, and for the date and location of next year's conferences in the US and Europe.

Index entries for this article
GuestArticlesBerkus, Josh


(Log in to post comments)

A report from OpenSQLCamp

Posted Nov 3, 2010 1:09 UTC (Wed) by ringerc (subscriber, #3071) [Link]

Standard machine-friendly representation of SQL

I'm an application developer who likes SQL - for reporting and and complex queries. It's a powerful language with great features, and I'd like to use it more in my code. In reality, I still use a query builder library or even an ORM much of the time, because what I hate and fear is assembling SQL from horrid little snippets for any non-trivial dynamic query. I'd love to see a standardized machine-friendly form of SQL that's:

  • Designed for fast, simple and unambigious parsing/processing;
  • Structured and friendly for machine processing. Use XML, use JSON, use whatever, but use something that's not free-form text;
  • Capable of being easily restructured to add and remove clauses, embed another query as a subquery, etc; and
  • Semantically identical to SQL, and translatable to text SQL queries for databases that don't accept it as an an alternate wire representation

Half the reason people "hate and fear" SQL is because it's a free-form, pseudo-natural-language query system that's an absolute bastard to manipulate in code. People are driven to add-on query builder APIs or ORMs to get library interfaces that make producing dynamic queries less excruciating than battling SQL snippets.

ORM extensions to the query language and result representation

The other thing I'd really love to see in SQL would be extensions to support ORMs. They're not going away, and the pain of getting from objects to/from relational data is greatly increased by the SQL interface that must be used. SQL is great for reporting and for complex queries, but ORMs are very useful for working with the data during small, highly dynamic transaction processing.

A set of ORM support extensions to SQL that defines a heirachical or graph representation for the returned data rather than a "flat" tabular/relational representation would massively reduce the pain involved in using ORMs with SQL databases, letting you stick to full SQL where its power is desirable and use a simplified "entity selector" when you just want to grab a bunch of records from a bunch of different relations according to some relationship. This would help avoid all that horrid left-outer-join abuse and many small selects that ORMs often do, by permitting them to request (say) "customer 1 and all customer 1's invoices and bookings" in a single query that'd return a single structured representation of the result data. For simple cases multiple tabular result sets from a query would be good enough (ie: a 1-record result for the customer, followed by an n-record result for invoices and an m-record result for bookings). Ideally you'd want to be able to return what's essentially a graph representation from the database, so you could slurp in database-produced JSON/XML/whatever and easily turn it into native language objects. Or, for that matter, return serialized forms of the language's native objects directly from the database (python pickle, Java serialize, etc) using a language runtime embedded in the database engine.

Imagine how much nicer it'd be to use ORMs if the database still stored data in relational form (so you get all those nice constraints, foreign key relationships, and the power of SQL to query it all) - but knew how to backward-traverse foreign key relationships and produce object graphs when you want an object representation of the data not a tabular representation. Oh, bliss.

A report from OpenSQLCamp

Posted Nov 3, 2010 10:22 UTC (Wed) by dgm (subscriber, #49227) [Link]

what's so hard about doing SELECT FROM invoices WHERE customer = '1'?

A report from OpenSQLCamp

Posted Nov 3, 2010 16:16 UTC (Wed) by iabervon (subscriber, #722) [Link]

Updating the code that reads the result set to deal with the column that you added for a different feature entirely (since that gives you all of the columns in the order they were declared in the schema definition)? Dealing with the possibilities of single quotes in the string if it comes from a potential attacker? The fact that your customer column is probably actually a number?

But the real issue is that applications will generally need some sort of search, and that will often involve optional clauses, which means that a fixed SQL statement, even with bind variables, will not be sufficient to implement some of the queries the application needs to do. And once you do pretty much anything dynamic, you're into a lot of annoyance with the need to have one less item of punctuation than list item, and needing to understand how join order can matter. And it's not long before you have to join the same table for two different purposes, getting different rows for each, and you have to worry about what you're naming the table in the query, and getting the right names for the columns from the two copies. The problem with a lot of the convenience features of SQL is that you can't use them with the queries that most need to be comprehensible.

A report from OpenSQLCamp

Posted Nov 3, 2010 19:33 UTC (Wed) by nix (subscriber, #2304) [Link]

Also, as soon as joins turn up, you're into hell. WITH makes things a bit simpler (assuming your database supports it), but encoding knowledge of the frankly somewhat bizarre tree-walk you have to do in order to figure out what names your columns and tables have got in *this* particular subclause is not very pleasant.

But that's not why I hate SQL, at all. That's just icing on the bad-tasting cake.

A report from OpenSQLCamp

Posted Nov 3, 2010 16:27 UTC (Wed) by alankila (guest, #47141) [Link]

Apart from not being syntactically correct, using string quotes for field that is likely an integer, and hardcoding a variable value in query rather than using a host parameter? Nothing.

From viewpoint of someone who constructs a query dynamically based on, say a hash-map's keys, the idiocy already starts from the fact that the first condition needs to start with "WHERE" and the other conditions need to start with "AND", just to construct a query that says "WHERE col1 = :v1 AND col2 = :v2". Lazy people type code like "WHERE 1=1" just to avoid having to deal with that issue.

Some other things that I hate personally are having to deal with the NULL value specifically; "foo = :some_value", but "foo IS NULL" instead of "foo = NULL". I'm sure there is some tired reason for this, but I don't care anymore. Additionally, if you want to negate a condition generally, you have to come up with schemes like (NOT foo = x) or (NOT foo IS NULL) because otherwise you have to generate foo != x vs. foo IS NOT NULL. (At least that is possible generically, though. A small victory, if you accept the use of different operators to begin with.)

Finally, the actual value is often presenting complexities of its own. Many databases flat out refuse a comparison like id = '5' if the id is in fact numerical. This is mitigated by host variables and intelligent cooperation between a dynamic language and the underlying SQL engine so automatic type conversions become possible, and in case of a static language it is a feature that the query crashes if wrong type of parameter is bound to a host variable.

It is this type of thoughtless hostility that the GP is complaining about, I believe. I have often wished having a programmable API for query construction that wouldn't require laying out a complex query out of tiny bricks. When designing one, I decided that such a thing must have a SQL parser because usually there is no way better to represent a SQL query than to spell it out in a string, and despite starting out in this way you still want to add and remove conditions later on without having to care about whether the condition must start with "WHERE" or "AND" or if the null value needs to be compared with operator = or IS, or whether the field in question needs a value that is a string or integer type, etc.

A report from OpenSQLCamp

Posted Nov 3, 2010 17:46 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

You're in luck:

List%lt;SomeObject%gt; objects=sess.from(someObject)
  .where(someObject.someField.lt(10))
  .where(someObject.anotherLinkedObject().anotherField.gt(10))
  .where(someObject.name.in("Joe", "Smith", "Bubba")
  .list(someObject);
or even:
HibernateQuery query=sess.from(someObject)
  .where(someObject.someField.lt(10))
  .where(someObject.anotherLinkedObject().anotherField.gt(10));

if (weNeedToLimitByName)
    query.where(someObject.name.in("Joe", "Smith", "Bubba"));

if (someAnotherCondition)
    query.where(someObject.id.eq(10));

query.list(someObject);
Secret sauce is QueryDSL. LINQ in C# has even more powerful capabilities.

A report from OpenSQLCamp

Posted Nov 3, 2010 22:10 UTC (Wed) by alankila (guest, #47141) [Link]

It looks reasonable on the surface, at least. I normally use the Criteria API when dealing with Hibernate on Java, but it is not especially smart or type safe. It's a pity that code generator is required for this.

A report from OpenSQLCamp

Posted Nov 4, 2010 1:02 UTC (Thu) by ringerc (subscriber, #3071) [Link]

Null handling isn't too bad if you're not dealing with brain-dead databases where you have to severely limit the used feature set. You can use "NOT (A IS DISTINCT FROM B)" to cover equality and nullity in one expression. It's a wee bit verbose, but better than a properly null-safe equality comparison, which is just horrid.

I'm of two minds on NULL. On one hand, it's semantically different to the null/NULL/None/undef/whatever in most programming languages, and behaves differently. On the other hand, it's a really useful concept that's intellectually appealing. It's such a PITA to actually work with that I'm not sure it's worth it, though, especially once you get to the inconsistencies in its application in various parts of the spec.

A report from OpenSQLCamp

Posted Nov 4, 2010 11:45 UTC (Thu) by alankila (guest, #47141) [Link]

Yeah, NULLs sound useful at first, but the only real place where I use them today is for indicating a missing value in a numeric field. For string fields, I find that empty strings are good substitutes for NULLs, and many dates, especially those that mark some start and end epoch can be set to year 0000 or year 9999 initially instead of NULL depending on what is the semantically most useful way to view the value.

A report from OpenSQLCamp

Posted Nov 3, 2010 17:49 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

Now suppose that you want to fetch 10 customer's address records, last 5 coupons, and 20 recent buys in the same request.

You'll get hit by Cartesian product in the form of 10*5*20=1000 records. Good hierarchical query should return a tree with 35 nodes.

A report from OpenSQLCamp

Posted Nov 3, 2010 19:40 UTC (Wed) by andresfreund (subscriber, #69562) [Link]

Why would you want to?

A report from OpenSQLCamp

Posted Nov 3, 2010 19:46 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

To minimize number of database roundtrips, as usual

A report from OpenSQLCamp

Posted Nov 7, 2010 13:27 UTC (Sun) by zorro (subscriber, #45643) [Link]

You can also send the three individual queries to the database in a single batch.

A report from OpenSQLCamp

Posted Nov 3, 2010 16:46 UTC (Wed) by iabervon (subscriber, #722) [Link]

I think one mistake that was made with SQL that you're still making is the idea that queries should normally be constructed as a sequence of characters, like program source. Since applications will almost always have to make queries with systematic structural variations, it isn't possible to put it all in the program source, and there's not really any other particular benefit to the sequence of characters for programmatic use (of course, it can be convenient to interact with the database manually over a text terminal, but that's an operational issue rather than a development issue). What I think would be really nice is something that is to SQL what DOM is to XML: something that can be manipulated at a logical granularity and doesn't need to be unparsed and parsed at all except when it has to go between independent systems (which wouldn't happen at all in most applications). I think it might even be fine if the unparsed form was still SQL, just because it would hardly matter (of course, it would mean having a syntax that isn't entirely predictable from the parsed form's API, which would mean that extensions would have to be defined in both formats if they are to be possible to use over a text terminal).

A report from OpenSQLCamp

Posted Nov 3, 2010 20:00 UTC (Wed) by nix (subscriber, #2304) [Link]

i.e. what we want is Sexpy Query Language? I'd support such an endeavour :)

A report from OpenSQLCamp

Posted Nov 10, 2010 17:08 UTC (Wed) by jeremiah (subscriber, #1221) [Link]

My current working solution is to define queries as an XML tree. This tree defines both the query, and what the resulting XML should look like. I'm not interested in storing full XML documents in the DB, but breaking them up into atomic pieces. That a subsequent query, can the reconstruct across documents. I had really considered hacking the Postres source to do this. I've just have a parser that builds sql, and one that processes result sets into XML. Based on a rather large assumption here, I would think it would be possible to create an API that would take XML based queries and parse them into Postgres's internal query representation, and have postgres's normal query optimization apply w/o much work. The difficult part would be processing the result sets in a hierarchical fashion w/o iterating over the redundant outer left join redundancies. There is a lot of wasted time there, but I'd imagine that the internal optimizations that produce results are not designed in a tree friendly way, but in a flat way. So you'd probably be making rather severe architectural changes, to the point of it not being worth it. I guess you could always filter the flat results at the DB into the XML, but that would mean processing the result tree twice. Which most of us do already, but atleast it would already be in mem, and avoid transport time.
</unedited train of thought>

In a nutshell, it seems like everyone wants to be able to send a query or data tree in, and get a result tree out.

A report from OpenSQLCamp

Posted Nov 10, 2010 17:33 UTC (Wed) by jeremiah (subscriber, #1221) [Link]

btw. NeoDB does an almost decent job of this, but too much of the processing and map reducing is done in code as opposed to by the DB itself. Although it's tree based, the query structure was not dynamically defined enough for my uses.

A report from OpenSQLCamp

Posted Nov 10, 2010 21:59 UTC (Wed) by nix (subscriber, #2304) [Link]

That would be lovely. Downside: XML is horribly verbose and ugly as sin.

(Has anyone written an XML -> sexp bidirectional filter? If not, I should: it wouldn't be hard and would make this stuff *so* much easier for humans to read and write. It's quite an achievement, actually, coming up with something *less* readable than Lisp's sexps... the SGML/XML people should be proud.)

A report from OpenSQLCamp

Posted Nov 10, 2010 23:40 UTC (Wed) by jeremiah (subscriber, #1221) [Link]

So I'm trying to visualize this, and failing. I don't suppose you'd toss up an example? I'm curious to see how you'd address a mix of attribute value pairs and element names and content. It's been at least 15 years since I messed with lisp.

I deal with XML every day, and agree with the verbosity and ugliness comment. The only way i can make it palatable is lots of indenting, and syntax highlighting, and even then it's pretty rough some times. I like it's flexibility and ubiquitousness though. I don't hate it nearly as much as I dislike the SQL language.

A report from OpenSQLCamp

Posted Nov 11, 2010 0:11 UTC (Thu) by nix (subscriber, #2304) [Link]

There are lots of possible unambiguous mappings. Instead of

<foo bar="baz">quux<wob>foo</wob>blah</foo>

you could say

(foo :bar "baz" (:content "quux" (wob (:content "foo")) (:content "blah")))

perhaps. Or, if you don't like the ":content", you could say

(foo (:attr bar "baz") "quux" (wob "foo") "blah")

(yeah, I think the latter is much easier to read. Attributes are less common than content in decent XML, so attributes should be more verbose in the sexpy syntax.)

Note that both these syntaxes render it impossible to close elements in the wrong order and impossible to do the analogue of missing a / out. Those two things on their own make it worthwhile, I think.

(I am by no means the first person to have thought about this. I'm not even the five thousandth.)

A report from OpenSQLCamp

Posted Nov 11, 2010 1:13 UTC (Thu) by jeremiah (subscriber, #1221) [Link]

I most definitely prefer the second, and thanks for the example. I'm going to roll it around in my head for a bit. I am curious to know why you think "Attributes are less common than content in decent XML" though. Most of the XML I deal with is coming out of a DB and generally the table name corresponds to the element name, and the column names to the attribute names etc. This makes for fairly decently readable XML, when a person has to look at it. Sure you run into issues with large content that doesn't really fit with in an attribute, and should be content instead. But for the majority of data that I have to deal with I have very very few content only elements. Do you have any pointers as to why this is a bad thing?

A report from OpenSQLCamp

Posted Nov 11, 2010 2:11 UTC (Thu) by jeremiah (subscriber, #1221) [Link]

Never mind. There was some missed subtlety in my reading of your post.

A report from OpenSQLCamp

Posted Nov 11, 2010 7:17 UTC (Thu) by nix (subscriber, #2304) [Link]

I think it depends on where your data's coming from. I think for the application you were discussing, you're probably right. I tend to deal with XML as a data-interchange format, and while a few of those are slightly self-describing, with datatypes in an attribute, most of them hand you a (generally very buggy) schema and then have no attributes at all in the content.

Note: a nice thing about using a real programming language rather than XML is that you could go the whole hog if you want and eliminate redundancy, although this looks uglier than it should because all my indentation is being eaten by the comment form and because I've only got one line of XML generated, much too little to be worth using this technique:

(let ((type-string '(:attr type "string"))
(type-int '(:attr type "int")))
(foo type-string "quux" (num-foos type-int 6)))

is equivalent to

<foo type="string">quux<num-foos type="int">6</num-foos></foo>

If you wanted to have 'type's with parameters, let alone optional ones, you'd have to do something a little different:

(flet ((type-float (precision) `(:attr type "float" `(:attr type ,(concat "float-" ,precision)))))
(foo (type-float 6) 14.172))

is equivalent to

<foo type="float-6">14.172</foo>

Note that the latter is horribly hard to typecheck properly in XML, but having something typecheck the Lisp notation is trivial. Also note that your generator and parser don't need to understand Lisp (although it is so easy the parser might as well understand a restricted subset): it just needs to know how to generate stereotyped let and (for parameterized attributes) flet expressions.

A report from OpenSQLCamp

Posted Nov 3, 2010 1:09 UTC (Wed) by nix (subscriber, #2304) [Link]

Most excellent summary. This is the sort of thing I pay my LWN subscription for. One quibble:
Currently many application developers not only don't understand SQL, but actively hate and fear it.
Speaking as a SQL expert (to some degree: I've been using it constantly for about twenty years and can make it dance to my tune by now), that's because it's one of the most abominably awful excuses for a language ever to have been inflicted on the programming public: it does almost everything wrong from a language design perspective, and on top of that is hard to read, inflexible, nonportable in any useful way, was appallingly badly standardized and is direly unpleasant to use. Unfortunately none of the more expressive and readable alternatives have taken off (and these days half of them seem to require Haskell, which makes them a dead loss from the start).

A report from OpenSQLCamp

Posted Nov 3, 2010 1:29 UTC (Wed) by ringerc (subscriber, #3071) [Link]

I couldn't agree more. It tries to be a "natural" language, and instead lands up being horribly unnatural for both humans and computers to understand, stuck in some twilight zone middle ground, a sort of language purgatory.

I *love* the power of SQL, but the language its self I hate with a passion made all the stronger by the fact that it's too powerful NOT to use. I liken it to being forced to write in in LOLcode or leet because what you want to say can't be expressed directly in English; you may detest the languages, but to say what you want to you have to use them anyway.

I nearly said that I'd like to see the SQL standards body move to design an alternate SQL representation - but actually, I don't. I'd like to see a new query language widely adopted, possibly via a push from Sun/Oracle via the JDBC interface spec, and via Microsoft via the ODBC/ADO interface specs (and LINQ), so that it actually sees wide multi-vendor adoption. Alas, neither Oracle nor MS have a history of favoring interoperable, vendor-independent approaches to problems. We're kind of stuck with SQL, because nobody wants to agree on a replacement that doesn't suck, they all want to push their own interfaces that work with their database and no other.

A report from OpenSQLCamp

Posted Nov 4, 2010 20:24 UTC (Thu) by lakeland (guest, #1157) [Link]

Make that 'me too' :)

SQL reminds me of HyperTalk from the late 80s, designed to be almost a native language but failing in awkward ways, and at the same time making it very hard to with with pro grammatically.

<RANT>

Have you used SAS? One of the interesting features of SAS is that you can manipulate your data using both PROC SQL and DATA steps. That's interesting because some operations are far more naturally expressed in DATA steps while others are far more naturally expressed in PROC SQL. I know nobody likes RBAR but sometimes it's the easiest way to solve a problem. Much like Microsoft have cripplingly bad string manipulation in SQL Server and say they won't fix it because string manipulation is slow. Perhaps true, but sometimes you care more about getting your job done simply than how hard the machine has to work.

Even when SQL has a natural approach I find the language unnecessarily awkward, like s how in every SQL variant (except MySQL), you cannot say GROUP BY 1 but instead have to copy/paste your select clause, or how you cannot make table names variable, or how the first condition clause is prefixed with WHERE and others with AND.

Another issue is how hard it is to work with GROUP BY... Say I'd like to display the best customers in each region, you'd think I'd be able to do something like

select * from customer where customer_id in (select customer_id from customer group by region having score = MAX(score) PARTITION BY REGION)...

You can see the whole thing is turning into a mess and it is not finished yet. Yet the problem was incredibly simple to specify precisely - anything that easy to specify precisely should be easy to do.

Another thing which Postgres does well but no others do is have a really elegant way of shifting to a slightly more powerful language when SQL doesn't quite do it for you.

Why is it that DB vendors seem to think supporting a feature is adequate and making it easy to use is optional? One thing I really like about SQL Server and to a lesser extent MySQL is that they go out of their way to make their nonstandard SQL extensions easy to use. For instance, ISNULL might have some dumb behaviours but it's so much better named than COALESCE, and have you looked at how awkward the spatial syntax is in MySQL, Postgres or Oracle?

</RANT>

A report from OpenSQLCamp

Posted Nov 14, 2010 0:49 UTC (Sun) by jberkus (guest, #55561) [Link]

All,

Thanks everyone for the rants on why you hate SQL. The above gives us SQL geeks some useful feedback on what to focus on (or what to avoid) when teaching people to live with SQL.

I happen to like SQL a lot, although I can certainly imagine an alternative, better data manipulation language (XLST doesn't cut it, sorry). I understand that a lot of people don't, just as SGML and XML give me the hives, yet there's people who adore *them*. For that matter, I like Perl and C.

Oh, and PostgreSQL does support GROUP BY 1 (it's in the standard). I recommend against it, though, because it's a foot-gun.

A report from OpenSQLCamp

Posted Nov 3, 2010 12:45 UTC (Wed) by clugstj (subscriber, #4020) [Link]

Agreed. "Seasoned" application developers have seen many (programming) languages in their day. Some good, some not so good. They can recognize ones that suck and SQL is definitely in that camp. Whereas, dyed-in-the-wool database weenies (no offense, some of my best friends are database weenies), who've never seen anything but SQL think it's great.

A report from OpenSQLCamp

Posted Nov 16, 2010 20:47 UTC (Tue) by BuggyFunBunny (guest, #71272) [Link]

-- Speaking as a SQL expert (to some degree: I've been using it constantly for about twenty years and can make it dance to my tune by now), that's because it's one of the most abominably awful excuses for a language ever to have been inflicted on the programming public

There's a reason for that: SQL *was not* specified by Codd, but by Chamberlin, who has graced us with xQuery. He was not part of Codd's group, and had been with IMS (the original hierarchical datastore; xQuery now makes sense, yes?). Legend has it that Armonk was so pissed at Codd for working out a better datastore than IMS just a couple years after IMS's release that they purposely anointed Chamberlin.

A report from OpenSQLCamp

Posted Nov 3, 2010 1:51 UTC (Wed) by kripkenstein (guest, #43281) [Link]

No SQLite people?

A report from OpenSQLCamp

Posted Nov 3, 2010 19:23 UTC (Wed) by drh (guest, #65025) [Link]

This LWN post was the first I'd heard of the latest OpenSQLCamp....

A report from OpenSQLCamp

Posted Nov 14, 2010 0:53 UTC (Sun) by jberkus (guest, #55561) [Link]

Nope. Sheeri said she e-mailed Richard, but got no response.

A report from OpenSQLCamp

Posted Nov 14, 2010 0:50 UTC (Sun) by Kamilion (subscriber, #42576) [Link]

Which is somewhat sad; as most of my database calls end up passing through SQLite in gearmand on their way out to the larger DBs. (Hi PostgreSQL!)

SQLite's a great asset when all you need from the frontend is simple sets of relational data. Ask a more powerful database to gather results asynchronously with gearmand, collect them in SQLite and process them.

In particular, it's made things so much easier to migrate away from Oracle's MySQL to open PostgreSQL. Having pggearman is also much sweeter. Being able to spread my load across 6 servers and running some queries in parallel has also been a tremendous performance boost.

A report from OpenSQLCamp

Posted Nov 3, 2010 1:56 UTC (Wed) by fuhchee (guest, #40059) [Link]

"Elena Zannoni joined the conference in order to talk about the state of tracing on Linux. Several database geeks were surprised to find out that SystemTap was not going to be included in the Linux kernel"

I wish people were not mislead into thinking that this ("inclusion in the linux kernel") was at all necessary to use systemtap.

"and that there was no expected schedule for release of utrace/uprobes"

I assume "merge" rather than "release". Yes, utrace and/or uprobes in the kernel would make this stuff more useful to database hackers on non-RH platforms.

A report from OpenSQLCamp

Posted Nov 3, 2010 12:57 UTC (Wed) by mjw (subscriber, #16740) [Link]

I wish people were not mislead into thinking that this ("inclusion in the linux kernel") was at all necessary to use systemtap.

The same is kind of implied in another LWN article about stable kernel interfaces too:

Another way of putting it, he said, is that, if you still want people to try out an ABI, you should not be asking him to pull it. In general, it can be better if new interfaces stay out of the kernel for a while. SystemTap was given as an example here: according to Linus, time has shown that the SystemTap interface is not a good one. He's very glad he never pulled it into the kernel. The lesson is that it's a good idea to impose a certain amount of pain on people who want to create new interfaces; let them live out of the mainline for a while.
It is as if SystemTap itself should somehow be pulled into the kernel, while SystemTap of course works fine outside it (seeing that it is mainly a user space program anyway). It is just that the kernel could make life for SystemTap easier by integrating some sane interfaces to use. Which is actually slowly happening. If the kernel interfaces for tracing and probing improve systemtap will obviously just adapt to use them.

BTW. Some nice articles about using SystemTap with SQLite/PostgreSQL/MySQL:

Relational &#8800; SQL

Posted Nov 3, 2010 7:10 UTC (Wed) by ldo (guest, #40946) [Link]

SQL does not have to be the only way to access relational databases. It has many shortcomings and peculiarities, for example it does not allow tables as first-class objects. It became entrenched because for so long all the relational databases worth using for proprietary; in the modern open-source world, there is no reason why this has to continue to be so.

Look at mathematical treatments like relational algebra and relational calculus, to see where a new, more modern, query language might come from.

Relational &#8800; SQL

Posted Nov 3, 2010 11:10 UTC (Wed) by epa (subscriber, #39769) [Link]

Postgres used to have its own language called Postquel but it was only when it switched to an SQL front-end that it took off. Unfortunately, SQL is the only relational query language the world knows. In its modern, ANSI-standardized form (as supported by Postgres) it's pretty powerful but the syntax is still yuck.

Relational &#8800; SQL

Posted Nov 5, 2010 19:48 UTC (Fri) by lakeland (guest, #1157) [Link]

Wouldn't it be possible to provide multiple languages accessing the same DB? So you could have this Postquel (which I have to admit I've never heard of) and SQL accessing the DB simultaneously while maintaining a fully ACID compliant DB.

A report from OpenSQLCamp

Posted Nov 3, 2010 15:47 UTC (Wed) by rbetts@gweep.net (subscriber, #21779) [Link]

Wonderful summary. Thanks for taking the time to write about the conference.

To put some (very course) numbers around the VoltDB latency / throughput tradeoffs - volt typically achieves ~10ms latencies from relatively idle to 80% or so of max throughput capacity (possibly 100k's of multi-sql-statement stored procedures per second).

Ryan.

A report from OpenSQLCamp

Posted Nov 3, 2010 17:41 UTC (Wed) by nteon (subscriber, #53899) [Link]

what are typical latencies for a traditional RDBMS like Postres or Mysql?

A report from OpenSQLCamp

Posted Nov 4, 2010 17:27 UTC (Thu) by kjp (guest, #39639) [Link]

VoltDB is interesting. I'm wondering if two postgres databases could instead use eachother's RAM instead of a transaction log, to have very high transaction rates.

It's sync replication but only to RAM. Checkpoints can occur whenever. So you get higher availability than a single server and more speed due to transactions just hitting RAM.

A report from OpenSQLCamp

Posted Nov 6, 2010 20:38 UTC (Sat) by mgedmin (subscriber, #34497) [Link]

The other major unsolved issue is multi-socket scaling. As it turns out, attempting to scale a single VM across multiple sockets is extremely inefficient with current software, resulting in tremendous drops in throughput as soon as the first thread migrates to a second socket. The current workaround is to give the VMs socket affinity and to run one VM per socket, but nobody is satisfied with this.
What does 'socket' refer to, in this context? A network socket? A CPU socket (not to be confused with cores)? Some database-specific term I'm not familiar with?

A report from OpenSQLCamp

Posted Nov 10, 2010 16:38 UTC (Wed) by jeremiah (subscriber, #1221) [Link]

I read it as CPU/Core

A report from OpenSQLCamp

Posted Nov 14, 2010 0:51 UTC (Sun) by jberkus (guest, #55561) [Link]

It refers to a CPU socket. Basic NUMA scaling issues, really.


Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds