Three bugs in the Go MySQL driver

bgrainger · on May 21, 2020

While the second and third problems sound specific to Go's `database/sql` API, the first problem (server-closed connections) is an issue any MySQL client library (that implements connection pooling) has to deal with.

My .NET MySQL connector (https://github.com/mysql-net/MySqlConnector) "solves" it at the MySQL level by sending a PING packet, but as the post points out, this adds extra latency.

The TCP-based approach of performing a non-blocking read sounds like a much better approach; I'm glad the author shared this, and I want to see if that technique can also be implemented in MySqlConnector: https://github.com/mysql-net/MySqlConnector/issues/821.

wahern · on May 21, 2020

The non-blocking read is a bad hack to an ill-defined problem.

For starters, if a read would fail, so would a write. Your write path has to handle errors anyhow, so the non-blocking read is a contrivance. I understand the reason for the contrivance--it's because the typical SQL driver API requires first dequeueing a connection from a pool, passing ownership to a caller, and the caller sending a query. The proper and better API is to combine the "dequeue connection" and "send query" steps, returning a handle back for the response, which may include (as an optimization) assigning the connection to the caller for any additional requests. Additionally, any hack that prevents the majority of write failures, but not all, leaves time bombs around by not stressing the write failure paths of the caller.

Secondly, IME a far more common reason in the wild for a connection disappearing is either a) the server crashing or b) NAT association timeout. In either case it's typical for the remote end to simply disappear, and unless TCP keepalive is enabled, a read won't return any error until you first attempt a write and wait for the TCP stack to figure out that the remote end is gone. In the best case that's a R/T, in the worst case some eager firewall rules in the middle or on the remote peer silently drop packets for unassociated connections and/or the ICMP response and you're stuck waiting for TCP timeout.

Thirdly, and related to #2, absent TCP keepalive or an application ping (not ICMP ping), connections are far more likely to get dropped on the floor while sitting in the pool. And an application ping is more reliable because some firewalls and application and NAT gateways drop or ignore TCP keepalive.

I've written client libraries with connection pooling for HTTP, MySQL, SMTP, RTSP and generic TCP. I learned all this the hard way.

tankenmate · on May 21, 2020

TCP connections are not necessarily symmetric, if a read fails doesn't necessarily mean a write will fail; there is a difference between shutdown() and close() (check a packet dump or this stackoverflow question; https://stackoverflow.com/questions/11436013/writing-to-a-cl...).

The non blocking read method is fine if you wish to try to treat any client connection with a server initiated FIN as non useable (since the client will never be able to read a response from the server); it cleanly distinguishes on the client side between a CLOSE_WAIT and ESTABLISHED but with an empty read buffer.

jerf · on May 21, 2020

"The proper and better API is to combine the "dequeue connection" and "send query" steps, returning a handle back for the response, which may include (as an optimization) assigning the connection to the caller for any additional requests."

It wouldn't be an optimization; you need it for transactions for DBs where transactions are tied to the connections, which is I believe basically all of them.

It seems like this would still leave "I sent a request and it was completed but I never heard that it was, so now I don't know if I need to resend it" unaffected... I find myself concluding that there's a certain amount of fundamental brokenness in trying to communicate with a database with a mutable-data based protocol over TCP, period. I'm not sure that can necessarily be resolved in any practical manner, but there may be ways of moving around the fundamental brokenness, once it is acknowledged to exist, to be less like to bite, and less likely to bite hard, and easier to detect if it happens.

user5994461 · on May 21, 2020

>>> a far more common reason in the wild for a connection disappearing is...

You are correct about the failure modes and mitigations in the general case, however for MySQL the most common reason by far for connections to disappear is the wait_timeout systematically being misconfigured and way too short.

Databases configured to drop connections every few seconds do drop client connections all the time. No surprise.

dub · on May 21, 2020

The article says the non-blocking read resolved the problem they were having while only introducing five microseconds of latency per borrow from the connection pool.

For a consumer-facing, latency-sensitive application, that sounds like much easier-to-stomach hack than introducing a network round-trip at the start of every connection borrow.

user5994461 · on May 21, 2020

It's funny you say it resolves the problem because if you read the bug report from the author opened since 2017 https://github.com/go-sql-driver/mysql/issues/657

It's repeatedly pointed out by one maintainer that his server and application are misconfigured and should be fixed. Yet the author doesn't consider it and goes on a project to redo the connection pool (the library can certainly benefit from better error detection but he would be better off fixing the root cause of connections dropping).

Ergo the issue is not fixed. If you look at the chart provided, it's still erroring every few minutes, instead of every minute. It's an improvement but a far cry from zero. https://user-images.githubusercontent.com/42793/57318792-6e3...

yencabulator · on May 21, 2020

There's a fundamental unfixable race there: a connection can be picked from the pool while a FIN is already on its way to the client.

This is "caused" (as in, not mitigated) by the intersection of a naive protocol, as pretty much all SQL protocols are, and the nature of distributed programming.

user5994461 · on May 21, 2020

You realize the database server was misconfigured to drop connections transparently after 30 seconds? If it were not for that, TCP race conditions are extremely rare.

infogulch · on May 21, 2020

The usual implementation of a using a Connection type tends to couple the persistent-socket connection domain with the authorization domain. But if you defer connection management to some outer scope, then the Connection type remains with just authorization functionality. (To the extent that you even authorize on a per-query basis.) That's an interesting idea.

twhitmore · on May 23, 2020

If the server is closing idle connections, the client connection pool should track idle times & close them earlier.

Checking for readability still has a race condition.

Really this seems like a poor example of configuration.

JulienSchmidt · on May 21, 2020

Bug #3 (The race) was silently introduced by a semantic change in Go's database/sql in January 2018: https://github.com/golang/go/commit/651ddbdb5056ded455f47f9c...

It took until December of the same year until we got the first bug report and figured out what was going on. While the semantic change might look subtle, it was certainly not from our (driver maintainer's) perspective. We were quite disappointed that such a change was made 1) without informing the driver maintainers 2) making sure the changes were in place before this change made it into a Go release.

We regularly test against Go's master (now using a Travis CI cron job), but that only helps if the existing tests fail. We don't have the time to constantly monitor all changes in the Go repo.

If there is a need to make such changes (not just in database/sql and not just in Go), PLEASE actively communicate early with the community / the direct users.

kardianos · on May 21, 2020

I'm sorry. This is clearly my fault and I'm sorry this wasn't communicated.

Please look into Go1.15. It should all be backwards compatible, but where the session resetter is called changes, and a new call is added for verifying if the connection is valid or not. I'll have the release notes updated soon for Go1.15, but until then, look at the driver docs for tip.

I'll make announce the change on the golang-sql mailing group.

nemetroid · on May 21, 2020

A comment on the original issue[1] promises to ”[open] some tickets for various drivers”. Did this not happen?

1: https://github.com/golang/go/issues/23519#issuecomment-36057...

JulienSchmidt · on May 21, 2020

I really appreciate the work that is done to improve the database/sql package. It is still rather young and not as mature as for example JDBC. I would rather thank the people who work on it for the work they do, than to point fingers for anything that is not done. The original design of database/sql was sometimes a bit too simplistic and will probably require some more ugly changes or workarounds in the future.

This specific change was a bit unfortunate. It tried to fix another bug caused by the introduction of Context cancellation support, but it unfortunately caused some other major bugs due to the changed semantics. Like the very same comment already indicates, other drivers also required changes. In retro-perspective I think it would have been better to try to handle the original issue entirely in database/sql, instead of changing the "contract" with the drivers.

Go MySQL Driver was, and I believe still is, developed mostly by some random programmers in their free time (I originally started it as a side-project during high school and until now was never paid for any work on it). It probably makes anyone who worked on it proud that now some major companies, not just GitHub, employ it. But like many such projects, it is not a perfect or finished product. If you use community-driven open source software at work, try to convince your management to set some work hours aside for actually contributing back to those, like GitHub did.

networkimprov · on May 21, 2020

Go is still undergoing intensive R&D, especially in the runtime. Its releases are quietly tested against internal Google projects, but not formally tested with widely-used open source projects.

Any org that deploys Go apps should designate someone to watch the Go issue tracker, and subscribe to all issues that might affect them.

https://github.com/golang/go/issues/

mfer · on May 21, 2020

Notifying users of the code of a semantic change does not mean it's not a semantic change.

Even if the Go team can find a way to notify all of the public drivers impacted, they aren't going to be able to know about the private drivers or forks to notify those people. Nor are they going to be able to notify all the consumers who, using MVS, have the version of the driver set to one that's not fixed.

Bugs from this change will propagate and there's nothing the Go team can do to stop them all. This is why projects following semantic versioning are not supposed to make breaking changes in minor releases.

This is, unfortunately, not the first time I've seen this kind of change in Go. Some previous changes have caused bugs for me a few times in the past. Not knowing which version of the compiler one would use I ended up needed to craft code to handle before and after cases for the change along with the bug reports from users.

kardianos · on May 21, 2020

FYI: Go 1.15 will have idle timeout: https://tip.golang.org/pkg/database/sql/#DB.SetConnMaxIdleTi...

The driver interface also separates out the session resetter and the connection validator concepts. https://tip.golang.org/pkg/database/sql/driver/

Please do have drivers implement the Connector.

EmielMols · on May 21, 2020

The work-around for the first issue might work well in practice, but due to asymmetry of tcp streams, is still open for a race where the server will idle-close a connection while a new (non-idempotent!) query is just in-flight.

The correcter solution would let the client manage the idle-timeout and disconnect once reached. Depending a bit on how controlled the client is, this might be a good or bad idea.

Note that this problem is very generalizable to http (1) upstream servers. If you need to support non-idempotent requests and want persistent connections to your upstream, it's not a good idea to have the upstream manage the idle timeout (and disconnect if reached).

In practice, I would have the client manage an idle-timeout of 30s, and server of 40s as an extra protection against misbehaving clients.

bgrainger · on May 21, 2020

Some MySQL clients offer this; for example MySqlConnector has a `ConnectionIdleTimeout` setting that controls how long a connection can be alive in the pool: https://mysqlconnector.net/connection-options/

This is generally useful to "prune" the number of open connections in the connection pool after a burst of activity has happened, so I'm surprised more clients don't offer it; I couldn't find an equivalent setting on go-sql-driver/mysql or MySQL Connector/J.

user5994461 · on May 21, 2020

>>> Keep in mind, we run our production MySQL clusters with a pretty aggressive idle timeout (30s)

Number one source of connectivity issue with MySQL. That setting alone is the source to every bug they encountered.

It's marvelous the amount of applications that have a low timeout and wonder why the (idle) connections are dropped all the time.

The timeout works fantastic when software prepare connection to the database, spend maybe 30 seconds initializing and doing other stuff, then send the query. The carefully prepared connection is guaranteed to be broken consistently. But the most impressive use case is with connection pools, whose purpose is to maintain idle connections, that gets butchered every 30 seconds.

MySQL should remove this fatal setting and enable TCP keepalive by default.

At least have mercy on the developers to set a sane default and cap the minimum value to 5 minutes.

throwdbaaway · on May 25, 2020

It looks like they don't need the aggressive idle timeout anymore, now that they have ProxySQL for connection pooling:

> ... ProxySQL, which is responsible for connection pooling ...

https://github.blog/2020-03-26-february-service-disruptions-...

More worryingly, it looks like they just rewrote the authz functionality from ruby to golang, and have both the monolith and the microservice accessing the same database?

twhitmore · on May 23, 2020

Agreed. This is fairly egregious misconfiguration.

But it's also unsafe from the Go client/ connection pool to expect idle sockets to stay open indefinitely -- any firewall or network appliance traversal would be highly likely to invalidate that.

Basic database connection pooling. Sigh.

kccqzy · on May 21, 2020

Cancellation is always such a big issue regardless of language or environment, and whether it's due to a timeout or some other action.

In C, when using threads and thread cancellation there are a lot of subtleties involved: https://ewontfix.com/2/

Even in Haskell where there's generally a focus on well designed abstractions, asynchronous cancellation can fail due to badly written code that wants to catch all exceptions.

sargun · on May 21, 2020

These bugs are not specific to MySQL. The bug around context closing the connection resulting in a txn not having an explicitly successful abort / rollback is a problem as well because conn.Close gets called in a goroutine when context expires.

We've experienced this in pq as well.

todotask · on May 21, 2020

This as well, a recently fix for this scenario

https://github.com/golang/go/issues/34775

kevinherron · on May 21, 2020

Bug #1 doesn't seem solved. They just narrowed the window for the race condition.

I don't see how you can fix this without the client first calling db.Ping() if within some proximity to its estimation of the connection's expiration.

nitwit005 · on May 21, 2020

> A quick Code Search for rows.Close() on GitHub shows that pretty much nobody is explicitly checking the return value of rows.Close() in their Go code.

Seems to be a pattern in all languages. Plenty of C code ignoring the result of the close system calls, or Java code that just wraps exceptions from close calls in try/catch blocks that do nothing with the exception.

cesarb · on May 21, 2020

Even Rust, famous for caring about safety, does that: for instance, look at the documentation for File (https://doc.rust-lang.org/std/fs/struct.File.html), and you won't find any close() method, it's always automatically closed when the File goes out of scope (in its Drop implementation). Of course, this means that any errors reported by close() are simply ignored. The only way to get the errors from close() is to do a file.into_raw_fd() (which extracts the file descriptor and suppresses the Drop) followed by a FFI call to the libc close().

It gets even worse when you have a BufWriter (https://doc.rust-lang.org/std/io/struct.BufWriter.html) on top, since it can have buffered data which will be written when it goes out of scope (in its Drop implementation), once again ignoring all errors; you have to remember to do a flush() (or an into_inner() to recover the File, which also does a flush()) before it goes out of scope to get any write errors (but you still won't get any errors from closing the file).

PeterisP · on May 21, 2020

Well, what should the code do there?

At that point, the status is uncertain, there's potentially a problem but IMHO there's nothing the code can do to reliably recover from it.

Thaxll · on May 21, 2020

The code should at least log the error because it means there is something wrong somewhere, eventhough you can't do much about it it helps understand some other issues.

The only way to log the error is to handle it in Go.

sudhirj · on May 21, 2020

The code should throw away the data it read from the connection just before it closed it, because there's a high chance of it being corrupt.

lstodd · on May 21, 2020

That's just crazy. How do you define "just before" ?

sudhirj · on May 21, 2020

The last iteration before the current check. But I'd assume the foolproof way would be to return an error if the Close() returns an error, erroring out the entire read. I don't know if a partial read is ever a good thing.

siscia · on May 21, 2020

What if it is a write?

sudhirj · on May 21, 2020

That complicates things, but no different from seeing an error either way. The point is that checking for an error at point one is not enough, if you want to be sure there wasn't an error you need to check point 2 as well.

In the case of a write, if you seen the success message at point 1 you might not care about a failure at point 2, because you already know the write was a success, assuming your confirmation at point 1 was explicit and well formed.

In the case of read, say you read an email address, and received 'someone@somwhere.co'. The thing is, if the connection was cut you don't know if that address was supposed to be '.com', '.co.in' or '.co'. The only way you can tell in the current MySQL protocol is to check for an error on point 2 (the Close()). If you see an error, you know the read was incomplete, and you throw it away. If you don't check, you might have read the wrong (corrupt) value.

dullgiulio · on May 21, 2020

That sentence is very misleading. Nobody checks the Close() error because the right thing to do is to call Err() like with the text.Scanner.

I suspect the right github code search would reveal way less buggy code.

https://golang.org/pkg/database/sql/#Rows.Err

apta · on May 21, 2020

And here we see why the claim that golang forces users to handle errors is simply not true. Error handling in golang is just badly implemented.

laumars · on May 21, 2020

Functions with a single return (err) are a problem but generally Go is better at forcing users to handle errors than some languages.

It's also true that error handling in Go is generally less sophisticated than some languages.

For developers who prefer more sophisticated tooling, Go might appear badly implemented. However that's a preference thing.

I hope you can see what I'm doing here; wild generalisations just end up starting flamewars which nobody wins because everyone is using a different yardstick to measure their "correctness".

mst · on May 21, 2020

I am not at all fond of the go aesthetic personally, but given that aesthetic they really do make the most of the trade-offs they've chosen, and all things considered, yeah, within go's style the way error handling works is actually damn good.

throw_m239339 · on May 21, 2020

> Functions with a single return (err) are a problem but generally Go is better at forcing users to handle errors than some languages.

Which languages?

apta · on May 21, 2020

It's not a preference specifically because golang's error handling is error prone. The things that slide in golang code don't happen in languages with exceptions.

laumars · on May 22, 2020

Go has exceptions too however they serve a different use case.

Sometimes an error is a problem that should be handled with granular logic and sometimes an error is something that should hard terminate that function and any parent functions until it is captured.

The problem with exceptions is that some languages don't surface that a particular function or API might raise an exception so you're often left having to capture all exceptions and having to write generic failure messages hoping that the exception is descriptive (often it isn't). Or worse yet, you fail to capture that exception and end up with your application hard terminating over something that isn't serious to the programs function.

This is why err is so widely used in Go; it enables granular logic, makes it clearer where errors are communicated and thus when they need to be handled; and enables developers to escape code or report errors in ways that are more meaningful to those who actually run the application and who might not have the ability to read the code.

By more useful error messages I mean something like this pseudo-code

    function writeLog(message string) {
        open(CREATE | APPEND, "/var/log/example.log")
        writeln(message)
        close("/var/log/example.log")
    }

If you had an exception in open(), writeln(), or close() then how do you report that to the user? What state has the function been left in? Do you have to capture the exception from each function or does some functions not generate an exception (close() probably wouldn't but the others could)?

Whereas (again pseudo-code)

    function writeLog(message string) error {
        err = open(CREATE | APPEND, "/var/log/example.log")
        if err != nil {
            return errors.New("Unable to open log: %s", err)
        }

        err = writeln(message)
        if err != nil {
            return errors.New("Log open but unable to write: %s", err)
        }

        close("/var/log/example.log")
        // doesn't return an error
    }

It's clear to the end user what specifically has failed, it's clear to the developer which functions can error and ok, the code is a little uglier for it but no more so than many of the other examples handling errors / exceptions (error handling is inherently ugly).

This is clearly a bonkers example made up on the spot and it wouldn't matter too much if there was one generic "unable to write to log" message. But the concept described above has proven very useful in larger applications I've written which have needed to handle errors nested several layers deep in a way that's meaningful for end users (as opposed to developers).

Now I'm not saying Go got it right nor that other languages haven't solved this in better ways and I'm certainly not trying to advocate that Go's error handling is "good". However it often feels like most of the complaints leveraged against Go's error handling don't really understand the point behind Go's design in the first place or have other missing pieces of information about what can and cannot be done in Go.

twhitmore · on May 23, 2020

Regarding an exception approach -- at a business level, the user cares that the log hasn't been written. Only one exception is needed to report that (so long as the exception includes the cause).

The causative IO exception should indicate which operation failed (open or write), and -- for those with access to source -- should indicate the line number.

Encapsulation principles suggest that both the user & calling code should care only that the log was written successfully (or not), but not how or why it succeeded or failed. Any recovery in code (if possible, usually it's not) should be encapsulated within the method.

In most error cases there is no realistic or sound recovery (occasionally a generic retry might be valid). The main purpose of error handling is to stop successfully and report the problem in a diagnosable manner.

Some context on the purpose of error handling/exceptions in this article: http://literatejava.com/exceptions/checked-exceptions-javas-...

laumars · on May 23, 2020

> Regarding an exception approach -- at a business level, the user cares that the log hasn't been written. Only one exception is needed to report that (so long as the exception includes the cause).

I thought I made it clear that example was intended to be more illustrative than literal, but obviously not. However there will clearly be occasions when exceptions are too blunt of a tool. Maybe think back to one of those occasions?

> The causative IO exception should indicate which operation failed (open or write), and -- for those with access to source -- should indicate the line number.

You're making a lot of assumptions there:

- that any imported libraries have well written error messages (sometimes they don't but they still the best at handling normal operation).

- that there is only one operation of a given type within a function (eg what if you're writing to two different logs? It would certainly help to wrap that error with which log you're writing a generic error for the whole function)

- that the person debugging the executable has access to the source code -- or even understands the language enough for the source to be of use. The target audience might not even be technical so having useful error messages that they can report back to support helps reduce the cost on your support team

> Encapsulation principles suggest that both the user & calling code should care only that the log was written successfully (or not), but not how or why it succeeded or failed.

Not entirely true. Code shouldn't care but humans who would need to rectify the fault definitely DO care. eg how often do people on HN comment about the usefulness of Rust's compiler messages? It's exactly the same principle you're now arguing against.

Also the log example is illustrative.

> Any recovery in code (if possible, usually it's not) should be encapsulated within the method.

The example I'd written does exactly that. So we're all good then ;)

> In most error cases there is no realistic or sound recovery (occasionally a generic retry might be valid). The main purpose of error handling is to stop successfully and report the problem in a diagnosable manner.

That's a rather large generalisation and often isn't true eg

- Some frameworks return EOF as an error state when reading past the end of a file (eg from pipes).

- If you're running a web service and someone's connection terminates while you're still writing to it, that would produce an error but you'd obviously need to recover otherwise your whole web service crashes.

Errors and exceptions are a pretty normal way of breaking control flow and aren't just used in edge cases where the application should be terminated in a controlled manor.

> Some context on the purpose of error handling/exceptions in this article: http://literatejava.com/exceptions/checked-exceptions-javas-.... .

I've been writing Java since v1 days. In fact I have 30 years of development experience in dozens of languages. Java does a lot of things better than Go but frankly I'd take Go's error system over Java's.

I think the biggest drawback of Go's error system is it's manual and verbose. Ironically that's also it's strength, as in that's why it works (as I described in my previous post). However from the perspective of developers looking from the outside in, err blocks just luck ugly. Which I also get. Sometimes it is a chore writing err blocks but they are also immensely useful in practice (and when isn't it tedious capturing errors?)

jlouis · on May 21, 2020

No.

Error handling in Go is very explicit since it doesn't alter the control flow of the program through an exception. I'm hesitant to agree this is just a bad implementation, because you will have to make decisions on error paths in any case. If you don't do that, you are just doing bad programming.

On the surface, it looks like you want to use exceptions or other control operators for handling this. But after having programmed for more than 20 years, I'm not too sure there is a good solution to this problem. Programs and operations in programs can fail. And you will have to think about these failures eventually.

throw_m239339 · on May 21, 2020

The point is that Go doesn't force any developer to deal with errors. Its error system isn't better than exceptions. In fact Go doesn't really have an error system (aside from panics), Go errors are purely a convention and no mechanism around it is actually baked into the language.

It's exactly like the "good old" return codes in C. Nothing more, this isn't an error system at all. It's primitive. It's just that Go has multiple returns to make it a bit more bearable.

> Programs and operations in programs can fail.

Yes, and using return codes absolutely does not guarantee you will avoid panics at all. In fact you can use error codes in any other language if it pleases you, because that's not a language construct.

The whole "we have exceptions but we don't have exceptions so you shouldn't use panic/recover" dance is just stupid.

sudhirj · on May 21, 2020

Why aren’t these problems applicable to Postgres? Is the connection model different or do the libraries already do all this?

jlouis · on May 21, 2020

Postgres handles the network connection quite differently. In particular, it can multiplex several queries on the same connection and feed data back in smaller chunks.

You may run into bugs, of course, but they are likely to be of a different kind.

user5994461 · on May 21, 2020

It's a MySQL specific issue, in short MySQL drops connections out of the box.

It's a combination of misunderstood settings, bad defaults and bad configuration that's been copy/pasted from the internet since inception.

hilbertseries · on May 21, 2020

Why is Postgres relevant to this post at all?

donatj · on May 21, 2020

Because Go uses the same provided abstraction layer for all SQL databases. If the errors are in the abstraction layer and not the driver it would be odd for them to be MySQL specific.

wolco · on May 21, 2020

Not really that odd. The abstraction layer provides functionality common to all. Perhaps something common didn't work for mysql.

csharptwdec19 · on May 21, 2020

Entirely possible.

If you've ever looked at an ORM or even complex MicroORM that supports multiple database you get an appreciation for how many 'little' differences everyone has in the SQL standard itself.

And if that's the "industry standard" (Yeah, I know it's not, but it's at least a spec/guideline) with that much deviation, I can only imagine how dissimilar each DB's nonstandard protocol might be.

infogulch · on May 21, 2020

I could see GP's comment as a way of exploring the problem in order to get a better grasp on the technical or architectural choices that may have caused the difference in outcome, in order to potentially learn what's important about how to design things like connection pooling.

sudhirj · on May 21, 2020

Yeah, we use Go with Postgres a lot, so I'm wondering if we should check for all these problems on Postgres as well. And if we don't, why? What so different about the PG connection models that this problem won't happen there? The principles are all the same, and QueryContext changes apply equally to all database adapters.

nerdbaggy · on May 21, 2020

I’m sure there are a lot of reasons to hate it but I like how Go does the different database types and the base type

nkozyra · on May 21, 2020

A lot of languages do this under the hood, because sql abstraction layers need it for network, query discrepancies.

jijji · on May 21, 2020

I encountered this same issue a few years ago when using the Go mysql driver, and the simple fix was calling db.Ping() before doing a query, which checks the connection and allows the query to proceed.

meritt · on May 21, 2020

The general approach I've used in other languages is if the mysql client API is implemented correctly [1], when it encounters a closed connection, will return a CR_SERVER_GONE_ERROR error to the calling code. You catch when this type of error occurs, re-initialize the connection, and replay the last query/transaction.

[1] https://dev.mysql.com/doc/refman/8.0/en/gone-away.html

RhodesianHunter · on May 21, 2020

So you have to make a round trip to the database before making a query? Yikes.

kevingadd · on May 21, 2020

You don't need a round-trip, you can push a ping request into the pipe immediately followed by your actual query. You don't care about the actual result of the ping unless your query fails - "if the ping also failed, this was a dead connection so I can retry the query, but if the ping was OK and the query was not, the query failed"

curryst · on May 21, 2020

I think you could also accomplish the same thing with a goroutine. Just a struct that stores a RW mutex and a sql.DB connection that has a .Start() function that runs the ping on a loop. On each start of the loop, lock the write mutex, run the ping, reconnect if it fails, and then release the mutex. Then just wrap each of the functions in sql.DB so that it gets a read lock before it executes the sql.DB function and defer the release.

It's still a performance penalty, but it moves the round trip to a 99% latency situation instead of an every query situation.

yuribro · on May 21, 2020

I really don't understand the discussion around the first bug. Either something is over simplified, or is it just an issue of a bad abstraction?

This issue (other side closes connection) is so fundamental to all networking code, I don't see how the use specific use case (MySQL idle timeout) is special. The connection pool also doesn't sound relevant - the same would happen if the caller is keeping the same connection (socket) and using it.

For this class of issues - this is the easiest case to deal with - the other side sent a FIN and we got it! Our kernel knows that this socket is closed (doesn't matter if fully closed or only half for this case).

So if you would do a select (or equivalent) call for the socket it will not return as writable (and will probably return in the exception case), and you'll reconnect.

If you don't use select - the write will fail immediately, and I would guess that at the worst case, you can see that 0 bytes were written?

So the problem here was just that the wrong error was returned through the multiple layers of abstractions? Why not propagate the error correctly? Or handle the re connection in a lower level? Trying to "Ping" on the application level is such an overkill and misuse of resources. Do we really want to have an extra round trip for every action we do on a DB?

Of course, as someone else pointed out, there are much harder issues to consider - silent disconnects, proxied connections where the proxy didn't propagate the error, failures on the DB level, and so on.

EDIT: I see now that another comment mentions that there is a specific error code in MySQL clients for lost server.

zkirill · on May 21, 2020

>> Instead, make sure that every SQL operation in your app is using the QueryContext / ExecContext interfaces

I've been wondering if this is indeed a best practice or something that should be used only when necessary. Should every database query really abort if the connection to the client fails?

In addition, if you follow this advice you now need to check if the returned error is context.Cancelled which warrants an HTTP 400 response.

dullgiulio · on May 21, 2020

Of course it is up to the application to decide case-by-case.

Just keep in mind that if your query is started by a user request, but does not need to terminate in the context of the user request that triggered it, you can move the query to a separate goroutine and decouple it completely from the triggering user request.

Also, aborting a query is safe: the transaction gets rolled back, removing unwanted side effects. You cannot do harm by cancelling, bu you can do harm if you don't cancel.

markdog12 · on May 21, 2020

Dumb question, but you can't listen for a close socket event in Go?

tptacek · on May 21, 2020

A TCP half-close, with a FIN? I think that's a limitation of the socket system call API itself.

markdog12 · on May 21, 2020

Yeah, I wasn't sure. I was looking at a Dart MySQL connection pool implementation, and that's what's done.

  socket.listen(onData, onDone: onSocketDone);

  onSocketDone() {
    _closed = true;
  }

  write(Buffer buffer) {
    if (_closed)
      throw StateError('Cannot write to socket, it is closed');

    ...write
  }

  onData(RawSocketEvent event) {
    if (event == RawSocketEvent.READ_CLOSED) {
      _closed = true;
    }

    // Other events handled as well, READ, CLOSED, WRITE
  }

But I don't know what's going on under the hood there.

dancemethis · on May 21, 2020

[flagged]

sudhirj · on May 21, 2020

Isn’t this clearly an explanation of a patch contributed upstream? Don’t think the connector is proprietary.

Traubenfuchs · on May 21, 2020

One consequence of reinventing the wheel (go) is going through the same problems others stumbled over years ago.

zkirill · on May 21, 2020

As someone who is currently transitioning from Go to Java/Kotlin I agree with you to an extent. However, I was surprised that I wasn't able to find anything in the Java/Kotlin universe that is comparable to Go's web request context cancellation.

The_rationalist · on May 21, 2020

https://kotlinlang.org/docs/reference/coroutines/cancellatio...

BTW what do you think about the transition? What do you miss that was in go and would you miss in go that you find in Kotlin + the JVM library ecosystem?

apta · on May 21, 2020

Loom will have superior context cancellation and deadline handling than golang. https://cr.openjdk.java.net/~rpressler/loom/loom/sol1_part1....

Traubenfuchs · on June 2, 2020

I didn't read all of it, but I was wondering: In a virtual loom thread, am I allowed to call currentThread().sleep(...); or will it block the system thread? Or is there some magic like currentThread().sleepAndThen(..., ()->{}); ?

diek · on May 21, 2020

CompletableFuture is the closest concept in Java, to my knowledge.

java.util.concurrent has a ton of great classes. If you need to do shared memory concurrency, it probably has something there to help you.

frobisher · on May 21, 2020

Just curious - are any of the these bugs possible in Rust?

The_rationalist · on May 21, 2020

Rust SQL drivers (when they exists) are mostly not production ready, they just have not the human resources. For example the leading driver (diesel) do not support elementary SQL such as GROUP BY

couchand · on May 21, 2020

I thought Diesel was just an ORM, but investigating your comment I discovered that they've implemented their own drivers in-tree. That surprises me, and also seems like a mistake. Steven Fackler's PostgreSQL driver, for instance, is widely considered to be quite solid.

The_rationalist · on May 21, 2020

The justification for going to go given here is very weak. Firstly if they really cared about performance they would go to jruby or truffleruby, secondly most of the performance of a language are in its library ecosystem beyond being the language itself. Ruby has had decades of optimisation for the server that the nascent go has to reinvent

PunksATawnyFill · on May 21, 2020

Go: Yet another Google language.

What's it going to be next week?

nkozyra · on May 21, 2020

They have Dart, which is alive and well after 9 years and Go, which is alive and well after 11 years.

What point were you trying to make?

erikrothoff · on May 21, 2020

As someone running MySQL with Go in production and seeing random MySQL connection loss exceptions, is there a summary of how they fixed this? It’s an insanely long article and I appreciate the detail and effort, but if someone has read it all and have a gist that would be insanely helpful .

sethammons · on May 21, 2020

You have production issues related to the article but you dont have the time to read the article?

erikrothoff · on May 22, 2020

Sorry if I came across as lazy. It’s a sporadic issue that’s more easily fixed with a service restart. Did not mean to come off as entitled. My intent was just to ask for some help from the community.

earthboundkid · on May 22, 2020

The article has three headings for “production tips”. TL;DR read those.

erikrothoff · on May 22, 2020

Thanks! I ended up reading the entire article, and didn’t really understand it all, but I think tl;dr upgrade to go mysql driver 1.5.