I wonder if machine learning could be used to model the probability distributions at a greater capacity, and thus reduce backtracking.
One might also consider placing the algorithm in the context of a generative adversarial network (GAN) to adapt the tile probability modeling ML component towards a pattern that is less distinguishable from a real city.
Death metal is my go-to work music for when the fight with entropy feels like a losing battle. Especially as the rest of our capitalist society tells you to be happy all the time, death metal offers you the intuitive sense that you're not alone in the pain, and some solace in the fact that some people are brave enough to lament their difficulty out loud.
I think catharsis is really important and people should teach themselves to seek it the way they seek their professional goals, food and companionship.
Conventional processors will eventually run into physical limits of heat dissipation that will impede moore's law. Reversible computers are a way of circumventing this physical limitation because they dissipate less energy as heat.
When I think exabyte scale queries on a columnar datastore I think aggregations, but then I have this question: Why do we need to do exabyte scale queries in the first place? Wouldn't statistical inference via random sampling be faster and accurate enough?
(Granted, often times aggregations are happening after some filtering, at which point the relation being aggregated might be considerably smaller than exabyte scale.)
Redshift is designed to fill the classic accounting datawarehouse role in an organisation. Whilst I'm sure there aren't too many companies with account ledgers that large (or any), I doubt too many accountants would be happy with statistical inference of their books... ;)
This new model of processing directly on S3 is pretty much aimed specifically at eliminating the "Load" part of the ETL process. Just dump to csv from whatever sources you originally had, and don't worry about the schema conversion/loading into a DB. The fact that it happens to scale to exabytes is just good marketing fluff.
Yeah, I think filtering is a big part of it. If you want to answer a statistical question about the entire dataset, then a random sample is probably good enough. If you want to drill down and do an analysis that only looks at a particular narrow slice of the data, then it's likely that the corresponding subset of your sample isn't big enough to be meaningful.
(You can pre-filter or pre-aggregate before sampling, but that assumes you know a priori what types of queries you'll want to do.)
it really depends on what you are doing. A large data set shouldn't be limited to longitudinal analysis. If you're storing every log record or every stock bid/ask, there may be times that you need to understand the specifics of what exactly was going on. There may be a lot of filtering on the underlying corpus for these sorts of exact match queries, but data set sizes continue to grow.
that said, I agree that approximate functions should be part of a modern database system. Redshift has approximate count distinct (based on hyperloglog) and approximate percentiles (based on quantile summaries)
One might also consider placing the algorithm in the context of a generative adversarial network (GAN) to adapt the tile probability modeling ML component towards a pattern that is less distinguishable from a real city.