Hacker News new | past | comments | ask | show | jobs | submit | krnaveen14's comments login

Reminds me of the time where I've implemented highly optimised version of Sieve of Eratosthenes in Java, just for a Code Golf Challenge.

Memory Usage is greatly reduced by using BitSet.

https://github.com/krnaveen14/PrimeNumbers/blob/master/src/m...

https://codegolf.stackexchange.com/a/66915/41984

OMG, it's been 8 years already!


The Loneliness of a pretty good developer https://news.ycombinator.com/item?id=31438426




Thanks for sharing, I enjoy reading about the GC changes every release.


Based on our experience with Apache Kafka and alternative streaming systems, Apache Pulsar natively addresses the Honeycomb's needs.

- Decoupling of Broker & Storage Layer

- Tierered Storage (SSD, HDD, S3,...)

We use both Kafka and Pulsar in our systems.

- Kafka is used for microservices communication and operational data sharing

- Pulsar is used for streaming large customer data in thousands of topics


In 2016 when we were founded, Pulsar wasn't a thing yet. Today, we'd be very likely to use Pulsar if we were starting from scratch.


Agree. We did a recent migration from Kafka (Confluent Platform) to OSS Pulsar for a near identical use case and couldn't have been happier - we've saved costs, improved uptime, and reduced errors. We found it much easier to manage Pulsar ourselves with automation that it mostly removed the need to use a managed Kafka service (we've tried solutions from AWS, Confluent, and Aiven).

The only downside with Pulsar (which is constantly improving) was that community support for teething issues was a bit slower than what we saw with Kafka.


Now Kafka has ZK taken out and tiered storage coming in these factors are becoming mute points. With tiered storage Kafka has 2 layers and hence the benefits of broker/storage decoupling. By comparison, Pulsar has broker, bookkeeper, S3, and ZK layers to contend with. This is why a second layer was never added to Kafka directly.


What's the story like for migrating from Kafka to Pulsar?


- Kafka couldn't cope up if there are hundreds or few thousands topics. High CPU load, longer startup times...

- Even empty Kafka topic consumes 20MB of on-disk storage (that's 20GB for 1000 topics)

- Inevitable coupling of non-partitioned topic to a particular Kafka broker limiting the storage scale

- Tiered storage was not available previously in Kafka (is it now available in open source version?)

- Native Multi Tenant support with Authentication / Authorization support not available in Kafka (essential for customer based thousands of namespaces and topics)


> Kafka couldn't cope up if there are hundreds or few thousands topics. High CPU load, longer startup times...

It also uses num_partitions*2 open file descriptors per topic, which can quickly surpass the default ulimit on a host. Always remember to raise the ulimit before you near 1000 topics, otherwise Kafka crashes.


right -- the question I and others have is, can you convert an existing topic in place or do you have to do the dual-writer/shadow launch, then drain the old one model etc.


We choose both Kafka and Pulsar from the start. Haven't done any migrations from one to other.

If we are gonna move from Kafka to Pulsar (operational data),

- Publish new messages only to Pulsar

- Wait for Kafka Subscriber to process all pending messages

- Start Pulsar Subscriber (step 2 and 3 can done in parallel or sequential based on message ordering needs)

We don't need to worry about pulsar subscriber offset since existing messages in kafka topic won't be copied to pulsar topic preventing duplicate messages. The whole migration would be completed in 1-2 min (depending on Step 2 wait time).

This approach is carefully taken such that there can be little lag in subscriber but the publisher should not be affected by any means.


What's the difference between Kafka, Pulsar, RabbitMQ and Crossbar.io ? They all seem to be hubs for PUB/SUB and RPC.


Is Pulsar a good fit for microservices communication ?


Yes, definitely. It's just that we used Kafka in the start for microservices communication and later choosen Pulsar for large data streaming.

We have it in our roadmap to migrate the former to Pulsar, but not the utmost priority right now. It also simplifies Infrastructure management by keeping one event streaming stack instead of two.


A few years back we were doing heavy file processing in Java with most of the time being spend in file i/o. Initially it was designed as receive zip file, extract zip to directory, read the extracted files one by one and processing it. If the zip file is 1GB and expands to 10GB when extracted, the amount of IO being done is significantly large. 1GB Read -> 10GB Write -> 10GB Read. Suppose our AWS Instance type is capable of 50GB/s, we were spending a minimum of 420 sec in IO operation itself.

This limited the throughout capacity of number of files which can be processed within a duration where the next set of zip files would be received in fixed interval. Since we passthrough the file only once during processing, we had to eliminitate the zip extraction and read the files in zip one by one as decompressed byte stream. This was possible with zipInputStream.getNextEntry() and reading the bytes but it posed a major refactoring and inconvenience where we now have to deal with byte[] instead of File in every place.

Then comes the most advanced nio and filesystem provider features (it was already available but we came to know about the benefits of it only then). All we had to do was simply replace File with Path and new FileInputStream() with Files.newInputStream() instead. Regarding zip file decompression, we simply replaced ZipInputStream and ZipEntry with FileSystems.newFileSystem() instead. Near instantly we were able to reduce 21GB IO into just 1GB IO reducing the total processing time exponentially.

Based on this understanding, we were able to implement the similar approach in zip file creation also where files will be written directly to zip as a compressed stream and only using Path in all places.

Developers are not just required have to a mental model about the memory constraints but also about the volume of Disk IO operations where the latency would instantly kill the application performance once the page cache cannot hold the files in memory.


I suppose many of us (including myself) are waiting for Java 17 LTS release so that there's is a single migration phase (8 -> 17) rather than doing it now (8 -> 11) and also (11 -> 17) after few months.


I'd say that Java 8 -> 11 is going to be and order of magnitude more disruptive than 11 -> 17.

So if you have downtime now (ha ha), you might as well migrate to 11.


Why not put all versions of Java, including the master branch build, into your CI, then you can instantly see what it's possible to upgrade to, and instantly see when incompatibilities appear.

I don't really get what migration issues people experience in practice.


upgrading from 8 to 11 is a good step to have. You will not believe how many problems pops up. Upgrading from 8 to 17 might be a very hard exercise - libraries and all..


Our product offers near real time (incremental sync every 3 mins) backup and restore solution for on-premise SQLServer data. We have used PostgreSQL to store the backup data in cloud and offered add-on services on top of PostgreSQL data (such as reports, analytics, etc). Every customer data is stored as a separate database.

Initially we have used RDS as PostgreSQL instance when the product is in pilot phase. RDS costed us $548 for just 2vCPU 16GB 500GB SSD (db.r5.large Multi-AZ). Considering the increasing active customer base and volume of data involved, we found that RDS is very expensive and costed us more (> 100%) than the market affordable estimated product pricing (:facepalm:). As per our performance benchmark, we found that db.r5.large can accommodate 250 customers and scalable linearly. To reduce the RDS spend, we had to reduce two costs.

1) Reduce the RDS Instance cost / customer - we aggressively optimised our sync flow and final benchmark reveals 500 customers can be accommodated in db.r5.large Instance (50% less RDS Instance spend / customer)

2) Reduce the RDS Storage cost / GB of customer data - we could not find any way to reduce the storage cost. Since RDS Instance is fully managed by AWS, no possibility of data compression.

When we compared the total cost based on our usage, RDS Instance cost is just 10-20% and Storage cost is 80-90%. So finally we decided to host our PostgreSQL Instance in EC2 with transparent data compression. This is our current configuration and usage metrics.

r5a.xlarge (4vCPU 32GB)

PG Master - Availabilty Zone 1

PG Slave - Availabilty Zone 2 (Streaming Replication)

----

8 x 100GB ZFS RAID0 FileSystem with LZ4 Compression (128KB recordsize)

40GB (wal files) ZFS FileSystem with LZ4 Compression (1MB recordsize)

600GB Compressed Data (3.1TB Uncompressed - x5.18 compression ratio)

----

2 x r5a.xlarge - 2 x $104.68 = $209.36

2 x 8 x 100GB - 2 x 8 x $11.40 = $182.40

2 x 40GB - 2 x $4.56 = $9.12

Total EC2 Cost = $400.88

----

If we had to use RDS

db.r5.xlarge Multi-AZ = $834.48

3.5TB Multi-AZ = $917.00

Total RDS Cost = $1751.48

----

So we have reduced our cost by $1751.48 / month (greater than x4 times) by using EC2 instead of RDS. Best of all we have purchased 3 Years No Upfront Savings Plan which further reduced our EC2 Instance cost to $105 (50% reduction). RDS doesn't have No Upfront Reserved plan for 3 Years and for 1 Year No Upfront we get just 32% Instance cost reduction.

Apart from the direct EC2 Instance cost and Storage size reduction, major benefit we indirectly got by migrating to EC2 Instances is

- IO Throughput increased by x5 due to ZFS LZ4 Compression. Importing of 3GB Compressed GZIP file would take around 2.5 - 3 hrs in RDS whereas in EC2 it just takes 30 - 45 min.

- Existing Savings Plan discount automatically applied (50% reduction)

- Ability to migrate to AMD based Instances (r5a.xlarge) - 50% reduction compared to Intel based Instances (r5.xlarge) in Mumbai region. It'll take ages before AMD based Instances are available in RDS.

- Ridiculous EBS Burst Credits by using 8 Volumes in RAID0. Base IOPS - 8 x 300 (100GB) = 2400 IOPS. Burst IOPS Credits - 8 x 3000 = 24000 IOPS :D

- PG Master is used for backup sync write operations and PG Slave is used for reporting and analytics. RDS requires Read Replica to be created from the already existing db.r5.xlarge Multi-AZ Instance for read operation which will further increase the estimated RDS cost by x1.5

- Planning to migrate to AWS Graviton 2 ARM64 Instances. AMD (r5a.xlarge) and Intel (r5.xlarge) based Instances have hyperthreading enabled which leaves us with just 2 real cores and 4 threads. But basic Graviton 2 Instance r6g.large itself has 2 real cores. So I'm kinda estimating that the basic r6g.large (2vCPU 16GB) Instance itself can support upto 1000 Active Customers (further 50% EC2 Instance Cost reduction).


I assume you're using ZFS filesystem for the transparant compression, what's your opninion/experience on using ZFS on cloud storage? I mean; the EBS disks are already redundantly stored by AWS and the COW mechanism could lead to a lot of write amplification; negatively impacting the network attached storage?

(I don't use EBS in my day job, but Azure's disk offering don't really offer adequate perforamance when used with any filesystem other then EXT4 in my experience)


There is a small write amplification due to pg page size being 8KB and ZFS recordsize being 128KB, but considering the Bulk write nature there is not much impact. Also max IO size of EBS is 256KB which helps us to optimally utilise available IOPS even if there is write amplification. Reducing the ZFS recordsize significantly reduces compression ratio so we kept as it is. If it's an OLTP application, reducing the recordsize will improve latency but for these bulk operations, it's the most suitable.

I haven't used Azure, but based on my raw benchmark and real time usage, I would say type of FileSystem doesn't affect performance of EBS Volumes. Our peak IO usage is 1200 IOPS and 20 MB/s. I would say similar RDS configuration would have x4 - x10 write amplification due to data being not compressed.


> we have reduced our cost by $1751.48 / month

It doesn't look like a huge win given how much complexity you added, while RDS manages it for you.


I guess $1751 / month isn't a big deal in developed countries. But in India this is a lot. Also if I include the RDS Read Replica in total RDS cost, it comes down to $2627 / month (~ INR 1.93 Lakhs / month). Here this is equivalent to 5 Junior Developer Salary / month.

Based on our current customer base of backup and restore solution with addons, AWS spend is about 12% - 16% of the total product revenue. Our company has about 5000+ Active Customer base where the core product offering is different. Backup and Restore solution is itself an add-on. If we would have priced this considerably larger due to larger RDS spend, then it won't be surprising even if we get just only 10% of the current addon customers (700+).

Plus I would say this isn't a much complexity, everything is automated using Terraform and Ansible - pg installation, streaming replication setup, ZFS RAID0 setup, etc... Not a single command is executed in our EC2 Instances manually. The only benefit we get from RDS is the failover capability with minimal downtime. But for that, x4-x5 increased RDS cost isn't worth for us.

We still use RDS for OLTP and service databases, but not affordable for the backup offering.


Aren't you getting killed on inter-AZ bandwidth costs with streaming replication?

This was our experience when we tried it.


Our total Inter-AZ bandwidth usage is about 2TB - 3TB / month which comes around to just $20 - $30 / month. We are planning to introduce SSL with compression between Master Slave setup to further reduce Inter-AZ bandwidth, but this isn't taken up yet.


Wow, thanks for your detailed comment. Learned a few things in here for our own EC2 deployment of PG.


>> The document data model and MongoDB Query Language, giving developers the fastest way to innovate in building transactional, operational, and analytical applications.

I'm pretty skeptical about this. It seems they kinda advertise it as traditional OLTP database.


I agree. That sounds like a marketing speak translation of developers saying they're familiar with it and/or they have an existing code base of examples they base off of.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: