Launch HN: Milk Video (YC W21) – Edit online event recordings quickly

derrickli978 · on Feb 24, 2021

We use both Milk and Descript at Macro for our podcast workflows.

Milk is really good at creating short (for us it's ~1min), impactful Twitter + LinkedIn marketing collaterals based on each podcast guest (we can add our logo, the guest's background, the guest's bio + picture, transcript, etc.).

Descript is amazing at editing the entire podcast and make sure we have the overall content needed to publish.

Can't imagine doing our podcast without Milk + Descript.

rememberlenny · on Feb 24, 2021

Thank you! Thank you! Thank you!

Per the podcast point, last night (phew) we launched the ability to upload audio files and work with them.

We are focused on webinars/Zoom recordings, but now you can upload a podcast and create a promo tile.

These are some other links that were made in Milk Video:

- https://twitter.com/m_cieplinski/status/1356331228954292224

- https://twitter.com/rabois/status/1310644068326629376

- https://twitter.com/rememberlenny/status/1339618249575714816

ahstilde · on Feb 24, 2021

Pretty cool to see a product I use on HN.

I've been using Milk to promote my podcast ( https://www.allschemesconsidered.com ) on LinkedIn. Here's some sample videos:

https://www.linkedin.com/posts/mraakashshah_cloud-aws-gcp-ug...

https://www.linkedin.com/posts/mraakashshah_heres-a-teaser-f...

All the social media companies are pushing video really hard in their algos (and stories ). Recording and editing a podcast is super fun for me, but the audience-building part was a drag. Milk lets me make professional quality highlights super easily. Ironically, the viewership on these highlight videos is 100x the listenership on the podcast.

Anyway, I like the software.

Disclaimer: Ross found and pitched me on Milk, but I've been a happy user ever since.

rosscranwell · on Feb 24, 2021

It's been great working with you, Aakash! Thanks for the support.

mchusma · on Feb 24, 2021

I have some constructive feedback. The main video on your homepage confused me immensely. It was a guy from brex talking about retool, but they weren't very eloquent (no offense) and it rambled on forever. I thought it was going to be a before and after thing showing how bad it was before and how amazing it was after...but it was just a before? Or was that after milk? I also thought for a minute I was on the wrong video, hearing about retool.

I use brex and retool and descript and canva so I think I would be a target user but just didn't get it at all from the video.

My feedback would be to make the company "Acme" and show before or after...but definitely for a video product that first video should be a really good video.

My 2 cents.

rememberlenny · on Feb 24, 2021

Extremely valuable and noted! Thank you for taking the time to write this out and share.

nate · on Feb 24, 2021

Congrats y'all! Looks interesting. One use case I believe you're talking about that I'd love to see even more fleshed out is the tool to find the interesting clips for me? I'd love if after I just did this upload, you were like: "here's 3 clips that seem interesting to our AI." Maybe even a summarization algorithm would suffice at finding the most relevant chunks in transcript? Or maybe something more fancy if it's doable. But I'd love a best effort stab at the clips so I don't even have to think about finding them :)

rememberlenny · on Feb 24, 2021

I love this idea in concept.

The point of the transcript is to lower the bar on who can review video content. One layer on top of that is moving the technical work (cutting video) into an editorial role (picking the parts that are recommended).

We aren't trying to position ourselves to do the clip picking/recommendation now, but we have already done some machine learning based analysis to make this easier to find. We have a video processing task that looks for "scene changes" based on image threshold changes, so the metadata associated to when a new person joins/slide changes/etc is present.

The original thinking here is that we can recommend "templates" that correspond with certain video (ie. multiple speakers vs single presenter).

1234letshaveatw · on Feb 24, 2021

So it wasn't by choice, but most of my grad school business classes are now virtual. Most have a team presentation aspect where you collaborate on a powerpoint and charts and spreadsheets then "present" virtually by tacking on audio and video and then upload the whole mess after a lot of stress and panic.

Have you considered creating a free "edu" edition that would generate a mashup of uploaded videos, ppt and media, watermarked or whatever? The students would then conversion to a paid model for work? I would use it

rememberlenny · on Feb 24, 2021

Yes! Non-profit and academic use case is currently free.

It's worth noting, the use case you are mentioning is interesting, but not exactly the workflow we are building for.

We are focused on improving the workflow around a marketer/demand generation/sales person at a company who needs to use existing content to get attention on social media/blog/email.

One of the ways we are thinking about this is around increasing the "shelf-life" of quality content, which doesn't get discovered because its long. This is very much a problem that appeared when businesses became content creators, as opposed to individuals.

That being said, if you have any questions, please email either Ross (ross@milkvideo.com) or me (lenny@milkvideo.com) and we will be happy to help!

1234letshaveatw · on Feb 24, 2021

Gotcha, I respect that you have a focused model.

I would also think that local government would be a good target audience- We have sat through some endless school board zooms (due to COVID). I am sure "shrink and share" would work there as well.

rememberlenny · on Feb 24, 2021

At a high level, there are two kinds of video processing work: synthesizing/organizing and creation.

There are a number of new companies appearing in the synthesizing/organizing space, because there was an explosion in the quantity of video being produced, and the software to parse it exists (speech-to-text, object recognition, better-search, etc).

At least for companies, people intuitively want to share all the "best parts", but we found that people won't actually watch them. Instead highlighting one specific piece that visually looks engaging is a better way to capture attention, and then engage.

Our thought process here is that there is only so much organizing/synthesizing most people will do, but there is an endless about of creation someone can do.

Re: government - For what it's worth, the US Chamber of Commerce is one of our paying customers.

1234letshaveatw · on Feb 24, 2021

It would be sexier to synthesize the cyber truck unveiling over the local planning board, but then again Buffett didn't shy away from investing in garbage pickup

whoisjuan · on Feb 24, 2021

"The students would then conversion to a paid model for work?"

Honestly that almost never works for startups.

It never happens or at least it never happens fast enough. I'm fairly sure that large companies like Microsoft that popularized student licenses, are benefiting from this, but through very long cycles of adoption.

At the end of the day if you get discounted Office for 4 or 5 years, you will highly likely continue doing it once you lose the discount. The secret sauce there is distribution. Office is so massively popular that you don't even need to advocate for that at work. It's the default choice.

1234letshaveatw · on Feb 24, 2021

I work in IT for engineering.

CAD sw for the engineering schools is crazy competitive. If you are graduating and know a CAD/simulation suite really well it helps you get a job (and therefore it will influence where you apply, which influences the employers, etc)

whoisjuan · on Feb 24, 2021

I totally believe this. But it's highly contextual to your job/area. CAD software is very specialized so it definitely makes sense that you can influence or take job decisions based on your knowledge of a particular suite.

What I believe is that if you're a startup building a generic business tool and you target students with the hope they will be your advocates or future buyers, you should adjust your expectations.

I believe that giving student discounts as a long-term adoption strategy makes sense. Unfortunately, startups have to fight for adoption in the short term. If you can afford to give student discounts you should do it. It's a good thing to do. But don't count on this as a way to drive adoption.

rargulati · on Feb 24, 2021

You mentioned your backend being a Rails app + serverless functions, what's the benefit of doing both there? What does your video processing infra look like, is that in one of those systems?

rememberlenny · on Feb 24, 2021

I could go pretty deep here, so let me know if I should elaborate on anything.

The backend is a Ruby on Rails application that serves the frontend app's API. This interfaces with the user tables, database, and handles all the "state" of the app.

The serverless stuff has changed over the months, but primarily it handles the stuff I don't want Rails to handle: file uploads, video processing and transcription.

First, huge props to the Mux (https://mux.com) team and product. I can not express how easy it has been to build video (and audio) products. File uploads are handled to AWS/GCP (depending on a few things) and then trigger a serverless callback to Mux.com. Mux was the fastest way we found to turn an arbitrary video file (mp4/mov/etc) into HLS format for quick streaming.

Then once the video is uploaded, we have another serverless callback that sends the video for transcription using Assembly AI (https://assemblyai.com). There are a ton of transcription based services and they vary dramatically in quality, based on the media content. I believe Google/Amazon services were largely built around the need to process phone calls, so unless you may for their "enhanced" models, the quality is surprisingly bad (and surprisingly slow).

I *highly highly* recommend Mux and Assembly AI if you are doing anything video/transcription based work.

To get an immediate update to the end user, we actually process two transcript requests - one that is just the first 60 seconds, and then the remainder of the video. This lets us render a preview transcript in the first 15-20 seconds.

We also have a serverless pipeline for generating the videos, but I won't go into that unless you're interested. In short, a serverless function kicks off a Docker instance running on ECS.

The requests to the serverless apps (mostly Node) have a callback to the Rails app, which then updates the end user state using websockets (which are very easy to use in Rail's ActionCable).

pm · on Feb 25, 2021

Interested to hear more about your pipeline and infrastructure for processing and delivering video. I'm working with processing short videos at the moment for my current startup, though I didn't use Mux (I figured it was a core competency we needed to develop). It's just a queue using FFMPEG to convert from MP4 to HLS.

rememberlenny · on Feb 25, 2021

I have horror stories about FFMPEG that I wont go into here.

In short, I'm just one person building this - so I'm sticking to what I know best. I want video to "just work" without having to worry about some video format/extension/containers that I have no idea about.

There are a number of video processing services, but Mux really is the best. The API is simple. They have a ton of really nice helper functions, that I use a lot (like timestamped thumbnails, preview gifs, and VTT storyboard generation), which I could easily spend a few days on making, and then countless hours maintaining.

I dont doubt that building video infra is a good idea, but just as I'm not about to train my own speech-to-text model, I'm not going to build out video infra.

At least for me, I'm more worried about the end user experience, and the more I can focus on that, the better the overall product will be.

pm · on Feb 25, 2021

I'm in the same boat - I'm the one building it, and my focus is on the user experience, but the business model won't tolerate the amount of video on someone else's service. :(

I haven't had FFMPEG nightmares yet, but I've done relatively little with it so far.

Any video apps I should look out for? I'm also pursuing a content creation angle that I've yet to spec out, so I'm always curious as to how others have approached the problem.

jon_dahl · on Feb 25, 2021

Hey! Jon from Mux here. Curious about this comment:

> the business model won't tolerate the amount of video on someone else's service

Does that mean you aren't using S3/EC2 or the like, or is there something about how we've built our cloud platform that doesn't work for your business model? We've designed Mux to be a low-level primitive for video, like Twilio is for SMS, so I'd be interested if we're doing something that makes this harder for you.

pm · on Feb 25, 2021

Hey Jon! I've looked at Mux (mainly the careers page), and it's a great platform. It would be a great fit technologically, but I'm not sure that my business model (which is tentative admittedly) would cover infra costs for processing, hosting and consuming the amount of video I'm eventually expecting, as I'm running on a shoestring budget at the moment.

Plus, it's a good chance to for me to learn the ins and outs of video. It's not reflective of the quality of your platform, just a choice I've made early in the piece for curiosity's sake.

jon_dahl · on Feb 25, 2021

Cool - thanks for the reply!

If some credits or a startup discount are ever helpful, we have a startup program and can help out there. Let us know.

pm · on Feb 25, 2021

Thanks, I'll have a look. If it appears as though it's not worth the effort to maintain my own video infra, you're the first choice.

rememberlenny · on Feb 25, 2021

And this is why Mux is awesome.

rememberlenny · on Feb 25, 2021

If you want, I’d love to chat about it!

Email: lenny@milkvideo.com

acemarke · on Feb 24, 2021

> The frontend is a React app based on Redux Toolkit and Recoil.js

Hey, great to see Redux Toolkit being used in the wild! Would love to hear your thoughts on using RTK, and I'm particularly curious about the combination of RTK + Recoil together. What use cases are you using each of those for?

Please let me know if you've got any suggestions for improving RTK! I'm usually in the Reactiflux Discord evenings US time, and always happy to chat.

rememberlenny · on Feb 24, 2021

Redux Toolkit is INCREDIBLE. I have the utmost respect for the developers working on it. I've worked on 6 large redux based applications, and they were all implemented incredibly differently. This has been the first time I really love the implementation.

I am using RTK for the overall app state and Recoil for the on-page state. I make API requests and store the results in the redux store, but the hooks/prop passing is too slow for handling video players/transcript manipulation.

I initially had everything in RTK, but noticed the render cycle for dispatching to and listening to the store was creating unusual issues.

With Recoil, Im able to represent the video player's current time state, and then listen to it in the other parts of the app. Similarly, when I have the transcript updating the time, the React Context based API performs better than the hook/props.

Happy to dig more into this. I'll reach out via Twitter too!

tfizzz · on Feb 25, 2021

Congrats on the launch! I plan on running/promoting webinars alot more this year and will check this out. Really interesting on what you shared about your backend/infra. Could you share more about how you chose your transcription/video APIs vs google/aws?

rememberlenny · on Feb 25, 2021

Regarding general infrastructure, we are running on Heroku, AWS, and GCP for various things.

I touched on the video APIs above, but regarding transcription - I have a lot of thoughts.

We chose to work with AssemblyAI (https://assemblyai.com) after trying AWS's transcript service, RevAI and Google's Speech-to-text API.

First, we started with doing manual transcripts through Rev, but the cost was unmanageable at scale. We were really happy with the quality, but couldn't charge $100 per video, so we needed a cost-effective automated solution.

I then found an old blog post from Descript's co-founder Andrew Mason, who talked about which speech-to-text API they decided to use. The blog post is old, so the metric used are going to be irrelevant, but I was impressed they decided to use Google's API.

We implemented the GCP option, I was shocked how slow it was and how expensive it was. For one, the quality wasn't that great, and to use the lower cost option (audio-only), you need to do some additional FFMPEG based transcoding, which is very error prone. Because we receive a range of video types from users, it was causing more problems than was worth dealing with. Also the time lost made the cost savings irrelevant.

Enter - AssemblyAI.

I did some research around what other companies were using today, and saw they have great ratings on G2 (https://www.g2.com/products/assemblyai-speech-to-text-api/re...). The CEO jumped on a call on a Sunday, when I was trying to improve our transcript processing time, and after testing the API, I was shocked that the transcript quality was closer to the human done transcripts I was getting, at a cost significantly cheaper than the Google option.

Conclusion - we needed speed, quality, low-cost and support. AssemblyAI won on all these fronts.

35mm · on Feb 24, 2021

I’m interested to try this as my day job involves editing 2-3 webinars and 2-5 Zoom interviews per week.

Currently I use Premiere Pro with some templates I’ve created.

I haven’t found any of the transcript based editing tools to be robust enough. Descript is buggy and slow on my MacBook Pro (which has zero issues running Prem Pro & After Effects). Transcriptive has issues where it gets out of sync with the original.

What would be really helpful is detecting speaker changes, long pauses, the start and end of slide presentations (or switching to different decks), and transcription if it actually works smoothly and stays in sync across edits.

rememberlenny · on Feb 24, 2021

Would you email me, and we can chat? lenny@milkvideo.com

The inspiration behind making this is to replace Premiere Pro, so I'd love to understand your MVP to solve your problem.

I do think we can be very performant for you, given that all the processing is done on the cloud, and you are only ever interfacing with JavaScript/video tags. That being said, there is work to do!

We don't have speaker diarization right now, but it's just a feature flag for us. Also the start/end content is something that we don't have active, but is planned for next week.

yeldarb · on Feb 24, 2021

We tried out the demo last summer (which was a bit of manu-mation before the product was fully built out -- kudos to Lenny & crew for doing things that don't scale) and had a great experience! Here's an example of one of the videos we got via Milk: https://www.youtube.com/watch?v=O4jOqVqyAo8

Excited to circle back and try it out again now that it's software instead of humans doing the heavy lifting behind the scenes!

rememberlenny · on Feb 24, 2021

Thank you!

Context here - before we started working on the current software, we planned to do a opaque marketplace for post-production video work. To vet the idea, we reached out to companies with webinars and manually edited their videos.

In the process, one finding was that its hard to make styled word-by-word highlighted captions. This resulted in a small utility app that turned SRT and VTT caption files into a dynamically sized/styled caption videos, and later evolved into todays product.

mfleit · on Feb 24, 2021

I’m using Type Studio https://typestudio.co What is the difference?

rememberlenny · on Feb 24, 2021

We aren't focused on being a transcript editing tool. You can upload a video, get a transcript, and edit it in Milk Video, but thats not our focus.

Our focus is helping make a visual clip that is engaging, based on the transcript information.

Here are some examples I posted above:

- https://twitter.com/rememberlenny/status/1339618249575714816...

- https://twitter.com/rabois/status/1310644068326629376?s=20

- https://twitter.com/m_cieplinski/status/1356331228954292224?...

trowngon · on March 2, 2021

Assembly AI is at $0.5 per hour, not extremely reasonable these days. With open source models like Facebook RASR or Vosk you can get self-hosted solution with even better accuracy and cost of $0.05 per hour, 10 times cheaper.

Once any of your customers come to require private video setup, you'll come to hosted solution anyway.

rememberlenny · on March 3, 2021

I'll say this clearly: Assembly AI is the hands down best speech to text transcription service on price, quality, speed and support. Hands down. It more than pays for itself time and time again.

If we were free, then I might have this concern, but our average customer value is far beyond the cost to transcribe.

Actually this is insanely cheap, especially given how much they are regularly improving. Amazon costs over $1.25, RevAI costs over $2/hour, and Google Speech to Text is over $2.15/hour.

We have a shared Slack channel with their team and I can't convey how incredible they have been. Literally the moment we have a question, we get instant responses.

Also our users will not poor transcripts. Every word that needs correcting is time/money lost for them, so our goal is to give them the highest quality transcripts.

We want to focus on our key value prop. Transcription is not one of them. We focus on user experience, design options, and speed.

gramakri · on Feb 24, 2021

In the mobile view of https://milk.video/pricing, all I see are numbers/unlimited . I guess the row titles got scrolled down

rememberlenny · on Feb 24, 2021

Thanks! Will fix this. The app definitely doesn't work on a small screen, but the homepage/pricing should.

FWIW - they are Webflow templates, but do a great job at making it easy to manage.

gamesbrainiac · on Feb 24, 2021

Just checked this out. I prefer Descript. Better editing, as well as overdub.

rememberlenny · on Feb 24, 2021

Thanks for signing up and trying it out.

We actually drive people to use Descript for most use cases, that aren't relevant.

Think Photoshop vs Canva.

Since speech-to-text APIs have become really good (props to companies like AssemblyAI (https://www.assemblyai.com/), the transcript-based interfaces are going to become much more common.

Our product goal is to solve the use case around making the visual output, when editing/correction isn't the goal. That being said, the editor should be performant and work well, so lots to improve there.

As an aside, there are a few evolving open-source libraries that consume the output of these STT services (https://github.com/bbc/react-transcript-editor) and make turnkey transcript interfaces.

The newest/most developed one I like is based on Slate, and made by a really amazing engineer at the Wall Street Journal named Pietro.

Link: https://github.com/pietrop/slate-transcript-editor

elviejo · on Feb 24, 2021

How does Milk compare to Descript?

what are the advantages / disadvantages ?

rememberlenny · on Feb 24, 2021

Thanks for asking this. This is like comparing Photoshop and Canva.

Descript is hands down the leader in any transcript based video/audio editing. They set the standard for detailed editing and magically manipulating audio/video.

We are focused on the workflow around creating something visually appealing, that uses a Zoom recording in it. Specifically, the transcript-based interface is only for speeding up the review process, but our main focus is on visual templates to drop in the video/captions.

One way to think of it is that we took the Descript Audiogram feature, and built out a workflow that creates a wider variety applicable to marketing/sales related needs.

We are solving the problem where you need to quickly take a video recording and make something you/your team can proudly share on social media, that reflects your company's brand guidelines/visual aesthetic.

sachaker · on Feb 25, 2021

i've not used the product but i know the founders, who are both extremely talented. super excited to know about this rocket during its early days!

mohitgarg · on Feb 25, 2021

Hey Lenny, congrats on the launch!!!

rememberlenny · on Feb 25, 2021

Thank you!

annelibby · on Feb 24, 2021

Go, Lenny!

rememberlenny · on Feb 25, 2021

Thank you Anne!!

vanpelt · on Feb 24, 2021

Hey Lenny! Congrats on the launch, so excited to see your product come together.

rememberlenny · on Feb 24, 2021

Thank you Chris for your advice and support! The partnership you and Luke have is an inspiration for Ross and I.