Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: Milk Video (YC W21) – Edit online event recordings quickly
154 points by rememberlenny on Feb 24, 2021 | hide | past | favorite | 52 comments
Hello HN gang! Lenny and Ross here, working on Milk Video (https://milkvideo.com/), a browser-based tool to turn long videos into watchable clips. We speed up the workflow for marketers editing long, boring Zoom recordings and webinars into visually engaging clips with quality templates and styled captions.

Ross and I met 8 years ago in Shanghai, where we worked at an education startup and organized tech and design events. When we realized Covid was creating a tsunami of webinars, Ross noticed the growing cost of editing all the new content as B2B companies replaced their in-person marketing channels with online events.

Most registrants to online events don't end up attending. They may be interested in the content, but they won’t take time to watch an entire webinar recording. Webinar content has a short shelf life unless it is reworked into a friendlier format. Doing that with traditional video editing software is cumbersome, so it often doesn’t happen. It’s time-intensive to review videos for key moments, ask designers to create appropriate graphics and captions, and receive final approval from managers.

We started out contacting companies organizing webinars, and learned they were stuck in a vicious cycle of constantly having to focus on the next upcoming event. We started manually editing videos for them to better understand how the most engaging bits could be reworked. Doing this manually revealed a glaring problem: the technology interfacing with video has changed dramatically, but the editing software hasn’t. Video editing software is designed for film makers or social media, and businesses creating video content have very different needs.

Milk Video uses a transcript-to-video based interface to review long recordings and minimize the mental effort around editing. We transcribe uploaded videos, present you with the content so you can quickly clip the best parts, and allow you to use templates to compose visually interesting layouts with additional assets, like logos or static text.

We made a drag-and-drop interface for creating short video clips with styled word-by-word captions. In a world where people often don't have their audio on, the timestamp information on a machine-generated transcript is perfect for creating interesting visual elements, such as captions styled one word at a time. This also makes content accessible by default. And because most webinars or Zoom recordings are visually similar, we have the ability to recommend which video templates might be best suited for their uploaded content in the future.

The frontend is a React app based on Redux Toolkit and Recoil.js. Our performant transcript interface is made possible due to Slate.js. Our backend is a Ruby on Rails app and depends on a non-trivial number of serverless functions hosted on Google Cloud and AWS. Our speech-to-text provider is AssemblyAI, who we found were both cheaper, faster and better than Google and Amazon.

We would love your feedback on the tool. We are spending a lot of time working directly with our first users, and would appreciate all of the input we can get. I’m also happy to go into detail around how any specific parts work! We’ll be in the comments and are eager to hear all your thoughts!




We use both Milk and Descript at Macro for our podcast workflows.

Milk is really good at creating short (for us it's ~1min), impactful Twitter + LinkedIn marketing collaterals based on each podcast guest (we can add our logo, the guest's background, the guest's bio + picture, transcript, etc.).

Descript is amazing at editing the entire podcast and make sure we have the overall content needed to publish.

Can't imagine doing our podcast without Milk + Descript.


Thank you! Thank you! Thank you!

Per the podcast point, last night (phew) we launched the ability to upload audio files and work with them.

We are focused on webinars/Zoom recordings, but now you can upload a podcast and create a promo tile.

These are some other links that were made in Milk Video:

- https://twitter.com/m_cieplinski/status/1356331228954292224

- https://twitter.com/rabois/status/1310644068326629376

- https://twitter.com/rememberlenny/status/1339618249575714816


Pretty cool to see a product I use on HN.

I've been using Milk to promote my podcast ( https://www.allschemesconsidered.com ) on LinkedIn. Here's some sample videos:

https://www.linkedin.com/posts/mraakashshah_cloud-aws-gcp-ug...

https://www.linkedin.com/posts/mraakashshah_heres-a-teaser-f...

All the social media companies are pushing video really hard in their algos (and stories ). Recording and editing a podcast is super fun for me, but the audience-building part was a drag. Milk lets me make professional quality highlights super easily. Ironically, the viewership on these highlight videos is 100x the listenership on the podcast.

Anyway, I like the software.

Disclaimer: Ross found and pitched me on Milk, but I've been a happy user ever since.


It's been great working with you, Aakash! Thanks for the support.


I have some constructive feedback. The main video on your homepage confused me immensely. It was a guy from brex talking about retool, but they weren't very eloquent (no offense) and it rambled on forever. I thought it was going to be a before and after thing showing how bad it was before and how amazing it was after...but it was just a before? Or was that after milk? I also thought for a minute I was on the wrong video, hearing about retool.

I use brex and retool and descript and canva so I think I would be a target user but just didn't get it at all from the video.

My feedback would be to make the company "Acme" and show before or after...but definitely for a video product that first video should be a really good video.

My 2 cents.


Extremely valuable and noted! Thank you for taking the time to write this out and share.


Congrats y'all! Looks interesting. One use case I believe you're talking about that I'd love to see even more fleshed out is the tool to find the interesting clips for me? I'd love if after I just did this upload, you were like: "here's 3 clips that seem interesting to our AI." Maybe even a summarization algorithm would suffice at finding the most relevant chunks in transcript? Or maybe something more fancy if it's doable. But I'd love a best effort stab at the clips so I don't even have to think about finding them :)


I love this idea in concept.

The point of the transcript is to lower the bar on who can review video content. One layer on top of that is moving the technical work (cutting video) into an editorial role (picking the parts that are recommended).

We aren't trying to position ourselves to do the clip picking/recommendation now, but we have already done some machine learning based analysis to make this easier to find. We have a video processing task that looks for "scene changes" based on image threshold changes, so the metadata associated to when a new person joins/slide changes/etc is present.

The original thinking here is that we can recommend "templates" that correspond with certain video (ie. multiple speakers vs single presenter).


So it wasn't by choice, but most of my grad school business classes are now virtual. Most have a team presentation aspect where you collaborate on a powerpoint and charts and spreadsheets then "present" virtually by tacking on audio and video and then upload the whole mess after a lot of stress and panic.

Have you considered creating a free "edu" edition that would generate a mashup of uploaded videos, ppt and media, watermarked or whatever? The students would then conversion to a paid model for work? I would use it


Yes! Non-profit and academic use case is currently free.

It's worth noting, the use case you are mentioning is interesting, but not exactly the workflow we are building for.

We are focused on improving the workflow around a marketer/demand generation/sales person at a company who needs to use existing content to get attention on social media/blog/email.

One of the ways we are thinking about this is around increasing the "shelf-life" of quality content, which doesn't get discovered because its long. This is very much a problem that appeared when businesses became content creators, as opposed to individuals.

That being said, if you have any questions, please email either Ross (ross@milkvideo.com) or me (lenny@milkvideo.com) and we will be happy to help!


Gotcha, I respect that you have a focused model.

I would also think that local government would be a good target audience- We have sat through some endless school board zooms (due to COVID). I am sure "shrink and share" would work there as well.


At a high level, there are two kinds of video processing work: synthesizing/organizing and creation.

There are a number of new companies appearing in the synthesizing/organizing space, because there was an explosion in the quantity of video being produced, and the software to parse it exists (speech-to-text, object recognition, better-search, etc).

At least for companies, people intuitively want to share all the "best parts", but we found that people won't actually watch them. Instead highlighting one specific piece that visually looks engaging is a better way to capture attention, and then engage.

Our thought process here is that there is only so much organizing/synthesizing most people will do, but there is an endless about of creation someone can do.

Re: government - For what it's worth, the US Chamber of Commerce is one of our paying customers.


It would be sexier to synthesize the cyber truck unveiling over the local planning board, but then again Buffett didn't shy away from investing in garbage pickup


"The students would then conversion to a paid model for work?"

Honestly that almost never works for startups.

It never happens or at least it never happens fast enough. I'm fairly sure that large companies like Microsoft that popularized student licenses, are benefiting from this, but through very long cycles of adoption.

At the end of the day if you get discounted Office for 4 or 5 years, you will highly likely continue doing it once you lose the discount. The secret sauce there is distribution. Office is so massively popular that you don't even need to advocate for that at work. It's the default choice.


I work in IT for engineering.

CAD sw for the engineering schools is crazy competitive. If you are graduating and know a CAD/simulation suite really well it helps you get a job (and therefore it will influence where you apply, which influences the employers, etc)


I totally believe this. But it's highly contextual to your job/area. CAD software is very specialized so it definitely makes sense that you can influence or take job decisions based on your knowledge of a particular suite.

What I believe is that if you're a startup building a generic business tool and you target students with the hope they will be your advocates or future buyers, you should adjust your expectations.

I believe that giving student discounts as a long-term adoption strategy makes sense. Unfortunately, startups have to fight for adoption in the short term. If you can afford to give student discounts you should do it. It's a good thing to do. But don't count on this as a way to drive adoption.


You mentioned your backend being a Rails app + serverless functions, what's the benefit of doing both there? What does your video processing infra look like, is that in one of those systems?


I could go pretty deep here, so let me know if I should elaborate on anything.

The backend is a Ruby on Rails application that serves the frontend app's API. This interfaces with the user tables, database, and handles all the "state" of the app.

The serverless stuff has changed over the months, but primarily it handles the stuff I don't want Rails to handle: file uploads, video processing and transcription.

First, huge props to the Mux (https://mux.com) team and product. I can not express how easy it has been to build video (and audio) products. File uploads are handled to AWS/GCP (depending on a few things) and then trigger a serverless callback to Mux.com. Mux was the fastest way we found to turn an arbitrary video file (mp4/mov/etc) into HLS format for quick streaming.

Then once the video is uploaded, we have another serverless callback that sends the video for transcription using Assembly AI (https://assemblyai.com). There are a ton of transcription based services and they vary dramatically in quality, based on the media content. I believe Google/Amazon services were largely built around the need to process phone calls, so unless you may for their "enhanced" models, the quality is surprisingly bad (and surprisingly slow).

I *highly highly* recommend Mux and Assembly AI if you are doing anything video/transcription based work.

To get an immediate update to the end user, we actually process two transcript requests - one that is just the first 60 seconds, and then the remainder of the video. This lets us render a preview transcript in the first 15-20 seconds.

We also have a serverless pipeline for generating the videos, but I won't go into that unless you're interested. In short, a serverless function kicks off a Docker instance running on ECS.

The requests to the serverless apps (mostly Node) have a callback to the Rails app, which then updates the end user state using websockets (which are very easy to use in Rail's ActionCable).


Interested to hear more about your pipeline and infrastructure for processing and delivering video. I'm working with processing short videos at the moment for my current startup, though I didn't use Mux (I figured it was a core competency we needed to develop). It's just a queue using FFMPEG to convert from MP4 to HLS.


I have horror stories about FFMPEG that I wont go into here.

In short, I'm just one person building this - so I'm sticking to what I know best. I want video to "just work" without having to worry about some video format/extension/containers that I have no idea about.

There are a number of video processing services, but Mux really is the best. The API is simple. They have a ton of really nice helper functions, that I use a lot (like timestamped thumbnails, preview gifs, and VTT storyboard generation), which I could easily spend a few days on making, and then countless hours maintaining.

I dont doubt that building video infra is a good idea, but just as I'm not about to train my own speech-to-text model, I'm not going to build out video infra.

At least for me, I'm more worried about the end user experience, and the more I can focus on that, the better the overall product will be.


I'm in the same boat - I'm the one building it, and my focus is on the user experience, but the business model won't tolerate the amount of video on someone else's service. :(

I haven't had FFMPEG nightmares yet, but I've done relatively little with it so far.

Any video apps I should look out for? I'm also pursuing a content creation angle that I've yet to spec out, so I'm always curious as to how others have approached the problem.


Hey! Jon from Mux here. Curious about this comment:

> the business model won't tolerate the amount of video on someone else's service

Does that mean you aren't using S3/EC2 or the like, or is there something about how we've built our cloud platform that doesn't work for your business model? We've designed Mux to be a low-level primitive for video, like Twilio is for SMS, so I'd be interested if we're doing something that makes this harder for you.


Hey Jon! I've looked at Mux (mainly the careers page), and it's a great platform. It would be a great fit technologically, but I'm not sure that my business model (which is tentative admittedly) would cover infra costs for processing, hosting and consuming the amount of video I'm eventually expecting, as I'm running on a shoestring budget at the moment.

Plus, it's a good chance to for me to learn the ins and outs of video. It's not reflective of the quality of your platform, just a choice I've made early in the piece for curiosity's sake.


Cool - thanks for the reply!

If some credits or a startup discount are ever helpful, we have a startup program and can help out there. Let us know.


Thanks, I'll have a look. If it appears as though it's not worth the effort to maintain my own video infra, you're the first choice.


And this is why Mux is awesome.


If you want, I’d love to chat about it!

Email: lenny@milkvideo.com


> The frontend is a React app based on Redux Toolkit and Recoil.js

Hey, great to see Redux Toolkit being used in the wild! Would love to hear your thoughts on using RTK, and I'm particularly curious about the combination of RTK + Recoil together. What use cases are you using each of those for?

Please let me know if you've got any suggestions for improving RTK! I'm usually in the Reactiflux Discord evenings US time, and always happy to chat.


Redux Toolkit is INCREDIBLE. I have the utmost respect for the developers working on it. I've worked on 6 large redux based applications, and they were all implemented incredibly differently. This has been the first time I really love the implementation.

I am using RTK for the overall app state and Recoil for the on-page state. I make API requests and store the results in the redux store, but the hooks/prop passing is too slow for handling video players/transcript manipulation.

I initially had everything in RTK, but noticed the render cycle for dispatching to and listening to the store was creating unusual issues.

With Recoil, Im able to represent the video player's current time state, and then listen to it in the other parts of the app. Similarly, when I have the transcript updating the time, the React Context based API performs better than the hook/props.

Happy to dig more into this. I'll reach out via Twitter too!


Congrats on the launch! I plan on running/promoting webinars alot more this year and will check this out. Really interesting on what you shared about your backend/infra. Could you share more about how you chose your transcription/video APIs vs google/aws?


Regarding general infrastructure, we are running on Heroku, AWS, and GCP for various things.

I touched on the video APIs above, but regarding transcription - I have a lot of thoughts.

We chose to work with AssemblyAI (https://assemblyai.com) after trying AWS's transcript service, RevAI and Google's Speech-to-text API.

First, we started with doing manual transcripts through Rev, but the cost was unmanageable at scale. We were really happy with the quality, but couldn't charge $100 per video, so we needed a cost-effective automated solution.

I then found an old blog post from Descript's co-founder Andrew Mason, who talked about which speech-to-text API they decided to use. The blog post is old, so the metric used are going to be irrelevant, but I was impressed they decided to use Google's API.

We implemented the GCP option, I was shocked how slow it was and how expensive it was. For one, the quality wasn't that great, and to use the lower cost option (audio-only), you need to do some additional FFMPEG based transcoding, which is very error prone. Because we receive a range of video types from users, it was causing more problems than was worth dealing with. Also the time lost made the cost savings irrelevant.

Enter - AssemblyAI.

I did some research around what other companies were using today, and saw they have great ratings on G2 (https://www.g2.com/products/assemblyai-speech-to-text-api/re...). The CEO jumped on a call on a Sunday, when I was trying to improve our transcript processing time, and after testing the API, I was shocked that the transcript quality was closer to the human done transcripts I was getting, at a cost significantly cheaper than the Google option.

Conclusion - we needed speed, quality, low-cost and support. AssemblyAI won on all these fronts.


I’m interested to try this as my day job involves editing 2-3 webinars and 2-5 Zoom interviews per week.

Currently I use Premiere Pro with some templates I’ve created.

I haven’t found any of the transcript based editing tools to be robust enough. Descript is buggy and slow on my MacBook Pro (which has zero issues running Prem Pro & After Effects). Transcriptive has issues where it gets out of sync with the original.

What would be really helpful is detecting speaker changes, long pauses, the start and end of slide presentations (or switching to different decks), and transcription if it actually works smoothly and stays in sync across edits.


Would you email me, and we can chat? lenny@milkvideo.com

The inspiration behind making this is to replace Premiere Pro, so I'd love to understand your MVP to solve your problem.

I do think we can be very performant for you, given that all the processing is done on the cloud, and you are only ever interfacing with JavaScript/video tags. That being said, there is work to do!

We don't have speaker diarization right now, but it's just a feature flag for us. Also the start/end content is something that we don't have active, but is planned for next week.


We tried out the demo last summer (which was a bit of manu-mation before the product was fully built out -- kudos to Lenny & crew for doing things that don't scale) and had a great experience! Here's an example of one of the videos we got via Milk: https://www.youtube.com/watch?v=O4jOqVqyAo8

Excited to circle back and try it out again now that it's software instead of humans doing the heavy lifting behind the scenes!


Thank you!

Context here - before we started working on the current software, we planned to do a opaque marketplace for post-production video work. To vet the idea, we reached out to companies with webinars and manually edited their videos.

In the process, one finding was that its hard to make styled word-by-word highlighted captions. This resulted in a small utility app that turned SRT and VTT caption files into a dynamically sized/styled caption videos, and later evolved into todays product.


I’m using Type Studio https://typestudio.co What is the difference?


We aren't focused on being a transcript editing tool. You can upload a video, get a transcript, and edit it in Milk Video, but thats not our focus.

Our focus is helping make a visual clip that is engaging, based on the transcript information.

Here are some examples I posted above:

- https://twitter.com/rememberlenny/status/1339618249575714816...

- https://twitter.com/rabois/status/1310644068326629376?s=20

- https://twitter.com/m_cieplinski/status/1356331228954292224?...


Assembly AI is at $0.5 per hour, not extremely reasonable these days. With open source models like Facebook RASR or Vosk you can get self-hosted solution with even better accuracy and cost of $0.05 per hour, 10 times cheaper.

Once any of your customers come to require private video setup, you'll come to hosted solution anyway.


I'll say this clearly: Assembly AI is the hands down best speech to text transcription service on price, quality, speed and support. Hands down. It more than pays for itself time and time again.

If we were free, then I might have this concern, but our average customer value is far beyond the cost to transcribe.

Actually this is insanely cheap, especially given how much they are regularly improving. Amazon costs over $1.25, RevAI costs over $2/hour, and Google Speech to Text is over $2.15/hour.

We have a shared Slack channel with their team and I can't convey how incredible they have been. Literally the moment we have a question, we get instant responses.

Also our users will not poor transcripts. Every word that needs correcting is time/money lost for them, so our goal is to give them the highest quality transcripts.

We want to focus on our key value prop. Transcription is not one of them. We focus on user experience, design options, and speed.


In the mobile view of https://milk.video/pricing, all I see are numbers/unlimited . I guess the row titles got scrolled down


Thanks! Will fix this. The app definitely doesn't work on a small screen, but the homepage/pricing should.

FWIW - they are Webflow templates, but do a great job at making it easy to manage.


Just checked this out. I prefer Descript. Better editing, as well as overdub.


Thanks for signing up and trying it out.

We actually drive people to use Descript for most use cases, that aren't relevant.

Think Photoshop vs Canva.

Since speech-to-text APIs have become really good (props to companies like AssemblyAI (https://www.assemblyai.com/), the transcript-based interfaces are going to become much more common.

Our product goal is to solve the use case around making the visual output, when editing/correction isn't the goal. That being said, the editor should be performant and work well, so lots to improve there.

As an aside, there are a few evolving open-source libraries that consume the output of these STT services (https://github.com/bbc/react-transcript-editor) and make turnkey transcript interfaces.

The newest/most developed one I like is based on Slate, and made by a really amazing engineer at the Wall Street Journal named Pietro.

Link: https://github.com/pietrop/slate-transcript-editor


How does Milk compare to Descript?

what are the advantages / disadvantages ?


Thanks for asking this. This is like comparing Photoshop and Canva.

Descript is hands down the leader in any transcript based video/audio editing. They set the standard for detailed editing and magically manipulating audio/video.

We are focused on the workflow around creating something visually appealing, that uses a Zoom recording in it. Specifically, the transcript-based interface is only for speeding up the review process, but our main focus is on visual templates to drop in the video/captions.

One way to think of it is that we took the Descript Audiogram feature, and built out a workflow that creates a wider variety applicable to marketing/sales related needs.

We are solving the problem where you need to quickly take a video recording and make something you/your team can proudly share on social media, that reflects your company's brand guidelines/visual aesthetic.


i've not used the product but i know the founders, who are both extremely talented. super excited to know about this rocket during its early days!


Hey Lenny, congrats on the launch!!!


Thank you!


Go, Lenny!


Thank you Anne!!


Hey Lenny! Congrats on the launch, so excited to see your product come together.


Thank you Chris for your advice and support! The partnership you and Luke have is an inspiration for Ross and I.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: