Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Edit videos faster by automatically removing silences (kapwing.com)
171 points by shahahmed on Feb 3, 2022 | hide | past | favorite | 89 comments
Our team is filled with technologists and creators, and when we record and edit videos, 80% of the time is spent chopping up the video, removing silences, and picking the right takes. So we decided to build a tool that did that for you — or at least get you there most of the way!

Our initial implementation is somewhat naïve and uses a user configurable silence threshold that just reads in volume levels. In the future, we’d like to use a frequency-based approach that focuses on the human voice. We’re also open to ideas, so let us know if you have any!




I watched the demonstration videos on the landing page, and the effect wasn't as bad as I often see on some youtube videos, but I think that's because the subject was sitting at a desk, rather than standing, moving around much more.

Anecdata - heavily jump-cut edits on youtube instantly inspire me to find an alternative source of the same information, as the breathless, rapid-fire, jerky-video, sensation is irksome. Evidently I'm in the minority, and I'm okay with that.


I think it's a gradient -- one extreme is that you have a recording with multiple takes and several long pauses as you're reading a script, in which case you want to streamline the clean up process. But I think it's a creative decision to make the cuts extremely tight and jumpy, not for everyone!


Or you have multiple camera angles and you switch between them to hide the cuts. Like they've done in movies forever. People started doing jump cuts in youtube videos and for some reason we're stuck with it.


>Or you have multiple camera angles and you switch between them to hide the cuts. Like they've done in movies forever.

Not exactly. Most movies use https://en.wikipedia.org/wiki/Single-camera_setup, so even though there might be cuts to different camera angles, each angle actually comes from different takes. The goal of different angles isn't really to cut out pauses/fillers, it's for stylistic/cinematography reasons.

>People started doing jump cuts in youtube videos and for some reason we're stuck with it.

Probably because setting up multiple cameras and lighting them is cost prohibitive for low budget videos on youtube?


> it's for stylistic/cinematography reasons.

Yes. One classic example is in a scene with two people speaking, you have two different camera angles to switch focus between them.

More generally, film makers who want to focus / depth-of-field effects very often do so with two cameras / takes where the focus is already set. Changing focus in the middle of a shot is very jarring.

Incidentally, this is something that became immediately obvious to me when I started to use my DSLR for video.


for some that's viable - either have multiple cameras or record with reset and record with different angles, but that's a lot of overhead for certain situations. has it's place for certain projects i think


Undeniably.

At the top end you've got people comfortable with one or more cameras, a clear message, good communication skills, a well honed script / presentation -- in those cases the information density is naturally going to be high, and any brief pauses are thoughtfully placed to allow the viewer to contemplate & digest. These types of videos are rare but excellent, per my taste.

As I noted, I'm definitely in the minority camp here --the desire that a poorly prepared video should have been a web page is probably generationally akin to the comparably curmudgeonly contemplation that rambling online meetings should have been an email.


It would be neat if the application doesn't make a hard cut, but merely speeds up the silent parts.

A more complex algorithm might watch movement and tune the amount of speed up so the visual doesn't get jumpy.


I like this idea - more sophisticated, and maybe a future version. Thank you!


Try also a fade-out-in of .25+.25 secs (or less) and I guess most of the annoyance will be gone (or make it customizable).


Very nice! As someone who wrote a native tool to do pretty much this (Recut / getrecut.com) it's super impressive to see it done in a web browser. The editor feels very fast and fluid.

Doing it natively was hard enough, and recently I've been rewriting Recut with Rust + Electron so I have an idea of how much work it was to get it working well :sweat-smile: Keep up the good work y'all!


I love your tool for trimming down interview videos from Zoom!

However, recently I updated my recording setup with a new platform. Now I have separate video tracks for each participant and recut doesn’t handle multiple videos. Will this be supported in the new release?


Multi-track support is one of the things I’m trying to bake in from the beginning this time and I’m hopeful that’ll make it into the new release.

I’ve heard it said before (I think it was on the Software Social podcast) that whenever you have a choice to support 1 or many of a thing, always err on the side of “many” (users, teams, etc.) Well it turns out that applies to video tracks and clips too! There are massively more things to consider once there’s more than one, haha.


Looking forward to it!

On 1 vs many,I think its totally reasonable to start with 1 given you are a small team. If you force yourself with high standard (which takes lots of extra person-hours), then you lose the key agility compared with other bigger players.


Hi Dave, I just found out about your tool from this post and it looks amazing! I was about to download it to try it out but I'm now hesitant to try (and possibly later purchase) if it's being rewritten in Rust + Electron rather than a native Mac app. I wouldn't want to wake up one day and find a great native app has been replaced by an Electron app after paying $100 for it. Unless you were planning on Electron for Windows and keeping the Mac app native?


Hey! The current plan is to offer both versions, at least until the Rust/Electron app is at feature parity with the Mac app. In any case, the Mac app will continue to work, and you wouldn't be forced to switch versions.

On the Rust/Electron side, all the heavy lifting is done in Rust-land, and so far, it's going pretty well. It's not all roses of course - it takes a bounce or two to start instead of launching instantly, and it seems there's not much to be done about Electron's baseline memory usage or size on disk.

(I went down a crazy rabbit hole of compiling Chromium/Electron to see if there were things that could be stripped out, because I honestly don't need much, but realized that is a very large undertaking. Possibly another day. Some kind of "Electron Lite" would sure be awesome...)

But in general I'm spending an inordinate amount of time paying attention to performance. It was my main concern with the Electron platform. So far I've got playback on par, and stuff like loading files, silence detection, and drawing waveforms is actually faster than the Mac version. It feels snappy.


Just wondering, did you consider Flutter? Thoughts?

It would help with both the disk and memory usage, easy to learn after web frameworks (both the framework, Flutter, feels familiar and Dart is super easy to learn), works well with Rust, but of course it has its downsides (no longer can rely on the html js css ecosystem).


I looked at Flutter briefly. To be honest, I was a little worried about betting on a Google thing, with the ecosystem being comparatively still pretty small, and with Dart.

IMO this project is already in debt on innovation tokens[0] given I didn't know Rust coming in, nor much at all about video. And then, hybrid web/native stuff is just not very widespread so there's not a lot of existing answers for things. Lots of digging into code and figuring stuff out. The web stack (Svelte, TypeScript) is the one part of this thing that already felt familiar, and I didn't want to throw that out.

0: from Choose Boring Technology :) http://boringtechnology.club/


Have you investigated Tauri [1]? It's a lightweight wrapper around native web views that connects to a rust backend. I've had great success with it so far.

[1] https://tauri.studio/


Yeah I looked at Tauri! It's pretty great and I'm excited to see where it goes. I didn't end up going with it because I had a hard time figuring out how to efficiently send data from Rust -> browser without making a copy, and Tauri is still fairly early on the native integration stuff. I'm definitely going to keep an eye on it though. One benefit of having most of the code in Rust is that I could feasibly port it to another platform.


Curious to know what your rationale was for choosing Electron? Targetting mobile?


It’s mainly for Windows support, with the side benefit of being able to potentially extend to browsers (with WASM). Electron doesn’t run on mobile as far as I know.

I looked at a bunch of cross-platform UI toolkits, and everything has its tradeoffs. Qt is probably the most full-featured alternative, and I’ve had experience with it in the past, but I didn’t love the idea of tackling a project this size in C++. On the Rust side, Tauri is very promising but it’s still early days.

Aside from the framework, the other big tradeoff is 1 app vs multiple - more “nativeness” for a much slower feature velocity. The big companies like Slack and Figma don’t even make that trade with millions in funding… and it seemed unwise to take that on as a solo bootstrapped developer.

In terms of stuff like ecosystem, building/packaging, updater support, and even nice little native details like progress bars on app icons, Electron has a lot going for it. The big downsides in my mind are startup time, baseline memory usage, and disk size. Once it’s up and running, IMO app performance is almost all down to the app’s code. So I’m optimizing where I can - being mindful of algorithms, avoiding memory copies, trying to keep things cache-friendly, avoiding heavy JS stuff, taking copious benchmarks, etc.


thanks for checking it out, Dave! I'm a big fan and learned react from your tutorials :)


Hey, glad to hear that! Kudos again on the editor, it's really well done.


This is basically an attention killing machine and I'm not talking about the method, the method is OK and not novel. I have found that these time cuts completely ruin my ability to concentrate and after it happens 2-3 times in a short period, I lose interest in the video because it is so annoying.


In contrast I speed videos up because of the silences or slow talking. It removes my concentration otherwise. Call it an attention disorder.

I will agree that those cuts in body movement seem odd if anything when silences are removed. Maybe put a cat picture in the middle.


Agreed. I find it extremely hard to watch videos where the sense of flow is completely disrupted by these cuts. It seems to be the norm though so I guess we're in the minority.


+1, perhaps a sizeable minority? It's probably a generational thing.

Personally I'd recommend writing a script and rehearsing it, instead of going on the fly and having to edit out the umhs, aahs and awkward silences. This might also raise the standard for excitement above the generic "super stoked".

Having said that, I think there's probably a market for this. Good luck to OP!


The simplest way would be to make it like that and then write a script from the result, then re-shoot as a single uninterrupted stream of information.


This whole concept of a scripting and rehearsing is an older generation thing. Better to sound natural and real than rehearsed


Potentially, but my favourite channels are those that don't jump cut in this way and still don't sound rehearsed. I think you're right about the generational thing, but also it speaks to a lack of effort on the creators part. Why learn to be a good presenter when you can have the technology take care of it for you?


Honestly the constant jump cut style of editing feels so unnatural to me... I'd rather watch someone takes pauses and not be taken out of the moment.... I'm sure this has applications particularly in advertising/marketing but it's not without outs issues.


Most people can read 5-10x faster than they can speak. Videos are very difficult to skim compared to text. If your video is trying to inform people, and there is no emotional content that's lost by cutting, then you're playing a losing game time-wise against just publishing text. People get bored pretty quickly, so people use jump cutting on YouTube to increase information bandwidth from the default that's frankly unacceptably slow to many viewers. None of it actually ends up fast enough for me, it is rare that I will suffer through a YouTube video that I could read about instead. In order to beat reading, IMO you need to use the visual bandwidth too, like 3Blue1Brown does. A talking head YouTube video is the worst of all worlds, but that's where people's audiences are, so they're just trying to make it bearable.

Cutting also removes ums and ahs. Newsreaders take ages to get good at this at speed, so giving people these tools makes video more accessible as a publishing medium.


Reminds me of the tool presented in [1], which also shows some interesting applications. Apparently an improved version is now being sold on its own platform [2], but the original Python script is still available on Github [3].

[1] https://www.youtube.com/watch?v=DQ8orIurGxw

[2] https://jumpcutter.com

[3] https://github.com/carykh/jumpcutter


oh this is very cool - haven't seen this one. thank you for sharing!


Whoa. For years I've been seeing YouTube videos that seem to just teleport choppily around what I assumed were silences. I always assumed there was a standard tool that everyone used to do this. I can do the equivalent thing to a podcast episode in like 5 seconds in Logic. The notion that people have been doing this by hand is staggering, but kudos to you for finally coming along and filling this niche.


It's not automatic, but there is this marker tool that uses an audio track to make note of important timestamps while you are recording: https://github.com/evankale/Blipper

The author also wrote supporting scripts for Vegas to extract scenes based on the position of the blips in the audio.


Can I ask what tool / plugin you use to do this in Logic?


1. Highlight the clip and "strip silence" to split it into a bunch of separate clips the leave out the silent bits 2. Highlight those clips and "shift left within selection" (I might not have that command name exactly right) to collapse them against each other


thanks! yeah, we have several video editors on our team and i edit a lot of videos - it's just how it is haha. there are tools that help with this problem, but they tend to be plugins or one-off tools, but we're happy that we can go end to end in one spot all in the browser


I'm pretty sure of seeing an ad for such a plugin more than 5 years ago. Is it now built in into most editors such as Final Cut Pro?


if you know how to do this natively in Final Cut, I would love to know how -- would be super useful for me!


I get the use case. Not sure there's enough value here as a cloud-based SaaS product. I use a similar product called Recut (https://getrecut.com/). $99 one-time fee (no subscription). Easy to open the edited project in NLEs. macOS only. Unclear that Kapwing is a better option when Recut would be breakeven at seven months and you can don't need to round trip videos in the cloud.


Oh hey, I made that! Glad to hear you’re getting good use out of it :)

It’s Mac-only right now but I’m currently working on a Windows version.

I’m a big fan of one-time licenses and the Sketch-style “1 year of updates but use it forever if you want” kind of thing too, and Recut will probably switch to that at some point to make this whole thing more sustainable. What’s interesting is I’ve noticed that not everyone feels this way - some people genuinely would rather pay a smaller fee, but monthly. The thought has crossed my mind to offer both.

I do think there’s a large chunk of the market who don’t use NLEs and would rather avoid them (especially in the mobile crowd) and something web-based like Kapwing is probably preferable for those folks.


Agree on that last sentence. Thanks for making Recut!


That's a fair criticism - for certain users, I do think ramping up on a native NLE (which can be flat rate or SaaS) and the using a plugin/external tool like Timebolt or Recut is a lot. To some exporting on 64 core machines that's not your local machine (heating it up) + having access on any browser is attractive, but obviously there are tradeoffs.


Fair point. That makes sense. May be an opportunity for better messaging for who the product is for. Good luck with the roll out. Others mentioned Descript (which is solid). There's a labs project from Adobe that's exploring the same space (https://labs.adobe.com/projects/shasta/). So lots of options. A more clear customer target is needed (zero experience, prosumer, post-production creative professional). Again, not knocking your product. Just feels like this may end up more of a feature than a product — the subscription angle will force you to add more value and better target the right audience.


Can somebody do a comparison this vs recut vs timebolt please?


I don't know if any other apps do this but there are plenty of Japanese youtube channels that do this

This one in particular is the one where I was introduced to the style but it certainly wasn't the first to do this.

https://www.youtube.com/c/%E6%9C%89%E9%9A%A3%E5%A0%82%E3%81%...

To be clear, they aren't just cutting out silence, they are cutting them to the point that it's a style and the speech pattern, particularly of the Owl character, sounds unnatural, which I'm assuming is what they were going for. It's that style of nearly cutting phrases together faster than they would be naturally spoken is the thing I'm saying is a trend in some Japanese videos.

Note: The channel itself is run by a stationary store chain from Tokyo. The funny part is the Owl character often makes fun of what they're showing as in "why would anyone buy this?" or "That's way too expensive" which is funny for a channel run by a store selling most of things they're showing off.


> I don't know if any other apps do this...

https://www.descript.com/ comes to mind


Congrats on the launch Shah! I can tell you stayed up late giddy for this launch :D. As another peer building for video creators, I am delighted to see more efficiency features like this released.

This approach was the one I tried first also (I also tried the frequency one fwiw, which has its own, worse drawbacks). But using loudness runs into issues if the source loudness isn't (relatively) even across the entire source media. Using a single sensitivity setting like this would be a problem if:

* recording gain is set to automatic, and there are sudden changes in noise floor like wind (if recorded in 24-bit or lower)

* crew adjusts gain partway through recording (big no-no but happens)

* talent/host moves in and out of microphone sweet spot

* talent/host adjusts themselves in a squeaky chair during silence or transition-to-silence (or coughs, or breaths loudly, or ambulance goes by...)

If you apply the edit w/ a single sensitivity and something like the above is true, it would cut in the wrong place. Unfortunately, you would have to watch the entire show, skipping to boundaries with your full attention to know that ever got a cut wrong.


The single-level approach is what Recut does too, and it tries to take a guess at a threshold with clustering but it's not always perfect. Maybe a better way to go would be a dynamic noise gate or kalman filtering or something.

Vidbase is looking awesome btw! I bet it's going to be huge. It looks like you've paid an insane amount of attention to the details.


this is super useful insight for us, thank you for sharing. yeah another product we're working on is "auto audio leveling", which I hope solves some of this, but we'll see.

and yes, I was very excited, thank you for checking it out, Van!


Weird. I edit a lot of videos but this is basically not something I have ever needed. The natural cadence of speakers is more valuable to me.


I think it can depend a lot on the style you're going for, but also on what kind of video it is.

With multiple speakers at once, I think removing a lot of silence can make it sound pretty unnatural.

As a solo person speaking to a camera, or recording a screencast, when I'm not perfect out the gate (read: I repeat each line like 6 times), I've found it easiest to take a breath and leave a pause between each take. After that, the editing is pretty straightforward, just cutting out the silences and keeping the good takes.

Some people are super comfortable on camera and can just talk for a few minutes and it's done... or memorize a script and knock it out... but sadly I'm not one of those people. Even with upfront planning, I'm more like "record 45 minutes to make a 5 minute screencast" so removing silence automatically is a huge time saver.


If you watch YouTube you'll notice there are lots of folks who edit out silences and disfluences. It really depends on what kind of video you're making :)


If it's the type of video where cutting silence automatically is viable, I wish they would just write out the transcript.

You can read what they're saying in a fraction of the time it takes to watch a video, and often internalize it better.


I think we'll display a transcript like this eventually, but I still believe the timeline should be the core driver of control


I use final cut to edit videos. Two things I haven't figured out how to do is to automatically that I would love an automatic solution to:

1. Cross fade the audio between clips, without crossfading the video, if the sound is available past the cut point. I think its pretty much 90% of the time that I would want something like that.

2. I have clips from a gopro that had a selfie stick that made a loud clicky sound from rattling that peaks above regular audio. Some way to lower the volume on that clicky noise without having to manually go through and do it for each click.


i really like the idea of automating the first one - sort of solves for not needing room noise as a bridging mechanism. i think we'll try this!

dont know how we'll deal with the second one ...


I believe there is (or was) a product used by TV stations which would semi-automatically re-edit a movie to fit in a specific amount of time, mostly by removing the start of end of a cut where not much was happening on the audio or video. Assuming I'm not completely imagining that, anyone know what it's called?


I'm glad there you guys made this option available in the browser. For those that don't know, YouTubers can spend more than 50% of their time editing on just cutting/trimming down silences!

Source: I am a YouTuber.

I even tried to tackle this problem in 2021 by building Atomic Edits[0] - an electron desktop app that did the same thing! I built a working prototype, but eventually stopped working on it after realizing that lots of web-based video editors started offering this functionality. In retrospect, it was obvious that doing this in the browser with cloud saves was way better.

Anyways, nice work! I'll check it out later.

[0] https://www.github.com/SuboptimalEng/atomic-edits


Instead of cutting out the silence spots, why not speed them up? If the presenter is silent b/c they are drawing something on the board (like in a lecture), then the result will feel a choppy.


In many cases, speeding up the entire video is even better.

I've been watching a lot of educational videos and lectures on 2x or even 3x speed, thereby saving a ton of time.

Of course, when the entire video is watched at 2x speed, the pauses can comfortably be shown at 5x speed or even faster.


IMHO, the speed up should be user controlled. Not everyone's first language is English and can tolerate higher rates of communication.


Yes. Varying types of silence each require its own method of cutting like speeding up,or putting a graphic in the middle.


that's a good idea. we'll try adding that, but i still think there are a lot of situations of "dead air," like between takes or long pauses.


Somewhere in the afterlife, Harold Pinter is cursing.


i think pinter pauses will always be a thing! it's up to the creator to decide what they're going for, certainly punchy dialogue and jump cuts have their place + i think with this UX we have, you can decide which pauses you want to keep


Oh definitely. I was just making a funny ;-)


On the about page you mentioned that you raised instead of continuing on the bootstrapped path. What made you pull the trigger?

My first impression when I checked out the landing page and product was: this is a perfect bootstrapped SaaS product idea to grow to meaningful ARR with a small team of 2-3 + some hired help for things like customer support.


[This is Julia, the cofounder/CEO] We bootstrapped for 8 months and grew to about 100k of MRR before raising money. I wrote a blog post about this decision https://www.kapwing.com/blog/why-we-decided-to-raise-money-f....

tl;dr is that 1) It's a competitive domain, so we felt we needed to grow our team quickly to stay relevant 2) There's a lot of technical challenges with video editing in the browser + cloud, so we needed to be able to hire talented engineers to scale and 3) I believe there's a huge business in empowering creative teams, so we think we can get to venture scale and needed the capital to support that.


Awesome! Thanks for sharing. I personally believe big ambitions lead to a more interesting journey, as well as having the ability to have more impact at scale. Impressive MRR!


The tool seems useful, but listening to the edited audio gives a suffocating feeling, because the speaker makes no pauses to breathe. Absolutely terrible sensation, exactly like I had earlier with commercial news radio stations that were packed with advertizing and announces.


This service seems very similar to another product I saw here months ago on Show HN called SavvyCut (https://www.savvycut.com/).

Can you comment on differences?


just tried this based on your comment. it does seem to re-process on every slider change, which I don't love. but definitely gets the job done! i prefer (biased obviously) live updates and also Kapwing lets you do general timeline edits you'd expect as well (text, overlays, add multiple tracks, etc.)


How would you compare your offering Descript? The pricing appears similar.


I feel like Descript doesn't work well when you're editing footage with other noises or the video isn't talking based. eg. for a vlog, there are talking sections and non-talking sections where Descript wouldn't be helpful for editing


i think there is overlap with descript, however i do think descript is transcript-first, and they focus on that level of control. I think video should be edited with a timeline as the core driver. I'm sure the right answer is somewhere in the middle, but I think having a robust timeline is important for video editing


I use ffmpeg for this. It has some silence removal options described in the man page. I've never gotten it to work really well, but for the stuff I do, it helps.


Like this a lot, will save time on rough cutting. Did you use a library for volume detection?


not for volume detection specifically, but we use the Web Audio API pretty heavily for this!


love the creativity in finding new use cases for AI that ...really make sense.


Isn't this a basic function in Audacity (for audio)?


Solving this for audio has almost absolutely nothing to do with solving it for video (other than that the detection algorithm will be the same).

Editing audio to deal with "silent sections" is nothing like editing the video for the corresponding sections.


in Audacity you can use Truncate Silence, which is pretty similar. Though obviously just for audio tracks, and I find it a missed opportunity that there isn't a live preview with a slider

https://manual.audacityteam.org/man/truncate_silence.html


We have several silence-trimming operations in Ardour. There's no live preview with a slider because generalized to N tracks where N could theoretically be in the hundreds, that is hellaciously expensive. However, since you can undo/redo trivially when working with metadata instead of actual audio, it doesn't seem like much of an impediment.


I already have a free plugin that does this :(


nice! which plugin is it?


[deleted]




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: