Hacker News new | past | comments | ask | show | jobs | submit login
Twilio Video – Real-time WebRTC video infrastructure (twilio.com)
211 points by timdorr on April 14, 2015 | hide | past | favorite | 104 comments



I was really excited by the title but this seems a bit lacking.

I've done a lot of WebRTC work -- we currently use Google Hangouts everywhere at Stack Exchange for internal team communication, but the 15 person cap hurts some bigger all-hands calls or our weekly remote parties. I don't think the hardest part of WebRTC is transcoding or TURN servers (what Twilio seems to be offering), multiplexing is where all the complexity comes from.

It's very easy to set up a video chat interface via WebRTC small-scale, and good on Twilio for focusing on making it easier to integrate mobile clients too, but I can't wait until someone (most likely the Talky.io team seeing how their betas are going) makes it a bit easier to have large N connections (right now on Hangouts N is 15, on my own stack N is around 10).


Large conferences translates to media server, network, sometimes interop, and eventually session control requirements.

Ideally, media servers with excellent IP connections, distributed around the world, and behind GeoDNS.

It also often means some interop capabilities (signaling and/or media interworking), since many large conference use cases include some users on traditional video (SIP or H.323), Lync/Skype For Business and/or PSTN.

Finally, session control becomes important if we get into enterprise and service provider conferencing markets.

Those reqs - especially media intensive ones - are interesting in that they are mostly not Twilio's core competency. Do they build it? Partner for it? Do they care enough about those use cases?

The answers to those questions are very interesting given Twilio's tremendous developer ecosystem.


Some Ns are larger than others. :-) At Talky, we can scale up to 15-25 participants using the excellent Jitsi Videobridge as a selective forwarding unit, and the results are similar at Jitsi Meet. Even at that relatively small scale, your browser might melt down depending on your hardware. Beyond ~25 we'll need to look at something more like a broadcast scenario a la Hangouts on Air (e.g., do you really need 25+ people actively participating and sending video at the same time?). As gz5 notes, that might imply a need for session control and moderation queues and such, or a larger group of passive audience members with the ability to temporarily grant send privileges to select users as needed. Those are all fun problems to solve. :-)


Hey, my coworkers and I built https://OpenTokRTC.com to test OpenTok (also a webrtc API) video streams. We've tested it with 15 people a year ago and it worked great. I would imagine that they have improved even more at this point in terms of scalability and quality. You can also test the android/ios companion apps, they all interoperate pretty nicely. As an added bonus, they also support real time video recording and playback, which is really nice.


Twilio's step into video streaming more interesting than video itself though, because now you can have the ability to call into a webrtc video conference, which can be pretty cool.


Here at Sococo (I work there) we support 20+ clients per meeting. Audio, video, doc sharing, rich presence info, always-on but no bandwidth burden when you're not meeting. Free to try!

We use it ourselves for our Agile process, doing ad-hoc meetings all day in various bullpens; Scrum each morning with 20+ participants; all-hands meetings once a week with a shared presentation. And all day I can see who is meeting with whom in my group. Its really pretty performant.


Connections are P2P for small meetings (<=4) but switches in an MCU automatically when more people 'enter the room'. Still, if you're on the same subnet (in the same building) with some of the participants it keeps those connections P2P based on ping time.

We're distributed across 5 states and 3 countries. Everybody has an equal footing in the meetings since we're all on-line. Even people in the same office (we have several small offices) attend using Sococo - I can see other attendees in the background of some people's videos :)


Thought I'd mention jitsi (https://jitsi.org) since it seems to aim to do what you are wanting. I am not sure how many connections it practically supports currently though.


Jitsi, licode, janus ... there are a whole list of them, open-source. Unfortunately they conflate the MCU feature with the SDP exchange feature. Which may or may not suit your application. For instance if you want to connect clients in anything but an everyone-in-the-room-hears-everyone-else model, they don't support anything else.


Bingo. You hit the nail on the head. To efficiently do large conferences, you need to both multiplex audio and video at the media server layer and have a concept of active speakers.


You can write your app to use one of those MCUs for large conference, and accomplish the audio/video multiplexing via those MCUs. But only in a logical full-mesh configuration. So for instance 100 people listening to 6 panel members but not one another - not easy. Not unless the clients cooperate and carefully trim the streams for audience peers (fail to subscribe to them), just leaving the panel streams. It becomes complicated very fast.


Jitsi Videobridge has both of those.


Jitsi only does the audio portion. There's no support for layouts/video mixing.

"Jitsi Videobridge does not mix the video channels into a composite video stream, but only relays the received video channels to all call participants."


Joe, this is actually not true. At least not for Jitsi where we have things like last N and possibilities to have 1:m or n:m sessions ...

There's also absolutely no usage of SDP Offer/Answer in Jitsi Videobridge


I think you've hit the nail on the head, here. WebRTC is really great for a handful of people. Implementation using some third party was never the difficult part, but scale was and is still a total nightmare.

I'd love to see WebRTC done by a service provider in a highly scalable way, so I can stop relying on RTMP (and typically embedding a flash player to support it) to deliver live streaming audio/video on a large scale.

As someone who has only leveraged WebRTC through third party providers (e.g., OpenTOK, etc.), I have no idea the complexity of what I'm asking for. Perhaps it's boil the ocean difficult, but it sure would be nice to have!


Yeah, I don't think that it is boil the ocean difficult but there is definitely some complexity there. I have been working on a scala WebRTC server, but I am not sure if it would really work at a large scale

Here was my basic approach:

An end user creates a PeerConnection with a Publisher node and starts sending a MediaStream using a string identifier.

Then another user can create a PeerConnection with a Subscriber node using the same identifier.

The Subscriber node then makes a request to the Publisher to make another PeerConnection to a Registry that exists on the same node as the Subscriber.

The replicated MediaStream can then be attached to the Subscriber.

Since the replicated MediaStream is in a Registry, any additional subscribers can attach the MediaStream on the same node as well.

The code is a huge mess but it is here (The scala server is in the media directory): https://github.com/jgrowl/livehq

There is a vagrant file that brings up the whole system at the root of the project (using docker). I have not tested it recently on anything other than ubuntu.


I've been working on exactly this. It's not horribly difficult if you've got experience building media servers, but it is tedious, because WebRTC has so many more moving parts than eg, RTMP, and the tooling is not very mature yet. Not to mention WebRTC itself is still a bit in flux (eg, ORTC).

Unfortunately, until IE and Safari support WebRTC, then RTMP is really still the best way to do low-latency streaming and video chat in the browser. Additionally, most RTMP server software will scale out to hundreds of clients out of the box.


Yep, absolutely right. Although I appreciate Twilio's approach all of this was easy before and the libraries exist in many variations... If you're doing WebRTC though Twilio's TURN service is very cheap and worth looking at


Blue Jeans solves this at scale, with higher quality than WebRTC alone, while also supporting video conferencing equipment joining the same meetings.

Transcoding is tough with unique views per client, but audio processing is tougher. :)


Interesting, is the 15 person cap a new thing? I remember it was still 10 at some point.


I think it is 10 for ad-hoc Hangouts, but you can raise it to 15 if you attach the meeting to a calendar invite.


Awesome!


I did a lot of testing of a few WebRTC platforms last summer for a product idea that we had.

I came to the conclusion that on mobile phones, WebRTC video is not yet usable. Even with just audio on a cellular network, call quality started deteriorating after a few minutes on every platform that I tried. Video without a WiFi signal was hopeless.

I discussed this with an WebRTC expert, and the problems are partly with WebRTC, partly with how mobile operators shape their data traffic.

Also all platform implementations that were considered the best of the breed were bought by mobile chat companies (both SnapChat and OK Hello did acquisitions on this front)


> how mobile operators shape their data traffic

What I've been wondering is if Apple gets any QoS concessions from the carriers for Facetime, because it usually works pretty well on 4G. Not sure if Facetime's quality is due to robust error correction, or because they've been able to wrangle special QCI/bearer status for Facetime data.

> mobile phones, WebRTC video is not yet usable

Once VoLTE + Video becomes common (eg, Verizon Advanced Calling), then it'll be interesting to see the impact on video-call quality, and whether V/VoLTE sessions can be established with the other end of a SIP trunk, and hence (browser) WebRTC from there.


Yeah, Facetime video seems to work well compared to others. It was a bit hard to explain to a couple of business guys, that "no, we are unlikely to get anything close to Facetime quality with WebRTC solution, but I don't know why."


You could try writing a FaceTime dara packet wrapper/unwrapper, which packs your application data into FaceTime video/audio packets, and unpacks it at the receiving end. If the carriers shape FaceTime traffic differently, it should be possible to establish whether this is the case, by emulating a FaceTime connection, as far as I can see. The content is encrypted, so I don't think they can shape traffic based on anything else than headers.


The issue is mostly due to the nature of the mobile IP, rather than inherent WebRTC variables.

However, ORTC does seek to improve with simulcast/SVC, and attributes of ORTC will merge into WebRTC as well, ultimately providing more hooks, finer-grained control and better instrumentation/visibility to the upper layers.

All that said, you can do acceptable quality WebRTC voice and video over 4G LTE today. But you do need a rock solid signal and be prepared to with a fully charged battery if you are not plugged in.


ORTC doesn't really have anything to do here, except nicer control over some configurations that you can already control with SDP munging.

Simulcast/SVC, for example, are irrelevant outside of multiway video. And you can already do simulcast with WebRTC.


ORTC is just an API.

The WebRTC protocol itself is defined in RTCWEB at the IETF and based on the existing RTP and RTCP transports, and SCTP transport layer. I don't think you'll be able to do better.


I get rather fine Skype video over 3G and 4G.


At Tuenti, a cell phone operator in Spain, we have been using webRTC very successfully for audio. In fact, we built android/iOS/web apps to make regular phone calls (to mobile or landline) using webRTC, either from WiFi or 3G/4G networks.


Interesting. Can you tell more about your stack? What did you use for signaling? Did you implement STUN and TURN to by-pass NATs and firewalls?


STUN and TURN, signaling over our XMPP infrastructure, a more or less recent version of webRTC trunk. We published a post some time ago on our blog about the details: http://corporate.tuenti.com/es/dev/blog/Building-a-VoIP-Serv...


Looking forward to hearing more about this on your website, That'll Never Work© News.


In my humble opinion, your comment is unwarranted. I never said that it will never work.

I actually spent several days last summer evaluating and developing prototypes on top of platform APIs (e.g. TokBox, Sinch) and tested the call quality in several different situations (moving from WiFi to 3G, dropping and regaining signal during the call, call behavior while moving in public transport etc.)

I described my experience and conclusion, hoping to spark informed discussion of the topic. Your comment didn't add any information to it.


Although it's great to see Twilio branching out into this space, with their recent product releases they do seem to be losing focus on their original Twilio Voice product. Compared to other prodividers their prices really aren't conpetitive, but Twiml is so much more flexible and easier to use than VoiceXML. What to do? :/


Rob from Twilio here - appreciate the feedback.

Have you had an opportunity to check out TaskRouter yet? We released it a few weeks ago - definitely my favorite recent product that works with Twilio Voice:

https://www.twilio.com/taskrouter


Interesting; however in my opinion I'd redesign that page. Having a lot of the information behind the interactive scrolling will cause it to be lost. Personally, and I know a lot of others, just scroll the page and look for the relevant information in paragraph form. Much easier to digest. This implementation forces me to stay on the page and is a bit frustrating.


I just posted the exact opposite of what you said :) Usually these "scroll hacks" are annoying. In this particular case the "flow" of calls is captured really well this way, IMHO.


Heheh, to each their own :)


Thats a pretty cool use of scroll to explain the product. Any details on how you guys built that? Is it completely custom or did you use some known JS/CSS lib/tricks?


Thanks for the kind words! We used a library called ScrollMagic (https://github.com/janpaepke/ScrollMagic) to handle the scroll listening and event triggering.


I saw this a while back and then forgot to go back and check it out. This looks interesting. Thanks for posting.


FWIW the scrolling effect is all jittery (and pretty much unusable) in Safari.


Shameless plug here, but if you're interested in voice platforms and are growing weary of Twilio, I'm working on a new hosted telephony platform that uses embedded lua executing directly on our hosted servers. Check it out - https://developers.corvisa.com/


Why did you choose Lua?


It was designed to be embedded, has a very small footprint, is relatively fast ( we use luajit ), and has few opinions as a language. That said, there is nothing locking us in to lua permanently as we would just need to implement a small-ish set of language bindings and docker configs to execute the language du jour.


First thing this reminded me of was https://tokbox.com/. Hopefully the pricing is along the same lines, it's good to see some competition in the space.


Awesome. Will try it soon. What would HN readers recommendfor near zero latency audio only? Our client needs to connect two devices within a LAN (no throughput issue) but near zero latency. We have tried Twilio PSTN, other webrtc technologies but they all have noticeable (half a second) delay from device to device. Any thoughts?


Try https://now.source-elements.com/#!/ Voice studios can manage to sync ProTools sessions on opposite sides of the Atlantic to within quarter of a frame, so on LAN it should manage nearer Opus' default of ~20ms.


You could try a SIP client with a low latency codec like CELT...


Thanks for your reply. Do you have any recommendations?


Yes, jitsi for desktop and if you need mobile, csipsimple is a nice option


Thanks Joeyspn. I am a noob when it comes to SIP and have some specific questions. For example, which of the jitsi product I should be using? If this is not the right place to ask such questions, can we please take this offline? I could totally use your knowledge


Just a standard SIP connection?


How available is IPv6 handset to handset connectivity? I understand that T-Mobile offers it, and Verizon can but it's usually disabled. If you have a true peer to peer connection without NAT, more options become possible.


If you can get a peer-to-peer connection, it's irrelevant whether you're behind a NAT or not. That's what ICE is for: to get a peer-to-peer connection even if you're behind a NAT. Having publicly routable IPv6 addresses may increase the likelihood of getting a peer-to-peer connection, but it's not a prerequisite.


Having a publicly routable IPv6 address means you don't need some paid service just to get connected.


The support for H.264 is interesting.

It of course makes sense for iOS and devices that can support hardware acceleration, but device mix can change during a call. So, what happens if two iOS users start, but then they add a VP8 only device? How Twilio handles that use case will be instructive for understanding what use cases we can support with this service.

It isn't likely they are transcoding since this intro suggests a P2P architecture without distributed media servers, and we know H.264/VP8 is very expensive video transcoding - e.g. many media server vendors don't get more than one transcode per core at full HD.


I just added h264 to jumpchat on ios and android. Say A(iOS) talks to B(Android). They both support h264, they negotiate to use h264. C(chrome) joins. A triangle connection is setup where A will talk to B using h264. A & B will talk to C in VP8. It's kind of taxing on mobile when this happens.


I can't speak with any real authority here but... WebRTC requires the two end points of the connection swap information for networking and things like what codecs they support. I don't believe the MediaStreams between a->b and a->c would have anything to do with each other. They only share a common VideoSource. That would lead me to believe that it would try to encode two different ways if that was the only option. More than likely you'd probably have the vendor select a preferred single codec instead of doing things differently for each user, though that would be pretty neat if it wasn't too resource intensive.


Transcoding is expensive, but with VP8, depending on the codec parameters, you can 'strip down' the stream for less-performant clients e.g. phones, while sending the full-fidelity version to other clients. Without decoding, just by dropping certain classes of packets. There is a loss of resolution of course. So anyway that's an option.


Are there any plans for Twilio to integrate this with the SIP side of things (e.g. Twilio providing a media server to enable WebRTC clients to communicate with SIP clients and also call out to traditional phone clients?)


I wrote this http://www.siptowebrtc.com/ that uses our SIP interfaces and PSTN to connect to our WebRTC client. I'm going to put the code up on Github after I clean it up but I'd be happy to talk through how it works. andrew at twilio dot com.



I'm pretty new to using WebRTC, but I was able to implement it really easily using icecomm.io's wrapper. I'd have to look into twilio some more, but icecomm has been more than effective enough.


Agreed. Out of all the WebRTC products I've tried, icecomm.io was the easiest one for me to use.


A tech question for Twilio: can you manipulate the SDP object?


Good question - one of Twilio Video's aim is to take a lot of the signaling headache away. We do expose the PeerConnection, but not the SDP object.

What would you want to use it for?


Extra authentication; new roles; address filtering (recognize VPN path etc); bundling control...


Check. Good feedback.


There goes my project. Back to the drawing board. Haha!


> Provides multi-party authentication, registration and signaling, which can orchestrate up to 4-way calling using peer to peer mesh topology, across any combination of supported devices. Calls can be video or voice only.

Peer to peer mesh topology? Anyone know of further reading on this?


"up to 4-way calling using peer to peer mesh topology" sounds like fancy names for regular WebRTC connections.


Interesting. I hadn't realized that WebRTC supported more than two clients talking to each other. My Google-fu is failing me, do you know where I can read up on how the topology of 4 connected callers is orchestrated without a central hub (i.e. how Hangouts does it)?

edit: https://www.webrtc-experiment.com/one-to-many-video-broadcas...


Rob from Twilio here - confirming this is an accurate description. Each peer in the conversation has a connection to every other peer.


Is there any chance of scaling WebRTC any better than that?


Absolutely - we have a lot of ambition around where we want to take Twilio Video in the future.

This is only the first step.


It would be useful to know which browsers are supported for javascript developers. I am using OpenTOK for Real-time WebRTC at the moment which works great, but it doesn't have safari support. Would definitely consider switching if this worked on all browsers!


Chrome and Firefox are the only browsers supported WITHOUT plugins. Look into the plugin that Temasys makes, which enables support for IE and Safari. It's relatively trivial to write a shim for said plugin.

IE will bring native WebRTC support in the next version of IE.


WebRTC on Opera works without plugins also.


Anyone know how this compares with icecomm? (http://icecomm.io/)

I was quite impressed with how easy it was to set up icecomm .. are twilio upping the ante here in any way that I should know about?


Was gonna ask the same — had a really good experience w/ IceComm when I tested it out a few weeks ago


Agreed. It was very easy to setup and implement into my web application.


looks similar, except they provide TURN servers.


WebRTC needs p2p for group chat, and it'd be unstoppable.

Currently using Hello for one-on-one conferencing, but need something that both scales and doesn't require a centralized server. Considering Skype had this ten years ago, it's obviously doable.


WebRTC already has p2p support for group chat. appear.in is one example of an implementation.

The main limitation is that upload bandwidth scales linearly with the number of people in the group chat. But that's often OK for up to 10 or so people.


The limit is more like 3 or 4 max. Anything about that and a modern machine starts choking on the video encoding anyway :)


I think the OP is correct. The video is encoded only once for all viewers. Its the uplink bandwidth, traditionally a choke point, that holds the practical limit at 3 or 4.


"up to 4-way calling" .. If it said up to 16 that would be more of a game changer.


MVP, got to start somewhere right?


Is ORTC still being implemented in the next version of WebRTC? I've read that ORTC needs a greatly simplified implementation (if you forget about all the other legacy stuff in WebRTC, such as support for the phone networks).


ORTC and WebRTC are separate groups in the W3C. You can think of the ORTC "Community Group" doing more exploratory work on a future version of the API while the WebRTC "Working Group" is finishing the 1.0 version. The work the ORTC Community Group is doing may or may not be adopted by the WebRTC working group as a future version someday, or maybe it will adopt some parts but not all.

The currently drafted ORTC API still allows all of the power to the Javascript apps to interoperate with "legacy stuff", but it also allows the JS to bypass most of it.

(I'm a member of both the WebRTC Working Group and the ORTC Community Group)


I believe that ORTC has been added to the WebRTC 1.1 specification.


There is no WebRTC 1.1 yet. 1.0 isn't even done yet. Once 1.0 is done, then work on 1.1 will start. ORTC maybe be adopted in whole or in parts as 1.1, but the future isn't written yet.


Good point. You're correct on both areas.


Would be nice if they support some video tracking capabilities. In the demo video, the arrow pointing to the slot is "asking" to be stack to the slot independent of the camera motion.


Doesn't CliqMeet do much of this? But it can go to 100+ because it provides an MCU service (multiplexing node), plus auditorium models as well.

Is the difference that this is SaaS, and CliqMeet was an app?


CliqMeet is a video conferencing product. Twilio is offering an API to allow you to build your own products with WebRTC functionality built-in.


damn twilio. building them pipelines.


can you use this to build something like twitch.tv?


For twitch.tv clones you're better off researching about streaming media servers... (and cheap bandwidth hosting)

http://en.wikipedia.org/wiki/List_of_streaming_media_systems


is there an open source streaming media server?

could you host this on digitalocean?


We had this in VoxImplant for a few months already :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: