Sam Talks Technology

7 of 90 episodes indexed
Back to Search - All Episodes

Jay LeBoeuf talks about AI Powered Podcasting Studio Descript

by Sam Sethi
April 30th 2020

Sam talks with Jay LeBoeuf about Descript’s latest version 3.4 which now includes live transcription, overdub and publishing to the web.  This is on top of its already powerful realtime audio and t... More

Hello, everyone on Welcome to Sam Talks Technology, Your weekly guide about all things tech on business with Sam Sethi. Hello. I'm welcome to another sound talks technology. I'm joined today by a very special guest. His name is J. Lo Bertha, and he is head of business development for a wonderful product called the script J. How are you? Good, Sam, How are you doing? Very well indeed. Where do we find you today? I am in Sunny San Francisco, California and I mean rainy England now J. You've recently joined the script, but what is the script? Maybe you could explain to listeners what is D script? Absolutely. Descript is a audio and video editing platform, and it's a I driven to make it as easy as possible for people to edit their content. Our overall premises you should be able to edit your content is easy as you can edit

text, and that's the paradigm we use. Everything that you put into the system gets automatically transcribed, just like you would in Google docks. Select text copy paste trim and it does all the audio editing in video editing behind the scenes. For somebody like me who's a podcaster who's been doing it for about two years. My old method was used something like audacity, but I was always having to go guess where the waves started and where the wave ended and I was cutting and pasting and it was never smooth. And there was a product called Otto Ai, which is a wonderful product, but it's separated if the text transcription to the audio. So if I change the text, the audio never change, so we had no value on. Then along came D script. One day I was listening to the podcast was so no who uses it on DSHEA. She described the script and I was like, really? So I tried it on. I've been an avid user for about two or three versions now, so thank you very much for the script. Well, you're very welcome. I'm glad to use it Now. Let's go through

some of the ways that descript works today, you upload, or you can record natively into the script so I could start a recording session directly in the script. Andi, it would record out for me absolutely, and even better, something we introduced just the other week is live transcription. So if we wanted Thio, we could open up the script. We could select your microphone is input number one you could select. We're talking over Zoom right now so we could select the output from Zoom as input number two. And then the script will do real time transcription keeping your track separate from my track but merging them together into a transcript. So the speaker labels will automatically go back and forth between Sam and J with each of our words. I am gonna have to try that on my next interview. I'm sorry I didn't try it on this one, so that's one great way of using it. I use it for my daily podcast, but I didn't fully understand that you could use it in

the way you've just described, which is brilliant. Absolutely so. The live transcription is something that I think is going to really increase people's workflow. But a very traditional and robust way of working is you don't have to record within the script. You can record in pro tools audition. GarageBand, or some of the media companies we work with are still going out with a zoom recorder in doing remote records. You can bring all that tape into the script and will transcribe all of it, whether it's a single track or multiple track. Now, one of the things that I find very powerful is the ability to go through the transcription once you've uploaded it on detected the speakers in the audio, which is brilliant and labeled them. But you could then go through in the first feature I loved is removal of all the mums and ours that we all make. There was such a powerful feature. It's magic, and I hope you do that to my recording as well. It's part of this paradigm shift, Sam, where if you go

back in time first, when Pro Tools came out and before that just a year or two before that was sound tools. I mean, we're getting 30 years ago, when these tools were first coming out, these were the first times that people could visual ise away form, and the tools that were created for it looked like little razor blades and looked like little pieces of tape because the paradigm they were trying to emulate. Waas Hey, you know how you can actually take magnetic tape and cut it and paste it together and rearrange it well, you can now do that digitally and 30 years ago that took a common user experience paradigm. People rushed into it cause you could work 10 times faster than you could before and then with that, you could do things that could never have been done before. So what's great with the script? You know, about 2017? The company's getting started, and there's been a dr in the application of artificial intelligence towards creative technologies. So

we've seen speech recognition get better and faster, and we've seen transcription become more accurate. But you could also now automatically align transcription audio files, and the descript team started leveraging this and saying, You know what the paradigm is? You should be able to edit your transcript just like we do now on paper or in Google docks, and that should map to the media files. So it's really kind of what we're starting to see now is what else can you dio when you have the transcript aligned automatically with the media content behind it? Well, weaken do natural language processing, and we can realize that the word, Um, in the word ah, are what we call in the script. We call them filler words. So, like you said, we could just with one click right click on it and say Ignore all filler words or delete all filler words. And then all of a sudden, all the ums and ahhs just magically get strikeouts. You can uncheck some of them if you want them in there for conversational nous. But another thing you could try, Sam, go

through your next podcasts, and I'm not saying you do this, but if you have guests that have ah lot of you knows or a lot of likes sadly, J, I do have a lot of you knows. So go into find and replace and search for comma, you know, comma and replace it with a space bar. And then now all of a sudden, all those you know, filler phrases will will get replaced. And all we're doing behind the scenes is intelligently recognizing in the wave form, you know, and just snipping it out intelligently, putting some cross fades around it. And again, this is something that you could have done in pro tools or GarageBand or audition or logic or audacity. You can do that in the wave form, but for anybody who has tried to do it in the way form, you have to have some magical way form. Reading abilities were just a lot of time of just scrubbing over it and trying to guess what's going

on Where. And it was a lot of guesswork. That's mainly what it turned out to be. So I love to strike out. I love the removal of filler words. I love the ability to be able to cut, copy and paste the ability to move information around. So as an editor off the podcast, I can get it to be sharper on tighter. One of the new features that you're starting to bring out there is a feature called Overdubbed. Maybe you can explain what overdub is. Overdub is another great example of something that it's really science fiction, that we actually can now do it. It's kind of crazy. When I first saw a demo, I couldn't believe how good it worked. Overdub is the ability for anybody to create their own voice, double or voice clone, with as little as 10 minutes of themselves speaking. So, Sam, you haven't created an overdub voice, have you? No, not yet. I'm in 2714 on the list, so we could maybe bump you up a few notches. Basically

what? That what would happen? You get to the front of the list, you get an email saying congratulations. We're now ready for you to upload your audio and you don't wanna maybe record yourself speaking for about 10 minutes and you're gonna use the same microphone in the same room that you're gonna do most of your production. So what, over double Dio is it will actually create a digital voice model with the same frequency response characteristics speaking styles, works, idiosyncrasies that you have in your natural voice and in your natural acoustic situation. And you can then use that digital voice to either generate new phrases and words to go back and fix things that you've misspoken or to really, you could paste in an entire paragraph of new text and it will synthesize it on the fly. Wow. I can't wait to try this. This is gonna be amazing. Now, in terms off the technology is everything that you

develop the ai, the transcription thieve speaker Detect on now, overdub, Is this all in house technology? Is thistle your own or has any of it been licensed? So overdub is is one of the things I described a science fiction that we're most proud of. Overdub came from a team that joined us at the end of last year s o. This company is called Lyre Bird. Uh, brilliant AI researchers up in Montreal s O. They joins the D Scripts team in September 2019, and they already had done several years of work before that. You know, they're really some of the early pioneers of voice cloning using generative neural networks and part of what they're able to dio in Part of their breakthrough is before they came along, you had to train with 10 hours, 12 hours, 20 hours of content. And I think we've all

read the stories on how laborious it was to create Siri's voice, for example, with some of the advancements that this team had come up with. Really, 10 minutes is the start. Maybe if you put in 20 minutes, it'll sounds 10% better. But once you get above 25 minutes of content of yourself. You know, reading your favorites. Uh, now your favorite article, Your favorite book. It gets to be good. Will the scripts use my future projects in which it's transcribing on detecting my voice toe, Add to my knowledge base within overdub? Or is that totally gonna be separate? So, for example, what I'm saying is I might do 10 minutes to train myself on overdub on. Then I do several projects. Obviously, it's detected my voice on each of those projects. Will any of that content be used against my overdub library of knowledge? Or do I have to keep those two separate? Right. So right

now, we haven't seen a lot of of improvements by having people continually add to their voice collection. Really, That initial training set should be good enough in the early phases, we will prompt you every now and then to give us feedback on Hey, how is the quality of this correction or how was the how was how was this phrase that we generated for you? And we just kind of looking for thumbs up, thumbs down and ah, drop down box just for our own quality control because we're soft rolling this out. The reason that there are 2,503,000 people on the wait list is we wanna learn something that, you know we were talking about earlier. How long I've been a D script is actually only my 10th week here. But one of the things that I've just been blown away about is how customer and feedback obsessed the company is. And so being able to actually meet one on one with users and to be able to pore over everybody's comments and feedback, uh, is actually

part of the development process. Well, I can't wait to try over DUP now. Some of the other newer features that have recently in added was Theobald iti to publish my transcription to a life page. Where do you see this developing So I can share that u R l with other people, third parties, Maybe they can comment on staff were just watch and follow the transcription order. But where do you see your where does see that? Developing into one of things I'm most excited about is how the tools really allow anybody to tell their own story, and I often describe the script is a storytelling tool because while it's very popular and has great product market fit within the podcaster community, anybody who's, you know, doing storytelling or audio journalism or short form audio content YouTubers. Anybody with a narrative driven story to tell is using the tool. And so if we can democratize the tools and let somebody who has no experience

in previously using audio or video editing tools get their stories recorded, the next step is how do we allow them to share it? So the early days of the product were allowing people to do first off to be able to export it if they do want to export it to a tool that they're more familiar with or did want to do further postproduction on. So you could always export it to pro tools or final cut pro or adobe premiere or audition professional tools like that you can always export. But then we started building in the sharing features. So version one, you could export the raw audio and export the raw transcripts and export the subtitles. But then, like you're telling you about before we started this call, there's a challenge of what do you do with these assets you have tow manually dragged them the three or four different places, and that's again, that's not part of the creative process. That's actually just a hassle. So we're starting to build in MAWR publishing into the APP. So what you mentioned s

so there's standalone publishing support where we'll create a de script hosted Web page that has an embedded player that automatically alliance to the transcript. It's great for accessibility purposes. It's great for S e O. It's also just kind of fun to look at to do you have the standalone Web page, but you also have the ability to generate audio grams. So this is something that has existed before. But we just kind of building a very super simple implementation so people can get started and rapidly share the video generated of their audio content on social media. So there's one feature requests on the audio Grand well two. Actually, one is the ability to change the background imagery. Descript introduced me to Canio, which is a third party tour that you use for doing your roadmap in getting customer feedback. I've shared that with so many different companies because It's such a great way for users like me to feedback feature request. Now on, then other users like

me can vote them up, which is just ALS that users want to do. We want to feel part of the product on the process, actually, where our product roadmap comes from. Okay, I have to encourage anybody with a feature request idea for D script to you. Just go to our website and you go to the feature requests page. And there you go. That's our road map. We don't ignore. The top voted things. That's wonderful. So one of the things was Audio Grams and I've read through pretty much. There's 357 feature requests in Canada right now. Not all of them are required, I'm hastened to add, but those are the top of very interesting, but one of those is the audio ground being able to customize it further. But the other one that I would add, which I haven't added Kanye. It's just one of those learnings. Twitter has a two minute audio gram limit linked in as a 10 minute audio gram limit on Facebook has no limit. And so whether they're called audio grams or full episodes. It's that ability toe have multiple

Lentz. That would be really interesting in the autograph today. Your limit is set to two minutes, but it would be nice to be able to say, actually going to do a five minute audio gram or I'm gonna highlight this clip composition. I'm gonna make sure 10 minutes because that's going to go into my linked in stream. That's great. And that limit. There's really no technological reason to it. That's a great example of we wanted to get something out and by imposing some limit on it and ensures that we can maintain great quality of service for all the rendering that goes on behind the scenes to make it happen in order to make it happen. Something that I hope you use within the script would be some functionality, like highlighting or also there's a function called clip to Composition, and this allows you to kind of go through. Let's say we go through our 45 minute interview and it would allow you to rapidly find the select of Here's the best of Quotes that you maybe wanna put in a two minute audio gram. And then here's the best

of the best of and then create a new composition to create the 32nd highlights of Yeah, if I had to represent what Jay and I talked about in one 32nd snippet, this is it on. That's how I use clip to compositions. Now there are two features that maybe I'm not quite a power user yet Onda need to understand. One is called sequencing on one's called Play Ahead. What do those two features offer me? Absolutely. So let's talk about Let's talk about a sequence. A sequence is really just a collection of audio tracks that are aligned to each other. The simplest sequence would be this interview right here where you have the high quality output from your microphone, and then you have the zoom recording from my microphone. You could bring those into D script, and the first thing you would do is Dragon. These two audio files and then you would say, create a sequence of these and then that would basically be a container, like a multi track recording in pro tools that

then aligns those recordings together. Now you can transcribe the sequence, and so with the injuries and Horowitz a 16 z podcast. They'll have up to five people at microphones at the same time. Those air five separate MP threes or wave files there, then aligned within a sequence, is a multi track recording. And then you press create transcript, and then you now have five tracks recorded into one edit double transcript with five speaker labels, and you could go back in at any time quickly. Go in and you can delete coughs or time. Try saw that one person on a separate track coughs. What someone else is speaking and you go, we just want to take that out. Absolutely. So it's with the sequence. You'd go into the sequence view. You know, there's a motorcycle going, and by my street right now, you could go in, see where the motorcycles going in and just hit delete. And then the person who is actually speaking doesn't get blocked out by those noises. I'm split clip at play ahead. What

does that do? So when you when you would want to do that, is when you want to create separate clips? And so the reason to create separate clips would be either Thio have ah region boundary such as? Hey, we're now going into an entirely different part of the discussion. And so I just want to visually see that in the wave form, another reason to do it would be for loudness management. So, you know, one of the things that we've built in is automatic clip leveling and the intuitive way of describing it for people that aren't audio engineers is you have Jay talking to the mic really close here. But then maybe I go get some tea and then I do the rest of the interview further away from the microphone. So that would be a moment when you would split the clip at that moment and then you would tell descript Hey, auto level this clip And it would take all of the quiet parts of my interview and bring them up to the same perceptual volume level. So

when I was using audacity on you've got things like normalization and you've got amplification as features on. I was always trying to get the second speaker up to my volume, and you'd end up never getting it right, and you'd still end up with a way far that was up and down on then I did it one day. I just decided to take the exported file out of D script and throw it back into a dusty. I didn't know why I was doing that, but I did. On suddenly I found this beautifully leveled way file all the way through. It was genuinely magic. Well, it's really, I think, what we're trying to create. I appreciate that the goal of the script is, too. Do as much of the underlying under the hood audio engineering work. So that way you could spend all of your time on the creative process. And so many of the problems that we encounter when we're trying to tell our stories really are just technical limitations. Who would know that I spent the first part of my recording close to the mic, and then e took a break. I came

back, and then I was further away and that I have to do something magical to it. So the goal of the script through the auto clip leveling and also the loudness normalization to just make it sound, is loud and proud. Is everything else you hear now? One of The things we talked about briefly before we came on air was markers on. But it's one of those feature requests, which I learned from watching the interactive videos to to put marks, and I thought, How brilliant. So now I could take a long form hour and a half podcast. I could go through a transcription. I can put markers which are like, I guess, chapter headings to my thinking on I could have those on and they do appear within the script. And I think I showed you where I thought they were not the most logical to me anyway, on where I thought there would be more logical. But what I would really love is the ability to take the exported a D script file. Andi basically have the upload feature into

, let's say, iTunes, which supports chapter ization. Um, have those markers convert Thio i d three tags. Have you added that as a future requests I have on he's been up voted as well, Wonderful. Well, the way this company works, Sam, once it starts gaining enough mo mentum every six weeks, we put out a new release, and so there's a decent chance we could have that out in six weeks if it bumps the other things out of the way. You know, we talked about publishing early so that technology already exists for us within our published pages. So if you have a descript published Web page, encourage anybody to put ample markers in there because you can click on the marker and it will automatically jump to that section of the transcript and the audio and then play those so we know it. We have the time code for it. It's a great suggestion. And so anybody else who agrees go to the feature requests, part of the scripts Web page. Give it a couple of foods. I'm sure Sam would be very

happy. I will be a very happy man now. The last couple of questions language. English is easy. How are you faring with inter nationalization and foreign languages and other languages, So there's really no technological reason why we can't do it. The main thing we're trying to do first is roll out the experience on primarily US English as best we can so we can understand all the quirks and get the workflow is greatest possible. And then scale out from there. I'm glad to say you recognize British English as well. Very well, Way do. And so you notice that I say it's a big part of it. It's the transcription models, and you're the thing with a I both on the transcription and also in the speech generation. It's only as good as your training set. So one of things that I'm really thrilled with is really just how, how diverse and how international the D script

core team is. We have a lot of different languages spoken within the company, so it's always in our mind set on a hourly basis. We get requests for which, you know, heated, heated discussions with very compelling arguments, such as somebody realizing that I like Van Halen and pointing out to me that but the Van Halen Brothers Air Dutch so therefore Okay, Dutch should be the next language we support. Okay, great. We're gonna be a little more principle than that. But no, that's it's something that something we're looking at. As soon as we get the English experience nailed down unarguably, then we can start rolling about childbirth headed development. Thank you so much for explaining descript. It's a wonderful product. I look forward to the future versions. Thanks a lot for your time. Thanks for being a user, Sam. And great talking to you. Thank you, Sam. That show was amazing. Don't forget to visit San Talks, Doc. Technology to discover Mawr Great shows. See you next week. Same time, same place

Jay LeBoeuf talks about AI Powered Podcasting Studio Descript
Jay LeBoeuf talks about AI Powered Podcasting Studio Descript
replay_10 forward_10