Creating the Metaverse with Volumetric Video and Virtual Production

by Jim Donnelly | November 30, 2021

The metaverse is one of those big, meaty concepts that tends to spur all kinds of debate. Sometimes practical, often theoretical. Or downright vague. And while some observers are hesitant and others gung-ho about the implications of the metaverse, many agree that the concept is a really big deal.

Metastage is a company on the forefront of it all. Metastage provides state-of-the-art, 3D volumetric video capture and photorealistic virtual humans—and when paired with virtual production, is perfect for injecting into the metaverse as representations of real people.

MASV spoke with Metastage CEO Christina Heller and Head of Production Skylar Sweetman to get their insights on how video producers can leverage the metaverse, what tools they can use to do it, and what it all means for M&E professionals and consumers.

Let’s dig in.

Jump to section:

What is the metaverse?
How will people access and consume media in the metaverse?
How will the metaverse affect media and entertainment?
How does this concept affect content creators?
What role does Metastage play in the metaverse?
What is volumetric video and how does it contribute to the metaverse?
What’s the relationship between volumetric and virtual production?
How do you see virtual production and volumetric capture evolving?

Note: The following transcript has been edited for length and clarity.

What is the metaverse?

Christina Heller: For me, the metaverse means our virtual and digital realities. And that’s obviously always evolving over time. But we’re starting to see the foundation of a digital layer on top of our existing physical spatial reality. A synthetic kind of world that’s endless, that you can explore and create, collaborate, and learn inside. It’s an entity unto itself, and also this kind of digital layer in which we can merge aspects of our computing life with our physical life. And we’re starting to see the building blocks of that with virtual reality, augmented reality, and mixed reality.

If you do a lot of gaming, for example, or you’re really active in virtual reality, or if you’re working in production on bleeding-edge AI concepts, you’re definitely starting to see the beginning of it. But it still feels pretty rudimentary compared to Ready Player One, sci-fi fantasy that I think we’re clearly moving towards. I just hope that the final vision of it is less dystopian than what we’ve seen in a lot of the media’s projections, and I hope we can be more mindful of the ways that technology really enhances and augments our life.

How will people access the metaverse and consume media there?

CH: To experience it in terms of spatial physical reality, at least for the foreseeable future, you’re going to need some kind of head-mounted device. I heard about contact lenses at AWE this year. There is going to be some interface that you put on your face that allows for the digital components to interact with the physical components.

So I think it’s gonna be glasses, and I think they’re gonna work in a similar way to how your watch functions with your phone at the moment. And we’ll start to see holograms, and data visualizations, and information, and entertainment, and advertising—unfortunately—peppered around our world. And when we go into the virtual reality component, then it really is more of an all-encompassing digital reality where the constraints of our physical form and physical experience are no longer there. And the implications of what’s possible in virtual reality, especially as the tech develops, are fascinating, crazy, and exciting.

Unlimited file transfer for unlimited creativity

Sign up for MASV and send up to 15 TB with a single file.

How do you think it’ll affect media and entertainment?

CH: What we’ve seen in the last 20 years is that entertainment can come in a lot of different forms. What people are fighting for these days is audience attention. Once extended reality becomes more mature, I think it will just become one more way for audiences to spend their hours engaging in content. And then it’s just like anything else. It’s a battle for time. Would you rather be in VR chat, watching one of your favorite musicians do a live musical performance? Or do you want to be looking at Netflix?

Maybe there’s a Netflix show that at some point busts out of my television and now is somehow a truly immersive experience. Imagine how cool that would be? I’m watching a drug trip sequence of a TV show, and then all of a sudden, my entire bedroom becomes that drug trip for like 30 seconds. There are a lot of ways people can be creative with it. And certainly, I think the hope is that more people get excited about mixed reality experiences.

How does the concept of virtual production affect content creators?

a virtual production studio uses unreal engine

Virtual Production Studio – Unreal Engine

CH: The thing that’s fun about virtual content production is that it has so much versatility. You can create a world you can shoot like a traditional TV show, but then you can take those same worlds and make them a virtual reality component where people can step inside, place themselves, and walk around. The more we create digital assets to make traditional or framed content, the more you can make those assets work for you as a creator and do weird, fun, experimental activations with it. Or you can sell it as an NFT. There are a million things you can do, because it’s a digital asset.

One of the things I do like about any film made before 2000 or so, is when you watch something blow up, you’re like, “Oh, my God, they actually blew that up.” In that sense, virtual production will never fully replace that sense of knowing something was done physically. But the benefit is that we won’t be disrupting people’s daily lives by shutting down streets, or creating a bunch of trash from shoots because we had to do some big crazy explosion.

Skylar Sweetman: Just to piggyback on what Christina said about the versatility of these virtual assets and spaces you’re creating for virtual production pipelines: That explosion we talked about is something that can happen over and over again. It can happen in a gallery space, or in some kind of experiential walkthrough where the public can go and feel that heat and see it as a visual—and it’s the exact thing that transpired in the film or TV show they were watching. There are so many ways to expand upon a digital asset and it’s exciting for creators and technologists, because it’s such a new and open frontier. There’s a lot of areas for collaboration. And that’s always kind of been the story of Hollywood and entertainment: The creators need technologists, and the technologists need the creators to know where to push them.

What role does Metastage play, or hope to play in the future, in the metaverse?

CH: We’ve been around for a little over three years, and our main thing has been to bring authentic, real human performances into the metaverse. So instead of creating a synthetic avatar that’s puppeted, we bring live-action performance of real people into virtual and augmented reality using traditional and virtual production tools and methods.

So, we do volumetric video or volumetric capture. We’ve captured athletes, musicians, and public figures. We’ve captured Black Panthers from the 1960s and Holocaust survivors. We pride ourselves on extremely high-quality, high-resolution capture—the closest thing to the real thing of standing next to or sitting across from or observing that person. Looking to the future, there are also aspects of live volumetric video on the horizon where we’re live broadcasting musicians or performances into the metaverse as well.

We also work with a lot of experts in Lidar scanning to bring real-world environments into virtual and augmented reality spaces. I wouldn’t want the metaverse to only be synthetic people and characters: It should also allow us to appreciate our real world and physical reality. It’s a very cool niche to be in, and I think an important one.

What is volumetric video?

SS: The concept behind volumetric video is that you use an array of cameras that’s around a person or an animal, and you record that person from every single angle above them, around them, from below them. You’re capturing every single part of their body. And you can then take all that video data, process it, and make a 3D asset that’s an exact, photographic digital replica of what transpired on stage. You don’t have to use tracking dots. There’s no puppetry or virtual rigging—there are no visual effects. It’s really a true representation in 3D of what happened on stage. So in that sense, it’s a really unique and wonderful technology for ed-tech or anything that requires a real and authentic human presence in a virtual digital environment.

CH: A lot of areas where we’ve seen success already are sports, for example, by bringing famous athletes in and having them do signature moves, or just being able to see their physical presence. I’m sorry, but a synthetic avatar of Ray Allen is not nearly as cool as the actual guy who did the legendary three point shots.

We’ve also done a lot with musicians. Musicians are always eager to be on the cutting edge. We’ve recently started doing 3D performance captures from music videos and commercials that were made with a game engine, where the volumetric video is captured at Metastage and is the hero asset inside the VFX pipeline of virtual production.

We’ve done a lot of marketing activations using 8th Wall and QR codes. You scan the QR code without having to download an app, and a little hologram pops up in front of you and you can bring this person into your living room. We did one recently with the world champion snowboarder Chloe Kim: You get your ski pass and you can scan a QR code and Chloe Kim pops into your living room, and talks to you about the upcoming ski season.

What’s the difference between 360/immersive video and volumetric video?

CH: 360 video is a totally different process and actually can be complementary to volumetric capture, such as 360 videos where you’re shooting with an array of cameras out to capture the world around you. And then you’ve got like LIDAR scanning and photogrammetry, where you’re actually doing a high resolution spatial scan of an environment.

Volumetric video, on the other hand, is shooting inward to capture a performance from every angle. But you can take 360 video and do a volumetric capture inside that 360 video, to create a realistic immersive video experience.

What type of gear is typically required for volumetric video shoots?

CH: It really depends on your goals and which volumetric system you’re working with. Metastage is a premium volumetric capture provider. That means we use top-of-the-line equipment: We use 106 cameras to capture these performances, and on top of it we use the Microsoft technology stack to process the data. We fall into the category of a professional production studio doing the very highest quality possible. Although, I should note that we’re not as expensive as people think. We use 106 cameras and we process the data both on a 50-blade, on-premises render farm—plus we use Azure cloud processing for overflow.

On the other side, you look at other solutions like Scatter’s Depthkit, which is more of a consumer-based volumetric product. It’s a very different system. It’s considerably fewer cameras, but it’s a system that you can put in a field, or wherever, and do volumetric capture with only one or two people for production. So there are a range of options when it comes to volumetric video production. Generally, though, more cameras equals better when it comes to capture, and you will see a qualitative difference.

Depthkit 1

Depthkit – Scatter

SS: It’s interesting, because a person with several iPhones could technically set up something like a volumetric capture rig. Which is kind of cool. I mean, it’s kind of exciting to think about it on that level.

How about resolutions and frame rates? What would you typically use and does it differ by project?

We typically shoot at 30 frames per second. We can shoot at 60, but then we’d be getting double the amount of data. To give you an idea of how much data we’re capturing when shooting at 12 megapixels, which is what we typically shoot at for the highest quality capture, we’re capturing 30 gigabytes of raw data per second. If we shoot at four megapixels, we’re doing 10 gigabytes of raw data per second. So we shoot at 30 frames per second, because then if we’re shooting at 60 frames per second we’re doubling the amount of data and it just takes that much more time and money and energy to process. I think we’ve done maybe one shoot at 60, where we were doing an athlete with a particularly fast move. And it was a very short capture.

Also, depending on where we’re going to be executing the final experience, we will adjust our poly count—which is the amount of polygons in the mesh of the holographic capture, depending on where it’s going. If it’s going to be a web AR experience, we’re going to want it as small as possible in terms of poly count and resolution because it’s going to stream through the web. In that case, I think the smallest we can go is 10,000 polygons at a 1.5K resolution.

We just did a recent shoot with Snapchat, and had to work really hard to get the poly count and resolution to a size that could be on Snapchat. Conversely, when we did a shoot for Unity for their new Metacast UFC proof of concept, they were going to put it through their own compression systems so we did it at our maximum resolution: 60,000 polygons at almost 4K resolution.

We’re always evaluating how big the file will be, along with what kind of quality you need. But the compression of our technology stack is amazing: If we need to get those 30 gigabytes of raw data per second down to, you know, 70 megabytes for a minute, we do that on a regular basis. Making the content as streamable as possible and small as possible is something we’ve been doing from the get go.

Send terabytes of raw data over the cloud

Sign up for MASV

Do you need or use any specialty camera lenses or camera types?

A volucam camera used for volumetric capture metaverse experiences

VOLUCAM 89B64CV Replay Camera

CH: We use IOI cameras. They’re machine vision cameras. Fifty-three of them are RGB cameras that capture the visual and textual data of the subjects, and the other 53 are infrared cameras shooting out a point cloud from which a mesh is derived. And that’s how we get the mesh. There’s a UV map, and then all that visual and textual data is applied to the UV map on the mesh, and that’s how we get the hologram.

SS: There are a variety of lenses you could use as well. And there’s no hard number on that. We use 16 mm lenses, but we’ve seen wider and we’ve seen narrower. It just depends on the tech stack and application.

You mentioned large file sizes; what’s the average file size of a volumetric video?

SS: Like Christina said, if we’re capturing at 12 megapixels, it’s 30 gigabytes of data per second at 30 frames per second. And then it might take a performer a few takes to get the shot. And so data management is definitely a part of our world—we think about it a lot. We tend to use Azure, both for processing overflow and also for storage.

Related: Send Files Directly to Microsoft Azure Storage

But when you’re talking about the final asset, the process of taking all 106 cameras’ worth of video data and making a small streamable file…you start with this giant point cloud, with millions of points. And that turns into a closed watertight mesh with giant textures. And you iterate and iterate and iterate upon that until it’s down to less than 100 megabytes for a full minute of volumetric, totally 3D performance.

Can you give an example of the biggest projects you’ve stored on Azure?

CH: I’m gonna say 60 to 70 terabytes. And that’s including a lot of the source data and stuff like that—we’re saving it in case the client wants to eventually reprocess it or something along those lines. And so that’s unprocessed. We haven’t put it through the pipeline. That’s just the raw data. It’s a feat of modern technology, because before cloud storage I don’t even know how you would do this kind of workflow.

Related: Send files to cloud storage with MASV Integartions

How about audio? How do you go about recording or creating 3D sound?

CH: We work with an amazing audio team—Echo VR is their name. We put a lot of energy into taking audio as seriously as we do the video. For a lot of our shoots we use lavaliers to get a good clean audio capture, like you would on any professional production shoot. We also have Sennheiser shotgun mics we use as a reliable and clean audio source: If the costumes don’t allow a lavalier to be integrated, let’s say we’re doing a swimsuit or something like that, then we can use the shotgun mics. And then we put everything through a typical Pro Tools sound cleanup and mastering process, and it’s attached to the final deliverable.

Read More: Understanding Audio File Formats

Is volumetric video a must-have in the future metaverse?

CH: You can have a metaverse experience that doesn’t have volumetric capture. I do it in VR all the time, go into AltspaceVR and VR chat. We’re all avatars. We’re exploring synthetic worlds. It’s super fun, and there’s no volumetric video necessary. But for the metaverse to really live up to its true potential, we need to be able to port real humans into it. And that’s where we are uniquely qualified to do that.

What’s the relationship between volumetric and virtual production? Do they intersect?

CH: They do. For instance, in one commercial we did with R/GA in New York, they built all the worlds using VFX software: They used Maya, Houdini, and Nuke. I think they use Maya for most of the world creation, and they use Houdini and Nuke for some additional VFX on the capture itself. And so the volumetric video is the hero performance asset, and the world was created using VFX software, and it’s where they did the virtual camera moves to frame all of the shots for the final commercial.

If you capture the actor or the performer using high-quality volumetric video, now you have the same flexibility with that asset that you have with any 3D asset. And if you’re creating everything inside the game engine, it allows real people to be a part of that virtual production scenario, as opposed to a virtual production scenario where a person is standing in front of an LED wall and you’re filming it that way. It allows you to do everything in-engine.

SS: And if you think about what R/GA’s team was able to do, they could do these wild, huge camera moves that would have been really expensive to do practically, and would have required an enormous budget and way more staff and more time. And they would have had to do the shots several times. But since they had this hero asset, everything else after that is just flexible.

How do you see virtual production and volumetric capture evolving in the future?

CH: I can see it allowing for more flexibility when it comes to creating content. Ideally, that will serve not only the directors and visionaries, but the performers as well. Maybe they won’t have to do as many takes, because the director can frame their shots in post.

We want to thank Christina Heller and Skylar Sweetman from Metastage for educating us on the technical nuances of creating a metaverse experience and how it will influence media & entertainment. To learn more about how Metastage uses volumetric video and virtual production to capture authentic human performances for a digital medium, you can metastage.com.

The metaverse promises to bleed the lines between our digital and physical realities. While that concept is ambitious, thanks to advancements in gaming and media technology, it’s not unattainable. Volumetric video and virtual production are two ways creative professionals are integrating humans into digital worlds. By capturing photorealistic performances in every angle and pairing it with VFX world-building and state-of-the-art rendering, creatives can produce content that speaks to us on a more intimate level.

In order to bring the metaverse—or any digital-first production to life, filmmakers and developers need to manage all the massive file sizes that come from high-resolution video captures and data-heavy production files. Whether it’s moving footage from volumetric capture stage to VFX team or sending proxies to clients for review—massive files call for MASV.

MASV is a file sharing service that lets media professionals quickly transfer terabytes of data to anyone in the world over the cloud. We consistently deliver files in record time thanks to our network of 150 global data centers and zero throttling on transfer speeds. All files shared through MASV are encrypted and backed by a Trusted Partner Network assessment; the industry standard for content protection. Sign up today for MASV and get 20 GB free towards your next transfer.

Send and receive large files from anywhere

Get 20 GB to use with the fastest, large file transfer service available today, MASV.

Sign up for MASV