• 2020年03月13日 05:48
  • 来源:ARVR科技报道
  • 作者:皇冠体育在线|皇冠体育app|皇冠体育网站


Football matches land on your table thanks to augmented reality

It's World Cup season, so that means that even articles about machine learning have to have a football angle. Today's concession to the beautiful game is a system that takes 2D videos of matches and recreates them in 3D so you can watch them on your coffee table (assuming you have some kind of augmented reality setup, which you almost certainly don't). It's not as good as being there, but it might be better than watching it on TV.



The "Soccer On Your Tabletop" system takes as its input a video of a match and watches it carefully, tracking each player and their movements individually. The images of the players are then mapped onto 3D models "extracted from soccer video games," and placed on a 3D representation of the field. Basically they cross FIFA 18 with real life and produce a sort of miniature hybrid.

“桌面上的足球”系统输入一个比赛的视频并仔细观看,跟踪每个球员和他们各自的动作。然后将运动员的图像映射到“从足球视频游戏中提取出来的3D模型”,并放置在该场的3D表示上。基本上,他们把FIFA 18与现实生活,并产生一种微型混合动力车。

Considering the source data — two-dimensional, low-resolution, and in motion — it's a pretty serious accomplishment to reliably reconstruct a realistic and reasonably accurate 3D pose for each player.


Now, it's far from perfect. One might even say it's a bit useless. The characters' positions are estimated, so they jump around a bit, and the ball doesn't really appear much, so everyone appears to just be dancing around on a field. (That's on the to-do list.)


But the ideais great, and this is a working if highly limited first shot at it. Assuming the system could ingest a whole game based on multiple angles (it could source the footage directly from the networks), you could have a 3D replay available just minutes after the actual match concluded.


Not only that, but wouldn't it be cool to be able to gather round a central location and watch the game from multiple angles on it? I've always thought one of the worst things about watching sports on TVs is everyone is sitting there staring in one direction, seeing the exact same thing. Letting people spread out, pick sides, see things from different angles to analyze strategies — that would be fantastic.


All we need is for someone to invent a perfect, affordable holographic display that works from all angles and we're set.


The research is being presented at the Computer Vision and Pattern Recognition conference in Salt Lake City, and it's a collaboration between Facebook, Google, and the University of Washington.



Soccer On Your Tabletop

We present a system that transforms a monocular video of a soccer game into a moving 3D reconstruction, in which the players and field can be rendered interactively with a 3D viewer or through an Augmented Reality device. At the heart of our paper is an approach to estimate the depth map of each player, using a CNN that is trained on 3D player data extracted from soccer video games. We compare with state of the art body pose and depth estimation techniques, and show results on both synthetic ground truth benchmarks, and real YouTube soccer footage.



Imagine watching a 3D hologram of a live soccer gameon your living room table; you can walk around with anAugmented Reality device, watch the players from differentviewpoints, and lean in to see the action up close.


One way to create such an experience is to equip thesoccer field with many cameras, synchronize the cameras,and then reconstruct the field and players in 3D using multiviewgeometry techniques. Approaches of that spirit werepreviously proposed in the literature [14, 13, 19] and evencommercialized as Replay’s FreeD, and others [1]. The resultsof multi-view methods are impressive, however the requirement of physically instrumenting the field with manysynchronized cameras limits their generality. What if, instead,we could reconstruct any soccer game just from asingle YouTube video? This is the goal of this paper.

创造这样一种体验的一种方法是为足球场装备许多摄像机,使摄像机同步,然后利用多视点几何技术在3D中重建场和玩家。这种精神的方法以前在文献〔14, 13, 19〕中提出,甚至商业化为重放的释放,而其他的〔1〕。多视点方法的结果是令人印象深刻的,然而,在物理上用多个同步摄像机对该场进行测量的要求限制了它们的通用性。如果我们可以从一个YouTube视频中重建任何足球游戏呢?这是本文的目的所在。

There are numerous challenges in monocular reconstructionof a soccer game. We must estimate the camera poserelative to the field, detect and track each of the players, reconstructtheir body shapes and poses, and render the combinedreconstruction.



We present the first end-to-end system (Fig. 2) that accomplishesthis goal (short of reconstructing the ball, whichremains future work). In addition to the system, a keytechnical contribution of our paper is a novel method forplayer body depth map estimation from a single frame. Ourapproach is trained on meshes extracted from FIFA videogames. Based on this data, a neural network estimates perpixel depth values of any new soccer player, comparing favorablyto other state-of-the-art body depth and pose estimationtechniques.


We present results on 10 YouTube games of differentteams. Our results can be rendered using any 3D viewer, enablingfree-viewpoint navigation from the side of the fieldrecorded by the game camera. We also implemented “holographic”Augmented Reality viewing with HoloLens, projectedonto a tabletop. See the supplementary material forthe AR video results and the 3D model of the game.

我们目前的结果在10 YouTube游戏的不同团队。我们的结果可以使用任何3D观看者渲染,使得从游戏相机记录的场的侧面实现自由视点导航。我们还实施了“全息”增强现实观看与HoloLens,投射到桌面上。请参阅AR视频结果和游戏3D模型的补充资料。

2. Related Work

Sports Analysis Sports game analysis has been extensivelyinvestigated from the perspectives of image processing,computer vision, and computer graphics [32], both foracademic research and for industry applications. Understandinga sports game involves several steps, from fieldlocalization to player detection, tracking, segmentation, etc.Most sports have a predefined area where the action is happening;therefore, it is essential to localize that area w.r.t. thecamera. This can be done with manual correspondences andcalibration based on, e.g., edges [5], or fully automatically[21]. In this work, we follow a field localization approachsimilar to [5].

体育分析运动游戏分析已经从图像处理、计算机视觉和计算机图形学(32)的角度进行了广泛的研究,既用于学术研究,又用于工业应用。了解一个体育游戏涉及到几个步骤,从现场定位到玩家检测、跟踪、分割等。大多数运动都有一个预定义的区域,其中的动作正在发生,因此,有必要对该区域进行定位。这可以通过手动通信和基于例如边缘[5 ]或完全自动[ 21 ]的校准来完成。在这项工作中,我们遵循类似于[5 ]的场定位方法。

Sports reconstruction can be achieved using multiplecameras or specialized equipment, an approach that hasbeen applied to free viewpoint navigation and 3D replaysof games. Products such as Intel FreeD [1] produce newviewing experiences by incorporating data from multiplecameras. Similarly, having a multi-camera setup allowsmultiview stero methods [18, 19] for free viewpoint navigation[17, 47, 16], view interpolation based on player triangulation[14] or view interpolation by representing players asbillboards [13]. In this paper, we show that reliable reconstructionfrom monocular video is now becoming possibledue to recent advances in people detection [38, 7], tracking[31], pose estimation [49, 37], segmentation [20], and deeplearning networks. In our framework, the input is broadcastvideo of a game, readily available on YouTube and otheronline media sites.

运动重建可以使用多个摄像机或专用设备来实现,这种方法已经应用于游戏的自由视点导航和3D重放。英特尔等免费产品(1)通过结合多个相机的数据产生新的观看体验。类似地,具有多相机设置允许多视点Stor方法(18, 19)用于自由视点导航[17, 47, 16 ],基于播放器三角测量(14)的视图插值或通过表示玩家作为广告牌的视图插值[13 ]。在本文中,我们表明,从单目视频的可靠重建现在成为可能,因为人们检测的最新进展〔38, 7〕、跟踪〔31〕、姿势估计〔49, 37〕、分段〔20〕和深度学习网络。在我们的框架中,输入是一个游戏的广播视频,在YouTube和其他在线媒体网站上很容易获得。

Human Analysis Recently, there has been enormous improvementin people analysis using deep learning. Persondetection [38, 7] and pose estimation [49, 37] provide robustbuilding blocks for further analysis of images and video.Similarly, semantic segmentation can provide pixel-levelpredictions for a large number of classes [51, 27]. In ourwork, we use such predictions (bounding boxes from [38],pose keypoints [49], and people segmentation [51]) as inputsteps towards a full system where the input is a single videosequence, and the output is a 3D model of the scene.

人类分析近年来,人们对深度学习的分析有了很大的改进。人物检测(38, 7)和姿态估计(49, 37)为图像和视频的进一步分析提供健壮的构建块。类似地,语义分割可以为大量的类提供像素级预测[51, 27 ]。在我们的工作中,我们使用这样的预测(从[38 ]的边界框,姿势关键点[49 ],和人分割(51)]作为输入步骤朝向一个完整的系统,其中输入是一个单一的视频序列,并且输出是场景的3D模型。

Analysis and reconstruction of people from depth sensorsis an active area of research [44, 3], but the use of depthsensors in outdoor scenarios is limited because of the interferencewith abundant natural light. An alternative wouldbe to use synthetic data [48, 22, 46, 43], but these virtualworlds are far from our soccer scenario. There is extensivework on depth estimation from images/videos of indoor[10] and road [15] scenes, but not explicitly for humans.Recently, the work of [48] proposes a human part anddepth estimation method trained on synthetic data. They fita parametric human model [29] to motion capture data anduse cloth textures to model appearance variability for arbitrarysubjects and poses when constructing their dataset. Incontrast, our approach takes advantage of the restricted soccerscenario for which we construct a dataset of depth map/ image pairs of players in typical soccer clothing and bodyposes extracted from a high quality video game. Anotherapproach that can indirectly infer depth for humans from2D images is [4]. This work estimates the pose and shapeparameters of a 3D parametric shape model in order to fitthe observed 2D pose estimation. However, the method relieson robust 2D poses, and the reconstructed shape doesnot fit to the players’ clothing. We compare to both of thesemethods in the Experiments section.

深度传感器的人的分析和重建是一个活跃的研究领域[44, 3 ],但是由于外界自然光的干扰,深度传感器在室外场景中的使用受到限制。另一种选择是使用合成数据〔48, 22, 46,43〕,但这些虚拟世界离我们的足球场景很远。有广泛的工作深度估计从图像/视频室内[ 10 ]和道路[ 15 ]场景,但不明确为人类。最近,[48]的工作提出了一种在合成数据上训练的人的部分和深度估计方法。他们适合参数化人体模型(29)的运动捕捉数据,并使用布纹理模型的外观变化为任意主题和姿态时,构建他们的数据集。相反,我们的方法利用的限制足球场景,我们构建了一个数据集的深度地图/图像对球员在典型的足球服装和身体姿势从高品质的视频游戏中提取。另一种方法可以间接推断人类从2D图像的深度是[ 4 ]。这项工作估计的三维参数化形状模型的姿态和形状参数,以适应所观察到的2D姿态估计。然而,该方法依赖于鲁棒的2D姿态,并且重建的形状不适合玩家的服装。我们在实验部分比较了这两种方法。

Multi-camera rigs are required for many motion captureand reconstruction methods [8, 45]. [33] uses a CNN personsegmentation per camera and fuses the estimations in3D. Body pose estimation from multiple cameras is usedfor outdoor motion capture in [40, 11]. In the case of a singlecamera, motion capture can be obtained using 3D poseestimators [35, 36, 30]. However, these methods providethe 3D position only for skeleton joints; estimating full humandepth would require additional steps such as parametricshape fitting. We require only a single camera.

多摄像机RIPS对于许多运动捕获和重建方法是必需的〔8, 45〕。〔33〕使用美国有线电视新闻网人分割相机,融合3D估计,在40, 11中使用多摄像机的身体姿态估计进行室外运动捕捉。在单个摄像机的情况下,可以使用3D姿态估计器(35, 36, 30)获得运动捕获。然而,这些方法仅为骨骼关节提供3D位置;估计完整的人体深度需要额外的步骤,例如参数形状拟合。我们只需要一架照相机。


3. Soccer player depth map estimation

A key component of our system is a method for estimatinga depth map for a soccer player given only a single imageof the player. In this section, we describe how we traina deep network to perform this task.


3.1. Training data from FIFA video games

State-of-the-art datasets for human shape modelingmostly focus on general representation of human bodiesand aim at diversity of body shape and clothing [29, 48].Instead, to optimize for accuracy and performance in ourproblem, we want a training dataset that focuses solely onsoccer, where clothing, players’ poses, camera views, andpositions on the field are very constrained. Since our goalis to estimate a depth map given a single photo of a soccerplayer, the ideal training data would be image and depthmap pairs of soccer players in various body poses and clothing,viewed from a typical soccer game camera.

目前最先进的人体模型数据集主要集中在人体的一般表征上,针对人体形状和服装的多样性(29, 48)。取而代之的是,为了优化我们的问题的准确性和性能,我们想要一个只专注于足球的训练数据集,其中服装、球员姿势、相机视图和场地上的位置都非常受限。由于我们的目标是估计一个足球运动员的单个照片的深度图,理想的训练数据将是足球运动员在各种身体姿势和服装的图像和深度图对,从一个典型的足球游戏相机观看。

The question is: how do we acquire such ideal data? Itturns out that while playing Electronic Arts FIFA games andintercepting the calls between the game engine and the GPU[42, 41], it is possible to extract depth maps from videogame frames.

问题是:我们如何获取这样的理想数据?事实证明,在玩电子艺界国际足联游戏和拦截游戏引擎和GPU(42, 41)之间的呼叫时,可以从视频游戏帧中提取深度图。

In particular, we use RenderDoc [2] to intercept the callsbetween the game engine and the GPU. FIFA, similar tomost games, uses deferred shading during game play. Havingaccess to the GPU calls enables capture of the depth andcolor buffers per frame1. Once depth and color is capturedfor a given frame we process it to extract the players.


The extracted color buffer is an RGB screen shot of thegame, without the score and time counter overlays and thein-game indicators. The extracted depth buffer is in Nor-malized Device Coordinates (NDC), with values between0 and 1. To get the world coordinates of the underlyingscene we require the OpenGL camera matrices that wereused for rendering. In our case, these matrices were not directlyaccessible in RenderDoc, so we estimated them (seeAppendix A in supplementary material).


Given the game camera parameters, we can convert thez-buffer from the NDC to 3D points in world coordinates.The result is a point cloud that includes the players, theground, and portions of the stadium when it is visible. Thefield lies in the plane y = 0. To keep only the players, weremove everything that is outside of the soccer field boundariesand all points on the field (i.e., points with y = 0).To separate the players from each other we use DBSCANclustering [12] on their 3D locations. Finally, we projecteach player’s 3D cluster to the image and recalculate thedepth buffer with metric depth. Cropping the image andthe depth buffer around the projected points gives us theimage-depth pairs – we extracted 12000 of them – for traininga depth estimation network (Fig. 3). Note that we use aplayer-centric depth estimation because we get more trainingdata by breaking down each frame into 10-20 players,and it is easier for the network to learn individual player’sconfiguration rather than whole-scene arrangements.

给定游戏摄像机参数,我们可以将Z缓冲器从NDC转换成3D坐标在世界坐标系中。结果是一个点云,包括运动员,地面和体育场的部分,当它是可见的。该场位于平面y=0。只保留球员,我们除去足球场界线以外的一切和场地上的所有点(即,Y=0的点)。为了将玩家彼此分离,我们在其3D位置上使用dBSCAN群集[12 ]。最后,我们将每个玩家的3D集群投射到图像中,并用度量深度重新计算深度缓冲器。裁剪图像和围绕投影点的深度缓冲器给我们图像深度对-我们提取了其中的12000个-用于训练深度估计网络(图3)。注意,我们使用一个以玩家为中心的深度估计,因为我们通过分解每个帧到10-20个玩家来获得更多的训练数据,并且网络更容易学习个体玩家的配置而不是整个场景安排。

3.2. Depth Estimation Neural Network

Given the depth-image pairs extracted from the videogame, we train a neural network to estimate depth for anynew image of a soccer player. Our approach follows thehourglass network model [34, 48]: the input is processedby a sequence of hourglass modules – a series of residualblocks that lower the input resolution and then upscale it –and the output is depth estimates.

给定从视频游戏中提取的深度图像对,我们训练神经网络来估计足球运动员的任何新图像的深度。我们的方法遵循沙漏网络模型〔34, 48〕:输入由沙漏模块序列处理-一系列残余块,这些输入块降低输入分辨率,然后对其进行升级,并且输出是深度估计。

Specifically, the input of the network is a 256×256 RGBimage cropped around a player together with a segmentationmask for the player, resulting in a 4-channel input.We experimented with training on no masks, ground truthmasks, and estimated masks. Using masks noticeably improvedresults. In addition, we found that using estimatedmasks yielded better results than ground truth masks. Withestimated masks, the network learns the noise that occurs inplayer segmentation during testing, where no ground truthmasks are available. To calculate the player’s mask, we applythe person segmentation network of [51], refined witha CRF [25]. Note that our network is single-player-centric:if there are overlapping players in the input image, it willtry to estimate the depth of the center one (that originallygenerated the cropped image) and assign the other players’pixels to the background.

具体地说,网络的输入是一个256×256 RGB图像,在一个播放器周围环绕着一个播放器的分割掩码,从而产生一个4通道输入。我们试验了没有面具、地面真相面具和估计面具的训练。使用面具明显改善的结果。此外,我们发现使用估计掩模比地面真实掩模产生更好的结果。通过估计掩码,网络在测试期间学习播放器分割中发生的噪声,其中没有可用的地面实况掩码。为了计算球员的面具,我们应用的人分割网络的[51 ],精制与CRF〔25〕。注意我们的网络是单玩家中心的:如果在输入图像中有重叠的玩家,它将尝试估计中心的深度(最初生成裁剪图像)并分配其他玩家的像素到背景。

The input is processed by a series of 8 hourglass modulesand the output of the network is a 64×64×50 volume,representing 49 quantized depths (as discrete classes) and1 background class. The network was trained with cross entropy loss with batch size of 6 for 300 epochs with learningrate 0.0001 using the Adam [24] solver (see details ofthe architecture in supplementary material).


The depth parameterization is performed as follows:first, we estimate a virtual vertical plane passing through themiddle of the player and calculate its depth w.r.t. the camera.Then, we find the distance in depth values between aplayer’s point and the plane. The distance is quantized into49 bins (1 bin at the plane, 24 bins in front, 24 bins behind)at a spacing of 0.02 meters, roughly covering 0.5 meters infront and in back of the plane (1 meter depth span). In thisway, all of our training images have a common referencepoint. Later, during testing, we can apply these distanceoffsets to a player’s bounding box after lifting it into 3D(see Sec. 4.4).


4. Reconstructing the Game

In this section we describe our full pipeline for 3D reconstructionfrom a soccer video clip.


4.1. Camera Pose Estimation

The first step is to estimate the per-frame parameters ofthe real game camera. Because soccer fields have specificdimensions and structure according to the rules of FIFA, wecan estimate the camera parameters by aligning the imagewith a synthetic planar field template.


4.2. Player Detection and Tracking

The first step of the video analysis is to detect the playersin every frame. While detecting soccer players may seemstraightforward due to the relatively uniform background,most state-of-the-art person detectors still have difficultywhen, e.g., players from the same team occlude each otheror the players are too small。


We start with a set of bounding boxes obtained with [39].Next, we refine the initial bounding boxes based on pose informationusing the detected keypoints/skeletons from [49].We observed that the estimated poses can better separatethe players than just the bounding boxes, and the pose keypointscan be effectively used for tracking the players acrossframes.

我们从用[39 ]获得的一组包围盒开始。接下来,我们使用从[49 ]中检测到的关键点/骨架来细化基于姿态信息的初始包围盒。我们观察到,估计的姿态可以更好地分离玩家,而不仅仅是包围盒,姿势关键点可以有效地用于跟踪玩家跨帧。

Finally, we generate tracks over the sequence based onthe refined bounding boxes. Every track has a starting andending location in the video sequence. The distance betweentwo tracks A and B is defined as the 2D Euclideandistance between the ending location of track A and startinglocation of track B, assuming track B starts at a laterframe than track A and their frame difference is smallerthan a threshold (detailed parameters are described in supplementarymaterial). We follow a greedy merging strategy.We start by considering all detected neck keypoints(we found this keypoint to be the most reliable to associatewith a particular player) from all frames as separate tracksand we calculate their pairwise distances. Two tracks aremerged if their distance is below a threshold, and we continueuntil there are no tracks to merge. This step associatesevery player with a set of bounding boxes and poses acrossframes. This information is essential for the later processingof the players, namely the temporal segmentation, depth estimationand better placement in 3D. Fig. 2 shows the stepsof detection, pose estimation, and tracking.


4.3. Temporal Instance Segmentation

For every tracked player we need to estimate its segmentationmask to be used in the depth estimation network. Astraightforward approach is to apply at each frame a personsegmentation method [51], refined with a dense CRF [25] aswe did for training. This can work well for the unoccludedplayers, but in the case of overlap, the network estimates areconfused. Although there are training samples with occlusion,their number is not sufficient for the network to estimatethe depth of one player (e.g. the one closer to the center)and assign the rest to the background. For this reason,we “help” the depth estimation network by providing a segmentationmask where the tracked player is the foregroundand the field, stadium and other players are background (thisis similar to the instance segmentation problem [20, 50], butin a 1-vs-all scenario).

对于每一个被跟踪的播放器,我们需要估计其深度估计网络中使用的分割掩模。一个简单的方法是应用在每个帧中的人分割方法(51),用密集的CRF(25)精炼,正如我们为训练所做的那样。这对于未被遮挡的玩家来说可以很好地工作,但是在重叠的情况下,网络估计被混淆。虽然存在有遮挡的训练样本,但是它们的数目对于网络估计一个玩家的深度(例如靠近中心的深度)并将其余部分分配给背景是不够的。为此,我们通过提供一个分段掩码来帮助深度估计网络,其中跟踪的播放器是前景,并且场、运动场和其他播放器是背景(这与实例分割问题[20]、50]相似,但在1-VS ALL场景中)。


4.4. Mesh Generation

The foreground mask from the previous step, togetherwith the original cropped image are fed to the network describedin 3.2. The output of the network is per-pixel, quantizedsigned distances between the player’s surface and avirtual plane w.r.t. the camera. To obtain a metric depthmap we first lift the bounding box of the player into 3D,creating a billboard (we assume that the bottom pixel of the player lies on the ground). We then apply the distance offsetsoutput by the network to the 3D billboard to obtain thedesired depth map.


The depth map is then unprojected to world coordinatesusing the camera parameters, generating the player’s pointcloudin 3D. Each pixel corresponds to a 3D point and weuse pixel connectivity to establish faces. We texture-mapthe mesh with the input image. Depending on the application,the mesh can be further simplified with mesh decimationto reduce the file size for deployment in an AR device.


4.5. Trajectories in 3D

Due to imprecise camera calibration and bounding boxlocalization, the 3D placement of players can “jitter” fromframe to frame. To address this problem, we smooth the 3Dtrajectories of the players.The first termof the objective ensures that the estimated trajectory will beclose to the original detections, and the second term encouragessecond order temporal smoothness.


5. Experiments

All videos were processed in a single desktop with ani7 processor, 32 GB of RAM and a GTX 1080 with 6GBof memory. The full (unoptimized) pipeline takes approximately15 seconds for a typical 4K frame with 15 players.

所有视频都在一个桌面上进行处理,其中有一个i7处理器,32 GB的RAM和一个带有6GB内存的GTX 1080。完整的(未优化的)流水线需要一个典型的4K帧,大约有15秒,有15个玩家。


Synthetic Evaluation We quantitatively evaluate our approachand several others using a held-out dataset fromFIFA video game captures. The dataset was created in thesame way as the training data (Sec. 3) and contains 32 rgbdepthpairs of images, containing 450 players. We use thescale invariant root mean square error (st-RMSE) [48, 10]to measure the deviation of the estimated depth values offoreground pixels from the ground truth. In this way wecompensate for any scale/translation ambiguity along thecamera’s z-axis. We additionally report segmentation accuracyresults using the intersection-over-union (IoU) metric。

综合评价,我们定量评估我们的方法和其他几个使用从国际足联视频游戏捕获的数据集。数据集以与训练数据相同的方式创建(SEC)。3),包含32个RGB-深度图像对,包含450个播放器。我们使用尺度不变的均方根误差(ST RMSE)〔48, 10〕来测量前景像素的估计深度值与地面实况的偏差。以这种方式,我们补偿任何规模/平移歧义沿相机的Z轴。此外,我们报告分割精度结果使用交叉联合(IOU)度量。

We compare with three different approaches: a) nonhuman-specific depth estimation [6], b) human-specificdepth estimation [48], and c) fitting a parametric humanshape model to 2D pose estimations [4]. For all of thesemethods, we use their publicly available code.

我们与三种不同的方法进行比较:A)非人类特定深度估计(6),B)人类特定深度估计(48),和C)将参数人体形状模型拟合到2D姿态估计[4 ]。对于所有这些方法,我们使用它们公开可用的代码。

The input for all methods are cropped images containingsoccer players. We apply the person detection and pose estimationsteps, as described in Sec. 4, to the original videogame images in order to find the same set of players for allmethods (resulting in 432 player-depth pairs). For each detection,we crop the area around the player to use as a testimage, and we get its corresponding ground truth depth forevaluation. In addition, we lift its bounding box in 3D toget the location of the player in the field and to use it for ourdepth estimation method (note that the bounding box is notalways tight around the player, resulting in some displacementacross the camera’s z-axis).


The cropped images come from a larger frame withknown camera parameters; therefore, the depth estimatescan be placed back in the original camera’s (initially empty)depth buffer. Since the depth estimates from the differentmethods depend on the camera settings that each methodused during training, it is necessary to use scale/translationinvariance metrics. In addition, we transform the output of[48] into world units by multiplying by their quantizationfactor (0.045m). Note that our estimates are also in worldunits, since we use the exact dimensions of the field forcamera calibration. For [4], we modify their code to usethe same 2D pose estimates used in our pipeline [49] andwe provide the camera parameters and the estimated 3D locationof the player. Table 1 summarizes the quantitativeresults for depth estimation and player segmentation. Ourmethod outperforms the alternatives both in terms of deptherror and player coverage. This result highlights the benefitof having a training set tailored to a specific scenario.

裁剪后的图像来自于具有已知相机参数的较大帧,因此,深度估计可以被放置在原始相机(初始空)深度缓冲器中。由于来自不同方法的深度估计依赖于在训练期间使用的每种方法的相机设置,因此有必要使用尺度/平移不变性度量。此外,通过将它们的量化因子(0.045 m)乘以,将[48 ]的输出转换成世界单元。请注意,我们的估计也在世界单位,因为我们使用的精确尺寸的领域相机校准。对于[4 ],我们修改他们的代码使用相同的2D姿态估计在我们的管道[49 ]中使用,并且我们提供摄像机参数和估计的3D位置的播放器。表1总结了深度估计和播放器分割的定量结果。我们的方法优于在深度误差和播放器覆盖方面的替代方案。这一结果突出了有适合特定场景的训练集的好处。

The method of [48] assigned a large number of foregroundpixels to the background. One reason is that theirtraining data aims to capture general human appearanceagainst cluttered backgrounds, unlike what is found intypical soccer images. Moreover, the parametric shapemodel [29] that is used in [48, 4] is based on scans of humanswith shapes and poses not necessarily observed in soccergames. Trying to fit such a model to soccer data mayresult in shapes/poses that are not representative of soccerplayers. In addition, the parametric shape model is trainedon subjects wearing little clothing, resulting in “naked” reconstructions.

〔48〕方法将大量前景像素分配给背景。其中一个原因是,他们的训练数据旨在捕捉一般人的外表,以应对混乱的背景,不像在典型的足球图像中发现的那样。此外,在[48, 4 ]中使用的参数形状模型[29 ]是基于人的形状和姿势的扫描,在足球比赛中不一定观察到。试图将这样的模型拟合到足球数据可能会导致形状/姿态不代表足球运动员。此外,参数形状模型训练受试者穿着很少的衣服,导致“裸”重建。

YouTube videos We evaluate our approach on a collectionof soccer videos downloaded from YouTube with 4Kresolution. The initial sequences were trimmed to 10 videoclips shot from the main game camera. Each sequence is150-300 frames and contains various game highlights (e.g.,passing, shooting, etc.) for different teams and with varyingnumbers of players per clip. The videos also contain typicalimaging artifacts, such as chromatic aberration and motionblur, and compression artifacts.



(Fig. 6) shows the depth maps of different methods onreal examples. Similar to the synthetic experiment, thenon-human and non-soccer methods perform poorly. Themethod of [4] correctly places the projections of the posekeypoints in 2D, but the estimated 3D pose and shape are oftendifferent from what is seen in the images. Moreover, theprojected parametric shape does not always correctly coverthe player pixels (also due to the lack of clothing), leadingto incorrect texturing (Fig. 7). With our method, while wedo not obtain full 3D models as in [4], the visible surfacesare modeled properly (e.g. the player’s shorts). Also, aftercorrectly texturing our 3D model, the quantization artifactsfrom the depth estimation are no longer evident. In principle,the full 3D models produced by [4] could enable viewinga player from a wide range of viewpoints (unlike ourdepth maps); however, they will lack correct texture for unseenportions in a given frame, a problem that would requiresubstantial additional work to address.

图6示出了实际例子中不同方法的深度图。类似于合成实验,非人类和非足球方法表现不佳。[4]的方法将姿态关键点的投影正确地放置在2D中,但是估计的3D姿态和形状往往不同于图像中所看到的。此外,投影的参数形状并不总是正确地覆盖播放器像素(也由于缺少衣服),导致不正确的纹理(图7)。用我们的方法,虽然我们没有获得完全的3D模型,如在[4 ]中,可见表面被适当地建模(例如玩家的短裤)。此外,在正确纹理化我们的3D模型之后,深度估计的量化伪影不再明显。原则上,由[4 ]产生的全3D模型能够使玩家从广泛的视域(不像我们的深度图)观看,然而,它们将缺乏对给定帧中未被看到的部分的正确纹理,这将需要大量额外的工作来解决。


Depth Estimation Consistency Our network is trainedon players from individual frames without explicitly enforcingany temporal or viewpoint coherence. Ideally, thenetwork should give compatible depthmaps for a specificplayer seen at the same time from different viewpoints. In (Fig. 8,) we illustrate the estimated meshes on the KTH multiviewsoccer dataset [23], with a player captured from threedifferent, synced cameras. Since we do not have the locationof the player on the field, we use a mock-up camera toestimate the 3D bounding box of the player. The mesheswere roughly aligned with manual correspondences.



In addition, for slight changes in body configurationfrom frame to frame, we expect the depthmap to changeaccordingly. Fig. 9 shows reconstructed meshes for fourconsecutive frames, illustrating 3D temporal coherence despiteframe-by-frame reconstruction.



Experiencing Soccer in 3D The textured meshes andfield we reconstruct can be used to visualize soccer contentin 3D. Fig. 10 illustrates novel views for three inputYouTube frames, where the reconstructed players are placedin a virtual stadium. The 3D video content can also beviewed in an AR device such as a HoloLens (Fig. 1), enablingthe experience of watching soccer on your tabletop.



See supplemental video.Limitations Our pipeline consists of several steps andeach one can introduce errors. Missed detections lead toplayers not appearing in the final reconstruction. Errors inthe pose estimation can result in incorrect trajectories andsegmentation masks (e.g. missing body parts). While ourmethod can handle occlusions to a certain degree, in manycases the players overlap considerably, causing inaccuratedepth estimations. We do not model jumping players sincewe assume that they always step on the ground. Finally,strong motion blur and low image quality can adversely affectthe performance of the depth estimation network.


6. Discussion

We have presented a system to reconstruct a soccer gamein 3D from a single YouTube video, and a deployment thatenables viewing the game holographically on your tabletopusing a Hololens or other Augmented Reality device.The key contributions of the paper are the end-to-end systemand a new state-of-the-art framework for player depthestimation from monocular video.


Going forward there are a number of important directionsfor future work. First, only a depth map is reconstructedper player currently, which provides a satisfactoryviewing experience from only one side of the field. Further,occluded portions of players are not reconstructed. Hallucinatingthe opposite sides (geometry and texture) and occludedportions of players would enable viewing from anyangle. Second, further improvements in player detection,tracking, and depth estimation will help reduce occasionalartifacts and reconstructing the ball in the field will enable amore satisfactory viewing of an entire game. In addition,video game data could provide additional information tolearn from, e.g., temporal evolution of a player’s mesh (ifreal-time capture is possible using a different capture engine)and jumping poses that could be detected from depthdiscontinuities between the player and the field.


Finally, to watch a full, live game in a HoloLens, weneed both a real-time reconstruction method and a methodfor efficient data compression and streaming.Acknowledgements This work is supported by NSF/IntelVisual and Experimental Computing Award #1538618 andthe UW Reality Lab.

最后,为了在HOLLONS中观看完整的实况游戏,我们需要实时重建方法和有效的数据压缩和流媒体的方法。这项工作是由NSF /英特尔视觉和实验计算奖1538618和UW现实实验室支持的。

" 今日排球比赛 " 的相关文章