A trip down the transformation pipeline in Mapbox GL-JS

2022-06-03

Comments

In this article I will briefly explain how GL-JS renders an encoded point from a vector tile onto the screen.

Few notes

I assume you already know what vector tiles are so here are few things to know about Mapbox vector tiles:

They are agnostic of geographics information.
Mapbox vector tiles have a resolution of 512 pixels.
They are efficiently encoded (using google protobufs) for fast transfer over web.
Let’s go!

How to encode a point in a tile

OK so naturally the first question is, how to encode a geographics location in a tile. Here is how I like to think about it:

We start with a point in WGS84 coordinates (latitude, longitude in degrees).
Next we covert this point to Web Mercator coordinates:

An intermediate step is to calculate $x'$ and $y'$ which are unitless and are in $[0, 1]$ range ¹):
$$ x' = {lon + \pi \over 2\pi} $$ $$ y' = {\pi - \ln[\tan({\pi \over 4} + {lat \over 2})] \over 2\pi} $$
Next we decide how big our world is going to be. This is a world in pixel units and its size depends on tile size (in pixels) and number of tiles (which in turn depends on zoom level), in short: $$ worldSize = tileSize * 2 ^ {zoom} $$ Note that this represents the number of pixels in each dimension.
We now are ready to compute the pixel coorinates that tells us which pixel the original point ends up on²: $$ px = \lfloor x' * worldSize \rfloor $$ $$ py = \lfloor y' * worldSize \rfloor $$
Then we compute Tile coordinates that tells us which tile this pixel belongs to: $$ x = \lfloor {px \over tileSize} \rfloor $$ $$ y = \lfloor {py \over tileSize} \rfloor $$ $$ z = zoom $$ Combined together this gives us the id of the tile reprsented as z/x/y.
Last piece tells us about the coordinates of the pixel inside the tile (i.e. in-Tile coorinates):
$$ ix = fract({px \over tileSize}) * tileExtent $$ $$ iy = fract({py \over tileSize}) * tileExtent $$ where $tileExtent$ is basically the “resolution” used to encode the in-Tile coordinates. for instance 4096, or 8192.

We have all the information we need so we can encode this pixel using Mapbox tile specifications and save the tile under the name z/x/y.mvt! Now let’s see how this gets rendered in gl-js!

How to transform a point in a tile

So now that we have our tiles and know what they contain we can transform them so we can display them on thes screen!
Recall the in-Tile coordinates and how it depends on the tile extent. In order to simplify things Mapbox normalizes the in-Tile coordinates so that they end up in $[0, 8192]$ range. This gives us what I’d refer to as Normalized in-Tile coordinates range i.e.: $$ nx = ix * {8192 \over tileExtent} $$ $$ ny = iy * {8192 \over tileExtent} $$

That’s all we need to do to prepare our tile geometry for rendering (i.e. vertex buffers)³.

Next we’ll setup our transformation matrices that basically take us from normalized in-Tile coordinates to screen coordinates!
What’s important and should not be missed is units and the input/output of each transformation matrix. I will describe these next.

Enter tile matrix

The first matrix is the tile matrix which takes us from normalized in-Tile coordinates to pixel/world coordinates.
Now this is basically the same coordinate I described in part 3 of previous section. One thing to note is that we only talked about integer zooms but actually we need to be able to support real values too. Imagine that we’re at zoom level 1.5 but there are no tiles at that zoom level so what we do is we fetch tiles say at zoom level 1 and then kind of compensate for this by scaling it up. Here is everything tile matrix does:

Divides the input by $EXTENT = 8192$ to bring all the coordinates to $[0, 1]$ range.
Scales up the result: $$ scale = tileSize * {2 ^ {zoom} \over 2 ^ {\lfloor zoom\rfloor}} $$ If $tileSize = 512$ and $zoom = 11.1$ this would roughly be $scale = 548.74$.
Multiplying by this value effectively associates a tile with $548.74$ pixel units.
One other way to interpret this is that each pixel unit would occupy $S \times S$ units where $S$ is in $[1, 2)$ range⁴.
Tranlates the result from previous step by:
$$ trx = id_x * scale $$ $$ try = id_y * scale $$ Where $id$ is the tile ID to which our original point belongs.

Putting it all together gives us the following matrix:

$$ \begin{pmatrix} scale / EXTENT & 0 & -trx \\ 0 & scale / EXTENT & -try \\ 0 & 0 & 1 \end{pmatrix} $$

Note that the actual implementation uses a $4\times 4$ matrix⁵.

Where is my camera?

Next we need to specify where the camera (viewpoint) is.
The camera position is defined in pixel/world coordinates and is determined by three factors:

$center$ aka map center defined in pixel/world coordinates is where the camera is looking at.
The camera’s forward vector which is determined by its orientation⁶ and is defined as follows: $$ forward = M * Z_p $$ where $M$ is the orientation matrix and $Z_p = (0, 0 ,1)$.
$fd$ denoting the distance between map center and camera position⁷.

In short: $$ cameraPosition = center - fd * forward $$

Which is a point in pixel/world coordinates⁸.

World to camera coordinates

Before we proceed I would like mention that our pixel/world space is defined in a left-handed coordinate system⁹.
$+X$ is from left of the screen to right, $+Y$ is from top of the screen to bottom and $+Z$ is prependecular to the screen and pointing outwards.
The camera however has a right-handed coordinate space where $+Y$ is actually flipped and is pointing upwards¹⁰.
Now that we know the orientation and the location of the camera in pixel/world coordinates we can construct a matrix that takes us from pixel/world coordinates to camera coordinates¹⁰:

The $z$ coordinate of the pixel/world coordinates is actually in meters¹¹. We would like it to be in pixel units so we multiply it by a scale factor that specifies the number of pixels per meter.
Multiply by the camera’s location and rotation matrices mentioned earlier.
Finally we flip the $y$ axis.

In short: $$ p_{camera} = \begin{pmatrix} 1 & 0 & 0 & 0 \\ 0 & -1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix} * R_{camera}^{-1} * \begin{pmatrix} 1 & 0 & 0 & -c_x \\ 0 & 1 & 0 & -c_y \\ 0 & 0 & 1 & -c_z \\ 0 & 0 & 0 & 1 \end{pmatrix} * \begin{pmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & ppm & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix} * p_{world} $$ where $ppm$ denotes pixels per meter, $R_{camera}^{-1}$ denotes the inverse of the camera rotation and $(c_x, c_y, c_z)$ is the location of the camera in pixel/world coordinates discussed earlier.

Camera to clip coordinates

Finally we construct a perspective matrix which takes us from camera coordinates to clip coordinates! What we need here is the field of view, the viewport’s aspect ratio and near and far plane distances.
Far plane distance calculation in particular is quite important here because we would like to ensure that everything on the map which needs to be displayed doesn’t get culled out and remains within the camera frustum bounds especially when camera has a high pitch.

Sum it up

So here is a summary of what’s happening:

$$ p_{pixel/world} = M_{tile} * p_{normalized-in-tile} $$ $$ p_{camera} = M_{camera} * p_{pixel/world} $$ $$ p_{clip} = M_{perspective} * p_{camera} $$

Next up

In this article I didn’t mention anything about terrain or how we render everything on a globe instead of a 2D mercator plane. I also didn’t discuss how the far plane for the perspective camera gets calculated and how we deal with rotations.
But fear not these will be described in follow-up posts so stay tuned! That’s all folks! Thank you!

Footnotes

$lon$ is in $[-\pi, \pi]$ and $lat$ is in $[-\phi_{max}, \phi_{max}]$ where $\phi_{max} = [2 \arctan(e^\pi) - {\pi \over 2}]$.
Plugging these values into $lon$ and $lat$ respectively ensures that $x'$ and $y'$ end up in the $[0, 1]$ range. ↩︎
Note that in this coordinate each tile corresponds to a $tileSize \times tileSize$ “pixel units” and each pixel has a $1\times1$ unit. But we can also think of it slightly differently in terms of what I would like to call 1x1 tile coordinates:
$$ wx = x' * numTiles $$ $$ wy = y' * numTiles $$ where instead each tile corresponds to a $1\times1$ unit. ↩︎
But what’s that magic 8192? We would like to describe the normalized in-Tile coordinates with as much precision as possible.
We use 16-bit integers for our vertex buffers but we also use some bits for other purposes which at the end of the day leaves us with 13 bits and you guessed it $8192 = 2 ^ {13}$. Also see this ↩︎
$$ \lim\limits_{(zoom - \lfloor zoom \rfloor) \to 1}\frac{2 ^ {zoom}}{2 ^ {\lfloor zoom\rfloor}} = 2 $$ ↩︎
You can take a look at the implementation of this matrix here. ↩︎
Using pitch and bearing which for now we assume as given. ↩︎
This is basically the so called focal distance of the camera which depends only on field of view and viewport resolution. The unit here is screen pixels / pixels. ↩︎
Another way to think about this is to just normalize everything to be in the mercator plane(i.e. divide everything by $worldSize$). Just imagine an XY-plane where the origin $(0, 0)$ is at the top-left corner and the furthest point $(1, 1)$ is at bottom-right. The camera is then located somewhere above this plane (depending on its orientation) and with an exact distance of $cameraFocalDistance / worldSize$ from map center. ↩︎
You may recall that $(0, 0)$ is located at top-left corner and $(worldSize, worldSize)$ is at bottom-right corner of our world plane. ↩︎
You can see the implementation spread across few places transform._updateCameraState, transform._calcMatrices and FreeCamera.getWorldToCamera. ↩︎ ↩︎
When using globe projection we use pixl units but that’s for another post. ↩︎