How Google Earth (Really) Works
One of the original creators of Google Earth explains how it works
I originally posted a version of this story in 2007 and have added a few updates for 2020. For the technically inclined, you may want to read the patents — Asynchronous Multilevel Texture Pipeline, Server for geospatially organized flat file data — that protect these ideas. [Note: Michael Jones, Chris Tanner, Phil Keslin, David Kornmann, John Hanke, and more contributed to Google Earth in different ways and currently work at Niantic (the makers of Pokemon Go).]
We’re going to proceed in reverse, strange as it may seem, from the instant the 3D Earth is drawn on your screen and later trace back to the time the data is served. I believe this will help explain why things are done as they are and why some other approaches don’t work nearly as well.
Part 1, The Result: Drawing a 3D Virtual Globe
There are two principal differences between Google Maps and Google Earth that inform how things should ideally work under the hood. The first is the difference between fixed-view (often top-down) 2D and free-perspective 3D rendering. The second is between real-time and prerendered graphics. These two distinctions are fading away as the products improve and converge. As of today, you can jump between 2D and 3D in the same webpage with just a click.
What both have in common is that they begin with traditional digital photography — lots of it: basically one giant high-resolution (or multiresolution) picture of the Earth. How they differ is largely in how they render that data.
Consider: The Earth is approximately 40,000 km around the equator. Whoever says it’s a small world is being cute. If you stored only one pixel of color data for every square kilometer of surface, a whole-earth image (flattened out in, say, a Mercator projection) would be about 40,000 pixels wide and roughly half as tall. That’s far more than most 3D graphics hardware can handle today. We’re talking about an image of 800 megapixels and 2.4 gigabytes at least. PCs in 2000 had 56k modems and were only beginning to commonly have GPUs for 3D rendering. Even in 2020, only high-end gaming PCs could handle this amount of information at once without some clever optimizations — and this is only for the “base” map. Once you zoom in, you’re talking about up to terabytes of information to manage.
This “base” map is your basic run-of-the-mill one-kilometer-per-pixel whole-earth image. The smallest feature you could resolve with such an image is about 2 kilometers wide (thank you, Mr. Nyquist) — no buildings, rivers, roads, or people would be apparent. But for most major U.S. cities, Google Earth deals in resolutions that can resolve objects as small as half a meter or less, at least 4,000 times denser, or 16 million times more storage than the above example.
We’re talking about images that would (and do) literally take many terabytes to store. There is no way that such a thing could ever be drawn on today’s PCs, especially not in real-time.
And yet it happens every time you run Google Earth.
Consider: In a true 3D virtual globe, you can arbitrarily tilt and rotate your view to pretty much look anywhere (except perhaps underground — and even that would be possible if we had the data). In all 3D globes, there exists some source data, typically a high-resolution image of the whole Earth’s surface, or at least the parts for which the company bought data. That source data needs to be delivered to your monitor, mapped onto some virtual sphere or ideally onto small 3D surfaces (triangles, etc.) that mimic the real terrain, mountains, rivers, and so on.
If you, as a software designer, decide not to allow your view of the Earth to ever tilt or rotate, then congrats, you’ve simplified the engineering problem and can take some time off. But then you don’t have Google Earth.
Various schemes exist to allow one to “roam” part of this ridiculously large texture. Other mapping applications solve this in their own way and often with significant limitations or visual artifacts. Most of them simply cut their huge Earth up into small regular tiles, perhaps arranged in a quadtree, and draw a certain number of those tiles on your screen at any given time, either in 2D (like Google Maps) or in 3D, like Microsoft’s Virtual Earth apparently does.
But the way Google Earth solved the problem was truly novel and worthy of a software patent (and I am generally opposed to software patents). To explain it, we’ll have to build up a few core concepts. A background in digital signal theory and computer graphics never hurts, but I hope this will be easy enough that that won’t be necessary.
I’m not going to explain how 3D rendering works — that’s covered elsewhere. But I am going to focus on texture mapping, and texture filtering in particular, because the details are vital to making this work. The progression from basic concepts to the more advanced texture filtering will also help you understand why things work this way and just how amazing this technology really is. If you have the patience, here’s a quick lesson in texture filtering.
The problem of scaling, rotating, and warping basic 2D images was solved a long time ago. The most common solution is called bilinear filtering. All that really means is that for each new (rotated, scaled, etc.) pixel you want to compute, you take the four “best” pixels from your source image and blend them together. It’s bilinear because it linearly blends two pixels at a time (along one axis) and then linearly blends those two results (along the other axis) for the final answer.
[A “linear blend,” in case it’s not clear, is simple: take 40% of color A and 60% of color B and add them together. The 40/60 split is variable, depending on how “important” each contributor is, as long as the total adds up to 100%.]
That functionality is built into your 3D graphics hardware such that your computer can do literally billions of these calculations per second. Don’t ask me why your favorite paint program is so slow.
The problem being addressed can be visualized pretty easily — that’s what I love about computer graphics. It turns out, whenever we map some source pixels onto different (rotated, scaled, tilted, etc.) output pixels, visual information is lost.
The problem is called “aliasing,” and it occurs because we digitally sampled the original image one way, at some given frequency (aka resolution), and now we’re re-sampling that digital data in some other way that doesn’t quite match up.
1. A simple low-res (11×11 pixel) image is about to be rotated. (The grid lines are merely to delineate pixels.)
2. Close up of one output pixel. Bilinear interpolation averages the “best” four source pixels for each new destination pixel (shown as a black border with white dots) based on their relative importance (ideally: fractional area).
3. Each pixel in the destination grid overlaps multiple pixels from the rotating original.
4. After bilinear interpolation, the resulting rotated image has some clear (or rather blurry) issues.
One of the other reasons for aliasing and scintillation has to do with the fact that we may ask for linear blending, but our texture maps are often encoded as images in a “gamma-corrected” color space, intended to match monitors better. Do not blend in gamma color space, as it gets poor results. Make sure to blend in linear color space and only apply gamma correction at the end.
Now, when we talk about output pixels and destinations, it doesn’t much matter if the destination is a bitmap in a paint program or the 3D application window that shows the Earth. Aliasing happens whenever the output pixels do not line up with the sampling interval (frequency, resolution) of the source image. And aliasing makes for poor visual results. Dealing with aliasing is about half of what texture mapping is all about. The rest is mostly memory management. And the constraints of both inform how Google Earth works.
The mission then is to minimize aliasing through cleverness and good design. The best way to do this is to get as close as possible to a 1:1 correspondence between input and output pixels, or at least to generate so many extra pixels that we can safely down-sample the output to minimize aliasing (also known as “anti-aliasing”). We often do both.
Consider: For resizing images, it only gets worse — each pixel in your destination image might correspond to hundreds of pixels of source imagery or vice versa. Bilinear interpolation, remember, will only pick the best four source pixels and ignore the rest. So it can, therefore, skip right over important pixels, like edges, shadows, or highlights. If some such pixel is picked for blending during one frame and skipped over subsequently, you’ll get an ugly “pixel-popping” or scintillation effect. I’m sure you’ve seen it in some video games. Now you know why.
Tilting images (or any 3D transformation) is even more problematic because now we have not only elements of scaling and rotation but also a great variation in pixel density across rendered surfaces. For example, in the “near” part of a scene, your nice high-res ground image might be scaled up such that the pixels look blurry. In the “far” part of the scene, your image might appear scintillated (as above) because simple 2×2 bilinear interpolation is necessarily skipping important visual details from time to time.
Here’s an example of where a certain kind of texture filtering causes poor results. The text labels are hardly readable. (Why they’re painted into the terrain image at all is another issue.)
Better Filtering, Revealed
Most consumer 3D hardware already supports what’s called “tri-linear” filtering. With tri-linear and a closely coupled technique called mip-mapping, the hardware computes and stores a series of lower-resolution versions of your source image or texture map. Each mip-map is automatically down-sampled by a factor of 2, repeatedly, until we reach a 1×1 pixel image whose color is the average of all source image pixels.
So, for example, if you provided the hardware with a nice 512×512 source image, it would compute and store 8 extra mip-levels for you (256, 128, 64, 32, 16, 8, 4, 2, and 1-pixel square). If you stacked those vertically, you might more easily visualize the “mip-stack” as an upside-down pyramid, where each mip-level (each horizontal slice) is always 1/2 the width of the one above.
During 3D rendering, mip-mapping and tri-linear filtering take each destination pixel, pick the two most appropriate mip-levels, essentially do a bilinear blend on both, and then blend those two results again (linearly) for the final tri-linear answer.
Say the next pixel would have no aliasing if only the source image had a resolution of 47.5 pixels across. The system has stored the power of two mip maps (16, 32, 64…). So the hardware will cleverly use the 64×64 and 32×32 pixel versions closest to the desired sampling of 47.5, compute a bilinear (4-sample) result for each, and then take those two results and blend them a third time.
That’s tri-linear filtering in a nutshell, and along with mip-mapping, it goes a great distance to minimizing aliasing for many common cases of 3D transformations.
So far, we’ve been talking about nice, small images, like 512×512 pixels. Our whole-earth image will need to be millions of pixels across. So one might consider making a giant mip-map of our whole earth image, at say 1 meter resolution. No problem, right? But you’ll realize fairly soon that it would require a mip-map pyramid 26 levels deep, where the highest resolution mip-level is some 66 million pixels across. That simply won’t fit on any 3D video card on the market, at least not in this decade.
I’m guessing Microsoft’s Virtual Earth gets around this limit by cutting their giant earth texture into many smaller distinct tiles of, say, 256 pixels square, where each gets mip-mapped individually. That approach would work to an extent, but it would be relatively slow and give some of the visual artifacts, like the blurring we see above, and a popping in and out of large square areas as you zoom in and out.
There’s one last concept about mip-maps to understand before we move on to the meat of the issue. Imagine for a moment that the pixels in the mip-map pyramid are actually color coded as I’ve indicated above, with an entire layer colored red, another yellow, etc. Drawing this on a tilted plane (like the Earth’s ground “plane”) would then seem to “slice through” the pyramid at an interesting angle, using only those parts of the pyramid that are needed for this view.
It’s this characteristic of mip-mapping that allows Google Earth to exist, as we’ll see in a minute.
The example on the left shows a normal 3D scene from Google Earth, as well as a rough diagram showing from where in the mip-stack a 3D hardware system might find the best source pixels, if they were so colorized.
The nearer area gets filled from the highest-resolution mip-level (red), dropping off to lower and lower resolutions as we get farther from the virtual point of view. This helps avoid the scintillation and other aliasing problems we talked about earlier and looks quite nice. We get as close as possible to a 1:1 correspondence between source and destination, pixel for pixel, so aliasing is minimized.
Even better still, tri-linear filtering 3D graphics hardware has been improved with something called anisotropic filtering (a simple preference option in Google Earth), which is essentially the same core idea as the previous examples but using non-square filters beyond the basic 2×2. This is important for visual quality because even with fancy mip-mapping, if you tilt a textured polygon to an oblique angle, the hardware must choose a low-resolution mip-level to avoid scintillation on the narrow axis. And that means the whole polygon is sampled at too low a resolution, when it’s only one direction that needed to dip down to the low-res stuff. Suffice it to say, if your hardware supports anisotropic filtering, turn it on for the best results. It’s worth every penny.
Now, to the meat of the issue
We still have to solve the problem of how to mip-map a texture with millions of pixels in either dimension. “Universal Texture” (in the Google Earth patent) solves the problem while still providing high-quality texture filtering. It creates one giant multi-terabyte whole-earth virtual texture in an extremely clever way. I can say that since I didn’t actually invent it. Chris Tanner figured out a way to do on your PC what had only ever been done on expensive graphics supercomputers with custom circuitry, called clip mapping (see SGI’s paper, also by Chris, Michael, et al., for a lot more depth on the original hardware implementation). That technology is essentially what made Google Earth possible. And my first job on this project was making that work over an internet connection way back when.
So how does it actually work?
Well, instead of loading and drawing that giant whole-earth texture all at once — which is impossible on most current hardware — and instead of chopping it up into millions of tiles and thereby losing the better filtering and efficiency we want, recall from just above that we typically only ever use a narrow slice or column of our full mip-map pyramid at any given time. The angle and height of this virtual column changes quite a bit depending on our current 3D perspective. And this usage pattern is fairly straightforward for a clever algorithm to compute or infer, knowing where you are and what the application is trying to draw.
Universal Texture is both a mip-map and a software emulated clip-stack, meaning it can mimic a mip-map of many more levels and greater ultimate resolution than can fit in any real hardware.
Note: Though this diagram doesn’t depict it as precisely as the paper, the clip stack’s “angle” shifts around to best keep the column centered.
So this clever algorithm figures out which sections of the larger virtual texture it needs at any given time and pages only those from system memory to your graphics card’s dedicated texture memory, where it can be drawn efficiently, even in real time.
The main modification to basic mip-mapping, from a conceptual point of view, is that the upside-down pyramid is no longer just a pyramid but is now much, much taller, containing a clipped stack of textures—called, oddly enough, a “clip stack”—perhaps 16 to 30+ levels high. Conceptually, it’s as if you had a giant mip-map pyramid that’s 16–30 levels deep and millions to billions of pixels wide, but you clipped off the sides — i.e., the parts you don’t need right now.
Imagine the Washington monument, upside down and you’ll get the idea. In fact, imagine that tower leaning this way or that, like the one in Pisa, and you’ll be even closer. The tower leans in such a way that the pixels inside the tower are what you need for rendering right now. The rest is ignored.
Each clip-level is still twice the resolution of the one “below” it, like all mip-maps, and nice quality filtering still works as before. But since the clip stack is limited to a fixed but roaming footprint, say 512×512 pixels wide (another preference in Google Earth), that means that each clip-level is both twice the effective resolution and half the coverage area of the previous one. That’s exactly what we want. We get all the benefits of a giant mip-map, with only the parts relevant to any given view.
Put another way, Google Earth cleverly and progressively loads high-res information for what’s at the focal “center” of your view (the red part above), and resolution drops off by powers of two from there. As you tilt and fly and watch the land run toward the horizon, Universal Texture is optimally sending only the best and most useful levels of detail to the hardware at any given time. What isn’t needed isn’t even touched. That’s one thing that makes it ultra efficient.
It’s also memory efficient. The total texture memory for an earth-sized texture is now (assuming this 512 wide base mip-map, and say 20 extra clip-levels of data) only about 17 megabytes, not the dozens to hundreds of terabytes threatened before. It’s actually doable and worked in 1999 on 3D hardware that had only 32 MB or less. Other techniques are only now becoming possible with bigger and bigger 3D cards.
What’s really clever is that the system needs only upload the smallest parts of these textures that are needed and it does it without making anyone wait.
In fact, with only 20 clip-levels (plus 9 mip-levels for the base pyramid), we see that 2^29 yields a virtual texture capable of up to 536 million pixels in either dimension. Multiplying that by 1/2 vertically gives a virtual image of a few hundred terapixels in area, or enough excess capacity to represent features as small as 0.15 meters (about 5 inches), wherever the data is available. And that’s not the actual limit. I simply picked 20 clip-levels as a reasonable number. And you thought the race for more megapixels on digital cameras was challenging. Multiply that by a million and you’re in the planetary ballpark.
Fortunately, for now, Google only really has to store a few dozen terapixels of imagery. The other beauty of the system is that the highest levels of resolution need not exist everywhere for this to work. Wherever the resolution is more limited, wherever there are gaps, missing data, etc., the system only draws what it has. If there is higher resolution data available, it is fetched and drawn too. If not, the system uses the next lower resolution version of that data (see mip-mapping above) rather than drawing a blank. That’s exactly why you can zoom into some areas and see only a big blur, where other areas are nice and crisp. It’s all about data availability, not any hard limit on the 3D rendering. If the data were available, you could see centimeter resolution in the middle of the ocean.
The key then to making this all work is that, as you roam around 3D Earth, the system can efficiently page new texture data from your local disk cache and system memory into your graphics texture memory. (We’ll cover some of how stuff gets into your local cache next time). You’ve literally been watching that texture uploading happen without necessarily realizing it. Hopefully, now you will appreciate all the hard work that went into making this all work so smoothly — like feeding an entire planet piecewise through a straw.
Finally, there’s one other item of interest before we move on. The reason this patent emphasizes asynchronous behavior is that these texture bits take some small but cumulative time to upload to your 3D hardware continuously, and that’s time taken away from drawing 3D images in a smooth, jitter-free fashion or handling easy user input — not to mention the hardware is typically busy with its own demanding schedule.
To achieve a steady 60 frames per second on most hardware, the texture uploading is divided into small, thin slices that quickly update graphics video memory with the source data for whatever area you’re viewing, hopefully just before you need it but at worst just after. What’s really clever is that the system only uploads the smallest parts of these textures that are needed and it does it without making anyone wait. That means rendering can be smooth and the user interface can be as fluid as possible. Without this asynchronicity, forget about those nice parabolic arcs from coast to coast.
Now, other virtual globes can also virtualize the whole-earth texture, perhaps they cut it into tiles and even use multiple power-of-two resolutions as GE does. But without the Universal Texturing component or something better, they’ll either be limited to 2D top-down rendering, or they’ll do 3D rendering with unsatisfying results, blurring, scintillation, and not nearly as good performance for streaming the data from the cache into texture memory for rendering.
And that’s probably more than you ever wanted to know about how the whole Earth is drawn on your screen.