Paul Tolstoi
Supervisor:Prof. Gudrun Klinker

Dipl.-Inf. Univ. David A. Plecher, M.A.

Submission Date:March 15th 2019


There are many different approaches to use Augmented Reality for historical applications. However, these are mostly limited to small rooms or objects (museums, sculptures, etc.). Usually these applications use Augmented Reality to show only information in form of a text or virtual content. You rarely see e.g. missing parts of objects especially buildings superimposed onto the real world. This master's thesis aims to create a framework for presenting location dependant Augmented Reality content in outdoor environments. The ultimate goal is to create a generic Augmented Reality application, that is independent of environmental factors such as different light or weather conditions, and is not specifically tailored for particular building structures or traits.


There were two different approaches, I tried for this master's thesis: A Neural Network and Pose Estimation as a service.

Neural Network for Pose Estimation: PoseNet

PoseNet is an approach presented by the Alex Kendall, Matthew Grimes, Roberto Cipolla from the University of Cambridge. It's a convolutional neural network, that takes an image as input and outputs a camera pose (position and rotation).

I trained the network on 1398 images of the Siegestor in Munich, taken from different angles and during different lighting/weather conditions.

examples of different lighting and weather conditions

Position of all images

The network performed quite good, however, it was quite sensible to different lighting conditions and especially different cameras: e.g a simple (digital) zoom resulted in a position estimation ~10 meters away from the ground truth.

Pose Estimation as a Service: ColMap with a webservice

The second approach was to utilize a Structure from Motion (SfM) application to estimate the camera position. For that I added a webservice to ColMap, a SfM application for sparse and dense reconstruction. The application exposes an endpoint, which expects an image and a location name or GPS coordinate. The endpoint returns the estimated camera pose.

Mobile Application

The mobile application uses Unity together with AR Foundation (and the ARKit and ARCore plugins). The AR content is authored inside Unity. A helper connects to the backend and pull the point cloud of a site so the location can be authored at correct scale.

After launching the application, ARCore/ARKit starts tracking the world around the user. When the user hits the "Locate" button, an image is uploaded to the backend, and a pose estimation is returned. The corresponding content is than instantiated at that estimated pose.

Image being uploaded to the backend

image overlay of the destroyed Siegestor (backsite)image overlay of the destroyed Siegestor (frontside)

However, since ARCore/ARKit is primarily used for indoor AR applications, sometimes the AR content starts drifting.


Nowadays most people in the western world have a smartphone capable of Augmented Reality (AR). To take advantage of this fact, the goal of this thesis was to create a framework that allows to show location-based AR content in an outdoor environment. The tracking and augmentation should withstand the difficult lighting und weather conditions caused by the outdoor setting.

I investigated two different approaches to create a prototype. The first attempt was to utilize a convolutional deep neural network that accepts a single image and outputs a camera pose estimation. However, this method was not reliable enough, since it was susceptible to different weather or lighting conditions as well as to different cameras. The second attempt was to use Structure from Motion (SfM) for camera pose estimation. Given that it involves very heavy computation, it could not be done on the mobile device alone. For this reason, I extended ColMap — an application that uses SfM — with a webservice, that the mobile app can employ to localize the user. This approach was then tested in a small user study.

As the testing during implementation and the user study showed, the overall approach seems quite promising. The probands liked the application and thought that it was easy to use. Furthermore, some of them wanted to see even more content. However, there are still some teething problems to be fixed. The localization provided by ColMap is relatively stable, yet the tracking provided by ARCore/ARKit sometimes becomes unstable resulting in displaced AR content. Additionally, the backend is limited only to a single concurrent pose estimation preventing the application from proper simultaneous multi-user support.


  • No labels