search engines amsterdam (May 24)

published on 2019-05-25

Search Engines Amsterdam is a monthly meetup for search engine technology and related subjects. Two presentations are given at every event, one by someone from industry and one academic. Even though they were not strictly on search technology this time, the talks presented on Friday the 24th of May were so interesting that we present a brief recap here.

neural video compression

The first talk was given by Ties van Rozendaal, who works on media compression at Qualcomm in Amsterdam. Their team has been working on a video compression strategy as an alternative to techniques currently employed in popular codecs like VP9, HEVC and H264, which are already magic. These codecs mostly rely on frequency domain transforms, chroma subsampling and motion prediction. In a nutshell, they get most of their compression from throwing away information about finer details in texture (quantization in the frequency domain) and color resolution which humans don't easily pick up on. Motion prediction exploits the fact that subsequent frames are often similar, further reducing the required information to reconstruct a frame.

The approach used by Rozendaal's team is wholly different. They use so called autoencoders to compress the information in a frame to a vector of numbers that is significantly smaller than the input. Autoencoders consist of two neural networks, an encoder network and a decoder network. They are trained in tandem by feeding the network frames on the encoder side, passing the resulting code to the decoder and comparing the decoded results to the original frames. By training both these networks on a large corpus of video, they learn what our world looks like, to varying degrees. Simple autoencoders may only learn about simple textures and edges. More complex autoencoders may learn what faces look like, or how shadows work. The output, or code, produced by such a autoencoders can then be thought of as a description of the scene.

It's clear how this can be used for compression. Let's consider the following description of a scene.

adorable tabby kitten napping on black keyboard

It is only 47 bytes long and you could probably reconstruct a scene from this short description alone. The traditionally compressed image of said kitten on the other hand is about 70,000 bytes long. That is a compression ratio of approximately 1500! Your reconstruction based on the description may not be quite as detailed, but it's not bad for such a short code. The point here is to illustrate that we can use knowledge about the world to create efficient codes.

adorable tabby kitten napping on black keyboard

To make the compression even more efficient, they used so-called variational autoencoders. Such systems don't simply learn how to transform images into codes and vice-versa, but how to encode them as probability distributions. By enforcing similarity to a known probability distribution on the latent space, these values can be further compressed using arithmetic codes. Such codes exploit the fact that some values are more likely to occur than others and hence shorter code words can be used for them. Morse code uses a very similar strategy; it uses only a single dot for the letter E which occurs frequently, while the more rarely used numerical digit 0 requires a whopping five dashes.

The architecture in its entirety looks like this.

000111010100101010000111010100101010EncoderDecoderOutputInputLatent spaceArithmetic Code

Their models can take advantage of temporal information in two ways. One is to simply batch multiple frames together and process them as three dimensional data. An other is using optical flow, which contains information about the patterns of motion between frames. Combined with previous frames, a frame can be reconstructed much more efficiently.

The performance of such models is already competitive with popular codes like h264, but not quite at the level of the state of the art yet. They definitely show promise and their performance will likely improve rapidly given the current pace of innovation in the deep learning space. One possible future direction for improvement that was briefly discussed during Q&A is to create specialized models for a specific (set of) video and transfer these as part of the payload.

hyperspherical prototype networks

Pascal Mettes presented the second talk on Hyperspherical Prototype networks. Mettes is an assistant professor for the computer vision group at the University of Amsterdam. In a recent paper on the subject, he and his coauthors propose a unified approach to regression and classification networks which it makes it easier to train networks that do both.

One of the main problems with training a network for both regression and classification tasks is that the most commonly used loss functions for these tasks, mean squared errors and cross entropy loss respectively, work on completely different scales. It has been an onerous task to combine the two loss functions in a way that makes networks do well on both tasks.

By projecting the output space to hyperspheres, the same loss function for both tasks can be used and training for both becomes trivial. A hypersphere is a generalization of circles and spheres to arbitrary dimensions. The defining property of hyperspheres is that all points have a fixed distance to the origin. So a hypersphere of dimension nn is the set of all vectors (a1,,an)(a_1, \dots, a_n) such that i=1nai2=1\sum_{i = 1}^n a_i^2 = 1. For n=2n = 2, that's the unit circle. In three dimensions it forms the unit sphere. Beyond that, our earthly imagination fails, but the mathematics ends up working just the same.

Training networks for classification is then done by considering the angle between the inferred projection on the hypersphere and the location of its label. Below is an example on a two dimensional hypersphere where some sample xx should be assigned to class DD. The loss is then defined by the cosine of the angle α\alpha between xx and DD.


Regression is done very similarly. We pick an (unused!) axis on the hypersphere and interpolate between the extreme values on that axis. This does require one to know the bounds of the output variable, but since out-of-bound values don't work very well on any network this is not a large drawback. The loss function is again a cosine, this time of the difference between the expected angle and the inferred angle.

What remains for classification is to choose the location of the labels, or prototypes, on the hypersphere. Ideally, we'd do this in a way that maximizes the margin between prototypes, or, in other words, evenly spreads the class prototypes over the surface of a hypersphere. The problem is: doing this is an open problem in mathematics! Even on a regular sphere, it gets quite intricate.

So aren't they simply replacing one intractable problem for another then? Not exactly. Engineering a good trade-off between classification loss and regression loss on regular networks requires a lot of hand tweaking for every new problem. The prototype selection problem is more universal; we can solve it once and apply it for almost every network.

The authors propose the following approach: find an approximate solution to the optimization problem that maximizes the minimum distance between any two points. Apparently starting by sampling an nn dimensional standard normal distribution and normalizing them is a good start since it produces an uniform distribution over the surface. This is a non trivial result as the connection between normal distributions and (hyper)spheres is not immediately evident. A good subject for another post, perhaps!

Other than being able to easily train a network for both classification and regression, Mettes' approach has other advantages. The classes are well separated by construction, which helps in the case of uneven data or few shot learning. Further, it enables the use of prior knowledge since we can place prototypes of classes that are semantically similar closer to each other on the hypersphere. This can improve the performance of the network significantly on smaller datasets since it doesn't have to learn these relations first.