Using Rosetta with UTM and Docker

For a recent robotics project I had to run a ROS/Gazebo pipeline that was based on a pre-built x86/amd64 docker image.

I do most of my local development work on an M1 Mac. The M1 is an ARM based processor, so the question was how to get the stack to run. Unfortunately Docker for Mac was not an option, because it wouldn’t run Gazebo properly. Therefore I knew that I had to go with Linux and somewhere between the OS and the Docker I needed something (qemu or similar) to go from the x86 to the ARM instruction set.

The real breakthrough was the realization that if you use Apple’s virtualization framework to run a Linux VM you can use Rosetta (Apple’s own processor translation layer) within that VM and this this blogpost helped tremendously.

Maybe I will do a follow-up on how exactly I made it work, but the basic setup is: Debian12 in a VM (utilizing Apple Virtualization in UTM https://getutm.app) and then execute the x86/amd64 docker image with Rosetta and I am still amazed that this works.

How Natural Interaction can Bootstrap Self-Supervised Learning

This has been in the works for quite some time now. If you find the ideas teased in the video interested take a look at the accompanying paper [1], to be presented at this years ICLR.

 

[1] A. Aubret*, M. R. Ernst*, C. Teuliere, and J. Triesch, “Time to augment self-supervised visual representation learning” in International Conference on Learning Representations (ICLR) (2023).

* Equal contribution

Uncertainty Estimation with Gaussian Processes

Gaussian Procress Regression (GPR) is a powerful tool to do uncertainty estimation for all kinds of problems including state estimation in robotics or machine learning of geospatial data. It enables you to get an uncertainty estimation given new incoming data and is related to the concept of Kalman filtering that was briefly mentioned on this blog before. In this post I show the basic building blocks of this method and how to code a GPR yourself using Python.

Background

Formally, a Gaussian process is a stochastic process and can be understood as a distribution over functions. The premise of this approach is that the function values themselves are random variables. When modeling a function as a Gaussian process, one makes an assumption that any finite number of sampled points form a multivariate normal distribution.

Thus, f(x) \sim GP(\mu, k) means that for any finite batch of function values \mathbf{f} = f(\mathbf{X}), where \mathbf{f} = \left[ f(\mathbf{x_1}, ..., f(\mathbf{x_n}))\right] \sim \mathcal{N}(\mu, K) holds.

One might ask why this assumption is useful. The main benefit of using Gaussian distributions is that they are incredibly nice to work with. The Gaussian family is self-conjugate and has a number of nice properties such as being closed under marginalization and closed under conditioning. This makes it a natural choice for Bayesian machine learning, where you have to specify a prior belief  and then marginalize over parameters to obtain a posterior predictive distribution.

Also, the multivariate normal distribution is entirely characterized by a mean vector and a covariance matrix.

Inference

So how does one make predictions based on observed (training) data? It is basically based on this formulation:

(1)    \begin{equation*} \left[ {\begin{array}{c} f \\f^* \end{array} } \right] = \mathcal{N}\left( \mu, \left[ {\begin{array}{cc} K + \sigma^2 \mathbb{I}_n & K_* \\K_*^T & K_{**} \\ \end{array} } \right]  \right). \end{equation*}

The equation above reflects the previously mentioned assumption of normality, but here we also differentiate training and testing points (the latter is marked with an asterisk). There’s also an additional assumption that training labels are corrupted by Gaussian noise, which is the \sigma part at the top left in the kernel-matrix above.

Now, it is possible to compute the conditional distribution of the test data:

(2)    \begin{eqnarray*} Cov(f^*) = K_{**} - K_*^T(K+\sigma^2 \mathbb{I}_n)^{-1}K_* \\ \mu(f^*) = K_*^T(K+\sigma^2 \mathbb{I}_n)^{-1}f. \end{eqnarray*}

These equations give us a formal way of calculating the predictive mean and covariances at the test points conditioned on the training data.

 

GP Kernels

To actually use the equations above, one has to find a way to model the covariances and cross-covariances between all combinations of training and testing data. To do this we assume a certain kernel function for modeling the covariances. There are multiple possible kernels and choosing the right kernel function is critical for achieving good model fit, because they determine the shape of the resulting function. In this post I choose a squared exponential kernel: k_{SqE}(x_1,x_2):= \sigma^2 \cdot \exp(-\|x_1-x_2\|^2_2)/(2 \cdot l^2)) with l>0 the lengthscale and \sigma^2>0 the signal variance. But feel free to try out another kernels, like Brownian k_B(x_1,x_2) := min(x_1,x_2) for example.

Once the hyperparameters are optimized, the kernel matrix can be plugged into the predictive equations for the mean and covariance of the test data to obtain concrete results.

Regression

One of the key features of GP regression is that we are modeling a function where we only have a limited number of observations. Yet, we are interested in the values (and uncertainties) at unobserved locations. This can be useful for weather prediction, as rainfall and temperatures are only measured at specific locations or for sensor readings that only occur every so often. Once you found a suitable kernel function for modeling the similarities between such data points distributed in space or time and you optimized the kernel function parameters using a training set, you can make predictions on the test set using the predictive equations.

Do it yourself with Python

Recall that I chose a squared exponential kernel, which itself is defined by a mean value M and a variance \sigma. We can quickly visualize this kernel to see what these two parameters do to the kernel.

 

Here is the kernel for our Python implementation:

 

class SquaredExponentialKernel:
    def __init__(self, sigma_f=1, length=1):
        self.sigma_f = sigma_f
        self.length = length

    def __call__(self, argument_1, argument_2):
        return float(self.sigma_f * np.exp(-(np.linalg.norm(argument_1 - argument_2)**2) / (2 * self.length**2)))

 

Let us shortly recall the formula:
Given training points x_1,...,x_n\in \mathbb{R}^m with values y_1,...,y_n\in \mathbb{R}, y = (y_i)\in \mathbb{R}^n with noise in each point \mathcal{N}_{0,\sigma} and points x_{n+1},...,x_k\in \mathbb{R}^m for which we want to predict the output, adapting our probability distribution leads to:

\mathcal{N}(K_*^T(K+\sigma^2 \mathbb{I}_n)^{-1}f, K_{**}-K_*^T(K+\sigma^2 \mathbb{I}_n)^{-1}K_*), with

K= (k(x_i,x_j))_{i,j\leq n}

K_*= (k(x_i,x_j))_{n+1\leq i, j\leq n}

K_{**}= (k(x_i,x_j))_{n+1\leq i,j}

 

And here we go with our main GPR class and a little visualization function:

 

# Helper function to calculate the respective covariance matrices
def cov_matrix(x1, x2, cov_function):
    return np.array([[cov_function(a, b) for a in x1] for b in x2])


class GPR:
    def __init__(self, data_x, data_y,
                 covariance_function=SquaredExponentialKernel(),
                 white_noise_sigma=0.0):
        self.noise = white_noise_sigma
        self.data_x = data_x
        self.data_y = data_y
        self.covariance_function = covariance_function

        # Store the inverse of covariance matrix of input (+ machine epsilon on diagonal) since it is needed for every prediction
        self._inverse_of_covariance_matrix_of_input = np.linalg.inv(
            cov_matrix(data_x, data_x, covariance_function) +
            (3e-7 + self.noise) * np.identity(len(self.data_x)))

        self._memory = None

    # function to predict output at new input values. Store the mean and covariance matrix in memory.

    def predict(self, at_values: np.array) -> np.array:
        k_lower_left = cov_matrix(self.data_x, at_values,
                                  self.covariance_function)
        k_lower_right = cov_matrix(at_values, at_values,
                                   self.covariance_function)

        # Mean.
        mean_at_values = np.dot(
            k_lower_left,
            np.dot(self.data_y,
                   self._inverse_of_covariance_matrix_of_input.T).T).flatten()

        # Covariance.
        cov_at_values = k_lower_right - \
            np.dot(k_lower_left, np.dot(
                self._inverse_of_covariance_matrix_of_input, k_lower_left.T))

        # Adding value larger than machine epsilon to ensure positive semi definite
        cov_at_values = cov_at_values + 3e-7 * np.ones(
            np.shape(cov_at_values)[0])

        var_at_values = np.diag(cov_at_values)

        self._memory = {
            'mean': mean_at_values,
            'covariance_matrix': cov_at_values,
            'variance': var_at_values
        }
        return mean_at_values


def plot_GPR(data_x, data_y, model, x, ax, color_index=1):

    mean = model.predict(x)
    std = np.sqrt(model._memory['variance'])

    for i in range(1, 4):
        ax.fill_between(x, y1=mean + i * std, y2=mean - i * std,
                        color=sns.color_palette('bright')[color_index], alpha=.3,)
        # label=f"mean plus/minus {i}*standard deviation")

    ax.plot(x, mean, label='mean', color=sns.color_palette(
        'bright')[color_index], linewidth=2)
    ax.scatter(data_x, data_y, label='data-points',
               color=sns.color_palette('bright')[color_index], s=30, zorder=100)
    
    pass

Now this is cool because now we can take a look at our GPR at work. The shaded areas represent 1,2,3 and 4 times the standard deviation. You can clearly see that uncertainty is low in the area where we have more data and it grows where we have no observations.

 

x_values = np.array([0, 0.3, 1, 3.1, 4.7])
y_values = np.array([1, 0, 1.4, 0, -0.9])
xlim = [-1, 7]
x = np.arange(xlim[0], xlim[1], 0.1)


fig, axes = plt.subplots(6, 1, figsize=(7, 9))
for i, ax in enumerate(axes.flatten()):
    x_val = x_values[:i]
    y_val = y_values[:i]
    model = GPR(x_val, y_val)
    plot_GPR(data_x=x_val, data_y=y_val, x=x,
             model=model, ax=ax, color_index=i)
    ax.set_xlim(xlim)
    ax.set_ylim([-6, 6])
plt.tight_layout()
plt.show()

 

Of course, we can also look at the different functions drawn from the distribution that make up our variance:

Another interesting aspect is that we of course still have the hyperparameters \sigma,l and the noise that can influence how we model our data. Varying these parameters looks like this:

Sigma mainly influences the uncertainty envelope, allowing for functions that more or less deviate from the mean. The length scale influences how smooth the curve becomes, and how curvy and bendy the functions get. And depending on the amount of noise we assume in the data the predicted functions are more or less tied to the observation points.

Conclusion

A GP model is a non-parametric model which is flexible and can fit many kinds of data, including geospatial and time series data. At the core of GP regression is the specification of a suitable kernel function, or measure of similarity between data points. The tractability of the Gaussian distribution means that one can find closed-form expressions for the posterior predictive distribution conditioned on training data. For a deep dive, feel free to look at the book Gaussian Processes for Machine Learning by Rasmussen and Williams.

Using Blender for 3D Animation and Modelling

The first long-term support of the 3.x series is here!
Discover Blender 3.3 LTS

So this is great news. Blender has been a staple of Open Source software for as long as I can remember and you can basically do everything with it. It has a steep learning curve, so you need to put in some time and effort but it really is an amazing piece of software. See their blog-post on how Blender was used to create the effects of the recent Bollywood blockbuster “RRR”.

A gatekeeper for me was always GPU acceleration. I never really had a personal computer that is that powerful or in recent time even has had a dedicated GPU, but with Blender 3 they offer GPU acceleration using Apple’s Metal framework and this is a Gamechanger even if you are just on a M1 Macbook Air, as a I am.

So to dive right in I chose the now famous Blender donut tutorial series to practice my skills and oh boy this is fun. With all the scripting support it also is an incredible tool to do datavisualization and I can’t wait to spend more time with it.

As for donuts, I just find it amazing that I basically created this out of thin air. Here’s my take:

Pytorch on Apple Silicon

In collaboration with the Metal engineering team at Apple, we are excited to announce support for GPU-accelerated PyTorch training on Mac.

It’s about time. It has been 18 months since the first M1 chips shipped and finally we got some support for the advanced GPUs on these chips. Being able to leverage your own local machine for simple Pytorch tasks and even just doing local testing of bigger projects is of tremendous value. One thing I am still confused about is that all of this technology still just uses the ‘standard’ GPU cores and there is still no way to access the custom Neural Engine of the chip. This would make inference on device even better. But it’s up to Apple to give access to their internal APIs.