On What is Wrong with Personal Objectives at Work

We all have had the experience: setting your personal objectives and defining your annual goals according to business plans for that year. Scratching your head and trying to figure out what the heck I’m going to write in there.

Here is what I think is the problem and maybe the solution.

The Grand Plan

Your company often has a set of macro goals expressed in an annual business plan (e.g. generate X amount of money, grow user base by X amount, …). This plan is usually just a wishlist and useless as concrete objectives as they are vague and doesn’t tell you how you are going to achieve them.

So we break them down to more tangible micro goals in an annual roadmap. This large roadmap then gets broken down further into quarterly roadmaps and then to smaller chunks of sprint plans.

Plans are Guesses

The reality is that plans are guesses and at best are intentions.

That’s why you revise your plans frequently (quarterly, every other month, every 2 weeks/sprint) to make sure they are what you really intend to do now once you’ve progressed forward in time and know more about the state of the world.

The Dilemma

Let’s walk along a common scenario here:

It is recommended to you to pick goals that are aligned with the annual plan in the beginning of the year. Let’s say you pick a project from the list of annual guesses and say that “by the end of the year, to develop the super idea called project A”.

What happens when in the next quarterly roadmap, someone decides that project A is not a good project anymore. Or it’s not a priority. What happens to your goal?

There’re two options: you either say that OK, I will also revise my annual goals with roadmap updates. Or you say screw your new plan, I’ve to work on project A.

The Path of Change:

If you choose to change your goals because the roadmap now is different, what’s the point of annual goals anyway? Why can’t you just say that your goal for this year is to work on the roadmap and finish projects there? Isn’t that why the company actually hired you?

The Path of Resistance:

You can choose to not to work on the roadmap and say that you’re going to work on project A as 8 months ago you agreed to do it with your boss and there’s no way to change it. Unless, this is an explicit process in your organisation (more on this to follow), you’re probably going to have to spend your own time to work on that project as you will also be working on the revised roadmap.

What now?

So how to you set your annual goals?

Depending on your organisation, you’re more likely to be in the “path of change” bucket. For this category, I believe growing as a better individual, contributor, thinker, and teammate is what really matters. Hence, your personal goals should reflect technical and personal skills that you want to improve on as these kind of goals are unlikely to change and is essential no matter where you’re or which project you’re working on or which team you’re part of.

The more interesting scenario of the “path of resistance” will work if your organisation is drastically different than norm and explicitly acknowledges autonomy in selecting and implementing goals that an individual believes in, even if those goals are different to what the organisation right now guesses to be the right ones to pursue.


So the premise here was that you pick your annual goals from a list of possible concrete options the company is set to explore in the roadmap.

You might say that I will pick something that’s not there. But then you’re directly in the “path of resistance” domain. How are you going to balance working on this versus the day to day sprint work.

You might say that I will generalise and say something like “develop 3 new features” or “fix 20% more bugs”. But again these are as good as just saying I will work on the roadmap as all of those 3 new features are in the roadmap and more bug fixes is also a goal somewhere there.

The same is true if the roadmap doesn’t change and you actually finish working on project A. Still you could have just said, I will work on the roadmap and it would have been fine.

I’d love to hear your experiences around this topic and what have worked for you in past as this is the best I’ve got, so feel free to comment.

These are solely my personal views. They are based on my experiences of working in technology startups, so your mileage will vary based on your past experiences.

A Practical Approach to Machine Learning Projects

The topic of how to approach Machine Learning (ML) based projects has come up many times over the past several years speaking to many in the field and industry. One of the fundamental aspects of ML projects is that there are often larger risks and unknowns attached to these project compared to other projects that are being undertaken within your company’s engineering department.

While we have made tremendous progress in ML technologies, modelling algorithms, platforms, and libraries, we are still very far from having what I call “textbook algorithms” that would reduce the risk of implementing a project.

As a result, I’ve created and adapted over time a rough approach for handling these kind of projects that I’ve found to lead to good results in practice. Today this topic came up again twice at the new company I just joined last week, so I thought of writing a blog post about it.

To start, imagine you are starting a new ML project. Let’s say you are intending to design an API that automatically extracts the title and the name of the authors of a book from a given input webpage. Here are the stages and steps I would take.

Stage 1: Implement your evaluation pipeline

Contrary to many other approaches I’ve seen that do this part as the last stage, building the evaluation pipeline, to me, is the first and one of the most important steps that you will have to take. Why? Let me take you through the individual steps and the reason hopefully will become clear.

1.1. Gather and create a benchmark dataset

Now that you’ve a metric and an algorithm, you need a dataset to form a benchmark (i.e. test set).

Doing this step makes sure that you have access to a dataset that is large enough and covers cases you would like to compute the performance. Unless you have access to a good and large in-house dataset, this is going to be one of the crucial steps that you will have to spend a good amount of time on it.

Be prepared to check web for public and open datasets, use crowsourcing platforms for human labelling, adapt an existing knowledge base to give you annotations, check your application logs and user feedbacks, add feedback mechanisms to your other apps to capture datasets … .

Any time and money investment here will pay off big, so take this very very seriously. And don’t opt for partial datasets (e.g. a list of webpages with only labelled book titles in the example above): it will not take you very far and you will not be able to answer simple questions such as how is the overall performance.

Also be sure to version your datasets and benchmarks. You should know which version of the dataset a model was trained and on which test dataset version it was evaluated on. Note that it’s better to have your benchmark dataset to change at slower pace than your training dataset making it easier to understand the impact of a new training dataset, a new model, a new bug fix, … .

1.2. Identify your main evaluation metric

No matter what smartness you are going to pull off later on, you need to be able to evaluate all your models on metrics that matter and are closest to what users would experience if they used your API.

Rule of thumb, it’s good to have multiple metrics to cover how models behave, but choose one overall metric that improving on that will be evident to your users. That’s what you need to ace!

In the example above, there are many choices: you might consider classical metrics of accuracy, precision, recall, AUC, AUPRC. But remember that you are dealing with a structured output: a book entry will have a title and a list of authors. How would you compute a metric for this? Would you evaluate model outputs independently or in conjunction? Would you weight them differently? Does weighted averaging of these metrics makes sense? How would you decide on weights? If you improved 10% in your metric of choice would you be able to translate that to what the new experience is going to be for your typical (possibly non-technical) user?

Answers to these will depend on your particular application, but tackling them head on will actually make you aware of kind of modelling approaches you will need to take later on.

1.3. Implement a dummy random algorithm

This will be your first model! Pure random predictions!

While it might not be intuitive, what doing this step gives you is the first trivial algorithm to evaluate. Another side-effect is that it provides you with a rough sketch of the prediction API. No matter the algorithm underneath, you need to figure out what are the necessary inputs and outputs, i.e. your run-time API.

 1.4. Evaluate your trivial algorithms

Pass the test dataset you’ve made through the random algorithm and measure the performance. While you shouldn’t expect miracles to happen here, be prepared to see skewed results showing up, indicating peculiarities of your test set. Make multiple runs and analyse how the performance varies.

Now that you’ve your first model tested, test other trivial models, such as predicting all test instances to a fixed class, not predicting anything at all, etc. Many of these will tell you about characteristics of your dataset as well as your metrics of choice. If your dataset is unbalanced (95% neg vs 5% pos) and you’re measuring accuracy, these trivial algorithms will tell you that you might want to consider changing to something that takes the unbalancedness into account or has different penalties per different types of mistakes, … .

1.5. Bonus steps

If you’ve time and means to do this without a lot of effort, consider

  • Setting up a repository for collecting and displaying evaluation results of different models (this could be as simple as a backed up CSV file with a new line appended to it)
  • Possibility of triggering the automatic evaluation runs to commits of the main code repository or regular nightly jobs if the evaluation takes a lot of time

Stage 2: Implement your baseline API

This is also a step that many might not pursue when they start a new project. There’s a lot of enthusiasm about the latest Arxiv paper you read a few days ago, so you are dreaming of just getting started on that super awesomeness.

Document all these ideas and papers, they are very important down the line. But consider establishing a quick baseline API first.

2.1. Implement a simple baseline

This should be the quickest model that you can build to establish a baseline. Consider using off-the-shelf ML algorithms that don’t need spending huge amounts of time fine tuning or training and build a model using the training dataset you have. Many ML and Deep Learning libraries come with tutorials and example demos bolted on, so don’t shy away from writing wrappers around and using them for this.

Any heuristically built model is also a good baseline: wear your engineering hat and come up with hacky solutions that would convert inputs to predictions.

And don’t overthink it! You are preparing for your big guns down the road! Just implement a sensible baseline as quickly as possible.

These baselines are there to give you a sense of how difficult the problem is: if your hacky off-the-shelf algorithm is already doing pretty well then is the problem an easy one? Or maybe you don’t have a good benchmark dataset? Or maybe again something is wrong with your metrics?

If the results are poor, start analysing the error cases. Is there a repeating pattern in errors? Do these errors feel genuinely difficult? Would another approach have a better chance?

All of these will give you important tips and directions when you’re building more complex models.

2.2. Demo your baseline API

Now that you’ve a model that hopefully is doing better than random predictions, why not to actually put all the bits together and release your first API? Depending on how good the baseline algorithm was, you well might be looking here at your v0.1 release of the API.

Demo this to your colleagues, see what they say, many of them are smart people so you have a good chance of gathering invaluable feedback here. Slap a simple front-end to it if it makes it easier to show what’s happening. Visualise intermediate outputs from your model to help you debug and interpret what is happening when people are using the demo app.

Stage 3: Let’s Rock

Now is the time to bring in the big guns and get fancy with the new model or architecture you thought about implementing. This is a very exciting phase: unless you’ve been dealing with a simple problem, there’s a lot of ground to be covered here with respect to your baselines.

This stage is often a cycle of a few repeating steps:

3.1. Research a new ML algorithm

Pick good papers and read with colleagues. Brainstorm on possible scenarios and approaches. Check and study open source implementations if they exist.

And early on, choose algorithms that have a good track record compared to a new architecture that you’ve just seen published with marginal performance gains. You will be able to come up with novel methods to get similar or much better marginal improvements. So it’s better from a benchmarking point of view to cover more established models first.

3.2. Develop a prototype

Try implementing quick and albeit dirty prototypes first. The goal here is to get to the end of training and evaluation as fast as possible.

Don’t spend a lot of time on the cleanest code, best infrastructure for your model, etc unless you genuinely know that you will be reusing these and they will make your subsequent work much smoother. To me it’s always a better option to have a model that is proven to perform on a difficult problem but the code is complicated and not up to standard versus a very well designed code with all the standards of modern software practices but does slightly better than the simple baseline above. The first problem is a great one to have compared to the second case!

3.3. Do the engineering work and release

Once you’re happy with how your model performs and you see a considerable jump in performance, take on the engineering work: think about the best architecture, how you would scale the training and run-time, how you would deploy this as an API, how would you do the DevOps side of it, write tests and documentation and explain API contracts, … .

Remember to version your models (as well as the datasets mentioned above). Also add instrumentations to your API/UI to capture user feedbacks, create a consistent logging system for your models, create dashboards on how your models are being used in production.

3.4. Repeat

If there is still a good room for improvement, consider repeating the R&D cycle again. Spend time analysing the error cases again from your live API, check what users have been trying and see if your models performs well on those.

Before embarking on a completely new model, remember that having a bigger/better dataset might actually make a lot of difference. So in parallel to R&D consider expanding the size and diversity of the dataset.

If you’re lucky to have a good user base and you’ve built mechanisms to capture their feedback, you might be able to have a growing dataset over time. If this is the case, consider automating and scheduling the training of your models to benefit from the new and hopefully better dataset. You can use the benchmark evaluation also as an additional criteria for automatic publication of your newly generated models!

Closing comments

The whole point of the approach I outlined above is to have a method that ensures you are taking solid steps in reducing the risks associated with ML projects. It also has an emphasis on speed and continuous delivery compared to far apart pure R&D cycles with occasional releases. If you’re lucky with the dataset situation and you’ve a couple of ML engineers/scientists, there is a good chance you will be demoing your v0.1 API (step 2.2) in a sprint or two and your v0.2 with a more sophisticated model might be another couple of month or so away (don’t use the timeline mentioned as a definite, it will highly depend on the nature of your problem, how familiar the team is with the project, do you need a special infrastructure, … think about your problem and adapt!).

This often sets you on a good path: you’ve covered important metrics, you’ve a reliable training dataset and a test set you can trust, you’ve discovered your APIs, you will be gathering invaluable feedback from alpha/beta users, and have solid baselines to compare to.

While I tried to cover as much as possible, I might have missed a few points here and there. Also I don’t believe this is the only way of approaching ML projects, so would love to hear your experience on this.

An Introduction to Deep Learning for Generative Models

Back in October, me @amirsaffari and Aida @aidamash released a Deep Learning based Twitter music bot, called “LnH: The Band”@lnh_ai, that is capable of composing new music on-demand from a few genres by simply tweeting at it. It has so far succeeded in composing more than 700 new songs. Here is where you can see the instructions on how to tweet.

I have been interested for many years in the intersection of art, creativity, and technology and the recent advances in Deep Learning is enabling us to make rapid progress in bridging those disciplines. Using algorithms to impersonate or assist an artist and create artefacts is not something new and currently is considered under a couple of fields such as Computational Creativity and Creative Coding.

This is a two episode article where I discuss how recent advances in supervised learning can be used for generative purposes (this article) and how one can use these models to create music (next episode). Before we start, I will try to keep the article as high level as possible, however, some of the concepts will be understood best with simple mathematical notations. 

Machine Learning, Deep Learning, and Generative Models

Recent advances in Machine Learning, and particularly, Deep Learning have resulted in algorithms and architectures that are able to model complex structured data types such as images, sounds, and text.

These advancements have been mainly focused on supervised learning algorithms which try to learn a statistical model for estimating a function called the posterior probability p(\mathbf{y} | \mathbf{x}) from an input sample \mathbf{x} to an output sample \mathbf{y}. You can imagine \mathbf{x} to be an image and \mathbf{y} be the kind of object that is in the image (e.g. a cat).

The probability written as p(\mathbf{y} | \mathbf{x}) tells us how much the model believes that there is a cat given an input image compared to all possibilities it knows about (e.g. other animals). Algorithms which try to model this probability map directly are often referred to as Discriminative Models or Predictive Models.

Generative Models on the other hand try to learn a related function called the joint probability p(\mathbf{y} , \mathbf{x}). You could read this as how much the model believes that \mathbf{x} is an image and there is a cat \mathbf{y} in it at the same time.

These two probabilities are of course related and could be written as p(\mathbf{y} , \mathbf{x}) = p(\mathbf{x}) p(\mathbf{y} | \mathbf{x}) with p(\mathbf{x}) being how likely it’s that the input \mathbf{x} is an image. The p(\mathbf{x}) probability is usually called a density function in literature.

The main reason to call these algorithms generative relates to the fact that the model has access to the probability of both input and output at the same time. Using this, one can, for example, generate images of animals by sampling animal kinds \mathbf{y} and new images \mathbf{x} from p(\mathbf{y} , \mathbf{x}).

One can take another step further and learn only the density function p(\mathbf{x}) of which only depends on the input space. These algorithms are considered Unsupervised Generative Models (as there’s no access to data on the kind of input). They are also generative as one can again sample from the distribution captured by the model.

For the remainder of this article will focus on these unsupervised generative models and will use generative and unsupervised generative interchangeably.

An appropriate application of these generative models to spaces that are considered of artistic value (such as music, painting, …) would result in algorithms that are capable of generating new artefacts on-demand. Next, we will look into a few classes of these models.

References and further readings:

From prediction to generation

As mentioned above, discriminative modelling has been at the forefront of the recent success and progress in the field of machine learning. These models are capable of making predictions that depend on a given input. However, on their own they are not directly able to generate new samples.

The basic idea behind many of the recent progress on generative modelling is simply to convert the generation problem to a prediction one and use the repertoire of deep learning algorithms to learn such a problem. Modern deep learning algorithms are capable of modelling very complex mappings and offer flexibility of defining problems in terms of computational graphs that can be optimised by variants of back-propagation algorithm on fast hardwares such as GPUs.

The following sections will review the three major categories of these approaches.

Auto-Encoder (AE) models

The simplest form of converting a generative problem to a discriminative one would be to learn a direct mapping from the input space to itself. Using the previous example of images, suppose we wanted to learn an identity map that for each image \mathbf{x} would ideally predict exactly the same image, i.e. \mathbf{x} = f(\mathbf{x}) for f being the predictive model.

On its own such a model would not be of any use, but as we will see by using a specific architecture with certain constraints, we can create a generative model.

The basic idea here is that we can create a model composed of two components: an encoder model q_e(\mathbf{h} | \mathbf{x}) that maps the input to another space, often referred to as hidden or latent space represented by \mathbf{h}, and a decoder model q_d(\mathbf{x} | \mathbf{h}) that learns the inverse mapping from the latent to input space.

These two components can be connected together to create an end-to-end trainable model and it is often the case that we impose a constraint on the latent space \mathbf{h}. The most common constraint is to create a bottleneck such that \mathbf{h} has a lower dimension compared to \mathbf{x}, forcing the model to learn a lower dimensional representation of the input space (as can be seen from the following figure – courtesy of deeplearning4j project, the left hand-side shows the encoder network and the right hand-side shows the decoder).


This way encoder can be seen as a compression algorithm and the decoder as a decompressor or reconstruction algorithm. In practice, both encoder and decoder models are deep neural networks of varying architectures (e.g. MLPs, ConvNets, RNNs, AttentionNets) to get desired outcomes.

Once such a model is learnt, we can unplug the decoder from the encoder and use them independently. For example, in order to generate a new sample, one could first generate a sample from the latent space (by let’s say combining the latent vectors of two inputs or directly sampling from the latent space) and then present that to the decoder to create a new sample from the output space.

To see these kind of models in action, I would suggest having a look at the online demo of Digit Fantasies by a Deep Generative Model. You can play with changing the latent space and generating new images of handwritten digits (an example displayed bellow).


Other examples could be the following two approaches to generation of natural images by DRAW on the left and a more recent version of Variational Auto-Encoders on the right.



References and further readings:

Generative Adversarial (GAN) models

As we could see from the architecture of Auto-Encoders, one can imagine a general concept of creating modular networks that work with a special relationship to each other and training such models in an end-to-end format can help us learn latent spaces leading to generation of new samples.

Another version of this concept is the Generative Adversarial Models framework, where we have a generator model q_g(\mathbf{x} | \mathbf{h}) for mapping a small dimensional latent space of \mathbf{h} (often modelled as noise sampled from a simple distribution) to the input space of \mathbf{x}. One can interpret this as having a similar role to the decoder in AEs. So far not much new here!

The trick is now to introduce a discriminative model p_d(\mathbf{y} | \mathbf{x}) where it tries to associate an input instance \mathbf{x} to a yes/no binary answer \mathbf{y} on whether the input was generated by the generator model  or it was a genuine sample from the dataset we are training on.

Let’s use the same image example we previously used. Imagine the generator model creates a new image and also we have a real image from our dataset. If our generator was good the discriminator model will not able to distinguish between the two images easily.  However, if our generator was poor, it would be very easy to tell which one is fake and which one is real.

When these two models are coupled, one can train them end-to-end (often in a stage-wise fashion) by ensuring that the generator is getting better over time to fool the discriminator while the discriminator is getting trained to work on harder and harder problem of detecting fakes. Ideally, we want to get to a generator model that its outputs are indistinguishable from the real data we used for training from the discriminator model’s point of view.

During initial parts of the training, the discriminator can easily detect samples coming from the dataset vs the synthetic ones generated by the generator which is just starting to learn. However, as generator gets better at modelling the dataset, we start seeing more and more generated samples that look similar to the dataset. An example of this can be seen in the following image which depicts the generated images of a GAN model learning over time (courtesy of OpenAI).

Recent versions of these models have tried to focus on improving the stability of training, using special architectures more suitable for image generation such as DCGAN and LapGAN, adding class information to the input space to generate images from a specific class (CGAN), unsupervised latent code discovery for interpretable semantic attributes of a dataset (InfoGAN), and combining AEs with GANs for Adversarial Auto-encoders.

You can check this demo for a simple GAN training simulation and this demo for a variant of VAEs+GANs for abstract image generation (example image below – courtesy of Otoro.net).

References and further readings:

Sequence models

If the data we are trying to model is characterised as a sequence over some dimensions (for example 1-d for time, 2-d for space, 3-d for spatio-temporal), then we can use special algorithms called Sequence Models. These models are able to learn the probability of the form p(\mathbf{y} | \mathbf{x}_{n}, \dots,\mathbf{x}_{1}) where i is an index signifying the location in the sequence and \mathbf{x}_{i} is the i-th input sample.

An example of this would the written text: each word is a sequence of characters, each sentence is a sequence of words, each paragraph is a sequence of sentences, and so on. Our output \mathbf{y} could be for example if the sentence has a positive sentiment associated with it or a negative one.

Using a similar trick from AEs, one can decide to replace \mathbf{y} with the next item in the sequence, i.e. \mathbf{y} =\mathbf{x}_{n + 1}, allowing the model to learn p(\mathbf{x}_{n + 1} | \mathbf{x}_{n}, \dots,\mathbf{x}_{1}).

In other words, with these models we can use the past history of a sequence of input data to make a prediction on what is likely to follow. Also note that one can use the chain rule of probability to estimate the probability of the overall sequence based on a recursive operation as p(\mathbf{x}_{n + 1}, \mathbf{x}_{n}, \dots,\mathbf{x}_{1})=\prod_{i = 1}^{n}p(\mathbf{x}_{i + 1} | \mathbf{x}_{i}, \dots,\mathbf{x}_{1}). Remember that from our definition of unsupervised generative models, this probability expresses how much model believes in the sequence to be a real one (i.e. coming from the same distribution as the dataset). 

A special branch of neural networks called Recurrent Neural Networks are specially suited for these tasks as they are able to keep a summary of the past inputs (often called (hidden) state \mathbf{h}_{n} in memory and simplify the model to a two stage operation:

  • Given a new input from the sequence \mathbf{x}_{n} and the old state \mathbf{h}_{n - 1} compute a new state \mathbf{h}_{n} by the encoder function q_e(\mathbf{h}_{n} |\mathbf{h}_{n - 1},\mathbf{x}_{n}).
  • Use the new state to compute how likely it is that the next input in the sequence is \mathbf{x}_{n + 1} by the decoder function p_d(\mathbf{x}_{n + 1} |\mathbf{h}_{n}).

As you can see, there is massive overlap between these generative sequence models and AEs we discussed previously.

A very popular example of sequence modelling is their application to NLP or text modelling. The generation procedure often is a recursive process of:

  • Choose a symbol (e.g. character) from the decoder using the probability map p_d(\mathbf{x}_{n + 1} |\mathbf{h}_{n})
  • Append that symbol to the list of generated  sequence
  • Use the new symbol as the input for the next step of the algorithm to update the state
  • Repeat until a stop event is generated.

The following diagram (adapted from here) shows an example of this process where the <START> symbol leads to generation of character “h” which then gets used as input to generate “e” and so on until the model generates an <END> symbol.


Depending on the type of problem (similar to AEs and GANs), researchers have come up with specialisation either in models for the encoder (such as LSTMs, GRUs, ConvRNNs, Bidirectional RNNs, Recursive Tree Models, Hierarchical AEs, Attention Networks, Spatial inputs with PixelRNNs or PixelCNN) or in models for the decoder (such as simple MLPs, LSTMs for Sequence to Sequence Learning, Mixture Density Models). Note that while most of sequence to sequence models (such as those used in Seq2Seq and Neural Language Translation) often use output spaces that are different than their input space, they still fit nicely as extensions of generative models derived from only one space.

To check a few demos on how these models work, you can visit this demo on text generation and this one on hand-writing image generation.

References and further readings:

Final words:

So far we have seen a brief overview of recent developments in generative models and their ability to create new samples and artefacts. The most interesting observation is that using a few simple tricks one can cast a generative modelling problem to a prediction problem which opens up many possibilities in terms of types of input/output spaces, architectures, and training algorithms.

Furthermore, while many of these algorithms differ significantly in their implementation details and objectives, they have a striking similarity which is reflected in their dual architecture where either two competing (GANs), inverting (AEs), or completing (Seq Models) networks work in tandem to create a generative model.

Another interesting aspect of these models is that they are often agnostic with respect to the data domain they are trained on. Researchers have used a diverse range of applications such as images, videos, text (including poems, lyrics, books, news), audio, and music. But there is really nothing holding these models back from being applied to any other domain: as longs as you can describe your input data either symbolically or numerically, you can test these models.

In the following article, we will look into music specifically and explore generative models used in that space. See you soon!

PS: I’ve purposefully omitted algorithms for Neural Style Transfer as they are currently limited to images and do not fit into the generative models description as nicely as the three frameworks mentioned above do.