Detecting Brand-Unsafe Content Through Computer Vision

0:00 / 0:00

In the realm of digital advertising, ensuring brand safety is a paramount concern. It involves the task of identifying and safeguarding ad placements within content that aligns with a brand’s values and objectives and – sometimes more importantly – detecting content that doesn’t align with a brand’s message. In 2019, the Global Alliance for Responsible Media (GARM), released the brand safety and suitability standard, which helps brands communicate their particular needs in a shared language across the industry.

‍

Business challenge

‍

With this context in mind, we want to present a challenge that was brought to us by one of our clients: how to identify brand unsafe content in a large corpus of videos in a cost effective way? Video analysis was cost prohibitive, so we designed a solution that leveraged image and audio analysis. After all – videos are a large set of images overlaid with sound, right?

In this article, we’ll focus on the computer vision component of the analysis – how to find, among other things, video scenes related to crime, weapons, terrorism, alcohol, drugs or pornography?

‍

First step – designing the solution

‍

‍The first thing we decided was to base our solution upon a sample of video stills from the entire video. Even though this increased the chance of error, image analysis is less expensive than video analysis, so we could sample a larger number of images. For illustrative purposes, if we used AWS Rekognition for content moderation using video analytics it costs $0.10/min, while image recognition would cost $0.0008 per image. We can sample 100 images in a minute of video, and still be below video costs.

The number of images to sample will be a parameter to consider, in order to balance model accuracy and overall cost.To sample these images, we wrote an Open CV script to load a video and sample a subset of stills from it.

‍

Once we have our images sourced from stills, how do we want to classify them? What model should we use? Our first thought was to leverage a similar idea to the one we exposed in our previous blog post, and classify them into all the different categories, and have a “catch-all” category for safe content. However, this had several problems. For starters, an image could belong to several different categories – making multi-class classification an insufficient solution. Also, what would happen with images that belonged to none of the categories? The “safe” category would potentially be huge. The amount of data we would need to train the models on all the things that are actually safe would’ve been massive. Not only that, we would potentially be incurring in class imbalance and the problems that this can entail. For example, the model may become biased towards the majority class, leading to poor performance on the minority class.

To solve the multiple-label issue, we considered multi-labeled models, but the issue with the number of examples needed for “safe” content still remained.

Because of this, we decided to tackle the problem from a new perspective. Instead of building one model to classify them all (sorry, easy joke), we built a set of one-class classification models, one for each of the classes we want to detect. One-class classification, also known as anomaly detection, is a technique where the model learns to identify a specific class of data from a predominantly normal dataset. Autoencoders, a type of neural network architecture, are particularly well-suited for this task.‍

‍

Let’s take a look into Autoencoders!

‍

‍Autoencoders are a type of artificial neural network used in unsupervised machine learning. They are primarily designed for dimensionality reduction, feature learning, and data compression. Autoencoders have applications in a variety of domains, including image and signal processing, anomaly detection, and even natural language processing.

An autoencoder consists of two main parts: an encoder and a decoder. Here’s how they work:

‍Encoder: The encoder takes an input (which can be an image, a sequence of data, or any other form of structured data) and maps it to a lower-dimensional representation, often referred to as a “latent space” or “encoding”. This process involves a series of transformations and layers that capture essential features or patterns in the input data. The encoder’s goal is to reduce the dimensionality while preserving the most important information.

‍Latent Space: The latent space is a compressed representation of the input data, and it typically has a lower dimension than the original data. It serves as a bottleneck in the autoencoder architecture, forcing the model to capture the most critical features in a compact form.

‍Decoder: The decoder takes the latent space representation and reconstructs the input data from it. It consists of layers that perform the reverse operation of the encoder, attempting to generate an output that closely resembles the original input. The quality of the reconstruction is a measure of how well the autoencoder has learned to capture the essential features of the data.

‍

Autoencoders are trained using a reconstruction loss, which measures the difference between the input and the output. The objective during training is to minimize this loss, effectively teaching the autoencoder to reproduce the input data as accurately as possible. In the case of one-class classification, if the reconstruction loss is high for a particular input, it suggests that the input does not fit well with the learned patterns, indicating that the image doesn’t belong to the class that the model was trained on. Let’s use an example: Imagine we train two auto encoders, one for detecting weapons and another one for detecting drugs. We can then submit an arbitrary image to each auto encoder and consider the reconstruction error for both. If we find the auto-encoder for drugs has a low reconstruction error, then it is likely that the image is related to drugs.‍

‍

Sourcing the necessary training data

‍

‍It’s hard to find open source datasets on many of these topics – especially due to their sensitivity. This is why it was crucial to create our own datasets. We created eight different datasets – crime, arms, military scenes, terrorism, explicit content, sexy content, smoking images and drinking images; each one having between 200 and 500 images. To do this, we used commercial-use allowed images from varied sources, such as Flickr, Google and Wikimedia. These were subject to a manual review from our team, to ensure that the source images were good representatives of what a brand unsafe video could have. For instance, an image of a soldier posing in uniform is related to military topics, but is not what the GARM standard refers to as unsafe content. A personal aside – the data labeling step is a task that is best to do as a team and in small doses, and if possible, looking at videos of adorable puppies in between.

‍

Preparing the data for training

‍

‍We’ll place all of our images in a folder, defined here as image folder. We’ll also define an image height and width, which is necessary for the autoencoder models. It’s important to use an image size that keeps the image quality without it being too large – since large images will create larger models – and require more memory to run! In our case, 256×256 pixels is a reasonable shape.

‍

Model training

‍

‍We’ll now create a function to train our autoencoder.

In this case, we chose to use Dense layers, but there are different layers available: among others Conv2D, MaxPooling2D, UpSampling2D. The architecture chosen was a combination of study and experimentation – figuring out which architecture yielded the best results while keeping the model complexity reasonably low. The optimal batch size and epochs were a combination of high accuracy and overall model size – we wanted good models but that could run on computers with 256GB of memory.

‍

With that function defined, we trained the model:

‍

Testing the model

‍

‍Now, here comes the fun part. We’ll use the following function to classify our images!

‍

And now, for the moment of truth – we test it on our data:

‍

Tuning the threshold

‍

If you are paying attention, you’ll see a “threshold” mentioned in the code above, very casually. What is that? Remember some sections ago, we talked about how autoencoders can be used for one class classification? We ask the model to encode an image, and then measure the reconstruction error. In this case, we’ll be using Mean Squared Error (MSE) – the mean of squares of errors between real image and predicted image. If the error is large, it means that the image doesn’t belong to the set of images we used for training because the model wasn’t able to recreate it accurately. This is the concept we’ll use to classify our images. However – what is a “large error”? This is something we’ll need to investigate – set a threshold that is consistent with your data. An error that is low enough to keep True Positives and True Negatives high, and False Positive and False Negatives low. Of course, this will also depend on your business use case, and the sensitivity to Type I and Type II errors.

This is the distribution of the MSEs for the images that belonged to the class:

‍

‍

And this is the distribution of the MSEs for the images that did not belong to the class:

‍

‍

As we can see, the choice of threshold is both a science and an art – choosing a lower threshold will make us mark a lot of images that belong to the class as not belonging, and risk not catching brand unsafe content. On the other hand, choosing a higher threshold will make us mark more inventory than strictly necessary. A threshold of 0.05 in this case was a good choice, capturing most of the images that did belong to the class, while misclassifying very few images that didn’t belong.

An interesting exercise is to analyze the images that were misclassified. In this case, since many of the training images had a desertic background due to the military campaigns in the Middle East, images of the desert were classified as having military content. To improve this, I improved my training data, to capture images from other contexts, such as the Vietnam war.

‍

Conclusion

‍

Brand safety in advertising demands meticulous attention to the context in which ads are displayed. Our challenge was to identify how to detect potential ad placement context in a cost-effective manner.

We can address this by employing Keras autoencoders for one-class classification of video frames. This approach significantly reduces data labeling requirements, since multi-label and multi-class models would require a wide variety of “brand safe” data.

Training an autoencoder on brand-safe content and establishing a threshold for reconstruction errors allows us to accurately identify placements that conform to brand values.

This solution offers significant potential for streamlining brand safety efforts in advertising, ensuring that ads are presented in alignment with a brand’s identity and mission.

‍

Stay ahead of the curve on the latest trends and insights in big data, machine learning and artificial intelligence. Don't miss out and subscribe to our newsletter!

‍

Business challenge

‍

In this article, we’ll focus on the computer vision component of the analysis – how to find, among other things, video scenes related to crime, weapons, terrorism, alcohol, drugs or pornography?

‍

First step – designing the solution

‍

To solve the multiple-label issue, we considered multi-labeled models, but the issue with the number of examples needed for “safe” content still remained.

‍

Let’s take a look into Autoencoders!

‍

An autoencoder consists of two main parts: an encoder and a decoder. Here’s how they work:

‍

Sourcing the necessary training data

‍

Preparing the data for training

‍

Model training

‍

‍We’ll now create a function to train our autoencoder.

‍

With that function defined, we trained the model:

‍

Testing the model

‍

‍Now, here comes the fun part. We’ll use the following function to classify our images!

‍

And now, for the moment of truth – we test it on our data:

‍

Tuning the threshold

‍

This is the distribution of the MSEs for the images that belonged to the class:

‍

‍

And this is the distribution of the MSEs for the images that did not belong to the class:

‍

‍

Conclusion

‍

Training an autoencoder on brand-safe content and establishing a threshold for reconstruction errors allows us to accurately identify placements that conform to brand values.

This solution offers significant potential for streamlining brand safety efforts in advertising, ensuring that ads are presented in alignment with a brand’s identity and mission.

‍

Stay ahead of the curve on the latest trends and insights in big data, machine learning and artificial intelligence. Don't miss out and subscribe to our newsletter!

‍

Thank you! The file will start to download shortly

Oops! Something went wrong while submitting the form.

Detecting Brand-Unsafe Content Through Computer Vision

Business challenge

First step – designing the solution

‍

Let’s take a look into Autoencoders!

Sourcing the necessary training data

Preparing the data for training

Model training

Testing the model

Tuning the threshold

Conclusion

Stay ahead of the curve on the latest trends and insights in big data, machine learning and artificial intelligence. Don't miss out and subscribe to our newsletter!

Download youre-book today!

Business challenge

First step – designing the solution

‍

Let’s take a look into Autoencoders!

Sourcing the necessary training data

Preparing the data for training

Model training

Testing the model

Tuning the threshold

Conclusion

Stay ahead of the curve on the latest trends and insights in big data, machine learning and artificial intelligence. Don't miss out and subscribe to our newsletter!

Related Articles

Related Articles

AI Transformation Challenge

AI Transformation Challenge

Download your
e-book today!