Detecting Brand-Unsafe Content Through Computer Vision

Victoria Seoane
.
February 25, 2024
Detecting Brand-Unsafe Content Through Computer Vision

In the realm of digital advertising, ensuring brand safety is a paramount concern. It involves the task of identifying and safeguarding ad placements within content that aligns with a brand’s values and objectives and – sometimes more importantly – detecting content that doesn’t align with a brand’s message. In 2019, the Global Alliance for Responsible Media (GARM), released the brand safety and suitability standard, which helps brands communicate their particular needs in a shared language across the industry.

Business challenge

With this context in mind, we want to present a challenge that was brought to us by one of our clients: how to identify brand unsafe content in a large corpus of videos in a cost effective way? Video analysis was cost prohibitive, so we designed a solution that leveraged image and audio analysis. After all – videos are a large set of images overlaid with sound, right?

In this article, we’ll focus on the computer vision component of the analysis – how to find, among other things, video scenes related to crime, weapons, terrorism, alcohol, drugs or pornography?

First step – designing the solution

The first thing we decided was to base our solution upon a sample of video stills from the entire video. Even though this increased the chance of error, image analysis is less expensive than video analysis, so we could sample a larger number of images. For illustrative purposes, if we used AWS Rekognition for content moderation using video analytics it costs $0.10/min, while image recognition would cost $0.0008 per image. We can sample 100 images in a minute of video, and still be below video costs.

The number of images to sample will be a parameter to consider, in order to balance model accuracy and overall cost.To sample these images, we wrote an Open CV script to load a video and sample a subset of stills from it.

no-line-numbers import cv2 import os def sample_images(path, file_name, directory_to_store_at=“/tmp”): “”” Capture frames from a video and save sampled images at specified intervals. Args: path (str): Path to the input video file. file_name (str): Base name for the output image files. directory_to_store_at (str, optional): Directory to save the sampled images. Defaults to “/tmp”. Returns: List[str]: List of file paths to the sampled images. “”” images = [] cam = cv2.VideoCapture(path) total_frames = int(cam.get(cv2.CAP_PROP_FRAME_COUNT)) fps = int(cam.get(cv2.CAP_PROP_FPS)) video_duration = total_frames / fps # Calculate the interval (in minutes) to sample at # according to the video duration (adjust to business logic) interval_minutes = 0.1 if video_duration < 60: interval_minutes = 0.1 else: interval_minutes = 1 frame_interval = int(fps * 60 * interval_minutes) if not os.path.exists(directory_to_store_at): os.makedirs(directory_to_store_at) # start sampling at frame 15 to avoid the initial black frames currentframe = 15 while(True): current_frame = current_frame + frame_interval cam.set(cv2.CAP_PROP_POS_FRAMES, current_frame) ret, frame = cam.read() # reading from frame if ret: # if video is still left continue creating images file_id = file_name + ‘frame’ + str(current_frame) + ‘.jpg’ name = directory_to_store_at + file_id image_string = cv2.imencode(‘.jpg’, frame)[1] with open(name, ‘wb’) as file: file.write(image_string.tobytes()) images.append(name) else: break # Release all space and windows once done cam.release() cv2.destroyAllWindows() return images

Once we have our images sourced from stills, how do we want to classify them? What model should we use? Our first thought was to leverage a similar idea to the one we exposed in our previous blog post, and classify them into all the different categories, and have a “catch-all” category for safe content. However, this had several problems. For starters, an image could belong to several different categories – making multi-class classification an insufficient solution. Also, what would happen with images that belonged to none of the categories? The “safe” category would potentially be huge. The amount of data we would need to train the models on all the things that are actually safe would’ve been massive. Not only that, we would potentially be incurring in class imbalance and the problems that this can entail. For example, the model may become biased towards the majority class, leading to poor performance on the minority class.

To solve the multiple-label issue, we considered multi-labeled models, but the issue with the number of examples needed for “safe” content still remained.

Because of this, we decided to tackle the problem from a new perspective. Instead of building one model to classify them all (sorry, easy joke), we built a set of one-class classification models, one for each of the classes we want to detect. One-class classification, also known as anomaly detection, is a technique where the model learns to identify a specific class of data from a predominantly normal dataset. Autoencoders, a type of neural network architecture, are particularly well-suited for this task.

Let’s take a look into Autoencoders!

Autoencoders are a type of artificial neural network used in unsupervised machine learning. They are primarily designed for dimensionality reduction, feature learning, and data compression. Autoencoders have applications in a variety of domains, including image and signal processing, anomaly detection, and even natural language processing.

An autoencoder consists of two main parts: an encoder and a decoder. Here’s how they work:

Encoder: The encoder takes an input (which can be an image, a sequence of data, or any other form of structured data) and maps it to a lower-dimensional representation, often referred to as a “latent space” or “encoding”. This process involves a series of transformations and layers that capture essential features or patterns in the input data. The encoder’s goal is to reduce the dimensionality while preserving the most important information.

Latent Space: The latent space is a compressed representation of the input data, and it typically has a lower dimension than the original data. It serves as a bottleneck in the autoencoder architecture, forcing the model to capture the most critical features in a compact form.

Decoder: The decoder takes the latent space representation and reconstructs the input data from it. It consists of layers that perform the reverse operation of the encoder, attempting to generate an output that closely resembles the original input. The quality of the reconstruction is a measure of how well the autoencoder has learned to capture the essential features of the data.

Autoencoders are trained using a reconstruction loss, which measures the difference between the input and the output. The objective during training is to minimize this loss, effectively teaching the autoencoder to reproduce the input data as accurately as possible. In the case of one-class classification, if the reconstruction loss is high for a particular input, it suggests that the input does not fit well with the learned patterns, indicating that the image doesn’t belong to the class that the model was trained on. Let’s use an example: Imagine we train two auto encoders, one for detecting weapons and another one for detecting drugs. We can then submit an arbitrary image to each auto encoder and consider the reconstruction error for both. If we find the auto-encoder for drugs has a low reconstruction error, then it is likely that the image is related to drugs.

Sourcing the necessary training data

It’s hard to find open source datasets on many of these topics – especially due to their sensitivity. This is why it was crucial to create our own datasets. We created eight different datasets – crime, arms, military scenes, terrorism, explicit content, sexy content, smoking images and drinking images; each one having between 200 and 500 images. To do this, we used commercial-use allowed images from varied sources, such as Flickr, Google and Wikimedia. These were subject to a manual review from our team, to ensure that the source images were good representatives of what a brand unsafe video could have. For instance, an image of a soldier posing in uniform is related to military topics, but is not what the GARM standard refers to as unsafe content.    A personal aside – the data labeling step is a task that is best to do as a team and in small doses, and if possible, looking at videos of adorable puppies in between.

Preparing the data for training

We’ll place all of our images in a folder, defined here as image folder. We’ll also define an image height and width, which is necessary for the autoencoder models. It’s important to use an image size that keeps the image quality without it being too large – since large images will create larger models – and require more memory to run! In our case, 256×256 pixels is a reasonable shape.

no-line-numbers import os import numpy as np # Set the path to the folder containing your images image_folder = ‘./training_images_military’ # Define image dimensions target_height = 256 target_width = 256 # Initialize an empty list to store the preprocessed images preprocessed_images = [] from keras.layers import Input, Dense from keras.models import Model from keras.preprocessing.image import load_img, img_to_array # Loop through the images in the folder for image_filename in os.listdir(image_folder): # Construct the full path to the image file image_path = os.path.join(image_folder, image_filename) # Load the image using Keras’ load_img function image = load_img(image_path, target_size=(target_height, target_width)) # Convert the image to a NumPy array and normalize pixel values image_array = img_to_array(image) / 255.0 # Append the preprocessed image to the list preprocessed_images.append(image_array) # Convert the list of preprocessed images to a NumPy array X_train = np.array(preprocessed_images)

Model training

We’ll now create a function to train our autoencoder.

In this case, we chose to use Dense layers, but there are different layers available: among others Conv2D, MaxPooling2D, UpSampling2D. The architecture chosen was a combination of study and experimentation – figuring out which architecture yielded the best results while keeping the model complexity reasonably low. The optimal batch size and epochs were a combination of high accuracy and overall model size – we wanted good models but that could run on computers with 256GB of memory.

no-line-numbers def train_and_evaluate(epochs, batch_size, loss, training_data): “”” Train and evaluate an autoencoder model on the provided training data. Args: epochs (int): The number of training epochs. batch_size (int): The batch size for training. loss (str): The loss function to be used for training. training_data (numpy.ndarray): The training data to be used for autoencoder training. target_height (int): The height of the target input images. target_width (int): The width of the target input images. Returns: tensorflow.keras.models.Model: Trained autoencoder model. “”” input_dim = (target_height, target_width, 3) dim = target_height * target_width * 3 input_dim = (dim,) input_img = Input(shape=input_dim) encoded = Dense(units=256, activation=‘relu’)(input_img) encoded = Dense(units=128, activation=‘relu’)(encoded) encoded = Dense(units=64, activation=‘relu’)(encoded) decoded = Dense(units=128, activation=‘relu’)(encoded) decoded = Dense(units=256, activation=‘relu’)(decoded) decoded = Dense(units=dim, activation=‘sigmoid’)(decoded) autoencoder = Model(input_img, decoded) autoencoder.compile(optimizer=‘adam’, loss=loss) # Train the autoencoder epochs = epochs batch_size = batch_size validation_split = 0.2 history = autoencoder.fit(training_data, training_data, epochs=epochs, batch_size=batch_size, validation_split=validation_split, verbose=0) return autoencoder

With that function defined, we trained the model:

no-line-numbers autoencoder = train_and_evaluate(50, 16, ‘mse’, X_train)

Testing the model

Now, here comes the fun part. We’ll use the following function to classify our images!

no-line-numbers def classify_image(image_path, autoencoder, threshold=0.001, target_height = 256, target_width = 256): “”” Classify an image based on its reconstruction error using an autoencoder. Args: image_path (str): Path to the image file to be classified. autoencoder (tensorflow.keras.models.Model): Trained autoencoder model used for image reconstruction. threshold (float, optional): Threshold for classification. Images with a Mean Squared Error (MSE) below this threshold are considered to belong to the class. Defaults to 0.001. target_height (int, optional): The target height for resizing the image. Defaults to 256. target_width (int, optional): The target width for resizing the image. Defaults to 256. Returns: tuple: A tuple containing: – mse (float): The Mean Squared Error (MSE) between the original and reconstructed image. – belongs (bool): True if the image belongs to the class based on the MSE, False otherwise. “”” image = load_img(image_path, target_size=(target_height, target_width)) image_array = img_to_array(image) / 255.0 # Reshape the image array to match the input shape of the autoencoder image_array = image_array.reshape(1, –1) # Use the autoencoder to predict the reconstructed image reconstructed_image = autoencoder.predict(image_array) # Calculate the Mean Squared Error (MSE) as the reconstruction error mse = np.mean(np.square(image_array – reconstructed_image)) if mse < threshold: #Image belongs to the class belongs = True else: #Image does not belong to the class belongs = False return (mse, belongs)

And now, for the moment of truth – we test it on our data:

no-line-numbers for image in os.listdir(image_folder): mse, belongs = classify_image(“training_images_crime/” + image, autoencoder, threshold)

Tuning the threshold

If you are paying attention, you’ll see a “threshold” mentioned in the code above, very casually. What is that? Remember some sections ago, we talked about how autoencoders can be used for one class classification? We ask the model to encode an image, and then measure the reconstruction error. In this case, we’ll be using Mean Squared Error (MSE) – the mean of squares of errors between real image and predicted image. If the error is large, it means that the image doesn’t belong to the set of images we used for training because the model wasn’t able to recreate it accurately. This is the concept we’ll use to classify our images. However – what is a “large error”? This is something we’ll need to investigate – set a threshold that is consistent with your data. An error that is low enough to keep True Positives and True Negatives high, and False Positive and False Negatives low. Of course, this will also depend on your business use case, and the sensitivity to Type I and Type II errors.

This is the distribution of the MSEs for the images that belonged to the class:

And this is the distribution of the MSEs for the images that did not belong to the class:

As we can see, the choice of threshold is both a science and an art – choosing a lower threshold will make us mark a lot of images that belong to the class as not belonging, and risk not catching brand unsafe content. On the other hand, choosing a higher threshold will make us mark more inventory than strictly necessary. A threshold of 0.05 in this case was a good choice, capturing most of the images that did belong to the class, while misclassifying very few images that didn’t belong.

An interesting exercise is to analyze the images that were misclassified. In this case, since many of the training images had a desertic background due to the military campaigns in the Middle East, images of the desert were classified as having military content. To improve this, I improved my training data, to capture images from other contexts, such as the Vietnam war.

Conclusion

Brand safety in advertising demands meticulous attention to the context in which ads are displayed.  Our challenge was to identify how to detect potential ad placement context in a cost-effective manner.

We can address this by employing Keras autoencoders for one-class classification of video frames. This approach significantly reduces data labeling requirements, since multi-label and multi-class models would require a wide variety of “brand safe” data.

Training an autoencoder on brand-safe content and establishing a threshold for reconstruction errors allows us to accurately identify placements that conform to brand values.

This solution offers significant potential for streamlining brand safety efforts in advertising, ensuring that ads are presented in alignment with a brand’s identity and mission.

Stay ahead of the curve on the latest trends and insights in big data, machine learning and artificial intelligence. Don't miss out and subscribe to our newsletter!

In the realm of digital advertising, ensuring brand safety is a paramount concern. It involves the task of identifying and safeguarding ad placements within content that aligns with a brand’s values and objectives and – sometimes more importantly – detecting content that doesn’t align with a brand’s message. In 2019, the Global Alliance for Responsible Media (GARM), released the brand safety and suitability standard, which helps brands communicate their particular needs in a shared language across the industry.

Business challenge

With this context in mind, we want to present a challenge that was brought to us by one of our clients: how to identify brand unsafe content in a large corpus of videos in a cost effective way? Video analysis was cost prohibitive, so we designed a solution that leveraged image and audio analysis. After all – videos are a large set of images overlaid with sound, right?

In this article, we’ll focus on the computer vision component of the analysis – how to find, among other things, video scenes related to crime, weapons, terrorism, alcohol, drugs or pornography?

First step – designing the solution

The first thing we decided was to base our solution upon a sample of video stills from the entire video. Even though this increased the chance of error, image analysis is less expensive than video analysis, so we could sample a larger number of images. For illustrative purposes, if we used AWS Rekognition for content moderation using video analytics it costs $0.10/min, while image recognition would cost $0.0008 per image. We can sample 100 images in a minute of video, and still be below video costs.

The number of images to sample will be a parameter to consider, in order to balance model accuracy and overall cost.To sample these images, we wrote an Open CV script to load a video and sample a subset of stills from it.

no-line-numbers import cv2 import os def sample_images(path, file_name, directory_to_store_at=“/tmp”): “”” Capture frames from a video and save sampled images at specified intervals. Args: path (str): Path to the input video file. file_name (str): Base name for the output image files. directory_to_store_at (str, optional): Directory to save the sampled images. Defaults to “/tmp”. Returns: List[str]: List of file paths to the sampled images. “”” images = [] cam = cv2.VideoCapture(path) total_frames = int(cam.get(cv2.CAP_PROP_FRAME_COUNT)) fps = int(cam.get(cv2.CAP_PROP_FPS)) video_duration = total_frames / fps # Calculate the interval (in minutes) to sample at # according to the video duration (adjust to business logic) interval_minutes = 0.1 if video_duration < 60: interval_minutes = 0.1 else: interval_minutes = 1 frame_interval = int(fps * 60 * interval_minutes) if not os.path.exists(directory_to_store_at): os.makedirs(directory_to_store_at) # start sampling at frame 15 to avoid the initial black frames currentframe = 15 while(True): current_frame = current_frame + frame_interval cam.set(cv2.CAP_PROP_POS_FRAMES, current_frame) ret, frame = cam.read() # reading from frame if ret: # if video is still left continue creating images file_id = file_name + ‘frame’ + str(current_frame) + ‘.jpg’ name = directory_to_store_at + file_id image_string = cv2.imencode(‘.jpg’, frame)[1] with open(name, ‘wb’) as file: file.write(image_string.tobytes()) images.append(name) else: break # Release all space and windows once done cam.release() cv2.destroyAllWindows() return images

Once we have our images sourced from stills, how do we want to classify them? What model should we use? Our first thought was to leverage a similar idea to the one we exposed in our previous blog post, and classify them into all the different categories, and have a “catch-all” category for safe content. However, this had several problems. For starters, an image could belong to several different categories – making multi-class classification an insufficient solution. Also, what would happen with images that belonged to none of the categories? The “safe” category would potentially be huge. The amount of data we would need to train the models on all the things that are actually safe would’ve been massive. Not only that, we would potentially be incurring in class imbalance and the problems that this can entail. For example, the model may become biased towards the majority class, leading to poor performance on the minority class.

To solve the multiple-label issue, we considered multi-labeled models, but the issue with the number of examples needed for “safe” content still remained.

Because of this, we decided to tackle the problem from a new perspective. Instead of building one model to classify them all (sorry, easy joke), we built a set of one-class classification models, one for each of the classes we want to detect. One-class classification, also known as anomaly detection, is a technique where the model learns to identify a specific class of data from a predominantly normal dataset. Autoencoders, a type of neural network architecture, are particularly well-suited for this task.

Let’s take a look into Autoencoders!

Autoencoders are a type of artificial neural network used in unsupervised machine learning. They are primarily designed for dimensionality reduction, feature learning, and data compression. Autoencoders have applications in a variety of domains, including image and signal processing, anomaly detection, and even natural language processing.

An autoencoder consists of two main parts: an encoder and a decoder. Here’s how they work:

Encoder: The encoder takes an input (which can be an image, a sequence of data, or any other form of structured data) and maps it to a lower-dimensional representation, often referred to as a “latent space” or “encoding”. This process involves a series of transformations and layers that capture essential features or patterns in the input data. The encoder’s goal is to reduce the dimensionality while preserving the most important information.

Latent Space: The latent space is a compressed representation of the input data, and it typically has a lower dimension than the original data. It serves as a bottleneck in the autoencoder architecture, forcing the model to capture the most critical features in a compact form.

Decoder: The decoder takes the latent space representation and reconstructs the input data from it. It consists of layers that perform the reverse operation of the encoder, attempting to generate an output that closely resembles the original input. The quality of the reconstruction is a measure of how well the autoencoder has learned to capture the essential features of the data.

Autoencoders are trained using a reconstruction loss, which measures the difference between the input and the output. The objective during training is to minimize this loss, effectively teaching the autoencoder to reproduce the input data as accurately as possible. In the case of one-class classification, if the reconstruction loss is high for a particular input, it suggests that the input does not fit well with the learned patterns, indicating that the image doesn’t belong to the class that the model was trained on. Let’s use an example: Imagine we train two auto encoders, one for detecting weapons and another one for detecting drugs. We can then submit an arbitrary image to each auto encoder and consider the reconstruction error for both. If we find the auto-encoder for drugs has a low reconstruction error, then it is likely that the image is related to drugs.

Sourcing the necessary training data

It’s hard to find open source datasets on many of these topics – especially due to their sensitivity. This is why it was crucial to create our own datasets. We created eight different datasets – crime, arms, military scenes, terrorism, explicit content, sexy content, smoking images and drinking images; each one having between 200 and 500 images. To do this, we used commercial-use allowed images from varied sources, such as Flickr, Google and Wikimedia. These were subject to a manual review from our team, to ensure that the source images were good representatives of what a brand unsafe video could have. For instance, an image of a soldier posing in uniform is related to military topics, but is not what the GARM standard refers to as unsafe content.    A personal aside – the data labeling step is a task that is best to do as a team and in small doses, and if possible, looking at videos of adorable puppies in between.

Preparing the data for training

We’ll place all of our images in a folder, defined here as image folder. We’ll also define an image height and width, which is necessary for the autoencoder models. It’s important to use an image size that keeps the image quality without it being too large – since large images will create larger models – and require more memory to run! In our case, 256×256 pixels is a reasonable shape.

no-line-numbers import os import numpy as np # Set the path to the folder containing your images image_folder = ‘./training_images_military’ # Define image dimensions target_height = 256 target_width = 256 # Initialize an empty list to store the preprocessed images preprocessed_images = [] from keras.layers import Input, Dense from keras.models import Model from keras.preprocessing.image import load_img, img_to_array # Loop through the images in the folder for image_filename in os.listdir(image_folder): # Construct the full path to the image file image_path = os.path.join(image_folder, image_filename) # Load the image using Keras’ load_img function image = load_img(image_path, target_size=(target_height, target_width)) # Convert the image to a NumPy array and normalize pixel values image_array = img_to_array(image) / 255.0 # Append the preprocessed image to the list preprocessed_images.append(image_array) # Convert the list of preprocessed images to a NumPy array X_train = np.array(preprocessed_images)

Model training

We’ll now create a function to train our autoencoder.

In this case, we chose to use Dense layers, but there are different layers available: among others Conv2D, MaxPooling2D, UpSampling2D. The architecture chosen was a combination of study and experimentation – figuring out which architecture yielded the best results while keeping the model complexity reasonably low. The optimal batch size and epochs were a combination of high accuracy and overall model size – we wanted good models but that could run on computers with 256GB of memory.

no-line-numbers def train_and_evaluate(epochs, batch_size, loss, training_data): “”” Train and evaluate an autoencoder model on the provided training data. Args: epochs (int): The number of training epochs. batch_size (int): The batch size for training. loss (str): The loss function to be used for training. training_data (numpy.ndarray): The training data to be used for autoencoder training. target_height (int): The height of the target input images. target_width (int): The width of the target input images. Returns: tensorflow.keras.models.Model: Trained autoencoder model. “”” input_dim = (target_height, target_width, 3) dim = target_height * target_width * 3 input_dim = (dim,) input_img = Input(shape=input_dim) encoded = Dense(units=256, activation=‘relu’)(input_img) encoded = Dense(units=128, activation=‘relu’)(encoded) encoded = Dense(units=64, activation=‘relu’)(encoded) decoded = Dense(units=128, activation=‘relu’)(encoded) decoded = Dense(units=256, activation=‘relu’)(decoded) decoded = Dense(units=dim, activation=‘sigmoid’)(decoded) autoencoder = Model(input_img, decoded) autoencoder.compile(optimizer=‘adam’, loss=loss) # Train the autoencoder epochs = epochs batch_size = batch_size validation_split = 0.2 history = autoencoder.fit(training_data, training_data, epochs=epochs, batch_size=batch_size, validation_split=validation_split, verbose=0) return autoencoder

With that function defined, we trained the model:

no-line-numbers autoencoder = train_and_evaluate(50, 16, ‘mse’, X_train)

Testing the model

Now, here comes the fun part. We’ll use the following function to classify our images!

no-line-numbers def classify_image(image_path, autoencoder, threshold=0.001, target_height = 256, target_width = 256): “”” Classify an image based on its reconstruction error using an autoencoder. Args: image_path (str): Path to the image file to be classified. autoencoder (tensorflow.keras.models.Model): Trained autoencoder model used for image reconstruction. threshold (float, optional): Threshold for classification. Images with a Mean Squared Error (MSE) below this threshold are considered to belong to the class. Defaults to 0.001. target_height (int, optional): The target height for resizing the image. Defaults to 256. target_width (int, optional): The target width for resizing the image. Defaults to 256. Returns: tuple: A tuple containing: – mse (float): The Mean Squared Error (MSE) between the original and reconstructed image. – belongs (bool): True if the image belongs to the class based on the MSE, False otherwise. “”” image = load_img(image_path, target_size=(target_height, target_width)) image_array = img_to_array(image) / 255.0 # Reshape the image array to match the input shape of the autoencoder image_array = image_array.reshape(1, –1) # Use the autoencoder to predict the reconstructed image reconstructed_image = autoencoder.predict(image_array) # Calculate the Mean Squared Error (MSE) as the reconstruction error mse = np.mean(np.square(image_array – reconstructed_image)) if mse < threshold: #Image belongs to the class belongs = True else: #Image does not belong to the class belongs = False return (mse, belongs)

And now, for the moment of truth – we test it on our data:

no-line-numbers for image in os.listdir(image_folder): mse, belongs = classify_image(“training_images_crime/” + image, autoencoder, threshold)

Tuning the threshold

If you are paying attention, you’ll see a “threshold” mentioned in the code above, very casually. What is that? Remember some sections ago, we talked about how autoencoders can be used for one class classification? We ask the model to encode an image, and then measure the reconstruction error. In this case, we’ll be using Mean Squared Error (MSE) – the mean of squares of errors between real image and predicted image. If the error is large, it means that the image doesn’t belong to the set of images we used for training because the model wasn’t able to recreate it accurately. This is the concept we’ll use to classify our images. However – what is a “large error”? This is something we’ll need to investigate – set a threshold that is consistent with your data. An error that is low enough to keep True Positives and True Negatives high, and False Positive and False Negatives low. Of course, this will also depend on your business use case, and the sensitivity to Type I and Type II errors.

This is the distribution of the MSEs for the images that belonged to the class:

And this is the distribution of the MSEs for the images that did not belong to the class:

As we can see, the choice of threshold is both a science and an art – choosing a lower threshold will make us mark a lot of images that belong to the class as not belonging, and risk not catching brand unsafe content. On the other hand, choosing a higher threshold will make us mark more inventory than strictly necessary. A threshold of 0.05 in this case was a good choice, capturing most of the images that did belong to the class, while misclassifying very few images that didn’t belong.

An interesting exercise is to analyze the images that were misclassified. In this case, since many of the training images had a desertic background due to the military campaigns in the Middle East, images of the desert were classified as having military content. To improve this, I improved my training data, to capture images from other contexts, such as the Vietnam war.

Conclusion

Brand safety in advertising demands meticulous attention to the context in which ads are displayed.  Our challenge was to identify how to detect potential ad placement context in a cost-effective manner.

We can address this by employing Keras autoencoders for one-class classification of video frames. This approach significantly reduces data labeling requirements, since multi-label and multi-class models would require a wide variety of “brand safe” data.

Training an autoencoder on brand-safe content and establishing a threshold for reconstruction errors allows us to accurately identify placements that conform to brand values.

This solution offers significant potential for streamlining brand safety efforts in advertising, ensuring that ads are presented in alignment with a brand’s identity and mission.

Stay ahead of the curve on the latest trends and insights in big data, machine learning and artificial intelligence. Don't miss out and subscribe to our newsletter!

Download your e-book today!