New for Amazon Comprehend – Toxicity Detection

November 10, 2023

With Amazon Comprehend, you can extract insights from text without being a machine learning expert. Using its built-in models, Comprehend can analyze the syntax of your input documents and find entities, events, key phrases, personally identifiable information (PII), and the overall sentiment or sentiments associated with specific entities (such as brands or products).

Today, we are adding the capability to detect toxic content. This new capability helps you build safer environments for your end users. For example, you can use toxicity detection to improve the safety of applications open to external contributions such as comments. When using generative AI, toxicity detection can be used to check the input prompts and the output responses from large language models (LLMs).

You can use toxicity detection with the AWS Command Line Interface (AWS CLI) and AWS SDKs. Let’s see how this works in practice with a few examples using the AWS CLI, an AWS SDK, and to check the use of an LLM.

Using Amazon Comprehend Toxicity Detection with AWS CLI
The new detect-toxic-content subcommand in the AWS CLI detects toxicity in text. The output contains a list of labels, one for each text segment in input. For each text segment, a list is provided with the labels and a score (between 0 and 1).

Amazon Comprehend toxicity detection API

For example, this AWS CLI command analyzes one text segment and returns one Labels section and an overall Toxicity score for the segment between o and 1:

aws comprehend detect-toxic-content --language-code en --text-segments Text="'Good morning, it's a beautiful day.'"
{
    "ResultList": [
        {
            "Labels": [
                {
                    "Name": "PROFANITY",
                    "Score": 0.00039999998989515007
                },
                {
                    "Name": "HATE_SPEECH",
                    "Score": 0.01510000042617321
                },
                {
                    "Name": "INSULT",
                    "Score": 0.004699999932199717
                },
                {
                    "Name": "GRAPHIC",
                    "Score": 9.999999747378752e-05
                },
                {
                    "Name": "HARASSMENT_OR_ABUSE",
                    "Score": 0.0006000000284984708
                },
                {
                    "Name": "SEXUAL",
                    "Score": 0.03889999911189079
                },
                {
                    "Name": "VIOLENCE_OR_THREAT",
                    "Score": 0.016899999231100082
                }
            ],
            "Toxicity": 0.012299999594688416
        }
    ]
}

As expected, all scores are close to zero, and no toxicity was detected in this text.

To pass input as a file, I first use the AWS CLI --generate-cli-skeleton option to generate a skeleton of the JSON syntax used by the detect-toxic-content command:

aws comprehend detect-toxic-content --generate-cli-skeleton
{
    "TextSegments": [
        {
            "Text": ""
        }
    ],
    "LanguageCode": "en"
}

I write the output to a file and add three text segments (I will not show here the text used to show what happens with toxic content). This time, different levels of toxicity content has been found. Each Labels section is related to the corresponding input text segment.

aws comprehend detect-toxic-content --cli-input-json file://input.json
{
    "ResultList": [
        {
            "Labels": [
                {
                    "Name": "PROFANITY",
                    "Score": 0.03020000085234642
                },
                {
                    "Name": "HATE_SPEECH",
                    "Score": 0.12549999356269836
                },
                {
                    "Name": "INSULT",
                    "Score": 0.0738999992609024
                },
                {
                    "Name": "GRAPHIC",
                    "Score": 0.024399999529123306
                },
                {
                    "Name": "HARASSMENT_OR_ABUSE",
                    "Score": 0.09510000050067902
                },
                {
                    "Name": "SEXUAL",
                    "Score": 0.023900000378489494
                },
                {
                    "Name": "VIOLENCE_OR_THREAT",
                    "Score": 0.15549999475479126
                }
            ],
            "Toxicity": 0.06650000065565109
        },
        {
            "Labels": [
                {
                    "Name": "PROFANITY",
                    "Score": 0.03400000184774399
                },
                {
                    "Name": "HATE_SPEECH",
                    "Score": 0.2676999866962433
                },
                {
                    "Name": "INSULT",
                    "Score": 0.1981000006198883
                },
                {
                    "Name": "GRAPHIC",
                    "Score": 0.03139999881386757
                },
                {
                    "Name": "HARASSMENT_OR_ABUSE",
                    "Score": 0.1777999997138977
                },
                {
                    "Name": "SEXUAL",
                    "Score": 0.013000000268220901
                },
                {
                    "Name": "VIOLENCE_OR_THREAT",
                    "Score": 0.8395000100135803
                }
            ],
            "Toxicity": 0.41280001401901245
        },
        {
            "Labels": [
                {
                    "Name": "PROFANITY",
                    "Score": 0.9997000098228455
                },
                {
                    "Name": "HATE_SPEECH",
                    "Score": 0.39469999074935913
                },
                {
                    "Name": "INSULT",
                    "Score": 0.9265999794006348
                },
                {
                    "Name": "GRAPHIC",
                    "Score": 0.04650000110268593
                },
                {
                    "Name": "HARASSMENT_OR_ABUSE",
                    "Score": 0.4203999936580658
                },
                {
                    "Name": "SEXUAL",
                    "Score": 0.3353999853134155
                },
                {
                    "Name": "VIOLENCE_OR_THREAT",
                    "Score": 0.12409999966621399
                }
            ],
            "Toxicity": 0.8180999755859375
        }
    ]
}

Using Amazon Comprehend Toxicity Detection with AWS SDKs
Similar to what I did with the AWS CLI, I can use an AWS SDK to programmatically detect toxicity in my applications. The following Python script uses the AWS SDK for Python (Boto3) to detect toxicity in the text segments and print the labels if the score is greater than a specified threshold. In the code, I redacted the content of the second and third text segments and replaced it with ***.

import boto3

comprehend = boto3.client('comprehend')

THRESHOLD = 0.2
response = comprehend.detect_toxic_content(
    TextSegments=[
        {
            "Text": "You can go through the door go, he's waiting for you on the right."
        },
        {
            "Text": "***"
        },
        {
            "Text": "***"
        }
    ],
    LanguageCode='en'
)

result_list = response['ResultList']

for i, result in enumerate(result_list):
    labels = result['Labels']
    detected = [ l for l in labels if l['Score'] > THRESHOLD ]
    if len(detected) > 0:
        print("Text segment {}".format(i + 1))
        for d in detected:
            print("{} score {:.2f}".format(d['Name'], d['Score']))

I run the Python script. The output contains the labels and the scores detected in the second and third text segments. No toxicity is detected in the first text segment.

Text segment 2
HATE_SPEECH score 0.27
VIOLENCE_OR_THREAT score 0.84
Text segment 3
PROFANITY score 1.00
HATE_SPEECH score 0.39
INSULT score 0.93
HARASSMENT_OR_ABUSE score 0.42
SEXUAL score 0.34

Using Amazon Comprehend Toxicity Detection with LLMs
I deployed the Mistral 7B model using Amazon SageMaker JumpStart as described in this blog post.

To avoid toxicity in the responses of the model, I built a Python script with three functions:

  • query_endpoint invokes the Mistral 7B model using the endpoint deployed by SageMaker JumpStart.
  • check_toxicity uses Comprehend to detect toxicity in a text and return a list of the detected labels.
  • avoid_toxicity takes in input a list of the detected labels and returns a message describing what to do to avoid toxicity.

The query to the LLM goes through only if no toxicity is detected in the input prompt. Then, the response from the LLM is printed only if no toxicity is detected in output. In case toxicity is detected, the script provides suggestions on how to fix the input prompt.

Here’s the code of the Python script:

import json
import boto3

comprehend = boto3.client('comprehend')
sagemaker_runtime = boto3.client("runtime.sagemaker")

ENDPOINT_NAME = "<REPLACE_WITH_YOUR_SAGEMAKER_JUMPSTART_ENDPOINT>"
THRESHOLD = 0.2


def query_endpoint(prompt):
    payload = {
        "inputs": prompt,
        "parameters": {
            "max_new_tokens": 68,
            "no_repeat_ngram_size": 3,
        },
    }
    response = sagemaker_runtime.invoke_endpoint(
        EndpointName=ENDPOINT_NAME, ContentType="application/json", Body=json.dumps(payload).encode("utf-8")
    )
    model_predictions = json.loads(response["Body"].read())
    generated_text = model_predictions[0]["generated_text"]
    return generated_text


def check_toxicity(text):
    response = comprehend.detect_toxic_content(
        TextSegments=[
            {
                "Text":  text
            }
        ],
        LanguageCode='en'
    )

    labels = response['ResultList'][0]['Labels']
    detected = [ l['Name'] for l in labels if l['Score'] > THRESHOLD ]

    return detected


def avoid_toxicity(detected):
    formatted = [ d.lower().replace("_", " ") for d in detected ]
    message = (
        "Avoid content that is toxic and is " +
        ", ".join(formatted) + ".n"
    )
    return message


prompt = "Building a website can be done in 10 simple steps:"

detected_labels = check_toxicity(prompt)

if len(detected_labels) > 0:
    # Toxicity detected in the input prompt
    print("Please fix the prompt.")
    print(avoid_toxicity(detected_labels))
else:
    response = query_endpoint(prompt)

    detected_labels = check_toxicity(response)

    if len(detected_labels) > 0:
        # Toxicity detected in the output response
        print("Here's an improved prompt:")
        prompt = avoid_toxicity(detected_labels) + prompt
        print(prompt)
    else:
        print(response)

You’ll not get a toxic response with the sample prompt in the script, but it’s safe to know that you can set up an automatic process to check and mitigate if that happens.

Availability and Pricing
Toxicity detection for Amazon Comprehend is available today in the following AWS Regions: US East (N. Virginia), US West (Oregon), Europe (Ireland), and Asia Pacific (Sydney).

When using toxicity detection, there are no long-term commitments, and you pay based on the number of input characters in units of 100 characters (1 unit = 100 characters), with a minimum charge of 3 units (300 character) per request. For more information, see Amazon Comprehend pricing.

Improve the safety of your online communities and simplify the adoption of LLMs in your applications with toxicity detection.

Danilo