Automate PII Redaction from Audio Files Using Node.js and AssemblyAI

Terrill Dicki  Jun 13, 2024 09:41  UTC 01:41

3 Min Read

In the age of data privacy, redacting Personally Identifiable Information (PII) from audio and video files is a crucial task for many applications. A recent tutorial by AssemblyAI outlines how to automate this process using Node.js and the AssemblyAI API.

Understanding PII and Its Importance

PII includes any data that can be used to identify an individual, such as names, phone numbers, and email addresses. Handling this information is governed by regulations like HIPAA, GDPR, and CCPA. Redacting PII is essential in various applications, such as recording phone conversations between a doctor and a patient.

Setting Up the Development Environment

To begin, ensure you have Node.js 18 or higher installed. Create a new project folder, navigate to it, and initialize a Node.js project:

mkdir pii-redaction
cd pii-redaction
npm init -y

Modify the package.json file to use ES Module syntax by adding "type": "module". Next, install the AssemblyAI JavaScript SDK:

npm install --save assemblyai

You'll need an AssemblyAI API key, which can be obtained from the AssemblyAI dashboard. Set this key as an environment variable on your system:

# Mac/Linux:
export ASSEMBLYAI_API_KEY=<YOUR_KEY>

# Windows:
set ASSEMBLYAI_API_KEY=<YOUR_KEY>

Transcribing Audio with PII Redaction

With the environment set up, you can start transcribing audio files. Create a file named index.js and add the following code:

import { AssemblyAI } from 'assemblyai';

const client = new AssemblyAI({ apiKey: process.env.ASSEMBLYAI_API_KEY });

const transcript = await client.transcripts.transcribe({
  audio: "https://storage.googleapis.com/aai-web-samples/architecture-call.mp3",
  redact_pii: true,
  redact_pii_policies: [
    "person_name",
    "phone_number",
  ],
  redact_pii_sub: "hash",
});

if (transcript.status === "error") {
  throw new Error(transcript.error);
}

console.log(transcript.text);

This script transcribes an audio file while redacting specified PII categories like names and phone numbers, replacing them with a hash.

Retrieving the Redacted Audio

To obtain the redacted audio, modify the code to include audio redaction settings:

import { AssemblyAI } from 'assemblyai';

const client = new AssemblyAI({ apiKey: process.env.ASSEMBLYAI_API_KEY });

const transcript = await client.transcripts.transcribe({
  audio: "https://storage.googleapis.com/aai-web-samples/architecture-call.mp3",
  redact_pii: true,
  redact_pii_policies: [
    "person_name",
    "phone_number",
  ],
  redact_pii_sub: "hash",
  redact_pii_audio: true,
  redact_pii_audio_quality: "mp3"
});

if (transcript.status === "error") {
  throw new Error(transcript.error);
}

console.log(transcript.text);

This configuration ensures that the redacted audio is available in MP3 format. The redacted audio file can be downloaded using the following code:

import { writeFile } from "fs/promises";

const { redacted_audio_url } = await client.transcripts.redactions(transcript.id);

const redactedFileResponse = await fetch(redacted_audio_url);
await writeFile("./redacted-audio.mp3", redactedFileResponse.body);

Executing the Script

Run the script in your shell:

node index.js

If successful, the console will display the redacted transcript, and a redacted audio file will be saved to your disk. The tutorial also provides an example of an unredacted transcript for comparison.

Conclusion

By following this tutorial, developers can efficiently redact PII from audio and video files using AssemblyAI and Node.js. For more details, visit the AssemblyAI blog.



Read More