IBM Research has unveiled groundbreaking innovations aimed at scaling the data processing pipeline for enterprise AI training, according to IBM Research. These advancements are designed to expedite the creation of powerful AI models, such as IBM’s Granite models, by leveraging the abundant capacity of CPUs.
Optimizing Data Preparation
Before training AI models, vast amounts of data must be prepared. This data often comes from diverse sources like websites, PDFs, and news articles, and must undergo several preprocessing steps. These steps include filtering out irrelevant HTML code, removing duplicates, and screening for abusive content. These tasks, though critical, are not constrained by the availability of GPUs.
Petros Zerfos, IBM Research’s principal research scientist for watsonx data engineering, emphasized the importance of efficient data processing. “A large part of the time and effort that goes into training these models is preparing the data for these models,” Zerfos said. His team has been developing methods to enhance the efficiency of data processing pipelines, drawing expertise from various domains including natural language processing, distributed computing, and storage systems.
Leveraging CPU Capacity
Many steps in the data processing pipeline involve “embarrassingly parallel” computations, allowing each document to be processed independently. This parallel processing can significantly speed up data preparation by distributing tasks across numerous CPUs. However, some steps, such as removing duplicate documents, require access to the entire dataset, which cannot be performed in parallel.
To accelerate IBM’s Granite model development, the team has developed processes to rapidly provision and utilize tens of thousands of CPUs. This approach involves marshalling idle CPU capacity across IBM’s Cloud datacenter network, ensuring high communication bandwidth between CPUs and data storage. Traditional object storage systems often cause CPUs to idle due to low performance; thus, the team employed IBM’s high-performance Storage Scale file system to cache active data efficiently.
Scaling Up AI Training
Over the past year, IBM has scaled up to 100,000 vCPUs in the IBM Cloud, processing 14 petabytes of raw data to produce 40 trillion tokens for AI model training. The team has automated these data pipelines using Kubeflow on IBM Cloud. Their methods have proven to be 24 times faster in processing data from Common Crawl compared to previous techniques.
All of IBM’s open-sourced Granite code and language models have been trained using data prepared through these optimized pipelines. Additionally, IBM has made significant contributions to the AI community by developing the Data Prep Kit, a toolkit hosted on GitHub. This kit streamlines data preparation for large language model applications, supporting pre-training, fine-tuning, and retrieval-augmented generation (RAG) use cases. Built on distributed processing frameworks like Spark and Ray, the kit allows developers to build scalable custom modules.
For more information, visit the official IBM Research blog.
Image source: Shutterstock