In recent years, generative AI has been attracting attention from the public and generating a lot of content, and its development and use is progressing around the world. As a result, more and more people are wondering what kind of environment they should prepare in order to quickly start developing generative AI and accelerate their team's efforts. In this article, we will introduce the best platform for generative AI, including DevOps in the age of generative AI, how to build your own platform, and NVIDIA's ecosystem.
*This article is based on a lecture given at the Macnica Data・AI Forum 2024 Winter held in February 2024.
Generating AI and LLMOps
Generative AI is defined as a machine learning algorithm that generates new data and information by inputting existing data and information such as text, voice, and images into a basic model. In recent years, generative AI has begun to bring innovation to business as a new problem-solving method, as represented by LLM.
LLM is an abbreviation for "Large Language Model," and is called "large-scale language model" in Japanese. The diagram below shows how LLM works. When a sentence is input, it goes through processes such as tokenization and context understanding, and ultimately returns a natural, human-like response. Examples of its use include chatbots, search engines, and text generation, but customization is essential for business use, which is one of the challenges.
Broadly speaking, there are three paths to actually using generative AI.
The first method is to use existing services, such as ChatGPT and Google Bird, which are widely available and use LLM.
The second and third categories are both LLM customizations, but on a different scale. The second category is small-scale, such as extending the LLM by loading in-house data such as PDF documents with RAG, or making adjustments with prompt engineering. The third category is large-scale, such as fine-tuning and developing your own foundational models.
Achieving these goals can take anywhere from several months to years, so if you want to move forward with the development and use of generative AI, it's important to consider what method is most suitable.
The general AI development and operation life cycle begins with data collection and data preprocessing, followed by model development (selection), learning, evaluation, deployment, etc., and finally operation and monitoring. In addition, it is important to take steps to improve accuracy, such as repeatedly performing the process from data collection to model evaluation, and continuing to monitor accuracy and re-learning even after operation and monitoring has begun.
Meanwhile, the LLM development and operations life cycle (LLMOps) starts with deciding on the development policy, determining what needs to be done and the cost. Next, development is started from the phase that is most suitable for the company, which is broadly divided into three development phases: "Building a foundation model," "Fine-tuning for specific tasks," and "Knowledge integration from proprietary data." At this point, trial and error is repeated until sufficient accuracy is achieved. After development is completed, it moves to the production phase and the service is provided.
The diagram below shows the elements necessary to realize LLMOps, broadly divided into two phases: the "development phase" and the "production phase."
The goal in the development phase is to achieve the expected response and accuracy even when using your own company's data. To achieve this, you need management techniques to improve accuracy through experiments, such as visualizing and inspecting the LLM execution flow and analyzing the LLM input and output.
On the other hand, stable service operation and feature expansion through model updates, both of which are desirable in the production phase, require highly accurate operation and management methods, such as model and service monitoring, data analysis, and anomaly detection.
There is a lot to consider in each phase, but if all phases can be managed from a common foundation, effective LLMOps can be achieved.
The diagram below shows a use case when LLMOps was implemented in our environment. In this case, we performed prompt engineering experiment management, where prompts are input into the base model and the model is adjusted while obtaining the output. We used a tool called Weigts&Biases because it allows us to understand the LLM chain and corresponds to the elements in the upper right of the diagram.
We believe that even if you start with experiment management, as in this use case, having a foundation that allows you to consider management methods when production shifts will lead to the realization of consistent LLMOps. In fact, there were many aspects of this experiment management that can be used in operations as well.
Generative AI development platform
From here on, I would like to talk about the generative AI development platform, including the infrastructure and frameworks required for actual development. We believe that this is not just infrastructure including hardware and software, but also goes into the foundation and application layers that can run LLMOps, and can take generative AI from development to operation.
In addition, they list two points to consider when developing a generative AI platform: "Does it have infrastructure that matches what you want to achieve?" and "Does it provide an environment for realizing generative AI?" After all, a platform that can effectively run the development lifecycle is an ideal form.
When it comes to infrastructure considerations, a popular question is, "Is it better to use on-premise or cloud?"
First, on-premise is a good choice when you have circumstances such as "want to secure large-scale resources for development" or "cannot send data outside." On the other hand, the advantage of the cloud is that it can be used on demand, so it is suitable for cases such as "want to use existing LLM services" or "mainly for experiments."
However, the availability of GPU resources has become a major issue in recent years. For example, it can take a long time to obtain GPU resources on the cloud, or they may be less available than expected. In such cases, on-premise may be an option.
First of all, when talking about platforms and developing LLM, we need to base it on an environment that can run LLM. Therefore, we actually prepared an environment and conducted an experiment to run the LLM-based model.
The server used for this experiment was equipped with a 7 billion parameter model of Llama-2 and a DGX-2 (V100 x16). As a result, the feeling of operation was completely different depending on the number of GPUs. Specifically, the operation was quite heavy from 4 to 8, and it improved a little with 12, but still felt heavy. From this, we learned the importance of preparing an environment in which the base model can run smoothly in advance.
Next, we move on to considering the actual platform. First, we look at Kubernetes, which can be used cloud-natively, albeit on-premise. This is a technology that has recently begun to be used in the field of generative AI, as it can collectively manage resources, is scalable, and can distribute resources to users by orchestrating containers. Although Kubernetes has many advantages, there are some issues, as shown in the bottom right of the figure, that prevent it from being effective in efficiently utilizing GPU resources.
Run:ai is a solution that can efficiently utilize GPU resources in the latest on-premise or cloud environments. Run:ai slices a single GPU, allowing multiple containers to share resources. This eliminates waste when a single GPU cannot be fully utilized.
Also, while free GPUs are noticeable, the method of managing them is a bottleneck, and as a result, there are cases where resources are wasted. In response to this, the scheduler function that can use free GPUs as extra is effective. Furthermore, since there is a function to visualize and manage resources, the development of LLMs that use GPU resources on a large scale can be accelerated.
Another option for developing LLMs is NeMo, which NVIDIA touts as a "complete solution for building large-scale language models." NeMo has functions such as natural language processing, automatic speech recognition, and speech synthesis, and is characterized by providing a complete series of steps, including data creation, model development, and experiments for implementation. In addition, when developing and operating an LLM, boundaries must be set according to the use case. In such cases, you can use NeMo Guardrails to operate more safely.
Summary
There are two key points to realizing generative AI. The first is to effectively run the development lifecycle of generative AI. The second is to build a platform suitable for developing generative AI.
Macnica offers solutions that allow customers to realize the ideal generative AI in an environment that suits their company. We hope to be of service to you through these solutions, so if you are interested, please feel free to contact us.