Building a data pipeline from scratch - The troubles of DevOps teams with scattered data and the path of infrastructure construction

AI business menu

AI business
HOME

What is macnica.ai

What AI Can Do

Products/Services

Seminar

Document List

What is macnica.ai

Strengths of macnica.ai

Search by industry/theme/library

Case Study

Blog

Glossary

Building a data pipeline from scratch: The troubles of a DevOps team with scattered data and the path to building the infrastructure

In order to keep up with the increasing scale of development in recent years, Yokogawa Electric Corporation Corporation (hereinafter referred to as Yokogawa Electric Corporation) has decided to analyze various data, such as development management data generated during the development process, quality control data for released products, and SaaS usage logs. This article introduces the company's development team's journey as they consider introducing an integrated utilization platform, build a data pipeline to integrate data distributed throughout the company, and work toward an environment in which all members can utilize data.

*This article is based on a lecture given at the Macnica Data・AI Forum 2024 Winter held in February 2024.

Databricks dramatically changed my relationship with data

Yamamoto: Today we're going to ask Yokogawa Electric, which successfully implemented centralized data management through the introduction of a data infrastructure, about the factors that led to this success. First, Mr. Kuwata, please explain this initiative to us.

Kuwata: We had four objectives behind introducing a data infrastructure. The first was to consolidate data managed on a project-by-project basis. The second was to reduce the burden of data processing for analysis and visualization. The third was to automate data updates, which were often done manually. The fourth was to visualize necessary data on a dashboard and use it for decision-making at the development site.

Before introducing the platform, the following four issues were raised regarding the data sources to be analyzed. Data was managed by individual or project, and problems arose such as a wide variety of storage locations and file formats, and multiple similar pieces of data existed. In addition, collaboration between tools was required to access data storage locations, but approval had to be sought from the information systems department each time, and in some cases, requests were rejected due to internal security issues. In order to solve these issues of "data dispersion and complicated management," the company began planning to build a data pipeline.

This is the schedule of activities leading up to building our data pipeline.

We started our activities in April 2022 and until October, we were investigating data connectors and data collection methods for building a data pipeline, but we were having difficulty. However, in November, Macnica introduced us to "Databricks," a SaaS-based data integration platform that can handle everything from data collection to visualization, and that was a turning point. After verifying the effects of the introduction, such as whether existing analysis data could be reproduced in a PoC of about one month, we decided to officially introduce Databricks in January 2023.

After the implementation, we continued to receive support from Macnica and started by building a development workspace. Eventually, it was decided to use it in a new project, and we analyzed and visualized development data, created a dashboard, and provided it to management. Furthermore, from April 2023, we will also collaborate on existing projects and take on the challenge of cross-departmental analysis and visualization.

In order to stimulate data analysis within the organization, we have also provided training and hands-on sessions on the basics of data analysis and how to use Databricks. In this initiative, we built a workspace for training and established operational rules that users must follow. We asked members who participated in the PoC and Macnica to provide hands-on training.

When building a data pipeline, we had trouble connecting data sources and aggregating data, something we had been doing since the beginning of our activities. We tried to implement it using a commercially available tool, but we were hindered by our company's security issues and gave up.

On the other hand, one factor that went well was that many of the team members were software engineers, so we were able to change course from using commercial tools to developing our own, and we were able to quickly respond to various issues using Databricks. I think it was also a positive factor that we had personnel in the company with some knowledge of AWS, SQL, and Python.

We are currently using the data pipeline we created for both new and existing development projects. For new projects, we can turn the development progress into a dashboard to understand delays and the amount of work allocated to each task. For existing projects, we use failure data to analyze and visualize the response status.

As a result, we were able to obtain various benefits from building a data pipeline. Since a built pipeline is meaningless if it is not used, we plan to continue our activities in the future to make data analysis and visualization more familiar within the organization.

The journey of building a data pipeline

Yamamoto: From here on, we will divide the issues and solutions related to data into three themes and dig deeper in the form of a dialogue. I would like to hear more about the initial issues and improvements that have been made, including what Mr. Kuwata just mentioned, as well as future operations and training, and about organizational building.

Issues to be overcome before data analysis can be implemented

Yamamoto: First, I would like to talk about the difficulties you faced in going from the data source to being able to analyze it. Mr. Fujiwara, please.

Fujiwara: There were few issues with the log data coming out of the system. However, when sending data created by people, we had to ask the relevant parties where the necessary data was located, and we couldn't predict how long it would take to check the contents. Since each person created the data for their own purpose, it was of course valuable, but the issue was how to organize it so that it would have value.

Yamamoto: So the problem was that by adding human intervention, unique rules and ways of handling data were created, and it took a long time to find the data you needed. In Yokogawa Electric Corporation 's case, this was perhaps even more pronounced because a huge number of projects were being carried out in parallel. Also, even after finding the necessary data, there were challenges with cleansing and processing it. What are your thoughts on this, Mr. Kuwata, who was in charge of implementation?

Kuwata: For cleansing, different tools were used depending on the file format, which meant it took time to get things up and running, and the number of people who could handle it was limited, so I think that was also a factor that made things complicated. In addition, there was also the problem that it took a whole day to process the ever-increasing amount of data.

Fujiwara: This is what we call personalization. In particular, when someone who had left the company used VBA to create Excel, it caused quite a problem.

Introducing a data platform

Yamamoto: So to solve this situation, you were working on building a data pipeline and unifying its management. The second theme, the introduction of a data infrastructure, is something Macnica assisted with, so I'll ask Mr. Ohta to explain it here.

Ota: This slide shows a comparison between a traditional data infrastructure and a SaaS-based data infrastructure. The one Yokogawa Electric Corporation introduced this time is the latter.

First, with a conventional data infrastructure, it was necessary to estimate the amount of data that would be accumulated, calculate the specs and number of machines that could handle that, and then draw up a construction plan.Furthermore, it was necessary to submit an application for the environment to the information systems department, install various components, and conduct multiple connection tests, which took several weeks before the data infrastructure could actually be used.

On the other hand, the SaaS platform was very simple, we prepared a user with administrator privileges and a user for creating the environment in the AWS cloud, and then we had them create workspaces and computing environments by operating the GUI. This type of construction by operating the GUI is possible because a mechanism called cloud formation runs in the background that automatically creates the necessary resources.

This time, we provided a procedure manual and asked Yokogawa Electric Corporation 's staff to build the system based on that manual, and since the staff had knowledge of the cloud, the process went very smoothly. The construction itself was completed in a day, so we were able to focus on our original purpose of building the data pipeline and analyzing it.

Yamamoto: Yokogawa Electric Corporation had a number of challenges, but what were the benefits of actually implementing the SaaS platform?

Kuwata: The problem of taking time to find the data we needed was solved by the data catalog function installed in the platform, and Python and SQL, which previously required separate dedicated environments to be built, can now both be used in a single notebook. Also, regarding data collection, we were introduced to a function called Autoloader, which automatically loads only the differences and makes them available for subsequent analysis. Normally, this requires Python programming, so it was very helpful to be able to do this so easily.

Yamamoto: The introduction of the platform has led to the development of a catalog, and the data that was previously scattered across projects and individuals has now been put into a common language, creating a clean analytical environment. As someone who supports our customers, we are very pleased about this.

To operate the data infrastructure

Yamamoto: In order for the data infrastructure to demonstrate its true value, it is essential to establish subsequent operations and analysis, but I think many customers have difficulty with this. So, in the third topic, we will talk about important elements in terms of operations after implementation. One of these is the "catalog management" function mentioned earlier. Mr. Ota, please explain it.

Ota: One important element in maintaining and operating a data infrastructure is to think about appropriate data management in advance. For example, we issue new accounts to people who request to use the data infrastructure. Then, when that person uses the analysis infrastructure, we determine the rules in advance, such as "please use this place."

More specifically, when you use sample data to perform ad-hoc analysis or create an AI model, you can use the learning catalog, which means that only data that is OK for anyone to see and is OK for public viewing is placed there.

There are also concepts such as development and production catalogs. The former stores data generated during the development process or during investigations of problems that occur in production, while the latter stores data generated by jobs that are actually running. In some cases, sensitive data that can only be accessed by specific members is also stored in this production catalog. In other words, it is important to strengthen access control and implement security measures for the production catalog.

The administrator catalog at the bottom of the slide stores server logs, application logs, audit logs, and other data generated by using the data infrastructure. By dividing the data infrastructure by purpose in this way, it will be easier to use even if the number of users increases, and we believe this will lead to expanded data utilization.

Yamamoto: So, in order for anyone to use it easily, it is necessary to decide which catalog to use depending on the characteristics of the data and the department and position of the user who will access it. I think Yokogawa Electric Corporation is already putting this into practice to some extent, but what do you think?

Kuwata: Databricks is a SaaS platform, so I think it has become easier to provide an analysis environment to users. Since there is no need to issue licenses, which is usually required, it is a big advantage that it can be provided immediately to those who want to use it. Currently, we are conducting general activities with the development catalog, but we also have a catalog that can be used for learning purposes. Since this is used by a variety of people, we have established certain rules, such as making sure to give it a fixed name.

There are no particular rules for the development catalog, but since it is used on a project-by-project basis, we operate it in a way that always includes the project name. We have not yet prepared a production catalog, but we would like to implement it in the future.

Yamamoto: As data utilization advances, the number of users will also increase, so there is a risk that the rules will become complicated. In order to mitigate this, I think that the catalog function for each purpose will become very important.

So far, we have mainly talked about mechanisms and functions, but it is also very important to develop human resources who can utilize them in the field. For example, since each department uses different tools, such as infrastructure and cloud computing in the information systems department and programs in the DX promotion department, it is not rare for each person to acquire different skills. On the other hand, it seems that Yokogawa Electric Corporation Electric is making good progress in creating a system for implementation. Could you tell us the key points?

Fujiwara: We started gradually building up knowledge of the cloud, data analysis, and programming about four years ago. We implemented a wide range of initiatives, both paid and unpaid, such as workshops and training across the organization, to build up knowledge of the cloud. Then, people who were good at it or wanted to do it raised their hands, so we asked those highly motivated and sensitive people to become members of the cloud CoE (Center of Excellence).

In the fields of programming and data analysis, we have had them learn Python and SQL, and have made good use of online learning courses. This time, we were featured as a success story, but it takes time, so I don't think it went smoothly without any effort. As Kuwata explained, I strongly feel that the smooth construction of the data infrastructure was achieved, including the fact that the CoE was established and that there were people who were gradually studying Python and SQL. I also think that leaders like us need to take the initiative in learning, rather than just giving instructions to our members.

Yamamoto: Were there people in various departments who said they wanted to do this?

Fujiwara: Yes, that's right. We were also conducting technological exploration activities as an organization.

Yamamoto: Rather than concentrating technical experts in a specific organization, you were able to instill a culture of data utilization in a wide range of departments over the long period of four years. This was all thanks to the company's original culture.

Fujiwara: It was a small start, but once we started, people who had heard about it started coming together. I think it was a good thing that we were able to provide a forum for that.

Summary

Yamamoto: This time, we talked about three themes: "Challenges to achieving data analysis," "Introducing a data infrastructure," and "Operating the data infrastructure." In particular, Fujiwara-san's comment about human resource development takes time, but ultimately it is a very important point in advancing data analysis and utilization. I think the reason why data collection and other activities at Yokogawa Electric Corporation went so smoothly was because of the great advantage of human resources. Could you tell us about the technology used in this initiative, Mr. Ota?

Ota: This time, Yokogawa Electric Corporation adopted Databricks and built it on the AWS cloud. This allowed us to retrieve data from various data sources using file copy and REST API methods, and import the data into Databricks by synchronizing with S3 packets.

Next, by building a data pipeline, we were able to automate the collection, processing, storage, and visualization of data. In doing so, we were able to utilize our programming skills in Python and SQL. We believe that the fact that the company had trained personnel with the necessary skills to use tools and technology in this way led to this successful experience.

Yamamoto: Finally, Mr. Fujiwara, could you tell us about your future prospects?

Fujiwara: We've put a lot of effort into creating this data infrastructure, so I'd like to see more and more users use it to their full potential. To that end, we want to create better guideline policies, but rather than restricting users, we want to create rules for safe use. We're also trying to create many dashboards in our development projects. To make the most of these, we want everyone in each role, including project managers, leaders, and members, to look at the dashboards every day, so that they can be useful in making decisions.