Table of Contents

  1. Utilising ChatGPT
  2. ChatGPT’s Limitations
  3. Closing Thoughts

It seems that we are living in a time that will one day be recognised as the beginning of the AI age explosion. AI tools are becoming more intelligent and more accessible. With the release of ChatGPT, concerns have raised again about AI outperforming humans in our jobs, but I believe we have many years ahead of us before AI can fully design and develop our data products from start to finish.

Source: unsplash.com

What are we going to cover today?

  • Data engineers can use AI to generate boilerplate code and speed up development.
  • AI can write SQL and Python code, but it cannot fully comprehend how data development translates to the real world, which is a task only humans can perform.
  • AI does not understand the semantics of data.

Utilising ChatGPT

AI technology is useful for helping data engineers with a range of tasks, including quickly and accurately diagnosing problems, providing answers to questions and performing data analysis. It can help data engineers understand their data better by asking it to find trends and patterns in data that they might not see at first glance.

ChatGPT can be used to generate synthetic data for augmenting existing datasets. For example, a data engineer could generate realistic but fictitious customer data to supplement an existing dataset that is limited in size or scope. This could be particularly useful for testing purposes, such as testing new algorithms or models on a large volume of data, or when real data is limited, such as in cases where privacy concerns prevent the use of real data.

By simply prompting ChatGPT to act as if it is an analyst, you can provide it some data and ask it to generate some analysis.

Here is an example where ChatGPT is asked to generate some fake customer data and to give some quick insights,

Fake customer data generated by ChatGPT

It then generated the code below as a good place to start with some exploratory data analysis and even gave detailed explanations on what each part of the code is doing. I then took some of the code to verify the results and visualise it.

ChatGPT generated code for exploratory data analysis

This ultimately saves a lot of time spent checking package documentation to remind oneself about the package’s syntax. It is also a good starting point for exploring data. While the example given is quite basic, ChatGPT is a large language model capable of performing text-related tasks. For instance, it could label customer feedback as positive or negative based on the text’s sentiment. This would be useful for speeding up development as part of machine learning tasks.

Generating boilerplate code is one of the most common ways that ChatGPT will boost a data engineer’s productivity in starting development work. Let’s try asking ChatGPT how we can begin designing an API using an OpenAPI specification as someone new to API development.

Scatterplot of age vs income from ChatGPT generated code

Before getting to the code, it does an incredible job at providing some background reading on how to get started designing an API which saves you from needing to find sources on the internet.

Now let’s ask it for some boiler plate code to work from.

ChatGPT generated OpenAPI specification

It provides a YAML file example of an OpenAPI specification and explains that it has a single GET endpoint, along with more information about the schema.


ChatGPT’s Limitations

ChatGPT can take in large amounts of data and respond intelligently to questions related to its content. Despite this impressive capability, it is still limited by its lack of creativity and human expertise and so making decisions about complex data-dependent problems requires human involvement. As well as the development and maintenance of systems that process data for downstream use cases in data engineering, there are also many components such as data privacy, data security, data management and data architecture that AI is not yet capable of doing itself.

As every business has its unique way of storing and processing data, a data engineer would need to be familiar with the data model used, how the business works, and how to bring them together. The engineer has the responsibility to understand what the data means, how it should be structured, and how to map the meaning to the expected outcome.

ChatGPT essentially works by proposing aggregations and optimising queries to predict the next word in a sentence, but it cannot fully comprehend the meaning behind the data. The accuracy of its responses really depends on how good its training datasets are and there is a chance it might give you some unexpected (or completely false) results. It may even misclassify text data or generate incorrect insights based on patterns in the data, so It’s important to evaluate the results generated and validate them against real-world data.


Closing Thoughts

It is imperative for the AI chatbot algorithms to have a cognition of the real world to understand how data really changes into something that is meaningful and this won’t be possible until we have artificial general intelligence. What AI will do in the meantime is improve a data engineer’s productivity and development speed, particularly in things like data modelling, data analysis and getting a head start when one might encounter programmers block.

Leave a Reply

Your email address will not be published. Required fields are marked *