Skip to content

Not There Yet: ChatGPT’s Data Analysis Tool (A Railroad Safety Data Example)

Large Language Models (LLMs) have historically struggled with numerical data, primarily because they are, at their core, language models.  But OpenAI has aspirations to move into the business intelligence and analytics spaces.  One way is through the Advanced Data Analysis tool.

We are continually monitoring updates to generative AI models to see how they progress.  For this case, we gave the data analysis tool some very basic safety incident data from the Federal Railroad Administration.   We have worked extensively with railroads to improve safety for more than a decade, so we are very famiar with this data.

We uploaded annual data for reportable injury rates and highway-rail incident rates for each of the class I (major) railroads.  We then interrogated it in a conversational manner.  The text of the chat, including the graphs it produced, can be seen below.

 

<Start of chat session>
You
Spreadsheet uploaded: Reportable Injuries Class I.xlsx
This is data from the Federal Railroad Administration (FRA) on reportable injury rates for the Class I (major) railroads.
GPT
Data Analysis

Great! You’ve uploaded a file containing data on reportable injury rates for Class I railroads from the Federal Railroad Administration. Let’s start by opening the file to understand its structure and contents. I’ll then be able to assist you further with any analysis or questions you might have.

The dataset contains the following columns:

  1. Year: The year of the data.
  2. Road: The railroad company.
  3. Reportable Injury Rate: The rate of reportable injuries.
  4. Highway-Rail Incident Rate: The rate of incidents at highway-rail crossings.

Each row seems to represent the annual data for a specific railroad company.

How would you like to proceed with this data? Do you have any specific analysis or questions in mind?

 

You
Can you comment on any trends you see? And can you extrapolate those trends into future years?
GPT
Data Analysis

To analyze the trends in the data, I’ll first examine the overall trend for the “Reportable Injury Rate” and the “Highway-Rail Incident Rate” across all railroad companies over the years. This will involve:

  1. Aggregating the data by year to see the overall trend.
  2. Plotting these trends to visualize changes over time.
  3. Commenting on any noticeable patterns or changes.

After that, I can attempt to extrapolate these trends into future years using a simple forecasting model. Let’s start with the trend analysis.

The plots above show the trends for both the “Reportable Injury Rate” and the “Highway-Rail Incident Rate” from the available data. Here are some observations:

  1. Reportable Injury Rate: There appears to be a general downward trend in the reportable injury rate over the years. This suggests improvements in safety or reporting practices within the railroad industry.
  2. Highway-Rail Incident Rate: The trend for the highway-rail incident rate is less clear. While there are fluctuations, there isn’t a strong consistent upward or downward trend over the years.

For extrapolation into future years, I’ll use a simple linear regression model. This model will be based on the assumption that the past trend continues into the future. However, it’s important to note that such extrapolations are speculative and may not account for unforeseen changes or non-linear trends.

I’ll extrapolate for the next 5 years. Let’s proceed with this analysis. ​

Based on the linear regression model, here are the extrapolated trends for the next five years:

Predicted Trends (2023 – 2027)

  1. Reportable Injury Rate:
    • The model predicts a continued decrease in the reportable injury rate. By 2027, it’s estimated to reach approximately 1.14, suggesting ongoing improvements in safety measures or reporting accuracy.
  2. Highway-Rail Incident Rate:
    • The model forecasts a slight increase in the highway-rail incident rate over the next five years. By 2027, it’s estimated to be around 4.05, indicating a potential rise in incidents at highway-rail crossings.

Important Considerations

  • These predictions are based on the assumption that past trends will continue in a linear fashion. Real-world factors, such as changes in industry practices, regulations, or unforeseen events, could significantly alter these trends.
  • The model does not account for potential non-linearities or complex patterns in the data. Thus, while useful for a general outlook, these predictions should be interpreted with caution.

Would you like to delve into any specific aspects of this analysis further, or is there anything else you’d like to explore with this data? ​

 

You
Provide a graph of the predicted trends along with the historical data.
GPT
Data Analysis

Here is the graph depicting both the historical data and the predicted trends for the next five years:

  1. Left Plot (Reportable Injury Rate): Shows a clear downward trend in the historical data, which the model expects to continue into the future.
  2. Right Plot (Highway-Rail Incident Rate): Illustrates the more fluctuating nature of the historical data. The model predicts a slight upward trend in the coming years.

These visualizations combine the actual past data with the linearly extrapolated future predictions, offering a comprehensive view of the trends over time.

If you need further analysis or have any other questions, feel free to ask! ​

 

You
Provide an analysis for each railroad company.
GPT
Data Analysis

To analyze the data for each railroad company, I will:

  1. Examine Trends: Look at the trends in the “Reportable Injury Rate” and the “Highway-Rail Incident Rate” for each railroad company over the years.
  2. Visualize Data: Create plots to visualize these trends.
  3. Comment on Differences: Note any significant differences or unique patterns among the different railroad companies.

This approach will help us understand how each company’s safety performance has changed over time and how they compare to each other. Let’s start with this analysis.

The plots above show the trends in the “Reportable Injury Rate” and the “Highway-Rail Incident Rate” for each railroad company over the years. Here are some observations:

  1. Variability Among Companies: There is noticeable variability in both the reportable injury rates and highway-rail incident rates among different railroad companies. This suggests differences in operational practices, safety measures, or geographic factors that could influence these rates.
  2. General Trends: Most companies show a general downward trend in reportable injury rates over the years, aligning with industry-wide improvements in safety. However, the highway-rail incident rates are more varied, with some companies showing increasing trends while others are more stable or decreasing.
  3. Company-Specific Patterns: Each company exhibits its own pattern over the years. These differences could be due to a variety of factors, including company size, the extent of their rail network, and the specific safety initiatives they have implemented.

This analysis provides a detailed view of how each company has performed in terms of safety over the years. If you need further analysis or have specific questions about any of these companies, feel free to ask! ​

You
Which railroad is a poor performer in terms of recent and future trends?
GPT
Data Analysis

To determine which railroad company is a poor performer in terms of both recent and future trends, we need to consider:

  1. Recent Performance: Examine the latest available data for each company to see who has the highest rates in both reportable injuries and highway-rail incidents.
  2. Future Trends: Utilize the linear regression model to predict future rates for each railroad company and identify which company is expected to have worsening or high rates in the coming years.

Let’s first identify the company with the highest rates in the most recent year from the dataset, and then proceed to predict the future trends for each company.

Based on the analysis:

Recent Performance (2022 Data)

  • Highest Reportable Injury Rate: Kansas City Southern (KCS) had the highest reportable injury rate in 2022, with a rate of 2.091.
  • Highest Highway-Rail Incident Rate: Canadian National Railway (CN) had the highest highway-rail incident rate in 2022, with a rate of 5.820.

Future Predictions (2027 Projection)

  • Future Reportable Injury Rate: Kansas City Southern (KCS) is predicted to still have a high reportable injury rate by 2027, with a rate of approximately 2.040.
  • Future Highway-Rail Incident Rate: CSX Transportation (CSX) is projected to have the highest highway-rail incident rate by 2027, with a rate of approximately 6.099.

Conclusion

  • Kansas City Southern (KCS) appears to be a poor performer both in terms of recent data and future projections for the reportable injury rate.
  • For the highway-rail incident rate, Canadian National Railway (CN) was the poorest performer in the recent data, but CSX Transportation (CSX) is projected to have the highest rate by 2027.

It’s important to consider that these are projections based on linear trends and actual future performance may be influenced by various factors not captured in this analysis.

If you need more detailed insights or have other questions, feel free to ask! ​

<end of chat session>

Observations

 

For basic summarizations it seems okay and pretty elementary.  It is thorough in that it tells you the steps it is going to take.

It did get more interesting when we asked it to comment by railroad, and when we specifically asked it about a poor performer (KCS, which is now combined with CP, but that happened in 2023 and data is only through 2022).

As for the math behind the analysis, it appears that all it has at its disposal is linear regression, and that is not good enough in this situation.  Relationships are not always linear, and regression is not the best tool for understanding patterns over time.  We ourselves have done better with integrated time series / regression models or vector autoregression.  To its credit, it acknowledges at the very end that the model is simple.

Furthermore, it seems unaware of the nuances of aggregating data.  You may not always want to do a simple average.  A naïve user may fall into a trap in relying on the model to do all the steps an experienced analyst would.

So while the commentary is good, and even potentially insightful, it may not be based on the best numerical analysis.

That said, we feel this is a technology that will be very interesting to watch, especially the commentary.  And we recognize that better responses to this simple example could be obtained through better prompt engineering.

We are already working with some of our clients to integrate generative AI into their BI tools.  These are small step, proofs of concepts, given the nascent and rapidly evolving nature of these technologies.  Please check back here often for updates, or contact us to see how you might apply these technologies in your company.

Back To Top