Goal

In this blog post, I will create some intriguing and colorful data graphics based on the NOAA climate data, including interactive graphs, boxplots, and scatterplots.

§1 Create a Database

Import the Data

First of all, I want to import three data sets that I am going to use for this blog post. Those three data sets are now in separate csv files, which are called "temps.csv", "station-metadata.csv", and "countries.csv". Pandas package is a powerful package for analyzing dataframes in Python that I want to use when importing those csv files.

# import pandas package
import pandas as pd
# import temps.csv into a DataFrame called temperature_data
temperature_data = pd.read_csv("temps.csv")
# import station-metadata.csv into a DataFrame called station_data
station_data = pd.read_csv("station-metadata.csv")
# import countries.csv into a DataFrame called country_data
country_data = pd.read_csv("countries.csv")

After importing the data sets, I want to check the first five rows of each DataFrame to be familiar with the information in it, especially paying attention to the column names. I am going to use df.head( ) to check the first five rows of each of the DataFrame.

temperature_data.head()
ID Year VALUE1 VALUE2 VALUE3 VALUE4 VALUE5 VALUE6 VALUE7 VALUE8 VALUE9 VALUE10 VALUE11 VALUE12
0 ACW00011604 1961 -89.0 236.0 472.0 773.0 1128.0 1599.0 1570.0 1481.0 1413.0 1174.0 510.0 -39.0
1 ACW00011604 1962 113.0 85.0 -154.0 635.0 908.0 1381.0 1510.0 1393.0 1163.0 994.0 323.0 -126.0
2 ACW00011604 1963 -713.0 -553.0 -99.0 541.0 1224.0 1627.0 1620.0 1596.0 1332.0 940.0 566.0 -108.0
3 ACW00011604 1964 62.0 -85.0 55.0 738.0 1219.0 1442.0 1506.0 1557.0 1221.0 788.0 546.0 112.0
4 ACW00011604 1965 44.0 -105.0 38.0 590.0 987.0 1500.0 1487.0 1477.0 1377.0 974.0 31.0 -178.0

Based on the partial DataFrame above, I notice that there are several columns inside temperature_data,
ID: the ID number of a station in the world
year: the year that a measurement was done
VALUE1 to VALUE12: the temperature of each month

station_data.head()
ID LATITUDE LONGITUDE STNELEV NAME
0 ACW00011604 57.7667 11.8667 18.0 SAVE
1 AE000041196 25.3330 55.5170 34.0 SHARJAH_INTER_AIRP
2 AEM00041184 25.6170 55.9330 31.0 RAS_AL_KHAIMAH_INTE
3 AEM00041194 25.2550 55.3640 10.4 DUBAI_INTL
4 AEM00041216 24.4300 54.4700 3.0 ABU_DHABI_BATEEN_AIR

Add DataFrame to the Database

I have all the data sets after importing them from the csv files. Then, I want to prepare them before adding them to a database.

Firstly, I want to modify temperature_data mainly by stacking and renaming since the temperature values of each month are in separate columns, but I want to have a column of each month with the temperature value.

Next, I want to also make connections between DataFrames so that at least there is one similar column between two DataFrames. It is obvious that both temperature_data and station_data have a column called "ID". Thus, "ID" column can be used as a connection between temperature_data and station_data. Since the first two letters in the value of the "ID" column in temperature_data contains the value of "FIPS 10-4" in country_data, I want to add a new "FIPS_code" column in temperature_data, similar information as "FIPS 10-4" in country_data but different column name, based on the first two letters of each value in "ID" column in temperature_data in order to build a connection between temperature_data and country_data.

In order to get the updated temperature_data I want, I want to define a function, called prepare_df, below.

def prepare_df(df):
	"""
	Prepare and update the DataFrame to have the proper the information with a few steps about stacking and renaming.

	Parameter
	----------
	df: the input DataFrame

	Return 
	----------
	df: a DataFrame that is updated and fully-prepared
	"""

	df = df.set_index(keys=["ID", "Year"]) # convert redundant columns into stack into a multi-index for the DataFrame
	df = df.stack() # stack all of the data values on top of each other
	df = df.reset_index() # get the ID, Year, and Name columns back and name the new columns as "level_2" and 0
	df = df.rename(columns = {"level_2"  : "Month" , 0 : "Temp"}) # rename the new columns
	df["Month"] = df["Month"].str[5:].astype(int) # update "Month" column with integer values
	df["Temp"]  = df["Temp"] / 100 # get units of degrees Celsius for the "Temp" column
	df["FIPS_code"] = df["ID"].str[0:2] # add a new "FIPS_code" column based on the first two letters of "ID"

	return(df)

Then, it is ready to use the prepare_df function and add all of the three DataFrames into a database. In order to create a new database, I want to import sqlite3 module that involves several uses of databases.

import sqlite3
conn = sqlite3.connect("tables.db") # create a database in current directory called "tables.db"

Now, there is a "tables.db" file in the current directory, waiting to be added those three DataFrames above. Next, I want to use "chunksize" in pdf.read_csv to return an iterator with parts of the data size equal to the value of the chunksize when querying it. By using a for loop and df.to_sql( ), each piece of "temps.csv" after using the prepare_df function will be added into the tables.db, having the name of "temperatures", without repetitions by using "if_exists" inside df.to_sql( ).

df_iter = pd.read_csv("temps.csv", chunksize = 100000)
for df in df_iter:
	df = prepare_df(df)
	df.to_sql("temperatures", conn, if_exists = "append", index = False)

By using the df.to_sql( ) again, station_data can also be added into "tables.db" with the name of "stations". Since station_data is small, I directly use df.to_sql( ) to add station_data, instead of adding it by chunks.

station_data.to_sql("stations", conn, if_exists = "replace", index = False)

Before adding country_data, I want to rename the column of "FIPS 10-4" to "FIPS_code", "ISO 3166" to "ISO_code", and the column of "Name" to "Country" to make the process of adding the dataframe into a database easier and make the later plotting easier.

country_data = country_data.rename({"FIPS 10-4": "FIPS_code", "ISO 3166": "ISO_code", "Name" : "Country"}, axis = 1) # rename the columns
									country_data.head()
FIPS_code ISO_code Country
0 AF AF Afghanistan
1 AX - Akrotiri
2 AL AL Albania
3 AG DZ Algeria
4 AQ AS American Samoa

After having the updated country_data I want, I am also going to directly add the small country_data with the name of "countries" into "tables.db" by using df.to_sql( ).

country_data.to_sql("countries", conn, if_exists = "replace", index = False)

Finally, I have finished adding those three DataFrames into "tables.db". I can check whether the database, "tables.db", contains three DataFrames below by using cursor to execute SQL commands.

cursor = conn.cursor() # interact with the database
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'") # execute a SQL command about showing the entries in the name column of the sqlite_master table such that the entry in the type column of sqlite_master is equal to table
print(cursor.fetchall()) # print the returned list of all the items by the query
[('temperatures',), ('stations',), ('countries',)]

Based on the above output, we can see that there are exactly three tables called "temperatures", "stations", and "countries". Next, I want to use another SQL command to show the column names of each table.

# execute a SQL command with a for loop for cursor.fetchall() to get the column names in all of the tables in the database
cursor.execute("SELECT sql FROM sqlite_master WHERE type='table';")
for result in cursor.fetchall():
	print(result[0])
CREATE TABLE "temperatures" (
"ID" TEXT,
	"Year" INTEGER,
	"Month" INTEGER,
	"Temp" REAL,
	"FIPS_code" TEXT
)
CREATE TABLE "stations" (
"ID" TEXT,
	"LATITUDE" REAL,
	"LONGITUDE" REAL,
	"STNELEV" REAL,
	"NAME" TEXT
)
CREATE TABLE "countries" (
"FIPS_code" TEXT,
	"ISO_code" TEXT,
	"Country" TEXT
)

Overall, we can find out that all of the three DataFrames with the correct columns have been successfully added into the database called "tables.db". Finally, I want to close my database connection.

conn.close() # close the database connection

§2. Write a Query Function

I want to write a query function called query_climate_database that is based on the created database, "tables.db", above. Before writing the query function, I want to explain the brief steps to write that functions:

1. To make the command more readable and clearer, we are going to write the command into multiple lines and make the stations table with an alias S, the temperatures table with an alias T, and the countries table with an alias C.

2. We need to make connections among the tables. Since T and C both have the "FIPS_code" column and T and S have the "ID" column, we will use a LEFT JOIN command to make that common column of T corresponds to the same column of C and use a LEFT JOIN command to make the "ID" column of T corresponds to the same column of S.

3. At the beginning of the command, we need to use a SELECT to choose the columns we want for the returned DataFrame. At the end of the command, we need to use a WHERE to add filtering conditions for the data.

According to the steps above, I am going to write the query_climate_database function with an example of calling the function below.

def query_climate_database(country, year_begin, year_end, month):
	"""
	Create a Pandas DataFrame that contains the average temperature at each station between the input year interval with the station's name in the input country and with the station's latitude and longitude.

	Parameters
	----------
	country: the input string about the name of a country for which should be returned
	year_begin: an input integer about the earliest year of the time interval for which should be returned
	year_end: an input integer about the latest year of the time interval for which should be returned
	month: an input integer about the month of the year for which should be returned

	Return 
	----------
	df: a Pandas DataFrame has the temperature readings with the columns of the station name, station's latitude, station's longitude, country name, year, month, and temperature
	"""

	conn = sqlite3.connect("tables.db") # open the database connection

	# S is the abbreviation of stations in tables.db
	# C is the abbreviation of countries in tables.db
	# T is the abbreviation of temperatures in tables.db
	# S.name, S.latitude, S.longitude, C.country, T.year, T.month, T.temp are the columns that the returned dataframe should contain
	# left join C with the "FIPS_code" column in both T and C and left join S with the "ID" column in both T and S to make connections among three tables
	# make conditions based on the input parameters about the value of "Country" column in C, "Year" in T, and "Month" in T
	cmd = \
	"""
	SELECT S.name, S.latitude, S.longitude, C.country, T.year, T.month, T.temp
	FROM temperatures T
	LEFT JOIN countries C on T."fips_code" = C."fips_code"
	LEFT JOIN stations S on T.id = S.id
	WHERE C.country == ?
	AND T.year BETWEEN ? AND ?
	AND T.month == ?
	"""

	# add the values of the parameters by params and make connections between the command cmd and the database connection conn by using pd.read_sql_query( )
	df = pd.read_sql_query(cmd, conn, params = [country, year_begin, year_end, month])

	conn.close() # close the database connection

	return df
query_climate_database(country = "India",
year_begin = 1980,
year_end = 2020,
month = 1)
NAME LATITUDE LONGITUDE Country Year Month Temp
0 PBO_ANANTAPUR 14.583 77.633 India 1980 1 23.48
1 PBO_ANANTAPUR 14.583 77.633 India 1981 1 24.57
2 PBO_ANANTAPUR 14.583 77.633 India 1982 1 24.19
3 PBO_ANANTAPUR 14.583 77.633 India 1983 1 23.51
4 PBO_ANANTAPUR 14.583 77.633 India 1984 1 24.81
... ... ... ... ... ... ... ...
3147 DARJEELING 27.050 88.270 India 1983 1 5.10
3148 DARJEELING 27.050 88.270 India 1986 1 6.90
3149 DARJEELING 27.050 88.270 India 1994 1 8.10
3150 DARJEELING 27.050 88.270 India 1995 1 5.60
3151 DARJEELING 27.050 88.270 India 1997 1 5.70

3152 rows × 7 columns

The output by calling the query_climate_database function looks correct, containing the columns we want in a Pandas DataFrame.

§3. Write a Geographic Scatter Function for Yearly Temperature Increases

In this section, I am going to write a geographic scatter function, called temperature_coefficient_plot, to make visualizations based on the created database in section one and the query_climate_database function in section two to answer the following question:

How does the average yearly change in temperature vary within a given country?

I want to import the packages I am going to use in the temperature_coefficient_plot function first.

# import the packages I will use in this part
from sklearn.linear_model import LinearRegression # find the average annually change
from plotly import express as px # plot interactive graphs
import calendar # transform number into words about month

# import the packges for saving the interactive output file from the plotly tutorial
import plotly.io as pio
import plotly.graph_objects as go
from plotly.io import write_html # save the output interactive graphs by plotly as html files

Since I want to find the average yearly change in temperature, I can write a function by using the Linear Regression Model; by fitting each pair of data of "Year" and "Temp" of a station into a new linear regression model with the help of df.groupby( ).apply( ), I can find the average yearly change in temperature by computing the first coefficient of a linear regression model output.

def coef(data_group):
"""
Compute the first coefficient of a linear regression model by using "Year" as the predictor variable and "Temp" as the target variable.

Parameter
----------
data_group: a pair of data consisting of the values of "Year" and "Temp"

Return 
----------
LR.coef_[0]: the first coefficient of a linear regression model
"""

x = data_group[["Year"]] # predictor variable
y = data_group["Temp"] # target variable
LR = LinearRegression() # a new linear regression model
LR.fit(x, y) # fit the linear regression model
return LR.coef_[0] # get the only first coefficient of the linear regression model

Noticing that the input value of the month and the values of the "Month" column in the output DataFrame by the query_climate_database function are in integers. I want to automatically transform those integers, representing months, into words for the title of my interactive graph. Thus, I want to use the calendar module to transform. I am going to use the examples below to show how to use calendar package to transform.

# transform integer, 1, to January
calendar.month_name[1]
'January'
# transform integer, 5, to May
calendar.month_name[5]
'May'

Then, I am ready to write the temperature_coefficient_plot function below.

def temperature_coefficient_plot(country, year_begin, year_end, month, min_obs, **kwargs):
	"""
	Create an interactive geographic scatterplot. By using Plotly Express, there should be a point for each station on the scatterplot, where color of the point reflects an estimate of the yearly change in temperature during the specified month and time period at that station.

	Parameters
	----------
	country: the input string about the name of a country
	year_begin: an input integer about the earliest year of the time interval
	year_end: an input integer about the latest year of the time interval
	month: an input integer about the month of the year
	min_obs: the minimum required number of years of data for any given station
	**kwargs: additional keyword arguments passed to px.scatter_mapbox() which can be used to control the colormap used, the mapbox style, etc

	Return 
	----------
	fig: an interactive geographic scatterplot by Plotly Express with a point for each station
	"""

	df = query_climate_database(country, year_begin, year_end, month) # use the query_climate_database function created in section two

	eligible_stations = df.groupby(["NAME"])["Year"].transform(len) # use df.groupby().transform() to find the number of years each station has done temperature measurements in the specified month
	df = df[eligible_stations >= min_obs] # keep data for stations with at least min_obs years worth of data in the specified month and filter out the others

	df_update = df.groupby(["NAME", "LATITUDE", "LONGITUDE"]).apply(coef) # find the estimate of the yearly change in temperature
	df_update = df_update.reset_index()
	df_update = df_update.rename({0: "Estimated Yearly Increase (°C)"}, axis = 1) # rename columns
	df_update["Estimated Yearly Increase (°C)"] = df_update["Estimated Yearly Increase (°C)"].round(decimals = 4) # round values of estimate of the yearly change in temperature

	# create an interactive geographic scatterplot by Plotly Express with the information of "LATITUDE", "LONGITUDE", "NAME", and "Estimated Yearly Increase (°C)" in estimates
	fig = px.scatter_mapbox(df_update, 
							lat = "LATITUDE",
							lon = "LONGITUDE",
							hover_name = "NAME",
							color = "Estimated Yearly Increase (°C)",
							color_continuous_midpoint = 0,
							title = "Estimates of yearly increase in temperature in 
" + calendar.month_name[month] + f" for stations in {country} years {year_begin} - {year_end}", **kwargs) # add title for the colorbar fig.update_layout(coloraxis_colorbar = dict(title = "Estimated Yearly
Increase (°C)")) return fig
color_map = px.colors.diverging.RdGy_r # choose a colormap

fig = temperature_coefficient_plot("India", 1980, 2020, 1, 
					min_obs = 10,
					zoom = 2,
					mapbox_style="carto-positron",
					color_continuous_scale=color_map)

fig.show() # show the figure
write_html(fig, "India_Yearly_Temperature_Increase.html") # save the output in a html file

Thus, we have successfully created an interactive geographic scatterplot for stations in India in January from 1980 to 2020. Based on the points on the graph, there are more light red points than grey points, especially for the stations near the Indian Ocean. From the colorbar, we can see that there are many stations that have some yearly increases in temperature in January in India from 1980 to 2020. There are also some stations, which are mostly located in the middle of India, experience a negative estimated yearly increase. There are even a few very dark points, such as the LUDHIANA station. All of these points show that the temperature increases per year from 1980 to 2020, which may be the negative effect of global warming.

Besides, we can also try to call the temperature_coefficient_plot function with different inputs and different colormap, shown below.

color_map = px.colors.diverging.PuOr # choose a colormap

fig = temperature_coefficient_plot("Brazil", 1980, 2000, 7, 
					min_obs = 10,
					zoom = 2,
					mapbox_style="carto-positron",
					color_continuous_scale=color_map)

fig.show() # show the figure
write_html(fig, "Brazil_Yearly_Temperature_Increase.html") # save the output in a html file

Similarly, we can easily observe that there are many colorful points shown on the interactive geographic scatterplot about the stations in Brazil in July from 1980 to 2020. The purple points shown on the plot reveal the yearly increase in temperature. And there are also few points in very dark purple, which represent the large increase in estimated yearly temperature. Similarly, most of the dark purple points are near the Atlantic Ocean.

Overall, we use the temperature_coefficient_plot function and found out that most of the stations in different countries are likely to experience a yearly increase in temperature in these years. The stations that are more close to the ocean are more likely to have a greater yearly increase in temperature. It is possible that global warming is the main factor that causes such an increase in temperature per year globally.

§4. Create Two More Interesting Figures

In this section, I want to use Plotly Express to create another two interesting interactive figures for this section to address two more questions, respectively.

First Question With Box Plots

The first question is

How does the average temperature vary between two given countries in each month of a specified year interval?

In order to answer this question, I want to make a comparison of the average temperature of all stations in two countries in each month in a given year interval by making a box plot. By using the box plot, I will be able to get many numerical results, such as the mean values and the max values of the average temperatures from all the stations in a country.

def query_climate_two_countries(country1, country2, year_begin, year_end):
	"""
	Create a Pandas DataFrame that contains the average temperature at each station in a specified year interval in two input countries.

	Parameters
	----------
	country1: the input string about the name of the first country for which should be returned
	country2: the input string about the name of the second country for which should be returned
	year_begin: an input integer about the earliest year of the time interval
	year_end: an input integer about the latest year of the time interval

	Return 
	----------
	df: a Pandas DataFrame contains the names of the stations, the names of the countries, year, month, and the temperature
	"""

	conn = sqlite3.connect("tables.db") # open the database connection

	# S is the abbreviation of stations in tables.db
	# C is the abbreviation of countries in tables.db
	# T is the abbreviation of temperatures in tables.db
	# S.name, C.country, T.year, T.month, T.temp are the columns that the returned dataframe should contain
	# left join C with the "FIPS_code" column in both T and C and left join S with the "ID" column in both T and S to make connections among three tables
	# make conditions based on the input parameters about the value of "Country" column in C and the value of "Year" in T
	cmd = \
	"""
	SELECT S.name, C.country, T.year, T.month, T.temp
	FROM temperatures T
	LEFT JOIN countries C on T."fips_code" = C."fips_code"
	LEFT JOIN stations S on T.id = S.id
	WHERE (C.country == ? OR C.country == ?)
	AND T.year BETWEEN ? AND ?
	"""

	# add the values of the parameters by params and make connections between the command cmd and the database connection conn by using pd.read_sql_query( )
	df = pd.read_sql_query(cmd, conn, params = [country1, country2, year_begin, year_end])

	conn.close() # close the database connection

	return df
query_climate_two_countries("Kenya", "Iceland", 2000, 2020)
NAME Country Year Month Temp
0 GUFUSKALAR Iceland 2000 1 1.96
1 GUFUSKALAR Iceland 2000 2 1.73
2 GUFUSKALAR Iceland 2000 3 2.81
3 GUFUSKALAR Iceland 2000 4 1.94
4 GUFUSKALAR Iceland 2000 5 5.51
... ... ... ... ... ...
5274 KAKAMEGA Kenya 2004 7 19.80
5275 KAKAMEGA Kenya 2004 8 20.65
5276 KAKAMEGA Kenya 2004 9 20.60
5277 KAKAMEGA Kenya 2004 11 20.80
5278 KAKAMEGA Kenya 2004 12 21.30

5279 rows × 5 columns

query_climate_two_countries("Kenya", "Brazil", 2000, 2020)
NAME Country Year Month Temp
0 CRUZEIRO_DO_SUL Brazil 2000 1 26.21
1 CRUZEIRO_DO_SUL Brazil 2000 2 26.09
2 CRUZEIRO_DO_SUL Brazil 2000 3 26.39
3 CRUZEIRO_DO_SUL Brazil 2000 4 25.93
4 CRUZEIRO_DO_SUL Brazil 2000 5 25.63
... ... ... ... ... ...
45996 KAKAMEGA Kenya 2004 7 19.80
45997 KAKAMEGA Kenya 2004 8 20.65
45998 KAKAMEGA Kenya 2004 9 20.60
45999 KAKAMEGA Kenya 2004 11 20.80
46000 KAKAMEGA Kenya 2004 12 21.30

46001 rows × 5 columns

Based on the columns in the above outputs, we can have the correct dataframe we want by using the query_climate_two_countries function. Similar to section three, we also need to write a function to plot the interactive figure by using Plotly Express.

def temperature_coefficient_box(country1, country2, year_begin, year_end, min_obs, **kwargs):
	"""
	Create an interactive box plot by using Plotly Express; there should be two box plots where each box plot is about the average temperature of all the stations in all 12 months of a specified year interval in each country.

	Parameters
	----------
	country1: the input string about the name of the first country
	country2: the input string about the name of the second country
	year_begin: an input integer about the earliest year of the time interval
	year_end: an input integer about the latest year of the time interval
	min_obs: the minimum required number of years of data for any given station
	**kwargs: additional keyword arguments passed to px.box()

	Return 
	----------
	fig: an interactive box plot by Plotly Express consisting of a box plot for each country
	"""

	df = query_climate_two_countries(country1, country2, year_begin, year_end) # use the query_climate_two_countries function created above

	eligible_stations = df.groupby(["NAME"])["Month"].transform(len) # use df.groupby().transform() to find the number of months each station has done temperature measurements
	df = df[eligible_stations >= min_obs] # keep data for stations with at least min_obs months worth of data in the specified year and filter out the others

	df_update = df.groupby(["NAME", "Country", "Month"])["Temp"].mean() # calculate the mean temperature of each station during the year interval in each month
	df_update = df_update.reset_index()
	df_update["Temp"] = df_update["Temp"].round(decimals = 4) # round the decimals of "Temp"
	df_update["Month"] = df_update["Month"].apply(lambda t: calendar.month_name[t]) # convert month in integers into month in words

	# make a box plot by using Plotly Express with the information of "Month", "Temp", and "Country" in df_update
	fig = px.box(df_update,
					x = "Month",
					y = "Temp",
					color = "Country",
					hover_name = "Country",
					hover_data = {"Month"},
					labels = {"Temp": "Temperature (°C)"},
					title = f"Monthly Average Temperature Comparison Between {country1} and {country2}
in years {year_begin} - {year_end}", **kwargs) return fig
pio.templates.default = "seaborn" # choose a theme

# remove the outliers
fig = temperature_coefficient_box("Kenya", "Iceland", 2000, 2020, 
									min_obs = 5,
									points = False)

fig.show()
write_html(fig, "Kenya_and_Iceland_Comparison.html") # save the output in a html file

The above interactive figure shows that Iceland and Kenya have very different average temperatures throughout the years 2000 to 2020 since the boxes of those two countries are at different levels. Overall, Iceland has a very lower temperature than Kenya on average.

pio.templates.default = "seaborn" # choose a theme

# remove the outliers
fig = temperature_coefficient_box("Kenya", "Brazil", 2000, 2020,
									min_obs = 5,
									points = False)

fig.show()
write_html(fig, "Kenya_and_Brazil_Comparison.html") # save the output in a html file

The above interactive figure shows that the average temperatures in Brazil and Kenya are similar throughout the years 2000 to 2020 since the boxes are nearly at the same levels.

The interactive figures created above are reasonable. Based on the first figure between Iceland and Kenya, since these two countries are far away from each other, the average temperatures for each month in a year are very different from each other. Differently, the second figure about Brazil and Kenya shows that the average temperatures for each month in a year are very close to each other because Brazil and Kenya are very close geographically. Overall, the interactive figures above show that how the average temperature varies between two given countries in each month of a specified year interval may depend on the locations of the two countries. If two countries are close geographically, their average temperatures in the same month of a year interval might be similar to each other. Oppositely, the average temperatures of the two countries in the same month of a year interval are likely to be different if their geographic locations are not close.

Second Question With a 3D Scatterplot

The second question is

How does the temperature variation vary among three given countries in the same month in several years?

In order to answer this question, I want to make a comparison of the standard deviation of temperatures of all stations in three countries in the same month in several years by making a 3D scatterplot since the standard deviation of temperatures can be used to show the temperature variation.

Similar to the process in the previous parts, I want to first define a query function to create a relevant DataFrame for making the interactive figures.

def query_climate_three_countries(country1, country2, country3, year_begin, year_end, month):
	"""
	Create a Pandas DataFrame that contains the average temperature at each station between the input year interval with the station's name in the input country and with the station's latitude and longitude.

	Parameters
	----------
	country1: the input string about the name of the first country for which should be returned
	country2: the input string about the name of the second country for which should be returned
	country3: the input string about the name of the third country for which should be returned
	year_begin: an input integer about the earliest year of the time interval for which should be returned
	year_end: an input integer about the latest year of the time interval for which should be returned
	month: an input integer about the month of the year for which should be returned

	Return 
	----------
	df: a Pandas DataFrame has the temperature readings with the columns of the station name, station's latitude, station's longitude, names of the three countries, year, month, and temperature
	"""

	conn = sqlite3.connect("tables.db") # open the database connection

	# S is the abbreviation of stations in tables.db
	# C is the abbreviation of countries in tables.db
	# T is the abbreviation of temperatures in tables.db
	# S.name, S.latitude, S.longitude, C.country, T.year, T.month, T.temp are the columns that the returned dataframe should contain
	# left join C with the "FIPS_code" column in both T and C and left join S with the "ID" column in both T and S to make connections among three tables
	# make conditions based on the input parameters about the value of "Country" column in C, "Year" in T, and "Month" in T
	cmd = \
	"""
	SELECT S.name, S.latitude, S.longitude, C.country, T.year, T.month, T.temp
	FROM temperatures T
	LEFT JOIN countries C on T."fips_code" = C."fips_code"
	LEFT JOIN stations S on T.id = S.id
	WHERE (C.country == ? OR C.country == ? OR C.country == ?)
	AND T.year BETWEEN ? AND ?
	AND T.month == ?
	"""

	# add the values of the parameters by params and make connections between the command cmd and the database connection conn by using pd.read_sql_query( )
	df = pd.read_sql_query(cmd, conn, params = [country1, country2, country3, year_begin, year_end, month])

	conn.close() # close the database connection

	return df
query_climate_three_countries("Norway", "Russia", "Finland", 2000, 2020, 1)
NAME LATITUDE LONGITUDE Country Year Month Temp
0 HELSINKI_KAISANIEMI_AWS 60.1667 24.950 Finland 2000 1 -2.35
1 HELSINKI_KAISANIEMI_AWS 60.1667 24.950 Finland 2001 1 -1.02
2 HELSINKI_KAISANIEMI_AWS 60.1667 24.950 Finland 2002 1 -2.90
3 HELSINKI_KAISANIEMI_AWS 60.1667 24.950 Finland 2003 1 -8.49
4 HELSINKI_KAISANIEMI_AWS 60.1667 24.950 Finland 2004 1 -5.63
... ... ... ... ... ... ... ...
22324 TEGJULTJE_TERDE 60.1670 130.199 Russia 2009 1 -37.05
22325 TEGJULTJE_TERDE 60.1670 130.199 Russia 2010 1 -36.87
22326 TEGJULTJE_TERDE 60.1670 130.199 Russia 2011 1 -33.90
22327 TEGJULTJE_TERDE 60.1670 130.199 Russia 2012 1 -34.90
22328 TEGJULTJE_TERDE 60.1670 130.199 Russia 2013 1 -42.12

22329 rows × 7 columns

query_climate_three_countries("Indonesia", "Brazil", "Colombia", 2000, 2020, 1)
NAME LATITUDE LONGITUDE Country Year Month Temp
0 CRUZEIRO_DO_SUL -7.633 -72.667 Brazil 2000 1 26.21
1 CRUZEIRO_DO_SUL -7.633 -72.667 Brazil 2001 1 26.12
2 CRUZEIRO_DO_SUL -7.633 -72.667 Brazil 2003 1 26.32
3 CRUZEIRO_DO_SUL -7.633 -72.667 Brazil 2004 1 26.92
4 CRUZEIRO_DO_SUL -7.633 -72.667 Brazil 2005 1 26.57
... ... ... ... ... ... ... ...
5954 CILACAP_INDONESIA -7.700 109.000 Indonesia 2008 1 27.60
5955 CILACAP_INDONESIA -7.700 109.000 Indonesia 2013 1 27.80
5956 CILACAP_INDONESIA -7.700 109.000 Indonesia 2014 1 27.10
5957 CILACAP_INDONESIA -7.700 109.000 Indonesia 2015 1 27.50
5958 JAMBI_PAALMERAH -1.600 103.700 Indonesia 2015 1 26.60

5959 rows × 7 columns

The DataFrames above contain the correct columns we want. Then, I want to define a temperature_coefficient_scatter function to plot an interactive 3D scatterplot.

import numpy as np
def temperature_coefficient_scatter(country1, country2, country3, year_begin, year_end, month, min_obs, **kwargs):
	"""
	Create an interactive 3D scatterplot by using Plotly Express; there should be a 3D scatterplot for all the three input countries where the point of a certain color is the standard deviation of the temperatures of all the stations in a specified month in several years.
	
	Parameters
	----------
	country1: the input string about the name of the first country
	country2: the input string about the name of the second country
	country3: the input string about the name of the third country
	year_begin: an input integer about the earliest year of the time interval
	year_end: an input integer about the latest year of the time interval
	month: an input integer about the month of the year
	min_obs: the minimum required number of years of data for any given station
	**kwargs: additional keyword arguments passed to px.scatter()
	
	Return 
	----------
	fig: an interactive 3D scatterplot by Plotly Express of three countries in different colors
	"""
	
	df = query_climate_three_countries(country1, country2, country3, year_begin, year_end, month) # use the query_climate_three_countries function created above
	
	eligible_stations = df.groupby(["NAME"])["Year"].transform(len) # use df.groupby().transform() to find the number of years each station has done temperature measurements
	df = df[eligible_stations >= min_obs] # keep data for stations with at least min_obs years worth of data in the specified month and filter out the others
	
	df_update = df.groupby(["NAME", "Country", "LATITUDE", "LONGITUDE"])["Temp"].std() # calculate standard deviation
	df_update = df_update.reset_index()
	df_update["Temp"] = df_update["Temp"].round(decimals = 4) # round the decimals of "Temp"
	df_update = df_update.rename({"NAME": "Station Name"}, axis = 1) # rename the column
	
	# make a 3D scatterplot by using Plotly Express with the information of "LATITUDE", "Temp", "Country", "Station Name", and "LONGITUDE" in df_update
	fig = px.scatter_3d(df_update, 
						x = "LATITUDE", 
						y = "LONGITUDE",
						z = "Temp",
						color = "Country",
						hover_name = "Station Name",
						hover_data = ["LONGITUDE"],
						labels = {"Temp": "Standard Deviation", "LATITUDE": "Latitude", "LONGITUDE": "Longitude"},
						title = f"Standard Deviation of Temperatures in " + calendar.month_name[month] + f" in years {year_begin} - {year_end} 
in {country1}, {country2}, and {country3}", **kwargs) fig.update_traces(marker = dict(size = 3)) return fig
pio.templates.default = "plotly_white" # choose a theme

fig = temperature_coefficient_scatter("Norway", "Russia", "Finland", 2000, 2020, 1, 10,
								width = 1000,
								height = 500,
								opacity = 0.3)

fig.update_layout(margin={"r":50,"t":50,"l":50,"b":50}) # set the margin
fig.update_layout(legend = dict(yanchor = "top", y = 0.7, xanchor = "left", x = 0.01)) # set the position of the legend
fig.show()

write_html(fig, "Norway_Russia_Finland_Comparison.html") # save the output in a html file

The above 3D scatterplot shows that there are many stations in Russia, Norway, and Finland since there are many colorful points. Besides that, the points of each country are spreading out and have large standard deviations. These observation results show that the temperature variation is large in Russia, Norway, and Finland, which means that there is a large change in temperature in each country in years 2000 to 2020.

pio.templates.default = "plotly_white" # choose a theme

fig = temperature_coefficient_scatter("Indonesia", "Brazil", "Colombia", 2000, 2020, 1, 10,
								width = 1000,
								height = 500,
								opacity = 0.3)

fig.update_layout(margin={"r":50,"t":50,"l":50,"b":50}) # set the margin
fig.update_layout(legend = dict(yanchor = "top", y = 0.7, xanchor = "left", x = 0.01)) # set the position of the legend
fig.show()

write_html(fig, "Indonesia_Brazil_Colombia_Comparison.html") # save the output in a html file

The above interactive 3D scatterplot shows that there are fewer stations in Brazil, Indonesia, and Colombia, comparing to the first interactive 3D scatterplot since the points are fewer. In addition, the standard deviations of temperatures are quite small, representing a small change in temperature in the years 2000 to 2020 in these countries.

In general, the first plot shows that stations in Russia, Norway, and Finland have quite large temperature variations due to the large standard deviations. Differently, the second plot shows that stations in Brazil, Indonesia, and Colombia have quite small temperature variations due to the small standard deviations. The above two plots also show that the countries in similar latitude usually may have a similar standard deviation of temperatures, no matter what their longitudes are. The above information read from the plots shows that countries near the equator usually have smaller temperature variation, while the countries far away from the equator usually have larger temperature variation. This is likely to be true since the temperature of the countries near the equator tends to be higher, and therefore, the temperature will be less likely to increase, which may be caused by global warming since the temperature in those countries has already been quite high.