So far we have done scripts using data we have coded ourselves, like creating a list in Visual Studio and putting some elements in it. However, this is neither practical nor realistic in the real world where databases and files exist. In a later post we will explore databases and how to connect to them using Python but for now let’s use real world data from online sources.
We will explore a dataset from the United Nations on imports/exports by Country. Open this link and search for “trade of goods”. A window will open with the results of the search, now click on:
Click on download, you want value separated, comma
This will download the file in a zip format. Unzip it to a custom folder in your C drive, call the folder “tradeDataSet”. Once the file is in there rename it to just trade, it should look like this:
Now that we have the file in our computer, it is time to create a new Visual Studio Project, call the project “importDataFromCSV”. Check this post if you don’t remember how to create a new project. Also remember to change the environment to honeyBadger.
All right, since we want to do some data analysis on this data, we can use a library called Pandas. Using the instructions in this post (go to the Anaconda Navigator section), install this library using Anaconda Navigator. You want just that: Pandas.
Your Visual Studio environment should look like this:
Now let’s import this library into our project
import pandas as pd
Now every time we need something from this library we can use the alias pd instead of the full name of pandas.
Pandas, has an incredibly easy way to read a CSV file, you can read about it here.
So in our code we can just do the following:
import pandas as pd file_name='C:\\tradeDataSet\\trade.csv' trade=pd.read_csv(file_name) #the head method prints the first 5 rows of the dataframe print(trade.head())
The dataset consists of 9 columns and we can see what the data is about, all commodities either import or exports and their value in US dollars. That’s interesting so far but what else can we learn about the data?
We can use the describe method to get some insights on the numerical columns of our dataframe
import pandas as pd file_name='C:\\tradeDataSet\\trade.csv' trade=pd.read_csv(file_name) #the head method prints the first 5 rows of the dataframe #print(trade.head()) print(trade.describe())
Describe gives us some basic statistics of the numerical columns, we can however see that applying this to Year is useless and there seems to be way too many zeros in Weight and Quantity so these 2 columns are little value for statistical analysis.
Let’s find out which countries are the biggest importers. This means we want to slice the dataframe by the Flow column on the value Import. However, hold on. We see in the top 5 rows that for this field there are 2 values: Import and Export. Are there any other values?
Let’s explore that
import pandas as pd file_name='C:\\tradeDataSet\\trade.csv' trade=pd.read_csv(file_name) #the head method prints the first 5 rows of the dataframe #print(trade.head()) #print(trade.describe()) print (trade.groupby(['Flow']).describe())
We now see that this field has other values: Re-Import and Re-Export
And according to the glossary of the UN DB where we obtained the data, Re-Imports should be accounted for in statistics of Importing goods. So let’s get the Imports and Re-Imports:
import pandas as pd file_name='C:\\tradeDataSet\\trade.csv' trade=pd.read_csv(file_name) #the head method prints the first 5 rows of the dataframe #print(trade.head()) #print(trade.describe()) #print (trade.groupby(['Flow']).describe()) valuesWeWant=['Import','Re-Import'] tradeImports=trade[trade['Flow'].isin(valuesWeWant)] print (tradeImports.groupby(['Flow']).describe())
Line 8: Here we created a list with the row values we want from the dataset in the column Flow.
Line 9: Here we create a new dataframe called tradeImports. Then in this statement
].isin(valuesWeWant) we “slice” the original trade dataframe in the column Flow using the valuesWeWant list.
Line 10: We print our new dataframe but first we use the method groupby. This aggregates the dataframe by the column Flow.
This produces only our new data frame with just Import and Re-Import.
Let’s check our new dataset tradeImports getting the first 10 records.
Now, how can we get the top countries that have the most Trade (USD)?
print(tradeImports.groupby('Country or Area').sum().sort_values(by=['Trade (USD)'], ascending=False))
Whoa! So what is this statement?? Let’s dissect it.
Pandas allows you to concatenate operations on a dataframe, in this case the first operation is our groupby that we already used but now we use the Country or Area column. Then we sum the results of this and finally we sort the results by the Trade (USD) column in descending order.
This displays the top countries by trade which is what we wanted. We should note however that this is the total summation of the dataset, it does not account for changes over the year (the year column has been added as well).