Why would anyone use python pandas

Data Science in Python - Pandas (Part 3)

After my colleague Marvin introduced the NumPy library in his article, this STATWORX blog post will be about the pandas library. Pandas is based to a large extent on NumPy, but offers a simple way of reading and manipulating data in Python, especially for a newcomer to the data science field. If you understand how NumPy works, you won't have any problems with pandas.

Overview of the data structures in pandas

Before we go into the different ways to work with pandas, we would like to give a brief introduction to the different structures that our data can have in pandas. There are three different options:

  • (1-dimensional)
  • (2-dimensional)
  • (3-dimensional)

If, for example, time series data are available, it can also be advantageous to use only one series. We are most concerned with in practice because they correspond to a typical table or matrix. thus unite different series so that we cannot represent just one data strand. The size can now be changed, whereby the individual columns must logically be of the same length. We will primarily work with in this article.

add another dimension. You can think of it as a collection of 2D DataFrames, although you cannot really represent it visually in this arrangement.

Installation and creation of a DataFrame

But enough of the theory! Pandas can be installed via or like any other Python library. Pandas are then often imported using the abbreviation pd. The latter is very common and immediately gives every data scientist the information that the respective script is working with pandas.

Before we go into how you can read data formats with pandas, let's briefly show how you can create one yourself in pandas. A consists of two building blocks:

  1. Data for the respective columns
  2. Column names

The easiest way to create one is to pass the data in the form of an array and specify the column names (columns) as shown in the following example.

Alternatively, it is also possible to create one using a dictionary. In this case, the column names are automatically extracted from the keys of the dictionary:

Loading data into pandas

As we all know that we do not want to enter our data manually in Python, there are ready-made functions in Pandas for most of the popular data formats (* .csv, * .xlsx, * .dta, ...) or databases (SQL). These are summarized in the IO tools. You can find an overview here. With these tools, data can not only be read in, but also saved in the appropriate format. The functions are all very similar in pattern and begin with, followed by the respective specification. For example for Excel files or Stata. To demonstrate the various functions and working methods with and on one, we use the Titanic data set from the Kaggle website. The data set can also be downloaded from Standford University. Now let's load the data once:

How is my data structured?

If you work with initially unknown data, you can quickly get an overview of the data with pandas, while we access both properties and functions. The following commands should serve as a small checklist:

With we query the structure of the, i.e. how many columns and rows it has, in this case there are 15 columns and 891 rows or observations in the data record. So it's a comparatively small data set. With you can now display the column names. Now we know what the columns are called, but not what the data looks like, you can simply change that with the or function. These functions normally return the first five or last five lines of the data set. More rows can also be returned using the function parameter n. You can now see what it looks like. For the sake of clarity, only the first five columns of the data record are shown here, if you call the function yourself, significantly more columns are displayed.

This first overview shows that the data contain different scales. The last function that we would like to introduce to get a first impression of data is. With this, typical statistical metrics such as the average and median of those columns that have a metric scale are returned. Columns such as the gender (sex) column are not taken into account by this function. The result is as follows:

In addition to the various metrics for the columns, it is clearly noticeable that the number of observations (count) in the age column differs from the total number of all observations. To get to the bottom of this statement, we need to take a closer look at the column.

Data selection

There are two options for selecting columns with pandas:

With the first option, you can always select a column. This is not always guaranteed with the second option, since column names can of course also contain spaces or special characters. If you now execute this command, you get one back as a result, i.e. a - vector. You can see that some observations are missing and are provided with so-called values. We'll show you how to correct them at the end of the article. First we want to go into how you can select multiple columns or a specific area.

The selection of several columns is just as easy, it is important to note that a list of the column names to be selected is created or transferred. When selecting a section of the, one uses the functions or, the former is intended for an index-based selection, i.e. one does not pass column names to the function, but the respective position of the column or a range of columns or indices. With the function you can then work with lists and individual column names. In the example above, all rows up to index 2 and the first three columns would be selected. In the example, however, rows up to index 3 in connection with the columns and are selected. These functions provide an introduction to the selection of different data sets; the examples only scratch the surface of the possibilities of how one can select data.

Data manipulation

To find perfectly prepared data in practice would be a great thing, unfortunately the reality looks a little different. We already had a first example above with the column. In order to use our data later for evaluations or visualizations, we now want to show a way of correcting the values. The function can be used for this purpose. With this, the values ​​are automatically selected and filled with values ​​that you have previously specified. In our case, we simply fill in the average age as the value. We therefore do not add any outliers to the age and the average does not change.

After calling the functions, we of course have to add the new age column to our data record, otherwise the transformation will not be saved. The assignment works very simply by passing the new column name as if we were selecting an existing column. You can then simply call up the new values. Alternatively, you could overwrite the previous age column, but this would mean that the raw data would no longer be visible.

Summary

At the end of the article we want to briefly summarize important information about getting started with pandas:

  • Reading in data simply works with the functions
  • Data are usually two-dimensional
  • With the functions,, and you get an initial overview of the data
  • Columns can be selected either once or multiple times with the associated name
  • Areas of a DataFrame are selected either with or with
  • The function can be used to replace missing observations

As I said, this is only a first introduction to the Pandas library.

As a little preview in our next article in the series Data Science with Python we will then look at the visualization library Matplotlib employ.

About the author

Moritz Gnisia

I'm an intern in the STATWORX data science team for half a year in 2018 and enjoyed writing the Python for data science Introduction. Beside this I am passioned about aviation and especially like gliding in the air.


STATWORX
is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. If you have questions or suggestions, please write us an e-mail addressed to blog (at) statworx.com.