Data Analysis and Visualisations using R (2024)

Step by Step Guide for Beginners

Are you starting your journey in the field of Data Science? Do you need to know how to get started with R? Are you intrigued by Data Visualisations? If yes, then this tutorial is meant for you!

With this article, we’d learn how to do basic exploratory analysis on a data set, create visualisations and draw inferences.

What we’d be covering

  1. Getting Started with R
  2. Understanding your Data Set
  3. Analysing & Building Visualisations

1.1 Download and Install R | R Studio

R programming offers a set of inbuilt libraries that help build visualisations with minimal code and flexibility.

You can download R easily from the R Project Website. While downloading you would need to choose a mirror. Choose R depending on your operating system, such as Windows, Mac or Linux.

Data Analysis and Visualisations using R (3)

It is super easy to install R. Just follow through the basic installation steps and you’d be good to go.

For an easy way to write scripts, I recommend using R Studio. It is an open source environment which is known for its simplicity and efficiency.

Data Analysis and Visualisations using R (4)

1.2 Install R packages

Packages are the fundamental units created by the community that contains reproducible R code. These include reusable R functions, documentation that describes how to use them and sample data.

The directory where packages are stored is called the library. R comes with a standard set of packages. Others are available for download and installation. Once installed, they have to be loaded into the session to be used.

To install a package in R, we simply use the command

install.packages(“Name of the Desired Package”)

1.3 Loading the Data set

There are some data sets that are already pre-installed in R. Here, we shall be using The Titanic data set that comes built-in R in the Titanic Package.

While using any external data source, we can use the read command to load the files(Excel, CSV, HTML and text files etc.)

This data set is also available at Kaggle. You may download the data set, both train and test files. In this tutorial, we’d be just using the train data set.

titanic <- read.csv(“C:/Users/Desktop/titanic.csv”, header=TRUE, sep=”,”)

The above code reads the file titanic.csv into a dataframe titanic. With Header=TRUE we are specifying that the data includes a header(column names) and sep=”,” specifies that the values in data are comma separated.

We have used the Titanic data set that contains historical records of all the passengers who on-boarded the Titanic. Below is a brief description of the 12 variables in the data set :

  • PassengerId: Serial Number
  • Survived: Contains binary Values of 0 & 1. Passenger did not survive — 0, Passenger Survived — 1.
  • Pclass — Ticket Class | 1st Class, 2nd Class or 3rd Class Ticket
  • Name — Name of the passenger
  • Sex — Male or Female
  • Age — Age in years — Integer
  • SibSp — No. of Siblings / Spouses — brothers, sisters and/or husband/wife
  • Parch — No. of parents/children — mother/father and/or daughter, son
  • Ticket — Serial Number
  • Fare — Passenger fare
  • Cabin — Cabin Number
  • Embarked — Port of Embarkment | C- Cherbourg, Q — Queenstown, S — Southhampton

2.1 Peek at your Data

Before we begin working on the dataset, let’s have a good look at the raw data.

view(titanic)

This helps us in familiarising with the data set.

Data Analysis and Visualisations using R (5)

head(titanic,n) | tail(titanic,n)

In order to have a quick look at the data, we often use the head()/tail().

Data Analysis and Visualisations using R (6)
Data Analysis and Visualisations using R (7)

In case we do not explicitly pass the value for n, it takes the default value of 5, and displays 5 rows.

names(titanic)

This helps us in checking out all the variables in the data set.

Data Analysis and Visualisations using R (8)

str(titanic)

This helps in understanding the structure of the data set, data type of each attribute and number of rows and columns present in the data.

Data Analysis and Visualisations using R (9)

summary(titanic)

Summary() is one of the most important functions that help in summarising each attribute in the dataset. It gives a set of descriptive statistics, depending on the type of variable:

  • In case of a Numerical Variable -> Gives Mean, Median, Mode, Range and Quartiles.
  • In case of a Factor Variable -> Gives a table with the frequencies.
  • In case of Factor + Numerical Variables -> Gives the number of missing values.
  • In case of character variables -> Gives the length and the class.

In case we just need the summary statistic for a particular variable in the dataset, we can use

summary(datasetName$VariableName) -> summary(titanic$Pclass)

as.factor(dataset$ColumnName)

There are times when some of the variables in the data set are factors but might get interpreted as numeric. For example, the Pclass(Passenger Class) tales the values 1, 2 and 3, however, we know that these are not to be considered as numeric, as these are just levels. In order to such variables treated as factors and not as numbers we need explicitly convert them to factors using the function as.factor()

Data Analysis and Visualisations using R (10)

Data Visualisation is an art of turning data into insights that can be easily interpreted. In this tutorial, we’ll analyse the survival patterns and check for factors that affected the same.

Points to think about

Now that we have an understanding of the dataset, and the variables, we need to identify the variables of interest. Domain knowledge and the correlation between variables help in choosing these variables. To keep it simple, we have chosen only 3 such variables, namely Age, Gender, Pclass.

What was the survival rate?

When talking about the Titanic data set, the first question that comes up is “How many people did survive?”. Let’s have a simple Bar Graph to demonstrate the same.

ggplot(titanic, aes(x=Survived)) + geom_bar()

Data Analysis and Visualisations using R (11)

On the X-axis we have the survived variable, 0 representing the passengers that did not survive, and 1 representing the passengers who survived. The Y -axis represents the number of passengers. Here we see that over 550 passenger did not survive and ~ 340 passengers survived.

Let’s make is more clear by using checking out the percentages

prop.table(table(titanic$Survived))

Data Analysis and Visualisations using R (12)

Only 38.38% of the passengers who on-boarded the titanic did survive.

Survival rate basis Gender

It is believed that in case of rescue operations during disasters, woman’s safety is prioritised. Did the same happen back then?

Data Analysis and Visualisations using R (13)

We see that the survival rate amongst the women was significantly higher when compared to men. The survival ratio amongst women was around 75%, whereas for men it was less than 20%.

Survival Rate basis Class of tickets (Pclass)

There were 3 segments of passengers, depending upon the class they were travelling in, namely, 1st class, 2nd class and 3rd class. We see that over 50% of the passengers were travelling in the 3rd class.

Data Analysis and Visualisations using R (14)
Data Analysis and Visualisations using R (15)

1st and 2nd Class passengers disproportionately survived, with over 60% survival rate of the 1st class passengers, around 45–50% of 2nd class, and less than 25% survival rate of those travelling in 3rd class.

I’ll leave you at the thought… Was it because of a preferential treatment to the passengers travelling elite class, or the proximity, as the 3rd class compartments were in the lower deck?

Survival Rate basis Class of tickets and Gender(pclass)

Data Analysis and Visualisations using R (16)

We see that the females in the 1st and 2nd class had a very high survival rate. The survival rate for the females travelling in 1st and 2nd class was 96% and 92% respectively, corresponding to 37% and 16% for men. The survival rate for men travelling 3rd class was less than 15%.

Till now it is evident that the Gender and Passenger class had significant impact on the survival rates. Let’s now check the impact of passenger’s Age on Survival Rate.

Survival rates basis age

Data Analysis and Visualisations using R (17)
Data Analysis and Visualisations using R (18)

Looking at the age<10 years section in the graph, we see that the survival rate is high. And the survival rate is low and drops beyond the age of 45.

Here we have used bin width of 5, you may try out different values and see, how the graph changes.

Survival Rate basis Age, Gender and Class of tickets

This graph helps identify the survival patterns considering all the three variables.

Data Analysis and Visualisations using R (19)
Data Analysis and Visualisations using R (20)

The top 3 sections depict the female survival patterns across the three classes, while the bottom 3 represent the male survival patterns across 3 classes. On the x-axis we have the Age.

It is evident that the survival rate of children, across 1st and 2nd class was the highest. Except for 1 girl child all children travelling 1st and 2nd class survived. The survival rates were lowest for men travelling 3rd class.

I hope you found this article helpful. Keep learning, keep growing!

Data Analysis and Visualisations using R (2024)

References

Top Articles
Latest Posts
Article information

Author: Francesca Jacobs Ret

Last Updated:

Views: 5843

Rating: 4.8 / 5 (68 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Francesca Jacobs Ret

Birthday: 1996-12-09

Address: Apt. 141 1406 Mitch Summit, New Teganshire, UT 82655-0699

Phone: +2296092334654

Job: Technology Architect

Hobby: Snowboarding, Scouting, Foreign language learning, Dowsing, Baton twirling, Sculpting, Cabaret

Introduction: My name is Francesca Jacobs Ret, I am a innocent, super, beautiful, charming, lucky, gentle, clever person who loves writing and wants to share my knowledge and understanding with you.