R-Programming
What if I told you that Project R is a GNU project written in R itself? It is primarily written in C and Fortran. And most of its modules are written in R itself. It’s an open source programming environment for statistical computing and graphics. The R language is widely used among data miners for statistical software development and data analysis. Its ease of use and extensibility have greatly increased the popularity of R in recent years. In addition to data mining, it provides statistical and graphical analysis techniques, including linear and nonlinear modeling, classical statistical tests, time series analysis, classification, clustering, and more.
The R language is the main competitor to Python for those involved in statistics and data analysis. It’s used in the social and economic sciences to look for cause and effect relationships, compare samples, and create visual reports and charts.
Scientists in the Department of Statistics at Oakland University developed the language. At first it was an internal tool, but then they made it available to everyone – it was very successful.
This is an important point: R is designed by statisticians for statisticians – it already has popular statistical tests, methods of data analysis, handy tools for graphing. These features are not present in all popular general-purpose languages.
And the specialized R language is steadily reclaiming its place under the sun: from 18th place in the TIOBE ranking in 2016, it rose to 8th place in January 2021. The interpreter and desktop environment can be installed on any modern operating system – MacOS, Linux, Windows.
Technical Writer. Likes to talk simply about complicated things.
What’s under the hood.
R is an interpreted object-oriented programming language. What does that mean? Functions or tables for it are objects that belong to a certain class (data type), and the finished program is executed immediately – line by line. You don’t have to compile the code into an executable file before running it.
R language syntax is simple and includes a minimum set of primitive data types: character, numeric, logical and complex. The primitive types are combined into more complex types. For example, a vector type is essentially a list of several objects (numbers, strings, and others). Numeric variables can also take special values: NaN (not a number), Inf (infinity) and NA (not available).
The most popular command in R is to read a file because you have to open and examine datasets all the time.
What you can do with R
- Process, clean up, and convert data to explore. For example, you want to see how many users on average downloaded your mobile app each summer and fall month. R allows you to exclude winter and fall and group them by month for further calculations.
- Run statistical tests. Suppose you want to know if the average life expectancy of men and women is different. To do this, you can run a t-test – its results will show whether there are statistically significant differences between the data.
- Run an exploratory analysis. The data need to be checked for normality because many statistical methods (e.g., the same
- t-test) require a normal distribution in the raw data. A normal distribution implies that most of the data are clustered around the mean value, and the remaining values are much smaller. Such a distribution is often found in life: people of average height are the most numerous in the world, and tall and short people are few. R has tools for checking normality with graphs and tests.
- Work with tables of different formats. This is useful for analysts: for example, to combine data from several .csv and .xlsx tables and process them as a single file.
- Draw an interactive chart and adjust its parameters, such as axis values and so on.
- Create an interactive application. The result is a nice-looking web page with a graph, filters and sorting data. You can send it to your colleagues or publish it as a part of your article. This is how to track the incidence of coronavirus around the world (the code is open and available on GitHub).
- Analyze regression models. Regression analysis is a technique that reveals the relationship between the dependent and independent variables. For example, an analyst wants to understand why some stores in a chain have higher sales than others. The dependent variable would be the volume of sales, and several independent variables would be income and age of neighborhood residents, and the distance from the store to bus stops. As a result, it is possible to find out which of these factors affects store sales more than others.