A Beginner’s Guide to Learning How to Code in R

By Savannah Novencido
Marrow Undergraduate Intern and Research Assistant
Getty Conservation Institute
2021 August

Introduction

Coding or programming is a vital component of data science because it allows for the automation of data processing and analysis. Because environmental data is often collected for conservation and preservation purposes, its analysis can be optimized using programming. R is commonly used for statistical computation, and acts as a great starting point for understanding how coding can be used to analyze and visualize environmental data.

 

As an undergraduate student majoring in materials science, I wanted to explore the applications of science in the cultural heritage field. This led me to an internship at the Getty Conservation Institute (GCI), where I had the opportunity to work with Associate Scientist Vincent Laudato Beltran from the Preventive Conservation Science Group and the Managing Collection Environments (MCE) Initiative.

 

One of my projects focused on the adaptation of the GCI Excel Tools using R. This set of environmental analysis Excel modules was developed by Vincent as a teaching tool for the MCE workshop, Preserving Collection in the Age of Sustainability. Using Excel, he created a suite of data calculations (dew point and humidity ratio, fluctuations, moving average, percentiles) and visualizations (time series, cumulative relative frequency, box plots, psychrometric chart) to improve understanding of temperature and relative humidity data (Cosaert and Beltran 2021). However, Excel posed complications for the user, including manual manipulation of data in worksheets or features on plots, limitations on the breadth and customization of available graphs, and the need to have access to the Excel program. While workarounds were found for a number of these issues, Vincent suggested I attempt to recreate the GCI Excel Tools using the open-source platform and coding language of R.

 

Having no formal background in coding, Vincent directed me towards a set of materials for learning R, but allowed for flexibility in exploring other resources. Based on my immediate work goals and background as a student, I wanted to explore available learning options without a huge commitment in terms of cost or time. Because my goal was to recreate an existing analysis tool package, I looked for examples that were similar to what I hoped to accomplish. Through this search, I discovered several comprehensive and free or inexpensive resources suitable for beginner coders.

 

While my learning style was guided by the goals and timeline of the project, others may choose to invest in paid courses as they can offer a more interactive and well-defined trajectory. Also, the act of paying for a course may increase accountability and follow-through. In some cases, these courses can give you access to a professional instructor and reliable sample code that you can work with as a part of the course curriculum.

 

Though there are many ways to learn, choosing resources best suited to your needs allows you to tailor your own coding journey. Access to multiple resources can also provide you with complementary perspectives and a range of coding examples that can be helpful. Table 1 briefly lists various recommended resources for learning R, with more detailed descriptions appearing in the Resources section at the end of this document. While these resources were selected as a means of getting started in coding, many will serve as references even after learning the basics.

Picture1.jpg

R and other programming languages

Programming has become a ubiquitous skill in many fields, and there exist over 700 programming languages. These languages allow users to communicate with the computer through written instructions, and are intended for specific applications. For example, programmers might choose to use C++ for high-level programming that demands faster processing rates (e.g., game or app development), while data scientists may opt to use Python for its simpler syntax. While I chose R for my project, other common languages used among data scientists include Scala, SQL, JavaScript, and MATLAB.

One phenomenon that is observed while learning to code is the Dunning-Kruger effect, which describes an overestimation of newly gained skills because the novice coder does not recognize the depth of knowledge needed to become proficient (Dunning and Kruger 1999).  This effect can be visualized through a competence versus confidence curve, shown in Figure 1. At the beginning of your coding journey, you may experience the “Coding Honeymoon”, where confidence in your abilities skyrockets despite a lack of experience. Then comes a period of “Initial Confusion”, when you may have a large decrease in confidence, even while gaining coding experience. As an intermediate coder, you may reach a valley of “Overwhelming Complexity”, as you begin to understand just how deep and complex the material actually is, making you doubt your capability. However, it is worth persevering through this stage as one will eventually begin to regain confidence in their coding skill with further expertise. Note that one’s level of confidence may plateau below that of a beginner coder; at this latter stage, an expert coder will have a more realistic assessment of their skill and capability. 

R was first developed in 1993 as a programming language intended for statistical computing and data visualizations (Hornik 2017). In its simplest form, R can be run using its built-in command-line interface (CLI), which is a text-based interface where the user inputs text to execute functions. In programming, a function is a section of code that executes a specific task.
 

Picture2.jpg

However, users can opt to use third-party graphical user interfaces (GUI). GUIs are programs that use symbols, graphic elements, and pointing devices to communicate with a computer. A popular GUI for R is RStudio (https://www.rstudio.com/), which is an integrated development environment (IDE) that allows you to import and edit code through a user-friendly interface. An IDE consolidates developer tools into a single GUI, allowing you to edit, execute, and debug code in the same environment. 

Packages are another important feature of R that allows for the use of pre-programmed functions to perform statistical tests, analyze and visualize data, and create predictive models. A standard set of packages is included as a part of the installation of R, but many user-created packages can be accessed in various repositories. A repository is a centralized location for data storage and organization; packages can be installed from repositories. The Comprehensive R Archive Network (CRAN, https://cran.r-project.org/) is a highly accessed repository with over 14,000 packages. ggplot2 and tidyverse are examples of commonly used packages.

RStudio Interface
One’s first encounter with RStudio can be overwhelming due to the numerous windows and tabs. The following provides a brief summary of the RStudio interface.

 

RStudio interface containing the editor (1), console (2), environment (3), and file (4) panels.

Picture3.png

The editor window (Figure 2-Area 1) is where we can create, view, and edit code. Coding in this window is preferable because we can save the script. Specific blocks of code can be run by highlighting the text and clicking the “Run” button. Comments can also be added by placing a ‘#’ in front of the line of code; this symbol signals that this line will not be run by the interpreter.

The console (Figure 2-Area 2) indicates the commands that have been executed and highlights any resulting warning messages. Additionally, coding in RStudio can be performed by directly typing commands into the console or through a source file. Source files are a type of text file written in a programming language and containing instructions for the computer. However, use of this traditional command-line style interpreter is atypical as the previously described editor window provides more flexibility and convenience for coding.  

 

Previous commands and objects created in prior sessions are recorded in the workspace and history tabs, denoted by Figure 2-Area 3. An object is any data structure within R (e.g., vectors, lists, matrices, arrays). Clicking an object opens its data in another tab. One can also view specific properties (e.g., type or dimension) of the loaded objects or datasets.

Figure 2-Area 4 displays files in your working directory, plots, and loaded packages. Plots can be saved in different image formats. The help tab displays help files for various packages and functions.
 

 

Utilizing ConCode
Despite being able to create select graphics in R, I struggled with communicating on larger platforms when trying to resolve coding issues. While StackOverflow   (https://stackoverflow.com/questions, a public online forum for asking programming questions or troubleshooting) has proven to be a powerful tool, the community members often use coding jargon that can make the issue confusing for a beginner. Further, posting a question that meets the expectations of StackOverflow is challenging; to avoid that issue, I would recommend familiarizing yourself with their posting guidelines. Note that reviewing existing posts on StackOverflow can be a helpful resource.

As I delved deeper into my GCI project, I continued to grapple with several outstanding coding issues. After reaching out to Bhav Shah, a data scientist at the V&A, I learned about ConCode (http://concodeworkspace.slack.com/), which is a Slack workspace that allows coders in cultural heritage to collaborate and offer advice. He suggested having a Q&A session with members of ConCode, where I could go through my code and highlight problem areas. An outline of the objectives for the meeting was distributed beforehand.

Overall, I found that ConCode was one of the best resources for answering more immediate questions. The coding issues that I’ve encountered have largely been already experienced by others, and the solutions are relevant since this specific coding community focuses on the cultural heritage field.  

That being said, ConCode is probably best used to get functioning code, whereas StackOverflow coders may urge you to optimize your code. While efficient code is definitely desirable, I find that most of these solutions demand a more rigorous study of R, which can be limiting. 

Reflecting on my ConCode meeting, I found that having real time conversations where you can ask several questions is the easiest way to find what works and what doesn’t. This meeting was essential in helping me get the ball rolling for my project, and as a beginner coder, the advice and encouragement from more experienced coders kept me motivated. It is hoped that recorded ConCode coding sessions will provide a useful resource for new and expert coders alike. 

Resources
Below is a collection of R learning resources (introduced in Table 1) that I’ve used throughout my project, as well as recommendations from other coding colleagues.

If this is your first experience with coding, a structured class such as the eDX or DataCamp courses are great for building a well-rounded foundation for future programming endeavors. The courses familiarize you with the basics of coding and integrate workable examples.

Alternatively, the free R for Data Science online book is excellent at guiding how to code in R and its use in data analysis, while still giving you the option to test code on your own. In the end, using a blend of these resources can round out your skills and help familiarize yourself with the programming language. 

Note that most of the resources listed below are oriented towards data analysis using R as the main language, but many also provide information on other programming languages such as MATLAB, Python, and C++.

Online Courses
eDX Courses (https://www.edx.org/)

  • Cost: Free to $150 (Verified Certificate)  

  • Description: eDX hosts free university level online courses for a variety of disciplines, including computer science, statistics, and data analysis. eDX also offers programs (both free and paid), which combine related courses into a single curriculum.  Paid eDX programs may offer a Professional Certificate, which serves as a way to verify your completion of the program. Many of the instructors for the courses are professors from universities like Harvard, Stanford, and MIT, and both self-paced and live courses are available.

  • Examples (Free)

    • HarvardX: Data Science: R Basics, Statistics and R

    • StanfordOnline: R Programming Fundamentals

    • UCx: Statistical Analysis 

  • Comments

    • Some of the professional programs can be costly, but the individual courses are completely free. 

    • A verified certificate proving proficiency is only given after both completing a program and successfully passing the assessments. 

DataCamp (https://www.datacamp.com/)

  • Cost: Free (Introduction only) to $33/month

  • Description: While all users can see the first chapter to over 350 online courses on data science using different programming languages, access to the full course requires users to subscribe individually or under a business license. Courses are organized by programming language, topic, and level of experience. The format of each course ranges from tutorial-style videos from coding experts or written exercises that can be completed with a built-in live script. Users can also access assessments that test their knowledge of a specific topic, as well as professionally-guided example projects. 

  • Examples

    • Data Science: Data Science for Everyone

    • R: Introduction to R, Introduction to the Tidyverse

    • Python: Introduction to Python, Introduction to Data Science in Python

    • SQL: Introduction to SQL, Joining Data in SQL

  • Comments

    • DataCamp subscriptions offer access to all courses in their entirety, while eDX professional courses must be paid for individually. 

    • While you can pay for structured courses in R, I found that I was able to do a lot with the blogs and mini tutorials that are offered outside of their program.

UCLA’s IDRE R Seminar (https://stats.idre.ucla.edu/r/seminars/intro/)

  • Cost: Free

  • Description: Created by UCLA’s Institute for Digital Research and Education, there are several classes, examples, book lists, and troubleshooting tips for R and other programming languages (Stata, SAS, SPSS, Mplus). Directions for installation and running each program are also available.

  • Examples

    • R Data Management

    • R Markdown Basics

  • Comments

    • Unlike the DataCamp and edX courses, there is no interactive portion built into the classes. 

Books
R for Data Science (Garrett Grolemund and Hadley Wickham, https://r4ds.had.co.nz/index.html)

  • Cost: Free (online), $30 (paperback)

  • Description: R for Data Science (R4DS) is an online and physical book that introduces R for data analysis. The chapters cover important concepts like basic data manipulation, data cleaning, and model building. R4DS is also a useful reference for correct grammar and syntax when writing code.

  • Comments

    • This book is structured so that the beginning chapters are a gradual introduction to R as a language, eventually diving into more advanced topics in later sections. The incremental nature of the curriculum makes it highly recommended for beginner coders.

    • The free online version of the book is preferable as it is continually updated.

 

YouTube
R Tutorials (https://www.youtube.com/channel/UCGxEunCWRMK9z3NRBdC_pyg)

  • Cost: Free

  • Description: The R Tutorials YouTube channel has short videos focused on a specific topic related to R, including relevant tutorials and exercises. 

  • Comments

    • R Tutorials developed workable data sets that can be downloaded through their website (http://r-tutorials.com/). The step-by-step exercises supplement the videos available on the channel, allowing you to put your knowledge into practice.

Online Forums and Communities
#RStats on Twitter (https://twitter.com/search?q=%23RStats&src=typed_query)

  • Cost: Free

  • Description: #RStats is a Twitter hashtag where the R community posts projects, tips, questions, and extra resources. Expert coders sometimes spend time on #RStats answering questions, and the hashtag can also be used to stay up to date on the latest R news. While #RStats can be used for browsing purposes, they are most helpful in situations where you need a question to be quickly answered. Using the Advanced Search function in Twitter will allow you to narrow the results to find the most relevant information. 

  • Comments

    • #RStats can be hard to navigate for those who are unfamiliar with Twitter. 

    • While an account is not needed to view the tweets on the hashtag page, one is needed to make posts. 

 

StackOverflow (https://stackoverflow.com/questions)

  • Cost: Free

  • Description: StackOverflow is a public platform where community members can ask questions or troubleshoot problems for all kinds of programming projects. Coders of all languages and expertise use this platform to exchange ideas and get help. Tags can be used to refine your search.

  • Comments

    • When posting a question on StackOverflow, it is important to follow the post guidelines outlining the specifics of your projects and any code or data needed. 

    • StackOverflow is a great option for getting general advice on your projects; however, the format of the website makes it difficult to discuss and ask follow up questions on the same post. 

 

ConCode (http://concodeworkspace.slack.com/)

  • Cost: Free

  • Description: ConCode is a Slack workspace dedicated to coding in the context of cultural heritage. As a community, ConCode was intended as a means for conservators and colleagues in allied fields to communicate and collaborate on coding projects and answer coding questions. Members of ConCode also join together for monthly meetings and seminars on popular topics.

  • Comments

    • Since its inception in October 2020, the ConCode community has increased substantially. Having a separate resource for coders in cultural heritage allows members to discuss current issues within the field and ways to use programming to optimize data analysis. 

    • Because members communicate through a Slack workspace, it is faster and easier to ask follow-up questions on the same thread. 

 

GitHub
TidyTuesday   (https://github.com/rfordatascience/tidytuesday)

  • Cost: Free

  • Description: TidyTuesday offers weekly projects to challenge data cleaning skills, with an emphasis on using different tools from the tidyverse package to analyze data and create visualizations. The projects and data sets can be found on GitHub, a popular code-hosting platform for collaboration and version control. As a product of R4DS, each week presents real-world datasets which can be used to practice. 

  • Comments

    • The data provided isn’t necessarily “tidy” so that you can practice both your data cleaning and visualization skills. 

    • Examples of community-made visualizations can be found on #TidyTuesday. 

Acknowledgements 
I would like to express my gratitude to my supervisor, Vincent Laudato Beltran, for inspiring and guiding this project. This work could not have been accomplished without his continued advice and support. 

I also wanted to take the time to thank ConCode organizers Bhavesh Shah, Melissa King, and Annelies Cosaert for helping me develop and edit my code. Finally, I wanted to thank the Getty Conservation Institute and Marrow Undergraduate Intern Program for the opportunity to work on this project. 

References 
Cosaert, Annelies and Vincent Laudato Beltran. 2021. Comparison of temperature and relative humidity analysis tools to address practitioner needs and improve decision-making. In Transcending Boundaries: Integrated Approaches to Conservation. ICOM-CC 19th Triennial Conference Preprints, Beijing, 17–21 May 2021, ed. J. Bridgland. Paris: International Council of Museums.

Kruger, Justin and David Dunning. 1999. Unskilled and unaware of it: How difficulties in recognizing one's own incompetence lead to inflated self-assessments. Journal of Personality and Social Psychology, 77(6): 1121–1134. https://doi.org/10.1037/0022-3514.77.6.1121

 

Hornik, Kurt. 2017. R FAQ (https://cran.r-project.org/doc/FAQ/R-FAQ.html). The Comprehensive R Archive Network. 2.1 What is R?. Accessed 2021 5 August.