Report, 1st draft

Authors
Affiliation

Description

Exploratory data analysis is a critical step in any statistical analysis. It allows us to investigate the data that we collected, to identify important patterns or trends, to spot any unexpected anomalies or issues, and to check the validity of the assumptions we will need for future statistical inference.

Format

Your final exploratory data analysis document should be submitted as a PDF file, rendered from a Quarto file (*.qmd).

Use the proposal you submitted previously as your starting point.

Content

Add a References section and attach a BibTeX database. Note that you will have to set the bibliography field to the correct path to your BibTeX file. See SDS 100 Lab 12 for more information.

Tip

Use bibliographic references as detailed in SDS 100. Please note the inclusion of a refs.bib file in the YAML header above!

In the Data section

Edit the Data section of your proposal such that it contains the following three subsections.

Descriptive Statistics and Univariate Summaries

Provide meaningful visual and numerical summaries of your chosen dataset. This will involve reporting/providing the following components:

  1. The number of rows in your dataset (i.e., the total number of observations that you plan to use in your analysis).
  2. For each of your response variable(s) and your most important explanatory variables (including at least one categorical variable and at least one numerical variable):
    1. The number of observations with missing values, if any
    2. For categorical variables: a bar plot visualizing the observed data distribution
    3. For numerical variables: the maximum and minimum values in your dataset and a histogram visualizing the observed data distribution. Briefly comment on the skew and modality of the distribution.
  3. A Table 1 summarizing the center and spread for your response variable(s) and your most important explanatory variables. Table 1 is a common feature of research articles, particularly in clinical research contexts. It summarizes the key features and characteristics of the sample. See SDS 100: Lab 9 for detail about how to create tables, figures, and references.
    1. For all numerical variables, you should report an appropriate measure of center (either the mean or the median) and an appropriate measure of spread (either the standard deviation or the interquartile range) in Table 1.
    2. For all categorical variables, you should report the number and percent of observations falling in each level in Table 1. As you construct these summaries, always check that what you see in the data makes sense to you; when doing so, it may be helpful to refer back to your project proposal—and in particular to either (a) the range of values you anticipated seeing in the data (for numerical variables) or (b) the possible values you thought the data might take on (for categorical variables).

Bivariate Relationships

In the second section of your exploratory data analysis, you should summarize one or two important bivariate relationships. For your primary hypothesis(es) only:

  1. Create an informative graphical summary of the bivariate relationship of interest: this may look like a scatterplot or side-by-side histogram.
  2. Fit a simple linear regression line to summarize this relationship and interpret each coefficient of that model in context. <!–
  3. Produce a residual plot and comment on the reasonableness of using a linear regression line to model your relationship of interest. –>

Pressing Data Cleaning Issues

In this final section, please provide a bulleted list of the issues with your data that you identified while doing your exploratory data analysis and that you will still need to address. For example:

  • Do you need to create new transformed variables to better address your research question?
  • Did you find any anomalous observations (e.g., potential outliers)?
  • Do you need to clean up the coding of missing data (e.g., switching from 999 coding to NA coding) and/or determine an appropriate strategy for addressing missing data?

Formatting Guidelines

When you submit your exploratory data analysis document, please adhere to the following best formatting practices:

  • Please use section headers to create three clear, labeled subheadings: one for the first section of the exploratory data analysis, one for the second section of exploratory data analysis, and the third for your data cleaning issues/questions. See SDS 100: Lab 8 for detail on formatting in Quarto.
  • Do not include extraneous R code in your final HTML file. Showing the code that you ran to generate your histogram or find your summary statistics is fine (strongly encouraged!), but please avoid excessively long output (e.g., printing the entire dataset to screen), unrelated R code (e.g., summary statistics for a variable you do not include in your Table 1), or error messages.
  • Please round all of your summary statistics to a reasonable number of decimal places (up to three decimal places at most; one or two decimal places strongly encouraged).

Submission

Please have one member of your group turn in a PDF to Moodle by the due date.