Writing for data science

Please follow the following guidelines for all drafts of your final paper:

  • All papers will be written in R Markdown.
  • All papers should use the rticles::mdpi_article template.
  • Use BibTeX for citations. Use Google Scholar to gather reference information in BibTeX format (adjust manually where necessary). Include doi and/or url field in BibTeX entries. Insert citations into the body of the paper as appropriate in R Markdown. Cite R packages with information from citation(). You can start a .bib file with knitr::write_bib(). The `BibTeX database for this page looks like this.
  • Include link to GitHub repository (where necessary) as a citation (even if it’s private)!
  • Figures should be in a figure environment (in LaTeX). If the figure is generated as a plot in R, name the chunk and use the fig.cap chunk option. If the figure is an external image, use knitr::include_graphics(). All Figures must be referenced explicitly in the text by number. Figures should have expansive captions that tell the reader what they are supposed to see in the figure. All figures should have readable axis labels and explicit units. Figures should be mostly self-explanatory! Spend more time gussying up the figures – not less! Recall the SDS 192 standards for Aesthetics, Context, and Markdown.
  • Tables should be in a table environment (in LaTeX). If the table is generated from data in R, use the kable() function. If the table is input manually, just put it in a data.frame and use kable() anyway. Tables should also have expansive captions and must be referenced in the text by number. Simple summary tables are easy to generate and really helpful to readers!
  • Use Markdown formatting, including: sections and subsections, lists, links, block quotes, chunks, and code, where appropriate.
  • Put any pieces of code in backticks (`). That includes names of packages, functions, variable names, and even certain strings. For bits of actual code, use a chunk. Note that R Markdown supports chunks in many languages, including Python, SQL, and bash.
  • Use wordcountaddin::text_stats() to count your words.
wordcountaddin::text_stats()
  • Limit your use of adverbs.

Audience

Your final deliverable will be to write up your project as an academic paper. For your convenience, the format should conform to the Undergraduate Research Project Competition (USRESP) guidelines. If possible, you are strongly encouraged to submit your paper to the competition, but that decision will not affect your grade in this class. The paper should be 12–15 pages long, plus references, and an appendix (of arbitrary length) in which you can show more graphics, tables, etc.

  • Use your GitHub repository to write the paper. Put the paper contents into a subdirectory called paper.
  • Use GitHub issues to delegate certain tasks to certain people, and to keep track of your progress.
  • Use the rticles packages to write the paper in R Markdown.
  • Use the MDPI format.
  • Submit the 1st draft of your paper by 11:55 pm on Friday, Oct 15
  • Submit the 2nd draft of your paper by 11:55 pm on Friday, Nov 19
  • Submit your final paper by 11:55 pm on Wednesday, Dec 8
  • One submission per group.

The structure of a scientific paper

Most commonly, research papers will have five sections like:

  1. Introduction and Motivation: What is the problem you are trying to solve? Why is this problem interesting? What has been tried before? What have been the shortcoming of those approaches that necessitate your efforts? Often, this section will conclude with a subsection (or paragraph) outlining “our contributions.” What is the new knowledge that this paper contributes?
  2. Data: Where did it come from? What are some basic summary statistics, variable definitions, and/or visualizations that help the reader understand the data you are working with?
  3. Methods: What did you actually do? What techniques or methods did you employ? What were the specifications for any statistical models you used? What software or packages did you use or develop?
  4. Results: What did you learn about the problem you identified in Section 1? This is where you put the tables, figures, and analytics by-products of your work.
  5. Conclusion: What are the limitations of your work? What are some next steps that someone (either you or another research group) should consider in attempting to further your work? Remind us one last time about what you did.
  • References: Every reference (except for books) needs a URL. Use Google Scholar to help with BibTeX. Every book needs publisher location. 🙄

Optionally, if you have more stuff:

  • Appendix: Include supplementary graphs and tables in an appendix. The reader may or may not read the appendix.

Feedback from previous drafts

The following common feedback applies to all groups. Please read carefully – these are mandatory recommendations.

General feedback:

  • Use an academic writing style. Use the present tense (in almost every case), even when writing about previously published papers or work that you did two months ago. Use “we” to describe your efforts. Avoid the passive voice!
  • Use examples, screenshots, or peeks at data to illustrate what you are talking about.
  • Don’t be afraid to get technical – this is a scientific paper. However, if you are going to invoke technical terms or concepts, you have to either explain them or give a citation to a reference that does.
  • Focus on problems and solutions. What were the problems that your client had? How did you solve them? Don’t just narrate about what you did – connect the work that you did to the problems that existed before you started. If you find this hard to do, perhaps what you are writing about doesn’t need to be in the paper.
  • Revise, then revise again.

Ethical statement

Your final paper must include an ethical statement. The ethical statement may be relegated to the appendix at your discretion.

  • Drawing on what you have learned about data science ethics in this class, discuss any ethical considerations in your project. For some projects, this statement could be quite short (one paragraph may suffice). For other projects, more detail may be needed (no more than 2 pages).
  • Be expansive and creative in your thinking about possible ethical considerations. One way to do this assignment poorly would be to write a short statement asserting that there are no ethical considerations, only to have me think of several fairly obvious ones.

Drafts and rubrics

First draft

Your first draft focuses on the mechanics of writing a journal paper in R Markdown. First, you have to get the template to compile. Second, you can fill in easy details about yourselves. Third, write a complete introductory section. This section should be complete and include a compreshensible description of your problem, and why the previous approaches to solving it were insufficient. Fourth, write a complete section about your data. What data do you have? How many rows and columns? What format? What are some basic summary statistics? [This is a good place to include a table.] Fifth, set up BibTeX and cite at least one reference in your introduction.

Your submission should include:

  • Title
  • Authors and affiliations
  • Abstract
  • Introduction (Section 1)
  • Data (Section 2)
  • References (using BibTeX, with at least one citation)
Final paper first draft rubric
Criteria Four Five Six
Formatting Multiple formatting errors. References do not use BiBTeX. Formatting works, but does not help to structure the paper. Section titles do not clearly indicate what section is about. Figures/images are too big or too small. Section titles provide clear structure. Figures, tables, and images fit nicely into the narrative. References use BiBTeX and are complete.
Attention to detail Title does is not descriptive. References are mostly incomplete. Author information and/or affiliations are incomplete. References do not include URLs. Title is informative and catchy. Author information is complete. References include complete information, including URLs.
Introduction Introduction does not describe the problem. Introduction does not discuss previous approaches or cite previous work. Introduction describes the problem, but uses language that is either overly technical (unexplained jargon) or not technical enough (imprecise). Description of previous work is vague. Introduction clearly describes the problem using appropriate language. Introduction discusses previous work and includes citations. Introduction is written in present tense.
Data Description of data is vague, rife with jargon, or inaccurate. Key variables are not defined. Units are not included. Description of data is not comprehensive. Key variables are summarized but not well-defined. Units are implied or omitted. Tables and/or figures are unclear or barely readable. Description of data is clear and comprehensive. Basic summary statistics are included in neat, readable tables. Key variables are well-defined and summarized in tables or figures, with units made explicit.

Second draft

Second draft criteria include everything in the first draft, plus:

  • Methods (Section 3)
  • Results (Section 4) – can be preliminary
Final paper second draft rubric
Criteria Ten Eleven Twelve
General Corrections from first draft were fully implemented. Methods section is imprecise and/or incomplete. Results section is sparse. Methods section contains too much narrative about what was tried, rather than focusing on what is important for the reader to know. Results section is largely incomplete and fails to drive home message. Methods section explains in appropriate scientific language what is important for the reader to know and how it was done. Results section drives home the message of the paper using fully-explained examples, figures, tables, etc.

Final draft

Final draft criteria include everything in the first and second drafts, plus:

  • Results (Section 4) – finalized
  • Conclusion (Section 5)
    • Ethical statement (can be in appendix)
    • Limitations
    • Future work
    • Final thoughts
  • Acknowledgements
Final paper final draft rubric
Criteria Sixteen Eighteen Twenty
Formatting Section titles do not clearly indicate what section is about. Figures/images are too big or too small. References do not use BiBTeX. Section titles provide clear structure. Figures, tables, and images fit nicely into the narrative. References use BiBTeX and are complete.
Narrative Paper tells the story of what you did, including excessive details about data wrangling, and/or failed approaches that are not informative. Introduction explains key concepts and illustrates why research was necessary. Team’s contribution is clearly communicated. Outside research is extensive and appropriately sourced. Paper focuses on the information that is most important to readers.
Completeness Results section is still largely empty. Methods section is thin on details or largely incomplete. Figures, tables, and references are not working properly. Results section is substantially complete. Methods section is fully sketched out, if not complete. Figures, tables, and references are working, if not complete.