::text_stats() wordcountaddin
Papers
Writing for data science
Please follow the following guidelines for all drafts of your final paper:
- All papers will be written in Quarto as Manuscripts
- Use BibTeX for citations. Use Google Scholar to gather reference information in BibTeX format (adjust manually where necessary). Include a
doi
(preferred) and/orurl
field in all BibTeX entries. Insert citations into the body of the paper as appropriate. Cite R packages with information fromcitation()
. You can start a.bib
file withknitr::write_bib()
. The BibTeX database for this page looks like this. - Include a link to your GitHub repository (where necessary) as a citation (even if it’s private)!
- All Figures must be referenced explicitly in the text by number. Figures should have expansive captions that tell the reader what they are supposed to see in the figure. All figures should have readable axis labels and explicit units. Figures should be mostly self-explanatory! Spend more time gussying up the figures – not less!
- Tables should also have expansive captions and must be referenced in the text by number. Simple summary tables are easy to generate and really helpful to readers!
- Use Markdown formatting, including: sections and subsections, lists, links, block quotes, chunks, and code, where appropriate.
- Put any pieces of code in backticks (`). That includes names of packages, functions, variable names, and even certain strings. For bits of actual code, use a chunk. Note that Quarto supports chunks in many languages, including Python, SQL, and bash.
- Use
wordcountaddin::text_stats()
to count your words, if necessary.
- Limit your use of adverbs.
So much of revision is realizing how useless 90% of adverbs are.
— Clint Smith ((ClintSmithIII?)) August 4, 2020
Audience
Your final deliverable will be to write up your project as an academic paper. For your convenience, the format should conform to the Undergraduate Research Project Competition (USRESP) guidelines. If possible, you are strongly encouraged to submit your paper to the competition, but that decision will not affect your grade in this class. The paper should be 12–15 pages long, plus references, and an appendix (of arbitrary length) in which you can show more graphics, tables, etc.
- Use your GitHub repository to write the paper. Put the paper contents into a subdirectory called
quarto-manuscript
. - Use GitHub issues to delegate certain tasks to certain people, and to keep track of your progress.
- Use the Manuscripts feature to write the paper in Quarto.
- One submission per group.
The structure of a scientific paper
Most commonly, research papers will have five sections like:
- Introduction and Motivation: What is the problem you are trying to solve? Why is this problem interesting? What has been tried before? What have been the shortcoming of those approaches that necessitate your efforts? Often, this section will conclude with a subsection (or paragraph) outlining “our contributions.” What is the new knowledge that this paper contributes?
- Data: Where did it come from? What are some basic summary statistics, variable definitions, and/or visualizations that help the reader understand the data you are working with? Tables and figures are strongly encouraged here.
- Methods: What did you actually do? What techniques or methods did you employ? What were the specifications for any statistical models you used? What software or packages did you use or develop?
- Results: What did you learn about the problem you identified in Section 1? This is where you put the tables, figures, and analytics by-products of your work.
- Conclusion: What are the limitations of your work? What are some next steps that someone (either you or another research group) should consider in attempting to further your work? Remind us one last time about what you did.
- References: Every reference (except for books) needs a URL. Use Google Scholar to help with BibTeX. Every book needs publisher location. 🙄 The format is
Publisher City: Publisher Name
.
Optionally, if you have more stuff:
- Appendix: Include supplementary graphs and tables in an appendix. The reader may or may not read the appendix.
Feedback from previous drafts
The following common feedback applies to all groups. Please read carefully – these are mandatory recommendations.
General feedback:
- Use an academic writing style. Use the present tense (in almost every case), even when writing about previously published papers or work that you did two months ago. Use “we” to describe your efforts. Avoid the passive voice!
- Use examples, screenshots, or peeks at data to illustrate what you are talking about.
- Don’t be afraid to get technical – this is a scientific paper. However, if you are going to invoke technical terms or concepts, you have to either explain them or give a citation to a reference that does.
- Focus on problems and solutions. What were the problems that your client had? How did you solve them? Don’t just narrate about what you did – connect the work that you did to the problems that existed before you started. If you find this hard to do, perhaps what you are writing about doesn’t need to be in the paper.
- Revise, then revise again.
Ethical statement
Your final paper must include an ethical statement. The ethical statement may be relegated to the appendix at your discretion.
- Drawing on what you have learned about data science ethics in this class, discuss any ethical considerations in your project. For some projects, this statement could be quite short (one paragraph may suffice). For other projects, more detail may be needed (no more than 2 pages).
- Be expansive and creative in your thinking about possible ethical considerations. One way to do this assignment poorly would be to write a short statement asserting that there are no ethical considerations, only to have me think of several fairly obvious ones.
Drafts and rubrics
First draft
Your first draft focuses on the mechanics of writing a manuscript in Quarto. First, you have to get the template to compile. Second, you can fill in easy details about yourselves. Third, write a complete introductory section. This section should be complete and include a compreshensible description of your problem, and why the previous approaches to solving it were insufficient. Fourth, write a complete section about your data. What data do you have? How many rows and columns? What format? What are some basic summary statistics? [This is a good place to include a table.] Fifth, set up BibTeX and cite at least one reference in your introduction.
Your submission should include:
- Title
- Authors and affiliations
- Abstract
- Introduction (Section 1)
- Data (Section 2)
- References (using BibTeX, with at least one citation)
Criteria | Four | Five | Six |
---|---|---|---|
Formatting | Multiple formatting errors. References do not use BiBTeX. | Formatting works, but does not help to structure the paper. Section titles do not clearly indicate what section is about. Figures/images are too big or too small. | Section titles provide clear structure. Figures, tables, and images fit nicely into the narrative. References use BiBTeX and are complete. |
Attention to detail | Title does is not descriptive. References are mostly incomplete. | Author information and/or affiliations are incomplete. References do not include URLs. | Title is informative and catchy. Author information is complete. References include complete information, including URLs. |
Introduction | Introduction does not describe the problem. Introduction does not discuss previous approaches or cite previous work. | Introduction describes the problem, but uses language that is either overly technical (unexplained jargon) or not technical enough (imprecise). Description of previous work is vague. | Introduction clearly describes the problem using appropriate language. Introduction discusses previous work and includes citations. Introduction is written in present tense. |
Data | Description of data is vague, rife with jargon, or inaccurate. Key variables are not defined. Units are not included. | Description of data is not comprehensive. Key variables are summarized but not well-defined. Units are implied or omitted. Tables and/or figures are unclear or barely readable. | Description of data is clear and comprehensive. Basic summary statistics are included in neat, readable tables. Key variables are well-defined and summarized in tables or figures, with units made explicit. |
Second draft
Second draft criteria include everything in the first draft, plus:
- Methods (Section 3)
- Results (Section 4) – can be preliminary
Incorporate my feedback from the first draft and revise accordingly!
Criteria | Ten | Eleven | Twelve |
---|---|---|---|
General | Corrections from first draft were fully implemented. Methods section is imprecise and/or incomplete. Results section is sparse. | Methods section contains too much narrative about what was tried, rather than focusing on what is important for the reader to know. Results section is largely incomplete and fails to drive home message. | Methods section explains in appropriate scientific language what is important for the reader to know and how it was done. Results section drives home the message of the paper using fully-explained examples, figures, tables, etc. |
Final draft
Final draft criteria include everything in the first and second drafts, plus:
- Results (Section 4) – finalized
- Conclusion (Section 5)
- Ethical statement (can be in appendix)
- Limitations
- Future work
- Final thoughts
- Acknowledgements
Incorporate my feedback from the second draft and revise accordingly!
Criteria | Sixteen | Eighteen | Twenty |
---|---|---|---|
Formatting | Section titles do not clearly indicate what section is about. Figures/images are too big or too small. References do not use BiBTeX. | Section titles provide clear structure. Figures, tables, and images fit nicely into the narrative. References use BiBTeX and are complete. | |
Narrative | Paper tells the story of what you did, including excessive details about data wrangling, and/or failed approaches that are not informative. | Introduction explains key concepts and illustrates why research was necessary. Team’s contribution is clearly communicated. Outside research is extensive and appropriately sourced. Paper focuses on the information that is most important to readers. | |
Completeness | Results section is still largely empty. Methods section is thin on details or largely incomplete. Figures, tables, and references are not working properly. | Results section is substantially complete. Methods section is fully sketched out, if not complete. Figures, tables, and references are working, if not complete. |