Introduction

This document is intended to describe how to access data from a MySQL database using R. It utilizes a database of genetic information that has made available for analysis.

Ensembl

The website at https://www.ensembl.org states:

Ensembl is a genome browser for vertebrate genomes that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation. Ensembl annotate genes, computes multiple alignments, predicts regulatory function and collects disease data. Ensembl tools include BLAST, BLAT, BioMart and the Variant Effect Predictor (VEP) for all supported species.

The website provides a list of what one can do with Ensembl:

  • View genes along with other annotation along the chromosome
  • View alternative transcripts (including splice variants) for a gene
  • Explore homologues and phylogenetic trees across more than 40 species for any gene
  • Compare whole genome alignments and conserved regions across species
  • View microarray sequences that match to Ensembl genes
  • View ESTs, clones, mRNA and proteins for any chromosomal region
  • Examine single nucleotide polymorphisms (SNPs) for a gene or chromosomal region
  • View SNPs across strains (rat, mouse), populations (human), or even breeds (dog)
  • View positions and sequence of mRNA and protein that align with an Ensembl gene
  • Upload your own data
  • Use BLAST, or BLAT, a similar sequence alignment search tool, against any Ensembl genome
  • Export sequence, or create a table of gene information with BioMart

Background on databases in R

A relevant paper on the topic of data management and databases in R can be found at http://chance.amstat.org/2015/04/setting-the-stage.

Accessing data from a database using SQL commands

First we demonstrate how to access data using SQL (structured query language) commands and the dbGetQuery() function. We begin by setting up a connection to the database.

library(tidyverse)
library(RMySQL)
con <- dbConnect(MySQL(), host = "ensembldb.ensembl.org",
                user = "anonymous", password = "", 
                port = 3306)

Next a series of SQL queries can be sent to the database. These return R dataframes. Let’s start by seeing what databases are available.

ds <- dbGetQuery(con, "SHOW DATABASES")
dim(ds)
## [1] 8282    1
grep("scrofa", ds$Database, value = TRUE)
##   [1] "sus_scrofa_core_56_9"            "sus_scrofa_core_57_9a"          
##   [3] "sus_scrofa_core_58_9b"           "sus_scrofa_core_59_9c"          
##   [5] "sus_scrofa_core_60_9d"           "sus_scrofa_core_61_9e"          
##   [7] "sus_scrofa_core_62_9f"           "sus_scrofa_core_63_9"           
##   [9] "sus_scrofa_core_64_9"            "sus_scrofa_core_65_9"           
##  [11] "sus_scrofa_core_66_9"            "sus_scrofa_core_67_102"         
##  [13] "sus_scrofa_core_68_102"          "sus_scrofa_core_69_102"         
##  [15] "sus_scrofa_core_70_102"          "sus_scrofa_core_71_102"         
##  [17] "sus_scrofa_core_72_102"          "sus_scrofa_core_73_102"         
##  [19] "sus_scrofa_core_74_102"          "sus_scrofa_core_75_102"         
##  [21] "sus_scrofa_core_76_102"          "sus_scrofa_core_77_102"         
##  [23] "sus_scrofa_core_78_102"          "sus_scrofa_core_79_102"         
##  [25] "sus_scrofa_core_80_102"          "sus_scrofa_core_81_102"         
##  [27] "sus_scrofa_core_82_102"          "sus_scrofa_core_83_102"         
##  [29] "sus_scrofa_core_84_102"          "sus_scrofa_core_85_102"         
##  [31] "sus_scrofa_core_86_102"          "sus_scrofa_core_87_102"         
##  [33] "sus_scrofa_core_88_102"          "sus_scrofa_core_89_102"         
##  [35] "sus_scrofa_core_90_111"          "sus_scrofa_core_91_111"         
##  [37] "sus_scrofa_core_92_111"          "sus_scrofa_core_93_111"         
##  [39] "sus_scrofa_core_94_111"          "sus_scrofa_funcgen_57_9a"       
##  [41] "sus_scrofa_funcgen_58_9b"        "sus_scrofa_funcgen_59_9c"       
##  [43] "sus_scrofa_funcgen_60_9d"        "sus_scrofa_funcgen_61_9e"       
##  [45] "sus_scrofa_funcgen_62_9f"        "sus_scrofa_funcgen_63_9"        
##  [47] "sus_scrofa_funcgen_64_9"         "sus_scrofa_funcgen_65_9"        
##  [49] "sus_scrofa_funcgen_66_9"         "sus_scrofa_funcgen_67_102"      
##  [51] "sus_scrofa_funcgen_68_102"       "sus_scrofa_funcgen_69_102"      
##  [53] "sus_scrofa_funcgen_70_102"       "sus_scrofa_funcgen_71_102"      
##  [55] "sus_scrofa_funcgen_72_102"       "sus_scrofa_funcgen_73_102"      
##  [57] "sus_scrofa_funcgen_74_102"       "sus_scrofa_funcgen_75_102"      
##  [59] "sus_scrofa_funcgen_76_102"       "sus_scrofa_funcgen_77_102"      
##  [61] "sus_scrofa_funcgen_78_102"       "sus_scrofa_funcgen_79_102"      
##  [63] "sus_scrofa_funcgen_80_102"       "sus_scrofa_funcgen_81_102"      
##  [65] "sus_scrofa_funcgen_82_102"       "sus_scrofa_funcgen_83_102"      
##  [67] "sus_scrofa_funcgen_84_102"       "sus_scrofa_funcgen_85_102"      
##  [69] "sus_scrofa_funcgen_86_102"       "sus_scrofa_funcgen_87_102"      
##  [71] "sus_scrofa_funcgen_88_102"       "sus_scrofa_funcgen_89_102"      
##  [73] "sus_scrofa_funcgen_90_111"       "sus_scrofa_funcgen_91_111"      
##  [75] "sus_scrofa_funcgen_92_111"       "sus_scrofa_funcgen_93_111"      
##  [77] "sus_scrofa_funcgen_94_111"       "sus_scrofa_otherfeatures_56_9"  
##  [79] "sus_scrofa_otherfeatures_57_9a"  "sus_scrofa_otherfeatures_58_9b" 
##  [81] "sus_scrofa_otherfeatures_59_9c"  "sus_scrofa_otherfeatures_60_9d" 
##  [83] "sus_scrofa_otherfeatures_61_9e"  "sus_scrofa_otherfeatures_62_9f" 
##  [85] "sus_scrofa_otherfeatures_63_9"   "sus_scrofa_otherfeatures_64_9"  
##  [87] "sus_scrofa_otherfeatures_65_9"   "sus_scrofa_otherfeatures_66_9"  
##  [89] "sus_scrofa_otherfeatures_67_102" "sus_scrofa_otherfeatures_68_102"
##  [91] "sus_scrofa_otherfeatures_69_102" "sus_scrofa_otherfeatures_70_102"
##  [93] "sus_scrofa_otherfeatures_71_102" "sus_scrofa_otherfeatures_72_102"
##  [95] "sus_scrofa_otherfeatures_73_102" "sus_scrofa_otherfeatures_74_102"
##  [97] "sus_scrofa_otherfeatures_75_102" "sus_scrofa_otherfeatures_76_102"
##  [99] "sus_scrofa_otherfeatures_77_102" "sus_scrofa_otherfeatures_78_102"
## [101] "sus_scrofa_otherfeatures_79_102" "sus_scrofa_otherfeatures_80_102"
## [103] "sus_scrofa_otherfeatures_81_102" "sus_scrofa_otherfeatures_82_102"
## [105] "sus_scrofa_otherfeatures_83_102" "sus_scrofa_otherfeatures_84_102"
## [107] "sus_scrofa_otherfeatures_85_102" "sus_scrofa_otherfeatures_86_102"
## [109] "sus_scrofa_otherfeatures_87_102" "sus_scrofa_otherfeatures_88_102"
## [111] "sus_scrofa_otherfeatures_89_102" "sus_scrofa_otherfeatures_90_111"
## [113] "sus_scrofa_otherfeatures_91_111" "sus_scrofa_otherfeatures_92_111"
## [115] "sus_scrofa_otherfeatures_93_111" "sus_scrofa_otherfeatures_94_111"
## [117] "sus_scrofa_rnaseq_67_102"        "sus_scrofa_rnaseq_68_102"       
## [119] "sus_scrofa_rnaseq_69_102"        "sus_scrofa_rnaseq_70_102"       
## [121] "sus_scrofa_rnaseq_71_102"        "sus_scrofa_rnaseq_72_102"       
## [123] "sus_scrofa_rnaseq_73_102"        "sus_scrofa_rnaseq_74_102"       
## [125] "sus_scrofa_rnaseq_75_102"        "sus_scrofa_rnaseq_76_102"       
## [127] "sus_scrofa_rnaseq_77_102"        "sus_scrofa_rnaseq_78_102"       
## [129] "sus_scrofa_rnaseq_79_102"        "sus_scrofa_rnaseq_80_102"       
## [131] "sus_scrofa_rnaseq_81_102"        "sus_scrofa_rnaseq_82_102"       
## [133] "sus_scrofa_rnaseq_83_102"        "sus_scrofa_rnaseq_84_102"       
## [135] "sus_scrofa_rnaseq_85_102"        "sus_scrofa_rnaseq_86_102"       
## [137] "sus_scrofa_rnaseq_87_102"        "sus_scrofa_rnaseq_88_102"       
## [139] "sus_scrofa_rnaseq_89_102"        "sus_scrofa_rnaseq_90_111"       
## [141] "sus_scrofa_rnaseq_91_111"        "sus_scrofa_rnaseq_92_111"       
## [143] "sus_scrofa_rnaseq_93_111"        "sus_scrofa_rnaseq_94_111"       
## [145] "sus_scrofa_variation_57_9a"      "sus_scrofa_variation_58_9b"     
## [147] "sus_scrofa_variation_59_9c"      "sus_scrofa_variation_60_9d"     
## [149] "sus_scrofa_variation_61_9e"      "sus_scrofa_variation_62_9f"     
## [151] "sus_scrofa_variation_63_9"       "sus_scrofa_variation_64_9"      
## [153] "sus_scrofa_variation_65_9"       "sus_scrofa_variation_66_9"      
## [155] "sus_scrofa_variation_67_102"     "sus_scrofa_variation_68_102"    
## [157] "sus_scrofa_variation_69_102"     "sus_scrofa_variation_70_102"    
## [159] "sus_scrofa_variation_71_102"     "sus_scrofa_variation_72_102"    
## [161] "sus_scrofa_variation_73_102"     "sus_scrofa_variation_74_102"    
## [163] "sus_scrofa_variation_75_102"     "sus_scrofa_variation_76_102"    
## [165] "sus_scrofa_variation_77_102"     "sus_scrofa_variation_78_102"    
## [167] "sus_scrofa_variation_79_102"     "sus_scrofa_variation_80_102"    
## [169] "sus_scrofa_variation_81_102"     "sus_scrofa_variation_82_102"    
## [171] "sus_scrofa_variation_83_102"     "sus_scrofa_variation_84_102"    
## [173] "sus_scrofa_variation_85_102"     "sus_scrofa_variation_86_102"    
## [175] "sus_scrofa_variation_87_102"     "sus_scrofa_variation_88_102"    
## [177] "sus_scrofa_variation_89_102"     "sus_scrofa_variation_90_111"    
## [179] "sus_scrofa_variation_91_111"     "sus_scrofa_variation_92_111"    
## [181] "sus_scrofa_variation_93_111"     "sus_scrofa_variation_94_111"    
## [183] "sus_scrofa_vega_69_102"          "sus_scrofa_vega_70_102"         
## [185] "sus_scrofa_vega_71_102"          "sus_scrofa_vega_72_102"         
## [187] "sus_scrofa_vega_73_102"          "sus_scrofa_vega_74_102"         
## [189] "sus_scrofa_vega_75_102"          "sus_scrofa_vega_76_102"         
## [191] "sus_scrofa_vega_77_102"          "sus_scrofa_vega_78_102"         
## [193] "sus_scrofa_vega_79_102"          "sus_scrofa_vega_80_102"         
## [195] "sus_scrofa_vega_81_102"          "sus_scrofa_vega_82_102"         
## [197] "sus_scrofa_vega_83_102"          "sus_scrofa_vega_84_102"         
## [199] "sus_scrofa_vega_85_102"          "sus_scrofa_vega_86_102"         
## [201] "sus_scrofa_vega_87_102"          "sus_scrofa_vega_88_102"

Let’s focus on the sus_scrofa_variation_79_102 database.

dbGetQuery(con, "USE sus_scrofa_variation_79_102;")
## data frame with 0 columns and 0 rows
dbGetQuery(con, "SHOW TABLES")
##    Tables_in_sus_scrofa_variation_79_102
## 1                                 allele
## 2                            allele_code
## 3                        associate_study
## 4                                 attrib
## 5                             attrib_set
## 6                            attrib_type
## 7             compressed_genotype_region
## 8                compressed_genotype_var
## 9                           coord_system
## 10                         display_group
## 11                         failed_allele
## 12                    failed_description
## 13           failed_structural_variation
## 14                      failed_variation
## 15                         genotype_code
## 16                            individual
## 17       individual_genotype_multiple_bp
## 18                 individual_population
## 19                    individual_synonym
## 20                       individual_type
## 21                                  meta
## 22                            meta_coord
## 23               motif_feature_variation
## 24                             phenotype
## 25                     phenotype_feature
## 26              phenotype_feature_attrib
## 27                            population
## 28                   population_genotype
## 29                  population_structure
## 30                    population_synonym
## 31          protein_function_predictions
## 32   protein_function_predictions_attrib
## 33                           publication
## 34                         read_coverage
## 35          regulatory_feature_variation
## 36                            seq_region
## 37                                source
## 38                     strain_gtype_poly
## 39                  structural_variation
## 40      structural_variation_association
## 41          structural_variation_feature
## 42           structural_variation_sample
## 43                                 study
## 44                      submitter_handle
## 45                         subsnp_handle
## 46                            subsnp_map
## 47              tagged_variation_feature
## 48                  transcript_variation
## 49                       translation_md5
## 50                             variation
## 51                      variation_attrib
## 52                    variation_citation
## 53                     variation_feature
## 54                    variation_genename
## 55                        variation_hgvs
## 56                         variation_set
## 57    variation_set_structural_variation
## 58               variation_set_structure
## 59               variation_set_variation
## 60                     variation_synonym
dbGetQuery(con, "SELECT * FROM publication")
##   publication_id
## 1              1
## 2              2
## 3              3
## 4              4
##                                                                                                                                      title
## 1                                         A combination of two variants in PRKAG3 is needed for a positive effect on meat quality in pigs.
## 2 RNA deep sequencing reveals novel candidate genes and polymorphisms in boar testis and liver tissues with divergent androstenone levels.
## 3                                                 TLR4 single nucleotide polymorphisms (SNPs) associated with Salmonella shedding in pigs.
## 4                                                  Population history and genomic signatures for high-altitude adaptation in Tibetan pigs.
##                                                                                                                                    authors
## 1                                                                                                                     Uimari P, Sironen A.
## 2 Gunawan A, Sahadevan S, Neuhoff C, Große-Brinkhaus C, Gad A, Frieden L, Tesfaye D, Tholen E, Looft C, Uddin MJ, Schellander K, Cinar MU.
## 3                                                             Kich JD, Uthe JJ, Benavides MV, Cantão ME, Zanella R, Tuggle CK, Bearson SM.
## 4                                                                                                Ai H, Yang B, Li J, Xie X, Chen H, Ren J.
##       pmid      pmcid year                          doi ucsc_id
## 1 24580963 PMC3943410 2014      10.1186/1471-2156-15-29    <NA>
## 2 23696805 PMC3655983 2013 10.1371/journal.pone.0063259    <NA>
## 3 24566961 PMC3990860 2014    10.1007/s13353-014-0199-8    <NA>
## 4 25270331 PMC4197311 2014     10.1186/1471-2164-15-834    <NA>
dbGetQuery(con, "SELECT * FROM source")
##   source_id               name  version
## 1         1              dbSNP      140
## 2         2      Archive dbSNP      140
## 3         3 Pig SNP Consortium        0
## 4         4               DGVa   201405
## 5         5       Animal_QTLdb 20140825
##                                                                                                                                                         description
## 1                                                                                                          Variants (including SNPs and indels) imported from dbSNP
## 2                                                                                                                         Former variants names imported from dbSNP
## 3                                                                                                                                             PorcineSNP60 BeadChip
## 4                                                                                                                              Database of Genomic Variants Archive
## 5 The Animal Quantitative Trait Loci (QTL) database (Animal QTLdb) is designed to house all publicly available QTL and association data on livestock animal species
##                                               url type somatic_status
## 1       http://www.ncbi.nlm.nih.gov/projects/SNP/ <NA>          mixed
## 2       http://www.ncbi.nlm.nih.gov/projects/SNP/ <NA>          mixed
## 3                                            <NA> <NA>       germline
## 4                      http://www.ebi.ac.uk/dgva/ <NA>       germline
## 5 http://www.animalgenome.org/cgi-bin/QTLdb/index <NA>       germline
##                   data_types
## 1                  variation
## 2          variation_synonym
## 3          variation_synonym
## 4 structural_variation,study
## 5    phenotype_feature,study
ds <- dbGetQuery(con, "SELECT * from individual LIMIT 1000")
glimpse(ds)
## Observations: 96
## Variables: 10
## $ individual_id        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...
## $ name                 <chr> "B0177", "D0067", "Y0081", "D0059", "B013...
## $ description          <chr> "IND:NIAS-AGP|AGP-PanelM84|B0177|9823|M|O...
## $ gender               <chr> "Unknown", "Unknown", "Unknown", "Unknown...
## $ father_individual_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ mother_individual_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ individual_type_id   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ display              <chr> "UNDISPLAYABLE", "UNDISPLAYABLE", "UNDISP...
## $ has_coverage         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ variation_set_id     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
ds %>%
  group_by(description) %>%
  summarize(N = n()) %>%
  arrange(desc(N))
## # A tibble: 85 x 2
##    description                                                           N
##    <chr>                                                             <int>
##  1 Subject from the DGVa study nstd24                                   12
##  2 IND:NIAS-AGP|AGP-PanelM84|B0075|9823|M|O|Western, SOURCE:submitt…     1
##  3 IND:NIAS-AGP|AGP-PanelM84|B0076|9823|M|O|Western, SOURCE:submitt…     1
##  4 IND:NIAS-AGP|AGP-PanelM84|B0077|9823|M|O|Western, SOURCE:submitt…     1
##  5 IND:NIAS-AGP|AGP-PanelM84|B0079|9823|M|O|Western, SOURCE:submitt…     1
##  6 IND:NIAS-AGP|AGP-PanelM84|B0126|9823|M|O|Western, SOURCE:submitt…     1
##  7 IND:NIAS-AGP|AGP-PanelM84|B0134|9823|M|O|Western, SOURCE:submitt…     1
##  8 IND:NIAS-AGP|AGP-PanelM84|B0138|9823|M|O|Western, SOURCE:submitt…     1
##  9 IND:NIAS-AGP|AGP-PanelM84|B0140|9823|M|O|Western, SOURCE:submitt…     1
## 10 IND:NIAS-AGP|AGP-PanelM84|B0157|9823|M|O|Western, SOURCE:submitt…     1
## # ... with 75 more rows

We can track down the publication which is associated with these data.

https://www.ncbi.nlm.nih.gov/pubmed/25662601

East Balkan Swine (EBS) Sus scrofa is the only aboriginal domesticated pig breed in Bulgaria and is distributed on the western coast of the Black Sea in Bulgaria. To reveal the breed’s genetic characteristics, we analysed mitochondrial DNA (mtDNA) and Y chromosomal DNA sequences of EBS in Bulgaria. Nucleotide diversity (πn ) of the mtDNA control region, including two newly found haplotypes, in 54 EBS was higher (0.014 ± 0.007) compared with that of European (0.005 ± 0.003) and Asian (0.006 ± 0.003) domestic pigs and wild boar. The median-joining network based on the mtDNA control region showed that the EBS and wild boar in Bulgaria comprised mainly two major mtDNA clades, European clade E1 (61.3%) and Asian clade A (38.7%). The coexistence of two mtDNA clades in EBS in Bulgaria may be the relict of historical pig translocation. Among the Bulgarian EBS colonies, the geographical differences in distribution of two mtDNA clades (E1 and A) could be attributed to the source pig populations and/or historical crossbreeding with imported pigs. In addition, analysis of the Y chromosomal DNA sequences for the EBS revealed that all of the EBS had haplotype HY1, which is dominant in European domestic pigs.

Accessing a database using dplyr commands

Alternatively, a connection can be made to the server by creating a dplyr table objects for any give SQL table.

db <- src_mysql(dbname = "sus_scrofa_variation_79_102", 
                host = "ensembldb.ensembl.org", 
                user = "anonymous",
                port = 3306,
                password = "")
Allele <- tbl(db, "allele")

Let’s explore this table

Allele %>% 
  head()
## # Source:   lazy query [?? x 8]
## # Database: mysql 5.6.33
## #   [anonymous@ensembldb.ensembl.org:/sus_scrofa_variation_79_102]
##   allele_id variation_id subsnp_id allele_code_id population_id frequency
##       <int>        <dbl>     <dbl>          <dbl>         <dbl>     <dbl>
## 1         1         1169  16338075              3            15        NA
## 2         2         1169  16338075              2            15        NA
## 3         3         1169 722145465              3            NA        NA
## 4         4         1169 722145465              2            NA        NA
## 5         5         1588  23132408              3            15        NA
## 6         6         1588  23132408              4            15        NA
## # ... with 2 more variables: count <dbl>, frequency_submitter_handle <int>
Allele %>%
  group_by(allele_code_id) %>%
  summarize(meanfreq = mean(frequency), count = n())
## # Source:   lazy query [?? x 3]
## # Database: mysql 5.6.33
## #   [anonymous@ensembldb.ensembl.org:/sus_scrofa_variation_79_102]
##    allele_code_id meanfreq    count
##             <dbl>    <dbl>    <dbl>
##  1              1    0.382 14503376
##  2              2    0.386 14500218
##  3              3    0.652 14686160
##  4              4    0.593 14690458
##  5              5    0.964     3276
##  6              6   NA            5
##  7              7   NA           35
##  8              8   NA            1
##  9              9   NA            1
## 10             10   NA           27
## # ... with more rows

Want to explore more? See the tutorials at https://www.ensembl.org/info/website/tutorials/index.html