This document is intended to describe how to access data from a MySQL database using R. It utilizes a database of genetic information that has made available for analysis.
The website at https://www.ensembl.org states:
Ensembl is a genome browser for vertebrate genomes that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation. Ensembl annotate genes, computes multiple alignments, predicts regulatory function and collects disease data. Ensembl tools include BLAST, BLAT, BioMart and the Variant Effect Predictor (VEP) for all supported species.
The website provides a list of what one can do with Ensembl:
A relevant paper on the topic of data management and databases in R can be found at http://chance.amstat.org/2015/04/setting-the-stage.
First we demonstrate how to access data using SQL (structured query language) commands and the dbGetQuery()
function. We begin by setting up a connection to the database.
library(tidyverse)
library(RMySQL)
con <- dbConnect(MySQL(), host = "ensembldb.ensembl.org",
user = "anonymous", password = "",
port = 3306)
Next a series of SQL queries can be sent to the database. These return R dataframes. Let’s start by seeing what databases are available.
ds <- dbGetQuery(con, "SHOW DATABASES")
dim(ds)
## [1] 8282 1
grep("scrofa", ds$Database, value = TRUE)
## [1] "sus_scrofa_core_56_9" "sus_scrofa_core_57_9a"
## [3] "sus_scrofa_core_58_9b" "sus_scrofa_core_59_9c"
## [5] "sus_scrofa_core_60_9d" "sus_scrofa_core_61_9e"
## [7] "sus_scrofa_core_62_9f" "sus_scrofa_core_63_9"
## [9] "sus_scrofa_core_64_9" "sus_scrofa_core_65_9"
## [11] "sus_scrofa_core_66_9" "sus_scrofa_core_67_102"
## [13] "sus_scrofa_core_68_102" "sus_scrofa_core_69_102"
## [15] "sus_scrofa_core_70_102" "sus_scrofa_core_71_102"
## [17] "sus_scrofa_core_72_102" "sus_scrofa_core_73_102"
## [19] "sus_scrofa_core_74_102" "sus_scrofa_core_75_102"
## [21] "sus_scrofa_core_76_102" "sus_scrofa_core_77_102"
## [23] "sus_scrofa_core_78_102" "sus_scrofa_core_79_102"
## [25] "sus_scrofa_core_80_102" "sus_scrofa_core_81_102"
## [27] "sus_scrofa_core_82_102" "sus_scrofa_core_83_102"
## [29] "sus_scrofa_core_84_102" "sus_scrofa_core_85_102"
## [31] "sus_scrofa_core_86_102" "sus_scrofa_core_87_102"
## [33] "sus_scrofa_core_88_102" "sus_scrofa_core_89_102"
## [35] "sus_scrofa_core_90_111" "sus_scrofa_core_91_111"
## [37] "sus_scrofa_core_92_111" "sus_scrofa_core_93_111"
## [39] "sus_scrofa_core_94_111" "sus_scrofa_funcgen_57_9a"
## [41] "sus_scrofa_funcgen_58_9b" "sus_scrofa_funcgen_59_9c"
## [43] "sus_scrofa_funcgen_60_9d" "sus_scrofa_funcgen_61_9e"
## [45] "sus_scrofa_funcgen_62_9f" "sus_scrofa_funcgen_63_9"
## [47] "sus_scrofa_funcgen_64_9" "sus_scrofa_funcgen_65_9"
## [49] "sus_scrofa_funcgen_66_9" "sus_scrofa_funcgen_67_102"
## [51] "sus_scrofa_funcgen_68_102" "sus_scrofa_funcgen_69_102"
## [53] "sus_scrofa_funcgen_70_102" "sus_scrofa_funcgen_71_102"
## [55] "sus_scrofa_funcgen_72_102" "sus_scrofa_funcgen_73_102"
## [57] "sus_scrofa_funcgen_74_102" "sus_scrofa_funcgen_75_102"
## [59] "sus_scrofa_funcgen_76_102" "sus_scrofa_funcgen_77_102"
## [61] "sus_scrofa_funcgen_78_102" "sus_scrofa_funcgen_79_102"
## [63] "sus_scrofa_funcgen_80_102" "sus_scrofa_funcgen_81_102"
## [65] "sus_scrofa_funcgen_82_102" "sus_scrofa_funcgen_83_102"
## [67] "sus_scrofa_funcgen_84_102" "sus_scrofa_funcgen_85_102"
## [69] "sus_scrofa_funcgen_86_102" "sus_scrofa_funcgen_87_102"
## [71] "sus_scrofa_funcgen_88_102" "sus_scrofa_funcgen_89_102"
## [73] "sus_scrofa_funcgen_90_111" "sus_scrofa_funcgen_91_111"
## [75] "sus_scrofa_funcgen_92_111" "sus_scrofa_funcgen_93_111"
## [77] "sus_scrofa_funcgen_94_111" "sus_scrofa_otherfeatures_56_9"
## [79] "sus_scrofa_otherfeatures_57_9a" "sus_scrofa_otherfeatures_58_9b"
## [81] "sus_scrofa_otherfeatures_59_9c" "sus_scrofa_otherfeatures_60_9d"
## [83] "sus_scrofa_otherfeatures_61_9e" "sus_scrofa_otherfeatures_62_9f"
## [85] "sus_scrofa_otherfeatures_63_9" "sus_scrofa_otherfeatures_64_9"
## [87] "sus_scrofa_otherfeatures_65_9" "sus_scrofa_otherfeatures_66_9"
## [89] "sus_scrofa_otherfeatures_67_102" "sus_scrofa_otherfeatures_68_102"
## [91] "sus_scrofa_otherfeatures_69_102" "sus_scrofa_otherfeatures_70_102"
## [93] "sus_scrofa_otherfeatures_71_102" "sus_scrofa_otherfeatures_72_102"
## [95] "sus_scrofa_otherfeatures_73_102" "sus_scrofa_otherfeatures_74_102"
## [97] "sus_scrofa_otherfeatures_75_102" "sus_scrofa_otherfeatures_76_102"
## [99] "sus_scrofa_otherfeatures_77_102" "sus_scrofa_otherfeatures_78_102"
## [101] "sus_scrofa_otherfeatures_79_102" "sus_scrofa_otherfeatures_80_102"
## [103] "sus_scrofa_otherfeatures_81_102" "sus_scrofa_otherfeatures_82_102"
## [105] "sus_scrofa_otherfeatures_83_102" "sus_scrofa_otherfeatures_84_102"
## [107] "sus_scrofa_otherfeatures_85_102" "sus_scrofa_otherfeatures_86_102"
## [109] "sus_scrofa_otherfeatures_87_102" "sus_scrofa_otherfeatures_88_102"
## [111] "sus_scrofa_otherfeatures_89_102" "sus_scrofa_otherfeatures_90_111"
## [113] "sus_scrofa_otherfeatures_91_111" "sus_scrofa_otherfeatures_92_111"
## [115] "sus_scrofa_otherfeatures_93_111" "sus_scrofa_otherfeatures_94_111"
## [117] "sus_scrofa_rnaseq_67_102" "sus_scrofa_rnaseq_68_102"
## [119] "sus_scrofa_rnaseq_69_102" "sus_scrofa_rnaseq_70_102"
## [121] "sus_scrofa_rnaseq_71_102" "sus_scrofa_rnaseq_72_102"
## [123] "sus_scrofa_rnaseq_73_102" "sus_scrofa_rnaseq_74_102"
## [125] "sus_scrofa_rnaseq_75_102" "sus_scrofa_rnaseq_76_102"
## [127] "sus_scrofa_rnaseq_77_102" "sus_scrofa_rnaseq_78_102"
## [129] "sus_scrofa_rnaseq_79_102" "sus_scrofa_rnaseq_80_102"
## [131] "sus_scrofa_rnaseq_81_102" "sus_scrofa_rnaseq_82_102"
## [133] "sus_scrofa_rnaseq_83_102" "sus_scrofa_rnaseq_84_102"
## [135] "sus_scrofa_rnaseq_85_102" "sus_scrofa_rnaseq_86_102"
## [137] "sus_scrofa_rnaseq_87_102" "sus_scrofa_rnaseq_88_102"
## [139] "sus_scrofa_rnaseq_89_102" "sus_scrofa_rnaseq_90_111"
## [141] "sus_scrofa_rnaseq_91_111" "sus_scrofa_rnaseq_92_111"
## [143] "sus_scrofa_rnaseq_93_111" "sus_scrofa_rnaseq_94_111"
## [145] "sus_scrofa_variation_57_9a" "sus_scrofa_variation_58_9b"
## [147] "sus_scrofa_variation_59_9c" "sus_scrofa_variation_60_9d"
## [149] "sus_scrofa_variation_61_9e" "sus_scrofa_variation_62_9f"
## [151] "sus_scrofa_variation_63_9" "sus_scrofa_variation_64_9"
## [153] "sus_scrofa_variation_65_9" "sus_scrofa_variation_66_9"
## [155] "sus_scrofa_variation_67_102" "sus_scrofa_variation_68_102"
## [157] "sus_scrofa_variation_69_102" "sus_scrofa_variation_70_102"
## [159] "sus_scrofa_variation_71_102" "sus_scrofa_variation_72_102"
## [161] "sus_scrofa_variation_73_102" "sus_scrofa_variation_74_102"
## [163] "sus_scrofa_variation_75_102" "sus_scrofa_variation_76_102"
## [165] "sus_scrofa_variation_77_102" "sus_scrofa_variation_78_102"
## [167] "sus_scrofa_variation_79_102" "sus_scrofa_variation_80_102"
## [169] "sus_scrofa_variation_81_102" "sus_scrofa_variation_82_102"
## [171] "sus_scrofa_variation_83_102" "sus_scrofa_variation_84_102"
## [173] "sus_scrofa_variation_85_102" "sus_scrofa_variation_86_102"
## [175] "sus_scrofa_variation_87_102" "sus_scrofa_variation_88_102"
## [177] "sus_scrofa_variation_89_102" "sus_scrofa_variation_90_111"
## [179] "sus_scrofa_variation_91_111" "sus_scrofa_variation_92_111"
## [181] "sus_scrofa_variation_93_111" "sus_scrofa_variation_94_111"
## [183] "sus_scrofa_vega_69_102" "sus_scrofa_vega_70_102"
## [185] "sus_scrofa_vega_71_102" "sus_scrofa_vega_72_102"
## [187] "sus_scrofa_vega_73_102" "sus_scrofa_vega_74_102"
## [189] "sus_scrofa_vega_75_102" "sus_scrofa_vega_76_102"
## [191] "sus_scrofa_vega_77_102" "sus_scrofa_vega_78_102"
## [193] "sus_scrofa_vega_79_102" "sus_scrofa_vega_80_102"
## [195] "sus_scrofa_vega_81_102" "sus_scrofa_vega_82_102"
## [197] "sus_scrofa_vega_83_102" "sus_scrofa_vega_84_102"
## [199] "sus_scrofa_vega_85_102" "sus_scrofa_vega_86_102"
## [201] "sus_scrofa_vega_87_102" "sus_scrofa_vega_88_102"
Let’s focus on the sus_scrofa_variation_79_102
database.
dbGetQuery(con, "USE sus_scrofa_variation_79_102;")
## data frame with 0 columns and 0 rows
dbGetQuery(con, "SHOW TABLES")
## Tables_in_sus_scrofa_variation_79_102
## 1 allele
## 2 allele_code
## 3 associate_study
## 4 attrib
## 5 attrib_set
## 6 attrib_type
## 7 compressed_genotype_region
## 8 compressed_genotype_var
## 9 coord_system
## 10 display_group
## 11 failed_allele
## 12 failed_description
## 13 failed_structural_variation
## 14 failed_variation
## 15 genotype_code
## 16 individual
## 17 individual_genotype_multiple_bp
## 18 individual_population
## 19 individual_synonym
## 20 individual_type
## 21 meta
## 22 meta_coord
## 23 motif_feature_variation
## 24 phenotype
## 25 phenotype_feature
## 26 phenotype_feature_attrib
## 27 population
## 28 population_genotype
## 29 population_structure
## 30 population_synonym
## 31 protein_function_predictions
## 32 protein_function_predictions_attrib
## 33 publication
## 34 read_coverage
## 35 regulatory_feature_variation
## 36 seq_region
## 37 source
## 38 strain_gtype_poly
## 39 structural_variation
## 40 structural_variation_association
## 41 structural_variation_feature
## 42 structural_variation_sample
## 43 study
## 44 submitter_handle
## 45 subsnp_handle
## 46 subsnp_map
## 47 tagged_variation_feature
## 48 transcript_variation
## 49 translation_md5
## 50 variation
## 51 variation_attrib
## 52 variation_citation
## 53 variation_feature
## 54 variation_genename
## 55 variation_hgvs
## 56 variation_set
## 57 variation_set_structural_variation
## 58 variation_set_structure
## 59 variation_set_variation
## 60 variation_synonym
dbGetQuery(con, "SELECT * FROM publication")
## publication_id
## 1 1
## 2 2
## 3 3
## 4 4
## title
## 1 A combination of two variants in PRKAG3 is needed for a positive effect on meat quality in pigs.
## 2 RNA deep sequencing reveals novel candidate genes and polymorphisms in boar testis and liver tissues with divergent androstenone levels.
## 3 TLR4 single nucleotide polymorphisms (SNPs) associated with Salmonella shedding in pigs.
## 4 Population history and genomic signatures for high-altitude adaptation in Tibetan pigs.
## authors
## 1 Uimari P, Sironen A.
## 2 Gunawan A, Sahadevan S, Neuhoff C, Große-Brinkhaus C, Gad A, Frieden L, Tesfaye D, Tholen E, Looft C, Uddin MJ, Schellander K, Cinar MU.
## 3 Kich JD, Uthe JJ, Benavides MV, Cantão ME, Zanella R, Tuggle CK, Bearson SM.
## 4 Ai H, Yang B, Li J, Xie X, Chen H, Ren J.
## pmid pmcid year doi ucsc_id
## 1 24580963 PMC3943410 2014 10.1186/1471-2156-15-29 <NA>
## 2 23696805 PMC3655983 2013 10.1371/journal.pone.0063259 <NA>
## 3 24566961 PMC3990860 2014 10.1007/s13353-014-0199-8 <NA>
## 4 25270331 PMC4197311 2014 10.1186/1471-2164-15-834 <NA>
dbGetQuery(con, "SELECT * FROM source")
## source_id name version
## 1 1 dbSNP 140
## 2 2 Archive dbSNP 140
## 3 3 Pig SNP Consortium 0
## 4 4 DGVa 201405
## 5 5 Animal_QTLdb 20140825
## description
## 1 Variants (including SNPs and indels) imported from dbSNP
## 2 Former variants names imported from dbSNP
## 3 PorcineSNP60 BeadChip
## 4 Database of Genomic Variants Archive
## 5 The Animal Quantitative Trait Loci (QTL) database (Animal QTLdb) is designed to house all publicly available QTL and association data on livestock animal species
## url type somatic_status
## 1 http://www.ncbi.nlm.nih.gov/projects/SNP/ <NA> mixed
## 2 http://www.ncbi.nlm.nih.gov/projects/SNP/ <NA> mixed
## 3 <NA> <NA> germline
## 4 http://www.ebi.ac.uk/dgva/ <NA> germline
## 5 http://www.animalgenome.org/cgi-bin/QTLdb/index <NA> germline
## data_types
## 1 variation
## 2 variation_synonym
## 3 variation_synonym
## 4 structural_variation,study
## 5 phenotype_feature,study
ds <- dbGetQuery(con, "SELECT * from individual LIMIT 1000")
glimpse(ds)
## Observations: 96
## Variables: 10
## $ individual_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...
## $ name <chr> "B0177", "D0067", "Y0081", "D0059", "B013...
## $ description <chr> "IND:NIAS-AGP|AGP-PanelM84|B0177|9823|M|O...
## $ gender <chr> "Unknown", "Unknown", "Unknown", "Unknown...
## $ father_individual_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ mother_individual_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ individual_type_id <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ display <chr> "UNDISPLAYABLE", "UNDISPLAYABLE", "UNDISP...
## $ has_coverage <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ variation_set_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
ds %>%
group_by(description) %>%
summarize(N = n()) %>%
arrange(desc(N))
## # A tibble: 85 x 2
## description N
## <chr> <int>
## 1 Subject from the DGVa study nstd24 12
## 2 IND:NIAS-AGP|AGP-PanelM84|B0075|9823|M|O|Western, SOURCE:submitt… 1
## 3 IND:NIAS-AGP|AGP-PanelM84|B0076|9823|M|O|Western, SOURCE:submitt… 1
## 4 IND:NIAS-AGP|AGP-PanelM84|B0077|9823|M|O|Western, SOURCE:submitt… 1
## 5 IND:NIAS-AGP|AGP-PanelM84|B0079|9823|M|O|Western, SOURCE:submitt… 1
## 6 IND:NIAS-AGP|AGP-PanelM84|B0126|9823|M|O|Western, SOURCE:submitt… 1
## 7 IND:NIAS-AGP|AGP-PanelM84|B0134|9823|M|O|Western, SOURCE:submitt… 1
## 8 IND:NIAS-AGP|AGP-PanelM84|B0138|9823|M|O|Western, SOURCE:submitt… 1
## 9 IND:NIAS-AGP|AGP-PanelM84|B0140|9823|M|O|Western, SOURCE:submitt… 1
## 10 IND:NIAS-AGP|AGP-PanelM84|B0157|9823|M|O|Western, SOURCE:submitt… 1
## # ... with 75 more rows
We can track down the publication which is associated with these data.
https://www.ncbi.nlm.nih.gov/pubmed/25662601
East Balkan Swine (EBS) Sus scrofa is the only aboriginal domesticated pig breed in Bulgaria and is distributed on the western coast of the Black Sea in Bulgaria. To reveal the breed’s genetic characteristics, we analysed mitochondrial DNA (mtDNA) and Y chromosomal DNA sequences of EBS in Bulgaria. Nucleotide diversity (πn ) of the mtDNA control region, including two newly found haplotypes, in 54 EBS was higher (0.014 ± 0.007) compared with that of European (0.005 ± 0.003) and Asian (0.006 ± 0.003) domestic pigs and wild boar. The median-joining network based on the mtDNA control region showed that the EBS and wild boar in Bulgaria comprised mainly two major mtDNA clades, European clade E1 (61.3%) and Asian clade A (38.7%). The coexistence of two mtDNA clades in EBS in Bulgaria may be the relict of historical pig translocation. Among the Bulgarian EBS colonies, the geographical differences in distribution of two mtDNA clades (E1 and A) could be attributed to the source pig populations and/or historical crossbreeding with imported pigs. In addition, analysis of the Y chromosomal DNA sequences for the EBS revealed that all of the EBS had haplotype HY1, which is dominant in European domestic pigs.
Alternatively, a connection can be made to the server by creating a dplyr
table objects for any give SQL table.
db <- src_mysql(dbname = "sus_scrofa_variation_79_102",
host = "ensembldb.ensembl.org",
user = "anonymous",
port = 3306,
password = "")
Allele <- tbl(db, "allele")
Allele %>%
head()
## # Source: lazy query [?? x 8]
## # Database: mysql 5.6.33
## # [anonymous@ensembldb.ensembl.org:/sus_scrofa_variation_79_102]
## allele_id variation_id subsnp_id allele_code_id population_id frequency
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1169 16338075 3 15 NA
## 2 2 1169 16338075 2 15 NA
## 3 3 1169 722145465 3 NA NA
## 4 4 1169 722145465 2 NA NA
## 5 5 1588 23132408 3 15 NA
## 6 6 1588 23132408 4 15 NA
## # ... with 2 more variables: count <dbl>, frequency_submitter_handle <int>
Allele %>%
group_by(allele_code_id) %>%
summarize(meanfreq = mean(frequency), count = n())
## # Source: lazy query [?? x 3]
## # Database: mysql 5.6.33
## # [anonymous@ensembldb.ensembl.org:/sus_scrofa_variation_79_102]
## allele_code_id meanfreq count
## <dbl> <dbl> <dbl>
## 1 1 0.382 14503376
## 2 2 0.386 14500218
## 3 3 0.652 14686160
## 4 4 0.593 14690458
## 5 5 0.964 3276
## 6 6 NA 5
## 7 7 NA 35
## 8 8 NA 1
## 9 9 NA 1
## 10 10 NA 27
## # ... with more rows
Want to explore more? See the tutorials at https://www.ensembl.org/info/website/tutorials/index.html