| UCSCTableQuery-class {rtracklayer} | R Documentation |
Querying UCSC Tables
Description
The UCSC genome browser is backed by a large database,
which is exposed by the Table Browser web interface. Tracks are
stored as tables, so this is also the mechanism for retrieving tracks. The
UCSCTableQuery class represents a query against the Table
Browser. Storing the query fields in a formal class facilitates
incremental construction and adjustment of a query.
Details
There are six supported fields for a table query:
- provider
-
The provider should be a session, a genome identifier, or a TrackHub URI.
session: TheUCSCSessioninstance from the tables are retrieved. Although all sessions are based on the same database, the set of user-uploaded tracks, which are represented as tables, is not the same, in general. - tableName
The name of the specific table to retrieve. May be
NULL, in which case the behavior depends on how the query is executed, see below.- range
A genome identifier, a
GRangesor aIntegerRangesListindicating the portion of the table to retrieve, in genome coordinates. Simply specifying the genome string is the easiest way to download data for the entire genome, andGRangesForUCSCGenomefacilitates downloading data for e.g. an entire chromosome.- hubUrl
The URI of the specific TrackHub
- genome
A genome identifier of the specific TrackHub, only need to provide it if the provider is up of TrackHub URI.
- names
Names/accessions of the desired features
A common workflow for querying the UCSC database is to create an
instance of UCSCTableQuery using the ucscTableQuery
constructor, invoke tableNames to list the available tables for
a track, and finally to retrieve the desired table either as a
data.frame via getTable or as a track
via track. See the examples.
The reason for a formal query class is to facilitate multiple queries
when the differences between the queries are small. For example, one
might want to query multiple tables within the track and/or same
genomic region, or query the same table for multiple regions. The
UCSCTableQuery instance can be incrementally adjusted for each
new query. Some caching is also performed, which enhances performance.
Constructor
-
ucscTableQuery(x, range = seqinfo(x), table = NULL, names = NULL, hubUrl = NULL, genome = NULL): Creates aUCSCTableQuerywith theUCSCSession, genome identifier or TrackHub URI given asxand the table name given by the single stringtable.rangeshould be a genome string identifier, aGRangesinstance orIntegerRangesListinstance, and it effectively defaults togenome(x). If the genome is missing, it is taken from the provider. Feature names, such as gene identifiers, may be passed vianamesas a character vector.
Executing Queries
Below, object is a UCSCTableQuery instance.
-
track(object): Retrieves the indicated table as a track, i.e. aGRangesobject. Note that not all tables are available as tracks. -
getTable(object): Retrieves the indicated table as adata.frame. Note that not all tables are output in parseable form, and that UCSC will truncate responses if they exceed certain limits (usually around 100,000 records). The safest (and most efficient) bet for large queries is to download the file via FTP and query it locally. -
tableNames(object): Gets the names of the tables available for the provider, table and range specified by the query.
Accessor methods
In the code snippets below, x/object is a
UCSCTableQuery object.
genome(x),genome(x) <- value: Gets or sets the genome identifier (e.g. “hg18”) of the object.hubUrl(x),hubUrl(x) <- value: Gets or sets the TrackHub URI.tableName(x),tableName(x) <- value: Get or set the single string indicating the name of the table to retrieve. May beNULL, in which case the table is automatically determined.range(x),range(x) <- value: Get or set theGRangesindicating the portion of the table to retrieve in genomic coordinates. Any missing information, such as the genome identifier, is filled in usingrange(browserSession(x)). It is also possible to set the genome identifier string or aIntegerRangesList.names(x),names(x) <- value: Get or set the names of the features to retrieve. IfNULL, this filter is disabled.ucscSchema(x): Get theUCSCSchemaobject describing the selected table.ucscTables(genome, track): Get the list of tables for the specified track(e.g. “Assembly”) and genome identifier (e.g. “hg19”). Heregenomeandtrackmust be a single non-NA string.
Author(s)
Michael Lawrence
Examples
## Not run:
# query using `session` provider
session <- browserSession()
genome(session) <- "mm9"
## choose the phastCons30way table for a portion of mm9 chr1
query <- ucscTableQuery(session, table = "phastCons30way",
range = GRangesForUCSCGenome("mm9", "chr12",
IRanges(57795963, 57815592)))
## list the table names
tableNames(query)
## retrieve the track data
track(query) # a GRanges object
## get the multiz30waySummary track
tableName(query) <- "multiz30waySummary"
## get a data.frame summarizing the multiple alignment
getTable(query)
# query using `genome identifier` provider
query <- ucscTableQuery("hg18", table = "snp129",
names = c("rs10003974", "rs10087355", "rs10075230"))
ucscSchema(query)
getTable(query)
# query using `TrackHub URI` provider
query <- ucscTableQuery("https://ftp.ncbi.nlm.nih.gov/snp/population_frequency/TrackHub/20200227123210/",
genome = "hg19", table = "ALFA_GLB")
getTable(query)
# get the list of tables for 'Assembly' track and 'hg19' genome identifier
ucscTables("hg19", "Assembly")
## End(Not run)