The Genomic Data Commons (GDC), is a research program of the National Cancer Institute (NCI).

The mission of the GDC is to provide the cancer research community with a unified data repository that enables data sharing across cancer genomic studies in support of precision medicine.

In Brief

  • Download Data with different methods

  • Data Wrangling

Example 1 Standarized Download

  • Click on Repository and choose Cases to setup interested data: TCGA-LUNG.

    caseset

  • Choose Files and select interested demand, and then click on Manifest to download data1.

    filesset1

  • Continue, select Isoform Expression Quantification to change Data Type and click on Manifest to download data2.

    filesset2

  • Keep Cases selection, remove all previously selection on the page of Files.

  • Select clinical from Data Category and bcr xml from Data Format, then Download data3.

    filesset3

  • Count number of lines in all downloaded data.

1
2
3
4
5
6
7
8

$ wc -l gdc_manifest.2020-09-27-*

Output
     568 gdc_manifest.2020-09-27-LUNG-miRNA-isoform.txt
     568 gdc_manifest.2020-09-27-LUNG-miRNA-seq.txt
     523 gdc_manifest.2020-09-27-clinical.txt
    1659 total
1
2
3
4

wget https://gdc.cancer.gov/files/public/file/gdc-client_v1.6.0_OSX_x64_1.zip

unzip gdc-client_v1.6.0_OSX_x64_1.zip
  • Check the usage of gdc-client.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10

$ ./gdc-client --help

output

commands:
  {download,upload,settings}
                        for more information, specify -h after a command
    download            download data from the GDC
    upload              upload data to the GDC
  • Check usage of gdc-client download.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
$ ./gdc-client download --help

output
usage: gdc-client download [-h] [--debug] [--log-file LOG_FILE] [--color_off]
                           [-t TOKEN_FILE] [-d DIR] [-s server]
                           [--no-segment-md5sums] [--no-file-md5sum]
                           [-n N_PROCESSES]
                           [--http-chunk-size HTTP_CHUNK_SIZE]
                           [--save-interval SAVE_INTERVAL] [-k]
                           [--no-related-files] [--no-annotations]
                           [--no-auto-retry] [--retry-amount RETRY_AMOUNT]
                           [--wait-time WAIT_TIME] [--latest] [--config FILE]
                           [-m MANIFEST]
                           [file_id [file_id ...]]

positional arguments:
  file_id               The GDC UUID of the file(s) to download

optional arguments:
  -h, --help            show this help message and exit
  --debug               Enable debug logging. If a failure occurs, the program
                        will stop.
  --log-file LOG_FILE   Save logs to file. Amount logged affected by --debug
  --color_off           Disable colored output
  -t TOKEN_FILE, --token-file TOKEN_FILE
                        GDC API auth token file
  -d DIR, --dir DIR     Directory to download files to. Defaults to current
                        directory
  -s server, --server server
                        The TCP server address server[:port]
  --no-segment-md5sums  Do not calculate inbound segment md5sums and/or do not
                        verify md5sums on restart
  --no-file-md5sum      Do not verify file md5sum after download
  -n N_PROCESSES, --n-processes N_PROCESSES
                        Number of client connections.
  --http-chunk-size HTTP_CHUNK_SIZE, -c HTTP_CHUNK_SIZE
                        Size in bytes of standard HTTP block size.
  --save-interval SAVE_INTERVAL
                        The number of chunks after which to flush state file.
                        A lower save interval will result in more frequent
                        printout but lower performance.
  -k, --no-verify       Perform insecure SSL connection and transfer
  --no-related-files    Do not download related files.
  --no-annotations      Do not download annotations.
  --no-auto-retry       Ask before retrying to download a file
  --retry-amount RETRY_AMOUNT
                        Number of times to retry a download
  --wait-time WAIT_TIME
                        Amount of seconds to wait before retrying
  --latest              Download latest version of a file if it exists
  --config FILE         Path to INI-type config file
  -m MANIFEST, --manifest MANIFEST
                        GDC download manifest file
  • Download Manifest files.
1
2
3
4
5
6

./gdc-client download -m gdc_manifest.2020-09-27-LUNG-miRNA-clinical.txt -d clinical/

./gdc-client download -m gdc_manifest.2020-09-27-LUNG-miRNA-isoform.txt -d isoform/

./gdc-client download -m gdc_manifest.2020-09-27-LUNG-miRNA-seq.txt -d miRNAseq/
  • Check the information of files downloaded.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
$ cd clinical

# count total number of files downloaded
$ ls|wc
output
    522     522   19314

# count alive number of clinical information
$ grep -i vital_status */*xml|grep Alive|wc
output
    840     840   10920

# count dead number of clinical information
$ grep -i vital_status */*xml|grep -v Alive|wc
output
     298    3580  104444

# get sampes name/id:
$ grep -i vital_status */*xml|grep Alive |cut -d"." -f 3
output
TCGA-J2-8192
TCGA-J2-8192
TCGA-91-8499
TCGA-91-8499
TCGA-55-6986
TCGA-55-6986
TCGA-NJ-A4YG
TCGA-NJ-A4YG
......

# count total number of alive samples
$ grep -i vital_status */*xml|grep Alive |cut -d"." -f 3|sort -u|wc
output
    395     395    5135

Data Wrangling

Data of overall Information

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# count files number for clinical, isoform and miRNAseq
$ ls clinical|wc
output
     522     522   19314

$ ls isoform|wc
output
     567     567   20979

$ ls miRNAseq|wc
output
     567     567   20979

R Scripts For Single Sample

  • Choose one file of clinical randomly to get the format of sample.
1
2
3
$ ls clinical/57d733f6-2f8e-4f0c-b391-48d3cf7c1f87
output
nationwidechildrens.org_clinical.TCGA-75-7030.xml
  • Follow the usage of R - XML Files.

    • Open R.

      The xml file is read by R using the function xmlParse(). It is stored as a list in R.

    • Reading XML File.

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      
      # Load the package required to read XML files.
      library("XML")
      
      # Also load the other required package.
      library("methods")
      
      # Give the input file name to the function.
      result <- xmlParse(file = "nationwidechildrens.org_clinical.TCGA-75-7030.xml")
      
      # Print the result.
      print(result)
      
    • Get Number of Nodes Present in XML File.

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      
      # Exract the root node form the xml file.
      rootnode <- xmlRoot(result)
      
      # Find number of nodes in the root.
      rootsize <- xmlSize(rootnode)
      
      # Print the result.
      print(rootsize)
      
      output
      [1] 2
      
    • Details of the First Node and second Node.

      1
      2
      3
      4
      5
      6
      
      # Exract the root node form the xml file.
      rootnode <- xmlRoot(result)
      
      # Print the result.
      print(rootnode[1])
      print(rootnode[2])
      
    • XML to Data Frame.

      1
      2
      3
      4
      5
      
      # Convert the input xml file to a data frame.
      xmldataframe <- xmlToDataFrame(rootnode[2])
      print(xmldataframe)
      t(xmldataframe)
      write.table(t(xmldataframe),'tmp')
      

Run R Scripts For All files

  • Complete scripts according to that of single sample.

  • Run Scripts.

    RScript1

Example 2 Fast Download

  • Enter Key words in Google and search: tcga gdc lusc.

    google

  • Click on site 1, and download data of files.

    downloadfile

  • Choose and download interested clinical data.

    set2

  • Choose and download interested Transcript data.

    transcript1


    transcript2

Run R.Scripts

In summary

  • Anyway, here is the official GDC Documentation on line.