Skip to content

Commit

Permalink
edits to earthdata tutorials
Browse files Browse the repository at this point in the history
  • Loading branch information
JessicaS11 committed Dec 4, 2023
1 parent 4ea5410 commit c9f8ffc
Show file tree
Hide file tree
Showing 3 changed files with 56 additions and 53 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ This guide was adapted from the following tutorials:

## 1. Modes of Data Access

In the past, most of our scientific data analysis workflows have started with searching for data and then downloading that data to a local machine; whether that is the hard drive of your laptop or workstation, or some shared storage device hosted by your institution or research group. This can be a time consuming process if the volume of data is large, even with fast internet. It also requires that you have sufficient disk-space. If you want to work with data from different geoscience domains, you may have to download data from several data centers.
In the past, most of our scientific data analysis workflows have started with searching for data and then downloading that data to a local machine; whether that is the hard drive of your laptop or workstation, or some shared storage device hosted by your institution or research group. This can be a time consuming process if the volume of data is large, even with fast internet. It also requires that you have sufficient disk-space and update your copy every time an updated version of the data is released. If you want to work with data from different geoscience domains, you may have to download data from several data centers.

<figure>
<center>
Expand All @@ -29,37 +29,37 @@ In the past, most of our scientific data analysis workflows have started with se
Figure credit: Alexey Shiklomanov, NASA ESDIS, from [The future of NASA Earth Science in the commercial cloud:
Challenges and opportunities](https://docs.google.com/presentation/d/12mh_8WU9lsrPviBO_MBv2blbjRufXoQmqCB4XGyxQ90/edit?pli=1)

However, a change is a-foot. New modes of data access are starting to becoming available. Driven by the growth in the volume of data from future satellite missions, the archiving and distribution of NASA data is in a [state of transition](https://www.earthdata.nasa.gov/eosdis/cloud-evolution). Over the next few years, all NASA data will be migrated to the NASA Earthdata Cloud, a cloud-hosted data store that will have all NASA datasets in one place. This not only offers new modes of accessing NASA data but also offers new ways of working with this data. As with Google Docs or Sheets, data in these "files" is not just stored in the cloud but compute resources offered by cloud providers allow you to process and analyze the data in the cloud. When you edit your Google Doc or Sheet, you are working in the cloud not on your computer. All you need is a web browser; you can work with these files on your laptop, tablet or even your phone. If you choose to share these documents with others, they can actively collaborate with you on the same document also in the cloud. For large geoscience datasets, this means you can _skip the download_ and take your _analysis to the data_.
However, a change is a-foot. New modes of data access are becoming available. Driven by the growth in the volume of data from future satellite missions, the archiving and distribution of NASA data is in a [state of transition](https://www.earthdata.nasa.gov/eosdis/cloud-evolution). Over the next few years, all NASA data will be migrated to the NASA Earthdata Cloud, a cloud-hosted data store that will have all NASA datasets in one place. This not only offers new modes of accessing NASA data but also offers new ways of working with this data. As with Google Docs or Sheets, data in these "files" is not just stored in the cloud but compute resources offered by cloud providers allow you to process and analyze the data in the cloud. When you edit your Google Doc or Sheet, you are working in the cloud, not on your computer. All you need is a web browser; you can work with these files on your laptop, tablet or even your phone. If you choose to share these documents with others, they can actively collaborate with you on the same document also in the cloud. For large geoscience datasets, this means you can _skip the download_ and take your _analysis to the data_.

## 2. NASA Earthdata Cloud

During this transition period, data will remain freely available from the NASA DAACs (Distributed Active Archive Centers) that have archived and distributed data for over 20 years; and will support data in cloud-hosted storage known as the Earthdata Cloud as data sets are migrated.
NASA's cloud-hosted storage is known as the Earthdata Cloud; all NASA datasets are being migrated to be available in the cloud. During this transition period, data will still remain freely available for download directly from the DAACs (Distributed Active Archive Centers), which have archived and distributed NASA data for over 20 years.

<figure>
<center>
<img src='images/NSIDC-DAAC.png' alt='NSIDC DAAC Intro'/>
</center>
</figure>

The NSIDC DAAC now offers all [ICESat-2](https://nsidc.org/data/icesat-2) and [ICESat/GLAS](https://nsidc.org/data/icesat) data sets in the cloud. A listing of all NSIDC DAAC cloud-hosted data can be found [here](https://nsidc.org/data/earthdata-cloud/data). More details on ICESat-2 below.
The NSIDC DAAC now offers all [ICESat-2](https://nsidc.org/data/icesat-2) and [ICESat/GLAS](https://nsidc.org/data/icesat) data products via Earthdata Cloud. A listing of all NSIDC DAAC cloud-hosted data can be found [here](https://nsidc.org/data/earthdata-cloud/data). More details on ICESat-2 below.

### Earthdata Cloud Computing Basics

"The Cloud" is a somewhat nebulous term (pun intended). In general, the cloud is a network of remote servers that run software and services that are accessed over the internet. There is a growing number of commercial cloud providers (Google Cloud Services, Amazon Web Services, Microsoft Azure). NASA has contracted with Amazon Web Services (AWS) to host data using the AWS Simple Storage Service (S3). AWS offers a large number of services in addition to S3 storage. A key service is Amazon Elastic Compute Cloud (Amazon EC2). This is the service that is _under-the-hood_ of the CryoCloud JupyterHub you are using during today's workshop. When you start a JupyterHub, an EC2 _instance_ is started. You can think of an EC2 _instance_ as a remote computer.

AWS has the concept of a region, which is a cluster of data centers. These data centers house the servers that run S3 and EC2 instances. NASA Earthdata Cloud is hosted in the `us-west-2` region. This is important because if your EC2 instance is in the same region as the Earthdata Cloud S3 storage, you can access data in S3 directly in a way that is analogous to accessing a file on your laptop's or workstation's hard drive. This is one of the key advantages of working in the cloud; you can do analysis where the data is stored without having to download the data to a local machine.
AWS has the concept of a region, which is a cluster of data centers. These data centers house the servers that run S3 and EC2 instances. NASA Earthdata Cloud is hosted in the `us-west-2` region. This is important because if your EC2 instance is in the same region as the Earthdata Cloud S3 storage, you can access data in S3 directly in a way that is analogous to accessing a file on your laptop's or workstation's hard drive. This is one of the key advantages of working in the cloud; you can do analysis where the data is stored without having to download or move the data to another machine.

### Cost Considerations

The notion of _analysis in place_, or the concept of bringing your compute, or processing, to the data, provides several advantages over the more traditional download method: You no longer need to move data from its archived location, and you only pay for the compute needed to do your analysis. A few key points about cost:

* Cost to access: As long as you are performing your processing in the same location as where the data are located in Earthdata Cloud, then the cost to access the data is completely free. CryoCloud is running in the same `us-west-2` region as where the NASA Earthdata Cloud data are stored.
* Cost to access: As long as you are performing your processing in the same location (region) as where the data are located in Earthdata Cloud, then the cost to access the data is completely free. CryoCloud is running in the same `us-west-2` region as where the NASA Earthdata Cloud data are stored.
* Cost to compute: Just like your laptop costs money up front that provides you with certain CPU and memory, the compute resources needed to run your analyses do cost money. This can be thought of as the difference between an upfront cost like purchasing a laptop to process data locally versus something you can pay for as you go. There is a cost associated with the EC2 instance mentioned above, paid for by CryoCloud.
* Cost to store: With _analysis in place_, the data are being streamed directly from its native location in the cloud, so storage is not needed. However you may wish to store analysis outputs or other data using your own S3 bucket which does incur a cost.
* Cost to store: With _analysis in place_, the data are being streamed directly from its native location in the cloud, so storage is not needed. However you may wish to store analysis outputs or other data using your own S3 bucket, which does incur a cost.

### "When To Cloud"

Migrating to a cloud-based data analysis workflow can often have a steep learning curve and feel overwhelming. There are times when Cloud is effective and times when the download model may still be more appropriate. Here are a few key questions to ask yourself:
Migrating to a cloud-based data analysis workflow can often have a steep learning curve and feel overwhelming. There are times when the cloud is effective and times when the download model may still be more appropriate. Here are a few key questions to ask yourself:

* What is the data volume?
* How long will it take to download?
Expand All @@ -73,7 +73,7 @@ Migrating to a cloud-based data analysis workflow can often have a steep learnin

![IS2](https://icesat-2.gsfc.nasa.gov/sites/default/files/MissionLogo_0.png)

ICESat-2 carries a satellite lidar instrument, ATLAS. Lidar is an active remote sensing technique in which pulses of light are emitted and the return time is used to measure distance. The available ICESat-2 data products range from sea ice freeboard to land elevation to cloud backscatter characteristics. A list of available products can be found [here](https://icesat-2.gsfc.nasa.gov/science/data-products).
ICESat-2 carries a satellite lidar instrument, ATLAS. Lidar is an active remote sensing technique in which pulses of light are emitted and the return time is used to measure distance, in this case the height of something on the earth's surface. The available ICESat-2 data products range from sea ice freeboard to land elevation to cloud backscatter characteristics. A list of available products can be found [here](https://icesat-2.gsfc.nasa.gov/science/data-products).

![IS2-Product-Tree](https://nsidc.org/sites/default/files/styles/article_image/public/images/Other/icesat2_graphic_2023_update_final.png.webp)

Expand All @@ -82,7 +82,7 @@ More key features of ICESat-2:
* Height determined using round-trip travel time of laser light (photon counting lidar)
* 10,000 laser pulses released per second, split into 3 weak/strong beam pairs at a wavelength of 532 nanometers (bright green on the visible spectrum).
* Measurements taken every 70 cm along the satellite’s ground track, roughly 11 m wide footprint.
* The number of photons that returns to the telescope depends on surface reflectivity and cloud cover (which obscures ATLAS’s view of Earth). As such, the spatial resolution of signal photons varies.
* The number of photons that return to the telescope depends on surface reflectivity and cloud cover (which obscures ATLAS’s view of Earth). As such, the spatial resolution of signal photons varies.

### Data Collection

Expand Down Expand Up @@ -123,9 +123,11 @@ This table provides an overview of the capabilities across the tools and service
| Preview data | x | x | | x | x | x |
| Download data from DAAC | x | x | | x | x | x |
| Access cloud-hosted data | x | x | x | | x | |
| All ICESat-2 data | x | x | | | x | x |
| All ICESat-2 data products accessible | x | x | | | x | x |
| Subset (spatially, temporally, by variable) | x | | x | x | _x_ | |
| Load data by direct-access | x | x | x | | | |
| Process and analyze data | | | x | | | |
| Plot data with built-in methods | x | | x | x | | |
```
```

***Jessica note*** what's the difference between Access cloud-hosted data and Load data by direct-access? And what's the italic x for?
Original file line number Diff line number Diff line change
Expand Up @@ -14,26 +14,26 @@ In this tutorial you will learn how to:

## Overview

NASA Earthdata Search is a web-based tool to discover, filter, visualize and access all of NASA's Earth science data, both in Earthdata Cloud and archived at the NASA DAACs. It is a useful first step in data discovery, especially if you are not sure what data is available for your research problem.
NASA Earthdata Search is a web-based tool to discover, filter, visualize and access all of NASA's Earth science data, both in Earthdata Cloud and archived at the NASA DAACs. It is a useful first step in data discovery, especially if you are not sure what data is available for your research problem.

This tutorials is based on the NSIDC [NASA Earthdata Cloud Access Guide](https://nsidc.org/data/user-resources/help-center/nasa-earthdata-cloud-data-access-guide). Take a look at this access guide if you want more information and also to learn how to use command line tools to download cloud-hosted data from an S3 bucket.
This tutorial is based on the NSIDC [NASA Earthdata Cloud Access Guide](https://nsidc.org/data/user-resources/help-center/nasa-earthdata-cloud-data-access-guide). Take a look at this access guide if you want more information and also to learn how to use command line tools to download cloud-hosted data from an S3 bucket.


## Searching for data and S3 links using Earthdata Search

### Search for Data

Step 1. Go to https://search.earthdata.nasa.gov and log in using your Earthdata Login credentials by clicking on the Earthdata Login button in the top-right corner.
Step 1. Go to https://search.earthdata.nasa.gov and log in using your Earthdata Login credentials by clicking on the Earthdata Login button in the top-right corner.

Step 2. Check the **Available in Earthdata Cloud** box in the **Filter Collections** side-bar on the left of the page (Box 1 on the screenshot below). The Matching Collections will appear in the results box. All datasets in Earthdata Cloud have a badge showing a cloud symbol and "Earthdata Cloud" next to them. To narrow the search, we will filter by datasets supported by NSIDC, by typing NSIDC in the search box (Box 2 on the screen shot below). If you wanted, you could narrow the search further using spatial and temporal filters, or any of the other filters in the filter collections box.
Step 2. Check the **Available in Earthdata Cloud** box in the **Filter Collections** side-bar on the left of the page (Box 1 on the screenshot below). The Matching Collections will appear in the results box. All datasets in Earthdata Cloud have a badge showing a cloud symbol and "Earthdata Cloud" next to them. To narrow the search, we will filter by datasets supported by NSIDC by typing "NSIDC" in the search box (Box 2 on the screen shot below). If you want, you could narrow the search further using spatial and temporal filters or any of the other filters in the filter collections box.

<figure>
<center>
<img src='images/Screenshot_EDSC_Searching_Cloud_Datasets.png' alt='Screenshot of Search for Cloud Datasets in Earthdata Search'/>
</center>
</figure>

Step 3. You can now select the dataset you want by clicking on that dataset. The Search Results box now contains granules that match you search. The location of these granules is shown on the map. The search can be refined using spatial and temporal filters or you can select individual granules using the "+" symbol on each granule search result. Once you have the data you want, click the **Download All** (Box 1 in the screenshot below). In the sidebar that appears, select **Direct Download** (Box 2 in the screenshot below). Then select **Download Data**.
Step 3. You can now select the dataset you want by clicking on that dataset. The Search Results box now contains granules that match your search. The location of these granules is shown on the map. The search can be refined using spatial and temporal filters or you can select individual granules using the "+" symbol on each granule search result. Once you have the data you want, click the **Download All** (Box 1 in the screenshot below). In the sidebar that appears, select **Direct Download** (Box 2 in the screenshot below). Then select **Download Data**.

<figure>
<center>
Expand All @@ -43,7 +43,7 @@ Step 3. You can now select the dataset you want by clicking on that dataset. Th

### Getting S3 links and AWS S3 Credentials

Step 4. A Download Status window will appear (this may take a short amount of time) similar to the one shown below. You will see a tab for **AWS S3 Access** (Box 1 in the screenshot below). Select this tab. A list of S3 links starting with `s3://` will be in the box below. You can save them to a text file or copy them to your clipboard using the **Save** and **Copy** buttons (Box 2 in the screenshot below). Or you can copy each link separately by hovering over a link and clicking the clipboard icon (Box 3).
Step 4. A Download Status window will appear (this may take a short amount of time) similar to the one shown below. You will see a tab for **AWS S3 Access** (Box 1 in the screenshot below). Select this tab. A list of S3 links (urls) starting with `s3://` will be in the box below. You can save them to a text file or copy them to your clipboard using the **Save** and **Copy** buttons (Box 2 in the screenshot below). Or you can copy each link separately by hovering over a link and clicking the clipboard icon (Box 3).

Step 5. To access data in Earthdata Cloud, you need AWS S3 credentials; “accessKeyId”, “secretAccessKey”, and “sessionToken”. These are temporary credentials that last for one hour. To get them click on the **Get AWS S3 Credentials** (Box 4 in the screenshot below). This will open a new page that contains the three credentials.

Expand All @@ -58,4 +58,4 @@ Step 5. To access data in Earthdata Cloud, you need AWS S3 credentials; “acce
Now that you have the S3 links and credentials you can download the data using the [AWS Command Line Interface (CLI)](https://aws.amazon.com/cli/). This is shown in the [NASA Earthdata Cloud Access Guide](https://nsidc.org/data/user-resources/help-center/nasa-earthdata-cloud-data-access-guide).

[!IMPORTANT]
As of writing the AWS CLI is not installed on CryoCloud.
As of writing, the AWS CLI is not installed on CryoCloud. The other modules in this tutorial will show how to search and access data all within a single Jupyter Notebook.
Loading

0 comments on commit c9f8ffc

Please sign in to comment.