Mission Overview

Hubble Image Similarity Project (HISP)

 

Primary Investigator: Richard L. White

HLSP AuthorsRichard L. White, Joshua E. G. Peek

Released: 2025-04-25

Updated: 2025-04-25

Primary Reference(s): White & Peek 2025

DOI: 10.17909/0q3g-by85

Citations:  See ADS Statistics

Read Me

Source Data:

 

Slideshow

Now showing slide 1 of 2

Sample HST Images used in HISP

This figure shows a grid of images, with 3 rows and 5 columns, displaying 15 total example cutout images used in the HISP project. Each image is a square showing a nebula, stars, or galaxies in black and white.

Figure 1. Sample of HST images used in HISP comparison tests.  The titles indicate the object name, camera, and filter used for the observation.  Each image is a 448x448 pixel region from the HST (HLA) image.  There is a great variety in the morphology and texture of the images, which was the goal of the image selection process.  The entire image set includes a total of 2,098 images similar to these.

NGC 1073 Cutouts

The left half of this figure shows a colorful image of the galaxy NGC 1073. The right side of this figure shows a series of square black-and-white images, arranged in an "S" shape, tracing out the spiral arms of this galaxy. This figure is an example how the cutout images for the HISP project (the squares) were selected for objects with a wide field of view, like this galaxy.

Figure 2. Example showing HISP image cutout selections within the elliptical region defined for NGC 1073.  The left panel shows the NGC catalog ellipse (cyan), the footprint for the HLA observation (red), and the regions for the selected cutout images (green).  The background image is from the Pan-STARRS1 gri filters.  The right panel shows the HST cutout images.  The removal of cutouts having low entropy leaves only images in regions with relatively high contrast.  For this object the images trace the galaxy's spiral arms.

Overview

The Hubble Image Similarity Project (HISP) is a large collection of similarity information between sub-regions of Hubble Space Telescope images as described in White & Peek 2025. These data can be used to assess the accuracy of image search algorithms based on computer vision methods. The images were compared by humans in a citizen science project, where they were asked to select similar images from a comparison sample. Nearly 850,000 comparison measurements have been analyzed to construct a similarity distance matrix between all the pairs of images.  The results are very impressive: the data capture similarity between images based on morphology, texture, and other details that are sometimes difficult even to describe in words (e.g., dusty absorption bands with sharp edges). The collective visual wisdom of the citizen scientists matches the accuracy of the trained eye, with even subtle differences among images faithfully reflected 16 in the distances.

Data Products

The filenames for all HISP data products follow the convention:

hlsp_hisp_hst_acs-wfc3-wfpc2_all_multi_v1_<file-type>.<ext>

where:

  • <file-type> is the type of file, described in the table below 
  • <ext> is the file extension, either "fits" or "zip"

Data file types:

objects.fits

Table of 6,527 NGC catalog objects that were searched for Hubble observations (Only 1,071 of these actually have Hubble data).

  • Primary index is objectid
observations.fits

Table of 4,465 Hubble observations that overlap the objects. Not all Hubble observations got used in the HISP sample, and a single Hubble image can be the source of multiple cutout images in the project.

  • Primary index is observationid
  • Joins to objects.fits via objectid
cutouts.fits

Table of 19,916 cutout images in the HISP parent sample of images.

The "selected" column indicates images that are actually used in the HISP selected sample. The parent sample includes 19,916 cutouts from 2,960 HST observations of 788 NGC objects. The selected sample used for HISP includes 2,098 cutouts from 666 HST observations of 666 NGC objects (there is a single HST observation selected for each object).

  • Primary index is cutoutid (integer)
  • Joins to observations.fits via observationid
  • Joins to objects.fits via objectid
phase<P>-samples.fits

Table of the pages presented to users for comparison during each phase <P>, where <P> is either 1, 2, or 3. In Phase 1, Users were asked to select all comparison images that are similar to the reference image, so they could make from 0 to 15 selections. Note that no rotations were used for the Phase 1 sample, so the rotation values are all zero.

 

Columns:

  • sampleid is the primary index (the page number) 
  • cutoutids is an array of 16 image IDs, with the first being the reference image ID. It joins to the <prefix>cutouts.fits table. 
  • rotation is a array of 16 rotation values. The rotation value ranges from 0 to 3, where 0 means the original unrotated cutout image is shown, and 1-3 mean successive rotations by 90 degrees compared with the original. 
  • golden is a integer where a non-zero value marks this as a "Golden" sample, where one of the included images is the same as the reference image. Golden samples can be used to confirm user performance. When non-zero, the value of golden is the number of the cutout (1-15) that matches the reference.
phase<P>-results.fits

Table of the results for the Phase <P> pages. Each sample was reviewed by one user in Phase 1,  and three users for Phase 2 and Phase 3.

 

Columns:

  • resultid is the primary index 
  • workerid is an integer indicating the reviewer for this sample 
  • sampleid identifies the page number (joins to <prefix>phase1-samples.fits) 
  • worktime is the time in seconds used for the review. Note this can be long if the user walked away after a sample was displayed in their browser. 
  • submittime is the actual time when the review was submitted. This is an ascii string in the format 'Sat Jun 06 12:47:16 PDT 2020'. 
  • answers is a 16-element Boolean array indicate which images were selected. The reference image is never selected, so samples[0] is always False. It is possible for no similar image to be identified, in which case all answers for a sample are zero.
img-simdist-phs<P>.fits

These three files contain the similarity distances (computed as described in the paper) using only the Phase 1 results, the combination of the Phase 1 and Phase 2 results, and the combined results from all three phases.

cutouts-selected.zip  Zip file with the 2,098 selected JPEG image cutouts used for the reviews. These are in a nested directory structure with a top-level directory named "cutouts" and lower-level directories named for each NGC object. An object can have more than one cutout, depending on its size and structure.
cutouts-all.zip

Zip file with the 19,916 JPEG image cutouts that constitute the parent sample for the reviews. Note this file includes all the selected images as well.

Data Access

Code Examples

The authors of HISP have provided a few code examples for working with the data in Python, shown below.

Reading from the Results

The phase3-results.fits file contains a table of the 323,829 results for the Phase 3 pages. Each page was reviewed by 3 users during this phase. The citizen scientists were asked to select which image out of 3 comparison images was most similar to the reference image. The answers column has 4 elements, where the first column represents the reference image and the other three columns the possible comparison images. Note that only one column is selected out of the 4 possibilities.

Here is a code snippet that reads the phase 3 results and confirms that the reference image is never selected and that exactly one comparison image is selected for each page:

>>> from astropy.table import Table
>>> import numpy as np

# Open Phase 3 results file
>>> prefix = "hlsp_hisp_hst_acs-wfc3-wfpc2_all_multi_v1_"
>>> tab = Table.read(prefix + "phase3-results.fits")

# Print Shape
>>> tab["answers"].shape
(323829, 4)

# Confirm that the reference image is never selected 
>>> print(tab["answers"].sum(axis=0))
[     0 135561 101458  86810]

# Confirm that exactly one comparison image is selected for each page:
>>> (tab["answers"].sum(axis=1) == 1).all()
True

Reading the Similarity Distance Array

The three img-simdist-phs<P>.fits files contain the similarity distances (computed as described in the paper) using only the Phase 1 results, the combination of the Phase 1 and Phase 2 results, and the combined results from all three phases. 

The data are stored as a 1-dimensional condensed distance array, as defined in the Python scipy.spatial.distance.pdist function. The similarity distance is symmetrical, and there is no need to compute the distance of an object from itself. Only the lower half of the matrix below the diagonal is stored. There are N=2098 galaxies, so the array in the file has N*(N-1)/2 = 2,199,753 elements.

If desired, the array can be expanded into a full 2098x2098 square array using the scipy.spatial.distance.squareform function.

In the full array, the rows and columns match the entries in the <prefix>cutouts.fits table that have selected=True. Here is Python code that identifies the least similar pair of images in the selected sample: 

>>> from astropy.table import Table
>>> from astropy.io import fits
>>> from scipy.spatial.distance import squareform
>>> import numpy as np

# Open cutouts table
>>> prefix = "hlsp_hisp_hst_acs-wfc3-wfpc2_all_multi_v1_"
>>> tab = Table.read(prefix + "cutouts.fits")
>>> stab = tab[tab["selected"]]
>>> print(len(stab))
2098

# Open Similarity Distance Array
>>> simdist = squareform(fits.open(prefix + "img-simdist-phs3.fits")[0].data)
>>> simdist.shape
(2098, 2098)

# Find the index of maximum distance (the least similar images)
>>> imax = np.argmax(simdist)
>>> i, j = imax//simdist.shape[1], imax % simdist.shape[1]
>>> print(i, j, simdist[i,j])
1056 1309 0.8228424088670983

# Print the corresponding cutout file names
>>> print(stab["outfile"][[i,j]])
                           outfile                            
--------------------------------------------------------------
cutouts/NGC3603/cutout_44_hst_11360_a1_wfc3_uvis_f656n_drz.jpg
   cutouts/NGC4452/cutout_01_hst_9401_44_acs_wfc_f475w_drz.jpg

Accessing the WCS Coordinates of the Cutouts

The cutout image sample is provided in two zip archives linked above, cutouts-selected.zip and cutouts-all.zip. For each cutout image, the JPEG image comment section includes WCS info and astronomical coordinates for the cutout. Here is a Python code snippet showing how to access the WCS for an example image from NGC1389:

>>> from PIL import Image

# Load in image
>>> im = Image.open("cutouts/NGC1389/cutout_01_hst_10217_11_acs_wfc_f475w_drz.jpg")

# Print WCS info from the JPEG comment
>>> print(im.info["comment"].decode())
CTYPE1  = 'RA---TAN'
CTYPE2  = 'DEC--TAN'
CRPIX1  = -180.000000000086
CRPIX2  = 57.2499999999145
CRVAL1  = 54.3042739376784
CRVAL2  = -35.7477127807115
CD1_1   = -2.7777777777776e-05
CD1_2   = 9.7424971796151e-35
CD2_1   = 0
CD2_2   = 2.77777777777792e-05
COMMENT Created by fitscut 1.4.2 (William Jon McCann)

Citations

Please remember to cite the appropriate paper(s) below and the DOI 10.17909/0q3g-by85 if you use these data in a published work. 

Note: These HLSP data products are licensed for use under CC BY 4.0.

References