Mission Overview
Hubble Image Similarity Project (HISP)
Primary Investigator: Richard L. White
HLSP Authors: Richard L. White, Joshua E. G. Peek
Released: 2025-04-25
Updated: 2025-04-25
Primary Reference(s): White & Peek 2025
DOI: 10.17909/0q3g-by85
Citations: See ADS Statistics
Source Data:
- Source Data DOI: 10.17909/3gmn-c234
Overview
The Hubble Image Similarity Project (HISP) is a large collection of similarity information between sub-regions of Hubble Space Telescope images as described in White & Peek 2025. These data can be used to assess the accuracy of image search algorithms based on computer vision methods. The images were compared by humans in a citizen science project, where they were asked to select similar images from a comparison sample. Nearly 850,000 comparison measurements have been analyzed to construct a similarity distance matrix between all the pairs of images. The results are very impressive: the data capture similarity between images based on morphology, texture, and other details that are sometimes difficult even to describe in words (e.g., dusty absorption bands with sharp edges). The collective visual wisdom of the citizen scientists matches the accuracy of the trained eye, with even subtle differences among images faithfully reflected 16 in the distances.
Data Products
The filenames for all HISP data products follow the convention:
hlsp_hisp_hst_acs-wfc3-wfpc2_all_multi_v1_<file-type>.<ext>
where:
- <file-type> is the type of file, described in the table below
- <ext> is the file extension, either "fits" or "zip"
Data file types:
objects.fits |
Table of 6,527 NGC catalog objects that were searched for Hubble observations (Only 1,071 of these actually have Hubble data).
|
observations.fits |
Table of 4,465 Hubble observations that overlap the objects. Not all Hubble observations got used in the HISP sample, and a single Hubble image can be the source of multiple cutout images in the project.
|
cutouts.fits |
Table of 19,916 cutout images in the HISP parent sample of images. The "selected" column indicates images that are actually used in the HISP selected sample. The parent sample includes 19,916 cutouts from 2,960 HST observations of 788 NGC objects. The selected sample used for HISP includes 2,098 cutouts from 666 HST observations of 666 NGC objects (there is a single HST observation selected for each object).
|
phase<P>-samples.fits |
Table of the pages presented to users for comparison during each phase <P>, where <P> is either 1, 2, or 3. In Phase 1, Users were asked to select all comparison images that are similar to the reference image, so they could make from 0 to 15 selections. Note that no rotations were used for the Phase 1 sample, so the rotation values are all zero.
Columns:
|
phase<P>-results.fits |
Table of the results for the Phase <P> pages. Each sample was reviewed by one user in Phase 1, and three users for Phase 2 and Phase 3.
Columns:
|
img-simdist-phs<P>.fits |
These three files contain the similarity distances (computed as described in the paper) using only the Phase 1 results, the combination of the Phase 1 and Phase 2 results, and the combined results from all three phases. |
cutouts-selected.zip | Zip file with the 2,098 selected JPEG image cutouts used for the reviews. These are in a nested directory structure with a top-level directory named "cutouts" and lower-level directories named for each NGC object. An object can have more than one cutout, depending on its size and structure. |
cutouts-all.zip |
Zip file with the 19,916 JPEG image cutouts that constitute the parent sample for the reviews. Note this file includes all the selected images as well. |
Data Access
Direct Download
The data for this HLSP are available for direct download using the links in the table below:
Code Examples
The authors of HISP have provided a few code examples for working with the data in Python, shown below.
Reading from the Results
The phase3-results.fits file contains a table of the 323,829 results for the Phase 3 pages. Each page was reviewed by 3 users during this phase. The citizen scientists were asked to select which image out of 3 comparison images was most similar to the reference image. The answers column has 4 elements, where the first column represents the reference image and the other three columns the possible comparison images. Note that only one column is selected out of the 4 possibilities.
Here is a code snippet that reads the phase 3 results and confirms that the reference image is never selected and that exactly one comparison image is selected for each page:
>>> from astropy.table import Table
>>> import numpy as np
# Open Phase 3 results file
>>> prefix = "hlsp_hisp_hst_acs-wfc3-wfpc2_all_multi_v1_"
>>> tab = Table.read(prefix + "phase3-results.fits")
# Print Shape
>>> tab["answers"].shape
(323829, 4)
# Confirm that the reference image is never selected
>>> print(tab["answers"].sum(axis=0))
[ 0 135561 101458 86810]
# Confirm that exactly one comparison image is selected for each page:
>>> (tab["answers"].sum(axis=1) == 1).all()
True
Reading the Similarity Distance Array
The three img-simdist-phs<P>.fits files contain the similarity distances (computed as described in the paper) using only the Phase 1 results, the combination of the Phase 1 and Phase 2 results, and the combined results from all three phases.
The data are stored as a 1-dimensional condensed distance array, as defined in the Python scipy.spatial.distance.pdist function. The similarity distance is symmetrical, and there is no need to compute the distance of an object from itself. Only the lower half of the matrix below the diagonal is stored. There are N=2098 galaxies, so the array in the file has N*(N-1)/2 = 2,199,753 elements.
If desired, the array can be expanded into a full 2098x2098 square array using the scipy.spatial.distance.squareform function.
In the full array, the rows and columns match the entries in the <prefix>cutouts.fits table that have selected=True. Here is Python code that identifies the least similar pair of images in the selected sample:
>>> from astropy.table import Table
>>> from astropy.io import fits
>>> from scipy.spatial.distance import squareform
>>> import numpy as np
# Open cutouts table
>>> prefix = "hlsp_hisp_hst_acs-wfc3-wfpc2_all_multi_v1_"
>>> tab = Table.read(prefix + "cutouts.fits")
>>> stab = tab[tab["selected"]]
>>> print(len(stab))
2098
# Open Similarity Distance Array
>>> simdist = squareform(fits.open(prefix + "img-simdist-phs3.fits")[0].data)
>>> simdist.shape
(2098, 2098)
# Find the index of maximum distance (the least similar images)
>>> imax = np.argmax(simdist)
>>> i, j = imax//simdist.shape[1], imax % simdist.shape[1]
>>> print(i, j, simdist[i,j])
1056 1309 0.8228424088670983
# Print the corresponding cutout file names
>>> print(stab["outfile"][[i,j]])
outfile
--------------------------------------------------------------
cutouts/NGC3603/cutout_44_hst_11360_a1_wfc3_uvis_f656n_drz.jpg
cutouts/NGC4452/cutout_01_hst_9401_44_acs_wfc_f475w_drz.jpg
Accessing the WCS Coordinates of the Cutouts
The cutout image sample is provided in two zip archives linked above, cutouts-selected.zip and cutouts-all.zip. For each cutout image, the JPEG image comment section includes WCS info and astronomical coordinates for the cutout. Here is a Python code snippet showing how to access the WCS for an example image from NGC1389:
>>> from PIL import Image
# Load in image
>>> im = Image.open("cutouts/NGC1389/cutout_01_hst_10217_11_acs_wfc_f475w_drz.jpg")
# Print WCS info from the JPEG comment
>>> print(im.info["comment"].decode())
CTYPE1 = 'RA---TAN'
CTYPE2 = 'DEC--TAN'
CRPIX1 = -180.000000000086
CRPIX2 = 57.2499999999145
CRVAL1 = 54.3042739376784
CRVAL2 = -35.7477127807115
CD1_1 = -2.7777777777776e-05
CD1_2 = 9.7424971796151e-35
CD2_1 = 0
CD2_2 = 2.77777777777792e-05
COMMENT Created by fitscut 1.4.2 (William Jon McCann)
Citations
Please remember to cite the appropriate paper(s) below and the DOI 10.17909/0q3g-by85 if you use these data in a published work.
Note: These HLSP data products are licensed for use under CC BY 4.0.