Demo of the crossmatch function

The crossmatch function calculates indexing arrays between two catalogs that share a common object ID, handling cases of both repeated and absent IDs. In this demo, we’ll create a couple of dummy catalogs to demonstrate basic usage.

The typical use-case of crossmatch is as follows. Suppose you have two catalogs of data, Cat_A and Cat_B, and both catalogs have some integer column storing a unique integer identifying each object. The crossmatch function calculates two indexing arrays that provide the correspondence between entries that pertain to the common objects. The conditions that crossmatch assumes are:

  • Cat_A is permitted to contain repeated entries of the same ID

  • Cat_A is permitted to contain entries of IDs that do not appear in Cat_B

  • Cat_B is NOT permitted to contain repeated entries of the same ID

Let’s get started by setting up a couple of catalogs storing some dummy data for demonstration purposes:

[1]:
import numpy as np

n_a = 2500
n_b = 400

cat_b_objid = np.arange(n_b).astype(int)
cat_b_mass = np.random.uniform(0, 10, n_b)
cat_b_spin = 10**np.random.uniform(-2, 0, n_b)
cat_b = dict(objid=cat_b_objid, mass=cat_b_mass, spin=cat_b_spin)

cat_a_objid = np.random.choice(cat_b_objid, size=n_a)

cat_a = dict(objid=cat_a_objid)

Note that cat_A has been set up so that every one of its entries has a unique matching entry in cat_B, and that while cat_A has numerous repeats, there are no repeated IDs in cat_B. So we see that these two catalogs meet the assumptions required by the crossmatch function. In the next example below, we explore a case where some of the entries in cat_A do not appear in cat_B, but for now in this first example everything has a match.

Now let’s use crossmatch to calculate the indexing arrays providing the correspondence between common objects:

[2]:
from galsampler.crossmatch import crossmatch

idxA, idxB = crossmatch(cat_a['objid'], cat_b['objid'])

First note that the length of the returned indexing arrays both have the same number of entries as the length of cat_A: the crossmatch function calculates arrays that provide an index in cat_B for every object in cat_A for which there is a match. Since every object in cat_A has a match, then both idxA and idxB have the same number of entries as the number of objects in cat_A:

[3]:
print(len(idxA), len(idxB))
2500 2500

Now let’s check that the indexing arrays have the expected behavior.

First let’s verify that they do indeed provide a matching correspondence:

[4]:
assert np.allclose(cat_a['objid'][idxA], cat_b['objid'][idxB])

Finally, let’s augment cat_A with the properties of mass and spin whose values are stored in cat_b. This is a two-step process:

  1. Initialize an empty array where we will store the new data from the cross-matching

  2. Use the indexing arrays to map the values from cat_B into cat_A

[5]:
cat_a['mass'] = np.zeros(n_a)
cat_a['spin'] = np.zeros(n_a)

cat_a['mass'][idxA] = cat_b['mass'][idxB]
cat_a['spin'][idxA] = cat_b['spin'][idxB]

Let’s do one more example in which some of the objects in cat_A have no matching counterpart in cat_B:

[6]:
n_unmatched = 20
cat_a_objid[:n_unmatched] = np.random.randint(-5, 0, n_unmatched)

cat_a = dict(objid=cat_a_objid)
[7]:
idxA, idxB = crossmatch(cat_a['objid'], cat_b['objid'])

We have set up this example so that the first \(20\) entries of cat_A have no match in cat_B. Let’s check that the length of the returned indexing arrays reflect this:

[8]:
print(len(idxA), len(idxB))
2480 2480

Now let’s again transfer the properties in cat_B into cat_A. This time, we’ll initialize our arrays with fill values so that it’s easy to verify that unmatched objects in cat_A still have their initial values after the cross-matching:

[9]:
cat_a['mass'] = np.zeros(n_a) + np.nan
cat_a['spin'] = np.zeros(n_a) + np.nan

cat_a['mass'][idxA] = cat_b['mass'][idxB]
cat_a['spin'][idxA] = cat_b['spin'][idxB]

Next we’ll define a simple has_match array storing whether or not the objects in cat_A have a match, and we’ll verify that only the negatively-valued IDs go unmatched, which is the way we set up this toy example.

[10]:
mask_has_match = np.zeros(n_a).astype(bool)
mask_has_match[idxA] = True

assert not np.any(np.isnan(cat_a['mass'][mask_has_match]))
assert np.all(np.isnan(cat_a['mass'][~mask_has_match]))

assert np.all(cat_a['objid'][mask_has_match]>=0)
assert np.all(cat_a['objid'][~mask_has_match]<0)

As we can see above, the only NaN values in our cross-matched catalog come from objects without a match in cat_B, all of which pertain to objects in cat_A with negative IDs.