Restructuring the Marshall Database Scheme

The transientBucket table has become a little unruly and can be difficult to manage and navigate. I’ve therefore sliced this table up in to various sub-tables. Incidentally this has helped me generalise the marshall so that the same core-tables can be used by multiple survey webapps, each survey having its own set of book-keeping tables much like the current pesstoObjects table.

The new core tables are:

  • marshall_sources - one row for each unique source in the marshall with RA, DEC and Sherlock Classification
  • marshall_sources_discoveries - discovery information for each transient from each of the surveys that have ‘discovered’ it
  • marshall_sources_photometry - all survey photometry info for each transient, including non-detection limits
  • marshall_sources_spectra - all spectral info for each transient
  • marshall_sources_related_files - a list of related files associated with each transient

Table of Contents

Refactoring and Cleaning

Before slicing the table up it was important to remove any duplicate entries from the transientBucket table. Here’s the query I have been using to find duplicates:

SELECT DISTINCT
    transientBucketId, survey, name, spectralType
FROM
    transientBucket
WHERE
    name IN (SELECT 
            name
        FROM
            (SELECT 
                COUNT(*) AS count, transientBucketId, name, survey
            FROM
                (SELECT DISTINCT
                transientBucketId, name, survey
            FROM
                transientBucket
            WHERE
                survey NOT IN ('atel-coords' , 'atel-names', 'bright sn list', 'atel', 'pessto', 'TNS', 'ATLAS FP')
                    AND name NOT LIKE 'AT2%'
                    AND name NOT LIKE 'SN20%'
                    AND name NOT LIKE 'SN20%'
                    AND name NOT LIKE 'SN21%'
                    AND name NOT LIKE 'SN22%'
                    and spectralType is null
            ORDER BY name) AS c
            GROUP BY name) d
        WHERE
            count > 1)
        AND survey NOT IN ('atel-coords' , 'atel-names',
        'bright sn list',
        'atel',
        'pessto',
        'TNS',
        'ATLAS FP')
        and spectralType is null
ORDER BY name , dateCreated;

The table is now free of duplicates.

I’ve also made an attempt at homogenising survey names across the marshall as a survey name entered by a user or from other external sources often does not match the survey names used by the automated import scripts. Here’s the current survey cleaning script I run before syncing the transientBucket with the new core-tables:

update transientBucket set survey = "CRTS-SSS" where survey = "CRTS" and name like "SSS%";
update transientBucket set survey = "CRTS-MLS" where survey = "CRTS" and name like "MLS%";
update transientBucket set survey = "CRTS-CSS" where survey = "CRTS" and name like "CSS%";
update transientBucket set survey = "CRTS-SSS" where survey = "CRTS-SSS" and name like "SSS%";
update transientBucket set survey = "CRTS-MLS" where survey = "CRTS-MLS" and name like "MLS%";
update transientBucket set survey = "CRTS-CSS" where survey = "CRTS-CSS" and name like "CSS%";
update transientBucket set survey = "CBET" where survey = "" or survey = "unknown" and name like "SN%";
update transientBucket set survey = "Gaia" where survey = "" or survey = "unknown" and name like "Gaia%";
update transientBucket set survey = "ASASSN" where survey = "" or survey = "unknown" and name like "ASASSN%";
update transientBucket set survey = "TOCP" where survey = "" or survey = "unknown" and name like "PSN%";
update transientBucket set survey = "MASTER" where survey = "" or survey = "unknown" and name like "MASTER%";
update transientBucket set survey = "LSQ" where survey = "" or survey = "unknown" and name like "LSQ%";
update transientBucket set survey = "PS1" where survey = "" or survey = "unknown" and name like "PS1%";
update transientBucket set survey = "DECam" where name in ("EMILIA", "GRETA", "GRETA_2", "PAMELA");
update transientBucket set survey = "amateur" where survey like  "amateur%";
update transientBucket set survey = "Berto Monard" where survey like  "%Monard%";
update transientBucket set survey = "ASASSN" where survey like  "ASAS%";
update transientBucket set survey = "CBET" where survey  like "cbat%";
update transientBucket set survey = "CBET" where survey  like "CBET%";
update transientBucket set survey = "Gaia" where survey  = "GaiaAlerts";
update transientBucket set survey = "MASTER" where survey  = "master";
update transientBucket set survey = "TOCP" where survey  = "TOPC";
update transientBucket set survey = "TNTS" where survey  like  "TNTS%";
update transientBucket set survey = "CHASE" where survey  = "chase";
update transientBucket set survey = "OGLE" where survey  = "OGLE-IV";
update transientBucket set survey = "Pan-STARRS1" where survey  = "PanSTARRS";
update transientBucket set survey = "TOCP" where survey  = "tocp";
update transientBucket set survey = "Pan-STARRS1" where survey  = "PSST";
update transientBucket set survey = "Pan-STARRS1" where survey  like "PS1%";
update transientBucket set survey = "CRTS" where name like  "MLS%" and spectralType is null;
update transientBucket set survey = "MASTER" where name like  "MASTER%" and spectralType is null;
update transientBucket set survey = "TOCP" where name like  "PSN%" and spectralType is null;

Table Creation and Sync

Here is the sync_transientbucket_to_core_tables.sql script and a summary of what’s going on:

marshall_sources

Here’s the command to sync transientBucket > marshall_sources

INSERT IGNORE into marshall_sources (marshallId,raDeg,decDeg,dateCreated,sherlockClassification) 
SELECT transientBucketId,raDeg,decDeg,dateCreated,sherlockClassification from transientBucket t where masterIdFlag = 1
ON DUPLICATE KEY UPDATE marshallId=t.transientBucketId,raDeg=t.raDeg,decDeg=t.decDeg,dateCreated=t.dateCreated,sherlockClassification=t.sherlockClassification;

Note, on first import of this table there are ~32,000 unique sources

marshall_sources_discoveries

The marshall_sources_discoveries table hosts the survey specific data for a transient such as URL, survey name etc. The survey name of the transient must be unique in this table.

Here’s the command to export data from the transientBucket into the marshall_sources_discoveries table. Note this command tries to always update the column value with non-null entries from the transientBucket.

INSERT IGNORE into marshall_sources_discoveries (marshallId,surveyId,survey,discoveryPhase,surveyObjectUrl,transientTypePrediction,transientTypePredicationSource,hostRedshift,hostRedshiftType,referenceImageUrl,targetImageUrl,subtractedImageUrl,tripletImageUrl,finderImageUrl,lightcurveURL,dateCreated,masterId,raDeg,decDeg)
select transientBucketId,name,survey,discoveryPhase,surveyObjectUrl,transientTypePrediction,transientTypePredicationSource,hostRedshift,hostRedshiftType,referenceImageUrl,targetImageUrl,subtractedImageUrl,tripletImageUrl,finderImageUrl,lightcurveURL,dateCreated,masterIdFlag,raDeg,decDeg from transientBucket t
where survey not in ("atel","atel-coords","atel-names","ATLAS FP","bright sn list","TNS","LSQ discoveries","non-pessto atel")
ON DUPLICATE KEY UPDATE
  marshallId = IF(t.transientBucketId is not null, VALUES(marshallId), marshall_sources_discoveries.marshallId),
  surveyId = IF(t.name is not null, VALUES(surveyId), marshall_sources_discoveries.surveyId),
  survey = IF(t.survey is not null, VALUES(survey), marshall_sources_discoveries.survey),
  discoveryPhase = IF(t.discoveryPhase is not null, VALUES(discoveryPhase), marshall_sources_discoveries.discoveryPhase),
  surveyObjectUrl = IF(t.surveyObjectUrl is not null, VALUES(surveyObjectUrl), marshall_sources_discoveries.surveyObjectUrl),
  transientTypePrediction = IF(t.transientTypePrediction is not null, VALUES(transientTypePrediction), marshall_sources_discoveries.transientTypePrediction),
  transientTypePredicationSource = IF(t.transientTypePredicationSource is not null, VALUES(transientTypePredicationSource), marshall_sources_discoveries.transientTypePredicationSource),
  hostRedshift = IF(t.hostRedshift is not null, VALUES(hostRedshift), marshall_sources_discoveries.hostRedshift),
  hostRedshiftType = IF(t.hostRedshiftType is not null, VALUES(hostRedshiftType), marshall_sources_discoveries.hostRedshiftType),
  referenceImageUrl = IF(t.referenceImageUrl is not null, VALUES(referenceImageUrl), marshall_sources_discoveries.referenceImageUrl),
  targetImageUrl = IF(t.targetImageUrl is not null, VALUES(targetImageUrl), marshall_sources_discoveries.targetImageUrl),
  subtractedImageUrl = IF(t.subtractedImageUrl is not null, VALUES(subtractedImageUrl), marshall_sources_discoveries.subtractedImageUrl),
  tripletImageUrl = IF(t.tripletImageUrl is not null, VALUES(tripletImageUrl), marshall_sources_discoveries.tripletImageUrl),
  finderImageUrl = IF(t.finderImageUrl is not null, VALUES(finderImageUrl), marshall_sources_discoveries.finderImageUrl),
  lightcurveURL = IF(t.lightcurveURL is not null, VALUES(lightcurveURL), marshall_sources_discoveries.lightcurveURL),
  dateCreated = IF(t.dateCreated is not null, VALUES(dateCreated), marshall_sources_discoveries.dateCreated),
  masterId = IF(t.masterIdFlag is not null, VALUES(masterId), marshall_sources_discoveries.masterId),
  raDeg = IF(t.raDeg is not null, VALUES(raDeg), marshall_sources_discoveries.raDeg),
  decDeg = IF(t.decDeg is not null, VALUES(decDeg), marshall_sources_discoveries.decDeg);

The first export resulted in ~30,000 new rows. The ~2,000 difference from the marshall_sources and marshall_sources_discoveries is mainly due to old sources from the TNS not having an internal survey name.

marshall_sources_photometry

Here’s the command to export from the transientBucket to marshall_sources_photometry:

INSERT IGNORE INTO marshall_sources_photometry (marshallId,surveyId,observationDate,observationMJD,magnitude,magnitudeError,filter,telescope,instrument,dateDeleted,limitingMag,dateCreated)
select transientBucketId,name,observationDate,observationMJD,magnitude,magnitudeError,filter,telescope,instrument,dateDeleted,limitingMag,dateCreated from transientBucket t where survey not in ("atel","atel-coords","atel-names","ATLAS FP","bright sn list","TNS","LSQ discoveries","non-pessto atel") and magnitude is not null
ON DUPLICATE KEY UPDATE
marshallId = IF(t.transientBucketId is not null, VALUES(marshallId), marshall_sources_photometry.marshallId),
observationDate = IF(t.observationDate is not null, VALUES(observationDate), marshall_sources_photometry.observationDate),
magnitudeError = IF(t.magnitudeError is not null, VALUES(magnitudeError), marshall_sources_photometry.magnitudeError),
filter = IF(t.filter is not null, VALUES(filter), marshall_sources_photometry.filter),
telescope = IF(t.telescope is not null, VALUES(telescope), marshall_sources_photometry.telescope),
instrument = IF(t.instrument is not null, VALUES(instrument), marshall_sources_photometry.instrument),
dateDeleted = IF(t.dateDeleted is not null, VALUES(dateDeleted), marshall_sources_photometry.dateDeleted),
limitingMag = IF(t.limitingMag is not null, VALUES(limitingMag), marshall_sources_photometry.limitingMag),
dateCreated = IF(t.dateCreated is not null, VALUES(dateCreated), marshall_sources_photometry.dateCreated);

The first export resulted in ~365,000 new rows.

marshall_sources_spectra

Here’s the spectra export command:

INSERT IGNORE INTO marshall_sources_spectra (marshallId,surveyId,survey,observationDate,observationMJD,transientRedshift,transientRedshiftNotes,spectralType,telescope,instrument,reducer,dateCreated,classificationWRTMax,classificationPhase) 
select transientBucketId,name,survey,observationDate,observationMJD,transientRedshift,transientRedshiftNotes,spectralType,telescope,instrument,reducer,dateCreated,classificationWRTMax,classificationPhase from transientBucket t where spectralType is not null
ON DUPLICATE KEY UPDATE
marshallId = IF(t.transientBucketId is not null, VALUES(marshallId), marshall_sources_spectra.marshallId),
survey = IF(t.survey is not null, VALUES(survey), marshall_sources_spectra.survey),
observationDate = IF(t.observationDate is not null, VALUES(observationDate), marshall_sources_spectra.observationDate),
transientRedshift = IF(t.transientRedshift is not null, VALUES(transientRedshift), marshall_sources_spectra.transientRedshift),
transientRedshiftNotes = IF(t.transientRedshiftNotes is not null, VALUES(transientRedshiftNotes), marshall_sources_spectra.transientRedshiftNotes),
spectralType = IF(t.spectralType is not null, VALUES(spectralType), marshall_sources_spectra.spectralType),
telescope = IF(t.telescope is not null, VALUES(telescope), marshall_sources_spectra.telescope),
instrument = IF(t.instrument is not null, VALUES(instrument), marshall_sources_spectra.instrument),
reducer = IF(t.reducer is not null, VALUES(reducer), marshall_sources_spectra.reducer),
dateCreated = IF(t.dateCreated is not null, VALUES(dateCreated), marshall_sources_spectra.dateCreated),
classificationWRTMax = IF(t.classificationWRTMax is not null, VALUES(classificationWRTMax), marshall_sources_spectra.classificationWRTMax),
classificationPhase = IF(t.classificationPhase is not null, VALUES(classificationPhase), marshall_sources_spectra.classificationPhase);

The first export resulted in ~11,000 new rows.

Not yet implemented but I will use it to keep track of source related files (spectra, finder charts, image stamps, lightcurve pngs etc) from around the web.