The 9 Mistakes That Are Ruining Your Data

9/5/2017 | SAMANTHA FOX

The 9 Mistakes That Are Ruining Your Data
Samantha Fox

Samantha Fox | Senior Product and Data Analyst | Bio

If you’re reading this, you have some interest in data. Welcome to my world! Though long seen as a geeky subject, data is becoming more embedded in all our worlds, and facility with it is a skill more and more people are expected to have. Whether you personally provide data to others or consume it or both, please read on; I hope you pick up a useful tidbit or two.

In my past few years of being a water data wrangler, I’ve seen the good, the bad and the ugly. I decided to make a list of data don’ts that I hope will be helpful to data providers, as well as all the professionals who use their data sets, as the water community moves transparency forward.

#1: You Can’t Get There From Here

Location data means everything in water. You need to know, as exactly as possible, where a well or a point of diversion or an effluent outfall really is in order to make meaningful analyses and decisions.

What data providers can do: If you have to place something at a centroid because you can only approximate its location, at least note that in the data. Make sure your data storage constraints or conversion from legacy systems haven’t rounded your lat/longs to too few decimal places. If you carry data in DMS (degree/minute/second) notation or UTMs rather than decimal degrees, consider providing locations in other notations to prevent conversion errors by those using your data. Make sure your lat/longs are of the correct sign and in the correct columns, not transposed.

What data wranglers can do: Plot locations on a map along with PLSS, land parcels or other boundaries to detect placement at centroids or lat/longs that have been rounded too much (they’ll appear in a checkerboard pattern, or will have dozens of items stacked on one spot). Check your map projections if things look “off” or not aligned with other boundaries like streams or county lines. Always check that latitudes are positive and longitudes are negative (in the contiguous US).

#2: Terrible, Horrible, No Good, Very Bad Dates

Nobody likes bad dates, but they seem to be everywhere. Don’t be responsible for one, and know what to do if it happens to you…

What data providers can do: Check for badly formed dates before releasing your data: years like 1000 or 4084, or days that don’t exist (November 34th or 12/00/1931). Ideally, don’t let these bad dates in the door in the first place – enforce constraints in your data entry processes, and spend time doing cleanup of data before releasing it.

What data wranglers can do: Check for bad dates at the outset. Throw the date values into Excel and filter the column; this is a quick and easy way to see if there are any values that don’t fall into the date hierarchy that Excel automatically creates.

#3: Lonely Codes

Cryptic codes without lookups hinder understanding, and their significance may also be overlooked by some users of the data. Use type codes, for example, left as “A, B, C, D and U,” may be secrets to important indicators of how land is used and taxed, but single-letter abbreviations are easily overlooked by users of the data who may not guess what is buried there.

What data providers can do: When you create data exports, include the spelled out values along with the codes themselves, or include a lookup table with the data you provide.

What data wranglers can do: Look for fields containing single or short alphanumeric values and make sure you understand what they mean before you hand the data over for analysis.

#4: Mixing Apples and Oranges

Letters and numbers commingled in fields that should be one or the other can result in incorrect conversions, double counting, or lost data during import processes.

What data providers can do: Data like “17.3 gpm” belongs in two separate columns – one for the number and one for the text – not bunched together in one. And try not to put data like “17.3 gpm” and “30 AF” in the same column when they’re not measurements of the same thing.

What data wranglers can do: Examine data fields named like Flow, Volume, Storage, or Diversion for hidden unit descriptions; visually inspect the first 100 rows of a dataset for concatenations. Always inspect varchar or alphanumeric fields for mixed data types. When you import data, try to enforce data types and capture any errors that tell you when data doesn’t fit an expected format.

#5: You Say Tomato, I Say Tomahto

Duplicates based on wrong spellings cause confusion and look unprofessional. For example, water beneficial use may be listed as “Irrigation” as well as “Irrigaiton” within the same dataset.

What data wranglers can do: Ideally, only allow correct spellings in data entry processes; if that’s not possible, run a process that cleans up duplicate values before delivering data into the world.

What data wranglers can do: Always select the distinct values out of any text field to look for the variations. Create your own lookups to collate and correct values for your own use.

#6: Memory Game

This is a fun game we play when we can’t tell how to join two tables together. Is it these two fields that match? Is it those other two? It becomes a guessing game. Land parcels often exhibit this problem when the parcel shapes come from one agency, and the ownership data comes from another: agency A calls the identifier the APN, and agency B calls it Parcel. Similarly, a well gage may have a Station ID in one dataset and a State Well Number in another.

What data providers can do: If you’re able to name your columns in a way that indicates what their relationship is to columns in other tables, try to do so. If that’s not possible, a database diagram showing table relationships is really helpful, or a readme document that details how to understand relationships.

What data wranglers can do: Try joining on similarly named fields. Make phone calls to the data purveyors. Inquiries can result in improved documentation for others and agencies usually do appreciate people looking at their data with care.

#7: The X Files

Outliers show up in most datasets. These are values way outside the expected norms, such as wells with a 70,000 foot depth or a water right of 70 million acre-feet.

What data providers can do: As the sole authoritative source and steward of data, data providers must be the ones to weed these out. If outliers and errors can’t or shouldn’t be deleted, they can at least be flagged. Error codes attached to data help users understand the types of errors that occur and why, and users can then make informed decisions about which data to include in their analysis.

What data wranglers can do: Plotting data in an Excel chart or calculating the minimum, maximum and standard deviation are simple ways to understand the range and variance of a dataset and spot outliers and obvious data entry errors.

#8: Existential Angst

Is a zero the same as a null? It may seem like an esoteric question, but to data people, there is a difference. Blanks, nulls, and zero values may need to be handled differently. A zero may indicate a verified value of 0, while a blank may mean the value was not measured or is not applicable; depending on whether this is a flow or a priority date, the differences can be quite meaningful.

What data providers can do: Ensure that your important data fields don’t have a combination of these values without providing a guide to what each value means.

What data wranglers can do: Don’t assume on this one. When in doubt, call the data steward.

#9: You Can’t Download This

The worst problem with data is having no data at all. Providing data to the world puts eyes on it, which can improve quality if the data steward is willing to respond to inquiries and suggestions on increasing its usability.

What data providers can do: Post data, post often…and document where you can, to save time on support calls. If you can’t post it, consider posting details on how to get in touch with you to request it. And thank you in advance for sharing your data.

What data wranglers can do: Thank your data stewards and promote a spirit of cooperation and appreciation when you get in touch. Be responsible for understanding what the data means and its limitations, and then go do good things with it.

I hope folks everywhere will benefit from this list, but especially California, which has been on my mind lately. After the recent passage of the Open and Transparent Water Data Act (AB 1755), more of California’s data will be coming online and into the hands of users. This guide contains some simple solutions to identify and transform data, save time, and enable better, faster decision-making for water districts, engineers, analysts, consultants and other data consumers.