-
Notifications
You must be signed in to change notification settings - Fork 0
Home
The following are my thoughts about implementing a distributed computing system to build and maintain a centralized database of information provided by insurance companies as required by the United States government.
Most people in the U.S. are at least familiar with parts of how medical insurance works, but much goes on behind the scenes that aren't exposed to patients. There are four key parts to how medical insurance billing in the U.S. works. I will refer to these as the four P's: plan/payor, provider, procedure, and payment. These four labels are not entirely accurate, but they're easy to remember, and I will explain them next.
- Plan/Payor: This is the insurance company that makes Payments to the Provider for the Procedures performed. The individuals receiving care are often responsible for portions of the bill as co-pays or deductibles, but the Payor here is the insurance company. Payors may have a thousand different plans within their network.
- Provider: This is the doctor, group of doctors, advanced practitioners such as nurse practitioners or physician assistants, or the hospital that performs and bills for the Procedure.
- Procedure: This label is not entirely accurate because, in addition to actual procedures performed, it includes all the provider's direct expenses associated with the care provided. If you've ever seen a medical bill for even a simple out-patient surgery, you'll know that there will be dozens of items listed. Everything from the anesthesiologist's time and the anesthesia to the pain medication given after the surgery. For every Procedure (i.e., procedures and consumables), there is a designated code. There are thousands of these codes for everything from a syringe to a heart transplant.
- Payment: Insurance companies negotiate payment rates for each Provider for each Procedure. This is the part of the system that is now required by U.S. law to be made available to the public. Thus, patients can now theoretically make an informed decision about their choice of provider, which includes the cost to the insurance company. I will use "negotiated rates" and Payments interchangeably.
Unfortunately, I feel this data will be of very little use to most patients who care primarily about what their deductible and co-pays are. Although, if deductibles are a major concern, patients may be able to save money by finding a provider that has negotiated a lower rate with the insurer. The people that will likely make the most use of this information are the providers who can now readily see how much other providers in their area are being paid. Ultimately, I see this giving leverage to the providers to negotiate higher rates which will then increase the cost of insurance. Hopefully, I'm wrong.
One issue with the system of negotiated rates is that most providers have negotiated rates for most procedures, even if they do not perform the procedure. This is because once a person graduates from medical school, they are legally allowed to perform any medical procedure. It is the Hippocratic oath, hospital administration, available facilities, and common sense that keep doctors from performing procedures beyond their area of expertise. However, in an emergency, it may be necessary to perform a procedure well outside of what a doctor routinely does.
Thus, if someone is browsing for the cost of, say, a hip replacement, they may find a provider that would charge nothing for the procedure. However, there is no way that provider would ever do a hip replacement, which is why the negotiated rate happened to be zero. While one finds this in the data, it is not common because most contracts have standard rates for most procedures, and the providers only negotiate the ones they might conceivably bill. Finding the best price for a procedure then requires that the patient know which providers will perform that procedure. For example, one will find over 50 million(!) negotiated rates for a heart transplant in the United Healthcare data alone. Nearly every provider in their data has multiple negotiated rates (because they have different contracts with different plans) for a heart transplant. However, of course, only a very small fraction of those rates are available.
Another issue is that, as mentioned above, what patients view as procedures are tens or hundreds of sub-procedures, each with its own negotiated payment. Take the negotiated rates for a heart transplant. Most of the rates are around $2,500, which appears exceptionally cheap until one remembers that an actual heart transplant will have hundreds of other charges.
Even if patients knew who in their vicinity performs a procedure and, of course, what insurance plan they have, the U.S. government only requires the insurance companies to make the data "machine-readable." There is no requirement that the data be "human readable" or easily searchable. When you multiply thousands of Payor/Plans with tens of thousands of Providers and tens of thousands of Procedures, you wind up with a huge amount of data, tens of terabytes per insurance company. Thus, the plan here is to make the data more readily available and searchable by people who are not data scientists.
Let's first understand the data involved by looking at some from Humana. Humana has provided its data in a simple yet inefficient format that lists every negotiated rate on a separate line/row. Each row contains the following information (I have included sample values for each category).
Information | Example 1 | Example 2 |
---|---|---|
Reporting Entity Name | Humana Inc | Humana Inc |
Reporting Entity Type | Health Insurance Issuer | Health Insurance Issuer |
Plan | Acme HMO | Acme HMO |
Last Updated On | 2022-09-21 | 2022-09-21 |
Version | 1.0.0 | 1.0.0 |
NPI | 1407402209,1518592237 | 1982715538,1306820253 |
TIN | 616001218 | 020574738 |
Type | ein | ein |
Negotiation Arrangement | ffs | ffs |
Name | Obturator Prosthesis Modification | Palatal Augmentation Prosthesis |
Billing Code Type | CDT | CDT |
Billing Code Type Version | 2022 | 2022 |
Billing Code | D5933 | D5954 |
Description | Obturator Prosthesis Modification | Palatal Augmentation Prosthesis |
Negotiated Type | negotiated | negotiated |
Negotiated Rate | 180.10 | 5073.81 |
Expiration Date | 9999-12-31 | 9999-12-31 |
Service Code | - | - |
Billing Class | professional | professional |
Billing Code Modifier | - | - |
Additional Info | - | - |
Bundled Billing Code Type | - | - |
Bundled Billing Code Version | - | - |
Bundled Billing Code | - | - |
Bundled Description | - | - |
Most of these data labels are self-explanatory; however, a few may not be. The NPI is the "national provider identifier" that all providers--individuals and groups--have. The NPI for an individual never changes, but provider group NPIs can change if the group changes ownership. The TIN is the "taxpayer identification number" that all businesses have. Most of the time, this will be an EIN, or "employer identification number," which is like a social security number for businesses. The TIN uniquely identifies provider groups such as hospitals, clinics, and group practices that might not have an NPI. In the two examples given, there are two physicians in each provider group. You can determine information about a provider here (https://npiregistry.cms.hhs.gov/search). The first (1407402209) is a nurse practitioner in Lexington, KY, who focuses on pediatrics named Elleanor Gray Vogt. The second group member is a nurse practitioner focused on family medicine named, Lindsay Paige Jenkins. The acronym "ffs" stands for "fee-for-service," not the more common off-color meaning.
Humana has split its 50 terabytes of compressed data into roughly 500,000 files of approximately 100 MB each. Each of these files contains about 1,200,000 lines like the two examples above, meaning there are about 600 billion(!) negotiated rates.
It should be clear from the two examples why this is an inefficient way to exchange data, i.e., there is a lot of redundant information--most other Payor data is about ten terabytes. For example, the first three items in each data line are identical. Why not replace them with a single column that contains an "HA" and a key that identifies an "HA" in the first column meaning three columns with "Humana Inc" in the first, "Health Insurance Issuer" in the second, and "Acme HMO" in the third?
As a slightly more complicated example of how one can reduce the amount of data but not the information contained, consider the Name, Billing Code Type, Billing Code Type Version, Billing Code, and Description columns. Not only will these be often repeated--essentially for each provider group--but they are all tied together. Thus, we can collapse all these columns and replace them with a table with identifying keys. We could do a similar thing with the NPI, TIN, and Type columns since these are tied together and often repeated, likewise for the bundled billing columns.
The reduced data would then include the following tables.
Payment Information | Example | Example |
---|---|---|
Payor ID | HA | HA |
Last Updated On | 2022-09-21 | 2022-09-21 |
Provider ID | 12345 | 67890 |
Negotiation Arrangement | ffs | ffs |
Procedure ID | 23456 | 78901 |
Negotiated Type | negotiated | negotiated |
Negotiated Rate | 180.10 | 5073.81 |
Expiration Date | 9999-12-31 | 9999-12-31 |
Service Code | - | - |
Additional Info | - | - |
Bundled Procedure ID | 0 | 0 |
Version | 1.0.0 | 1.0.0 |
Payor Information | Example |
---|---|
Payor ID | HA |
Reporting Entity Name | Humana Inc |
Reporting Entity Type | Health Insurance Issuer |
Plan | Acme HMO |
Provider Information | Example | Example |
---|---|---|
Provider ID | 12345 | 67890 |
NPI | 1407402209,1518592237 | 1982715538,1306820253 |
TIN | 616001218 | 020574738 |
Type | ein | ein |
Billing Information | Example | Example |
---|---|---|
Billing ID | 23456 | 78901 |
Negotiation Arrangement | ffs | ffs |
Name | Obturator Prosthesis Modification | Palatal Augmentation Prosthesis |
Billing Code Type | CDT | CDT |
Billing Code Type Version | 2022 | 2022 |
Billing Code | D5933 | D5954 |
Description | Obturator Prosthesis Modification | Palatal Augmentation Prosthesis |
Billing Class | professional | professional |
Billing Code Modifier | - | - |
Bundled Billing Information | Example |
---|---|
Bundled Billing ID | 0 |
Bundled Billing Code Type | - |
Bundled Billing Code Version | - |
Bundled Billing Code | - |
Bundled Description | - |
There may now be more data, which is true here because we only considered two negotiated rates. The savings when you have billions of negotiated rates is illustrated by two tables now having only one row. In the billions of negotiated rates, the billing and provider information will be repeated thousands of times. Thus, we can get an estimate--at least for Humana--of the sizes of each table. There will still be 600 billion entries in the payments table, but there will only be 12 instead of 25 columns of data.
The payor table will be filled with all the payor's plans. Let's say Humana has 1,000 unique plans. The table will then be 1,000 rows and four columns. The provider table will only contain the number of unique providers in the data set. Assuming one million unique providers, the table will be one million rows and four columns. There are roughly 50,000 billing codes so the billing table will be 50,000 rows and nine columns. If there are 5,000 bundled billing codes, that table will be 5,000 rows and five columns. We can then tally the number of individual pieces of data or items in a table by multiplying the rows and columns. The following table summarizes this for the Humana estimates before and after breaking the single large table they provide into smaller ones.
Method | Payment Table | Payor Table | Provider Table | Billing Table | Bundled Billing Table | Total |
---|---|---|---|---|---|---|
Old | (600 billion) * 25 | 0 | 0 | 0 | 0 | 15 trillion |
New | (600 billion) * 12 | 1000 * 4 | (1 million) * 4 | 50,000 * 9 | 5,000 * 5 | 7.2 trillion |
Note that in the new method, the payor, provider, billing, and bundled billing tables contain about 4.5 million items and contribute almost nothing to the total. The savings appear to be about 50%, but it will be much larger than that because the items in the new payment table are "smaller" than those in the other tables. The old method was repeating a lot of "wordy" information. We can replace many other columns with key tables to reduce storage requirements. Take, for example, the Negotiated Type column. There are only a handful of types in the data, so we could use a three- or four-bit key to identify them? If the strings are encoded in UTF-8 then each character takes eight bits of data. The word "negotiated" takes 80 bits of data to store. Replacing it with a key reduces the storage requirements for that information by a factor of 20. In practice, the new files are roughly eight times smaller than the old ones.
Because there is so much data and insurance companies are required to update it every month, it makes sense to distribute the effort of downloading and manipulating it into a more searchable form.
The graphic below gives a broad outline of how this distribution will work.
There are three layers. The top layer is the database layer that we want to create. This will ultimately have an API and a user-friendly interface for programmers and the general public. The middle layer contains the computers (nodes) that will collect the information from the bottom layer consisting of the insurance companies.
The primary challenge here is developing a system for communication between each layer. The nodes will talk to the insurance companies simply via HTTP because the insurance companies have made the data available from their websites. That leaves the system used for the nodes to talk to the database, and this is where the challenge lies.
The core of the effort here is to match a combination of the payor, provider, and procedure with a negotiated price. It would be great if we could pick a Payor ID, a Provider ID, and a (Bundled) Billing ID and then simply look up the negotiated rate. And this is indeed what we would ultimately like to do. However, the data is not provided by the insurance companies in that way.
A big issue with doing this in a distributed way is assigning consistent payor, provider, and procedure IDs. If we allow the nodes to create the IDs, then we have to make sure they create the same ID that any other node would produce given the same information. For example, if one node sees the provider information given in Example 1, and a second node sees the same information in some other negotiated rate, we need them to produce the same ID, or you lose the savings of having a separate provider table. We also need to worry about ID collisions. Say, for example, one node encounters Example 1 in the data, and a second node encounters Example 2. We need to ensure that they don't happen to generate the same ID.
One way to do this is to have an algorithm known to each node that, given the same information, will produce the same ID. A trivial way of doing this is simply concatenating all the information into a single string. The IDs would then be just as long and contain the same information as the data, so there are no savings since we might as well not have created an ID. Another tool computer scientists have is what is called a hash. This is an algorithm that consumes some data of arbitrary size and produces a very large number. This number is of the order of the number of atoms in the known universe, so it is very unlikely that two different data sets will produce the same hash. A common hash is an MD5 algorithm that produces a 128-bit hash. There are 2^128 ~= 10^38 or 1 followed by 38 zeros possible numbers in 128 bits of data. You may encounter numbers like this around the internet as crypto-currency addresses. However, they are written as alphanumeric strings to reduce the apparent number of digits.
Hashes are great and have made much of modern computing possible, but they are necessarily still quite large. Recalling that 128 bits are equal to 16 bytes, each hash will consume 16 bytes of memory or storage space. While there will not be many hashes in the ID tables, trillions will be in the payment information table, and every trillion hashes consume 16 terabytes of space. We could use smaller hashes but then the chances of getting the same hash from two different sets of data grow.
If we could simply use sequential numbers for each ID, the savings in space would be tremendous. Let's say we expect there to be no more than ten million different provider IDs necessary. Using 24 bits will provide us with over 16 million unique IDs. If we expect no more than 100,000 different billing IDs, 17 bits will provide 131,000 unique ones. As an overestimate, let's say we use 24-bit (3-byte) numbers for each ID; then every trillion IDs in the payment information table will take three terabytes of space or less than one-quarter the space of the MD5 hashes.
The challenge is to determine a way to have the nodes create unique IDs that will be the same given the same data. The simple answer is to simply not have the nodes generate the IDs. The payors/plans, providers, and procedures are all relatively static, changing infrequently. These tables can also readily be produced and nearly fully populated before nodes start processing the payment information. The payor/plan database can be created simply by parsing the filenames the payors provide for download. Procedure code tables are readily available. The only challenge is the provider table. We are currently extracting all the providers from the currently posted data. Again, this will be complete soon and long before any code is ready to be pushed to nodes to process the data.
When a node first requests work from the server, it will download these key tables to use as it parses data. Should it find something for which there is no key, a protocol outlined below will be used. Periodically, likely after a batch of payments has been submitted to the server, nodes will ask for an update to the key tables. The payor/plan and procedure key tables will likely be small enough to keep in memory as dictionaries, so nodes can rapidly find the key associated with a payor/plan or procedure. The provider table will be orders of magnitude bigger, and it will likely best be stored in and referenced from a small local database; the other key tables will also be stored in a local database when the node is not processing data.
Unfortunately, these tables will need data appended occasionally. Let's assume that a node discovers a new provider; the protocol outlined will work for payor/plans and procedures too. There needs to be a protocol for the node that discovers this new provider to relay that information to the server. There are many ways to do this, but, like with the discussion of using hashes, we need to be careful to avoid multiple nodes encountering the same new information and generating distinct IDs for it. Thus, using hashes as temporary IDs seems to be a good way to proceed. Thus, the node encountering a provider that is not documented will create a hash of the data and use it for the ID. When that hash-based ID is transmitted to the centralized server with the other processed data, the server will recognize the hash ID and assign it a unique numeric ID. The server will maintain a small table of these newly generated IDs that will relate the hash ID with the numeric ID. Thus, if another node encounters this same new provider before its local database has been updated with the new numeric ID, the server will recognize that there is already a numeric ID for the hash ID this second node transmitted. This table will be pruned periodically and should never be larger than 100 entries.
First, I'm almost certain there is no nice database between taxonomy and procedure code. Why would there be? At best, it would be very fuzzy, and until now, who would want it? Besides the insurance companies, of course, who I'm sure use something like that (or a history of charges) to check for coding errors. I've been working under this assumption for the last month since I couldn't find anything and convinced myself there's no reason for it to exist. Without this information, the data is essentially useless for its intended purpose. The only use I can see is for providers to use it to negotiate higher rates with insurance companies. It's useless to consumers as is.
So, we can either try to create this taxonomy-to-procedure-code database. By trying to reverse engineer it or making something that at least does some sanity checks and refining from there?
The alternative is to find out who was the driving force behind EO that started this and make them understand that their vision for the information "...will provide additional access to pricing information and enhance consumers' ability to shop for the health care that best meet their needs" will never be met without more information from the insurance companies about who is billing for what.