New features #2

kuatroka · 2021-10-17T17:06:17Z

kuatroka
Oct 17, 2021

Hi Brian. I need to wrap my head around a proper explanation of the new features and more importantly explain the why of them. I'll try to do it tomorrow and meanwhile, what I'd like to comment on is what you have already mentioned in the issue's thread.
You said the code treats all the downloaded filings files in the same way - by parsing through the XML structure.
The big problem is that not all files have XML structure. In fact, the most of them don't. Somewhere between 1999 and 2012 it's just a .txt files with no helpful tags to extract tables at all. It only began to be applied sometime in 2012/2013. Unfortunately, I don't have the exact time when XML became a mandatory requirement by the SEC.
You can check it by downloading/looking at any 13F file from 1999 or 2000.

This is one of the big hurdles (in my opinion).
Another complexity is that even these non-xml files differ among themselves depending on the company filing them.

For now, I think a good idea could be to identify those files that are XML and non-XML so your parsing code gets applied only to the XML files and doesn't error our when sees the old format and instead sends a message or a log entry - "old file" or something like that.
It's a big topic and I'll share more ideas on it and why it's important ( imho) to somehow parse the old data too.

briancaffey · 2021-10-17T17:55:31Z

briancaffey
Oct 17, 2021
Maintainer

Thanks @kuatroka this is a good point. Thanks for pointing out that some of the files are of a different format that doesn't contain XML.

I agree, for the time being the file processing can be skipped for files that are not in XML form. I have mostly been testing out data from recent quarters. Interested to hear why you think this is important.

3 replies

kuatroka Oct 18, 2021
Author

Hi Brian. I was trying to put my ideas on "paper", so to speak, and it turned out not such a simple exercise, so instead I created a small presentation and I recorded a narration to it. I'll try to explain what I'm trying to achieve on a very high level. Please check it out and let me know what do you think. If it's of interest we can get deeper into it and start working on identifying the priority features and work on them together.

In the recording, I explain why the widest range of data is important. Btw, I think the easiest way to differentiate with the file is XML or not is just scan the file for the explicit tag .

At the end of the recording, I'm sharing my screen so you can see that I'm still facing local installation issues. I think something is wrong with my local ports/IPs, because all seem to be working independently, but it's just the frontend doesn't talk to the backend.

Here are the links to the recording and the presentation. I added your gmail email to authorise the viewing of both of them.

briancaffey Oct 19, 2021
Maintainer

Thanks for making the recording. I had listen and there are some great ideas.

I have thought about how the position exits / new positions need to be calculated, this would be a good item to start with. For some of the other ideas, we can maybe open new discussions or create tickets. Prioritizing efforts will be important. I have made some progress, but it may be easier to rewrite some of what I already have in order to make the backend more flexible.

FYI, it looks like you might be running an IIS Server that is occupying port 80 on your host machine. Check out this thread to try to stop it: https://superuser.com/questions/1377068/how-do-i-disable-the-iis-server-on-windows-10-and-free-up-port-80

Alternatively, you could try changing the port that the nginx service in the docker-compose is using. For example, if you wanted to use port 8089 for nginx, you could

  nginx:
    container_name: nginx
    build:
      context: ./nginx
      dockerfile: dev/Dockerfile
    ports:
      - "8089:80"
    [...]

Then you could access the application on http:localhost:8089. And you would be able to access the /api or /admin routes as well as the frontend application, all from the same port.

That's great that you have been able to build some of the data, and I agree that how data is added could be made simpler. I was thinking of two options:

Add the Quarter and Year only, save the filing list and then process it. This will download the file from the SEC and process it.
Add the Quarter and Year and a custom file. This could be useful for debugging. For example, you could upload a filing list that only has a few filings (there are usually ~6000+ filings per filing list / master.idx file). This option would not download the file from SEC, instead it would use the file that is uploaded through the Django admin UI.

kuatroka Oct 19, 2021
Author

Glad it was of a value to you!

Thanks for the ideas on how to solve the frontend/ports issue. The IIS server didn't work for some reason, but the idea of changing the port worked well, so I'finally 100% in!

Re features. You are right, the effort/feature prioritisation will be tough, but I think if we keep the end/conceptual goal in mind, we'll deal with it :)

"..I have thought about how the position exits / new positions need to be calculated, this would be a good item to start `with.."
It's up to you of course, but I'd propose to wait with this feature for now because it means dealing with a new Data Source. A Stock Price Data Source to be exact, and a new DS adds complexity while we haven't finished battling the SEC-13F DS yet. Besides, even if we have the new DS with stock prices, we have no way of connecting them to the SEC data as F13 filings don't have Company Tickers in them. We still need to do the CIK and CUSIP or "Name of Issuer" mapping to Stock Ticker.
I would propose to first work on solving every single issue we have with this DS - F13 and then go further with integrating other DSs and connecting them. So, the first things that I would suggest to focus are:

1 - Correct the Date - meaning to move from using the "Date Filed"/"FILED AS OF DATE" to the "CONFORMED PERIOD OF REPORT" and we could call it "Filing Date" and then create a new column with "Filing Quarter" that would be derived from the "Filing Date". This new date should be used to download the filings and to put them into correct Quarter buckets. See the gif below that shows how the current date doesn't represent the real filing quarter.

2 - Filing Data Load - I didn't quite get your 2nd option, but I understand it would be a good option for debugging purposes and love it. For the normal data load though I would propose that the maximum simplicity should be our goal. What do I mean...
I think it would be great to abstract away as much of complexity as possible. I think the "Filing Lists" option could ideally be removed from the Django tab and it should be all part of one process in the "Filings" option "Add". On the "Add Filing" screen (standard - non the debug one) we could only have two options: A - Load All Data (one button that load all filings since inception w/o need to input any dates - good for the real world use of the app) B - Select Data Range to Load (with two calendars where we select the From_Date and To_Date and behind the scenes, we'll derive to which quarter the date belongs).
I would remove CIK, Filename, Datafile as confusing, but I would keep them under a separate tab "debugging" as per your suggestion. I would definitely keep "Form Type" as it will help us later in bringing different filings. I would rename it though to keep consistency of terminology. I'd call it "Filing Type", because if not, it feels like there are two different things Form and Filing.

3 - Split into XML and non-XML - This one could be important for the step 2 (above) so if we select a date where the filings are still non-XML, a pop-up would appear on the loading page warning to change the date and would also suggest the correct "From Date"

4 - Parsing the non-XML files - That's a biggy and I have some code that deals with some of the Berkshire Hathaway's old filings. I'll send you the code later.

If you want, I can create separate issues for these 4 topics, but of course only if you think they are worth dedicating your time to. I will also be helping of course.

P.S There are a couple of open source libraries that "sort of" deal with the same issue of downloading SEC's data, but they have shortcomings. Maybe you know of them already, but if not, take a look because maybe it's worth reusing them if needed.

sec-edgar-downloader - quite simple, but good and it accepts both CIK and Stock Ticker, so it solved the CIK to Ticker mapping which we could reuse. The problem is it needs the list of CIKs to download anything.
sec-edgar - This one could have been a very good one for downloading everything we need with just adding range of dates, the problem is it's very unreliable. Maybe it's just my case, but when I launch download it constantly stops mid process or can't find the module, VSCode intellisense is constantly highlights the module as not present and once after a perfect download of year worth of data, when I simply changed the range to a different year, it broke into many cryptic errors. Anyways , this is just in case you want to reuse some of it.

kuatroka · 2021-10-19T17:42:27Z

kuatroka
Oct 19, 2021
Author

Hi @briancaffey. Here is the example of the code that parses non-xml 13F filing. This code only works for the most of txt based filings from Berkshire Hathaway. Unfortunately, other companies have different variations and that's the problem. Somehow these variations have to be accounted for. Just remember that I'm not a good dev, so this code is veeeeery childish, but some of it works :)

Put this code in a root folder and create another folder in it called "data". Into "data" folder download this filing
Run the code and it will generate a .json file with the content of the 13F table from the .txt file's table. I don't bring all the fields. Copy the .json code and paste it here to view it as a table.
Feel free to try other files downloaded manually from here to see different variations of .txt files and the output .json tables. Just remember to change the name of the .txt file in the line 203

def main():
    fname = (
        DATA_DIR
        / r"0001054420-99-000019.txt"

and rename the output .json for each new .txt file in the line 212 so they don't get overwritten

    tables_idx = get_tables_idx(data)
    tables = build_tables(data, tables_idx)
    json_obj = tables_to_json(tables)
    # print(pd.read_json(json_obj))
    save_json(json_obj, DATA_DIR / "output_99-000019.json")

Code

# System imports
from pathlib import Path
import json
import re
from re import search

from numpy.core.numeric import NaN
import pandas as pd
import numpy as np
from itertools import chain


# Third party imports

# User imports
pattern = re.compile(r"(FORM 13F\s+_+)")

HOME_DIR = Path("./")
DATA_DIR = HOME_DIR / "data"
TABLE_TO_LOOK_FOR = "Form 13F Information Table"
# TABLE_TO_LOOK_FOR2 = "FORM 13F   ______________________"
# TABLE_TO_LOOK_FOR3 = "FORM 13F        ____________________________"


TABLE_OPEN = "<TABLE>"
TABLE_CLOSE = "</TABLE>"
START_COLUMN_MARK = "<S>"

COLUMN_NAMES = [
    "Table #",
    "Name of Issuer",
    "Title of Class",
    "CUSIP",
    "Market Value x$1000",
    "Shares or Principal Amount",
    "Investment Discretion A",
    "Investment Discretion B",
    "Investment Discretion C",
    "Other Managers",
    "Voting Authority A",
    "Voting Authority B",
    "Voting Authority C",
]

COLUMN_NAMES_SHORT = [
    "Table #",
    "Name of Issuer",
    "Title of Class",
    "CUSIP",
    "Market Value x$1000",
    "Shares or Principal Amount",
    "Investment Discretion A",
    "Investment Discretion B",
    "Investment Discretion C"
]

def get_tables_idx(lines):
    tables_idx = []

    for i, l in enumerate(lines):

        # if l.strip() == TABLE_TO_LOOK_FOR or TABLE_TO_LOOK_FOR2 in l or TABLE_TO_LOOK_FOR3 in l:
        if l.strip() == TABLE_TO_LOOK_FOR or search(pattern, l):

            # check for start of table
            k = i + 1
            while not lines[k][:3] == START_COLUMN_MARK:
                k += 1

            # check for end of table
            j = k + 1
            while "---" not in lines[j] and "=" not in lines[j]:
                j += 1


            tables_idx.append((k, j))

        else:
            continue

    return tables_idx


def build_tables(lines, tables_idx):
    tables = []
    for idx in tables_idx:
        clients = []
        
        first_line = lines[idx[0]]

        start_of_col = [i for i, v in enumerate(first_line) if v == "<"] + [
            len(first_line)
        ]

        
        start_end = [(v, start_of_col[i + 1]) for i, v in enumerate(start_of_col[:-1])]

        start_end[3] = list(start_end[3])
        start_end[3][1] = start_end[3][1] - 1
        tuple(start_end[3])

        start_end[4] = list(start_end[4])
        start_end[4][0] = start_end[3][1]
        tuple(start_end[4])

        client = ""
        for line in lines[idx[0] + 1 : idx[1]]:
            
            # check if data line or just client
            if len(line) < start_end[0][1]:
                client += " " + line.strip()
            else:
                # data column
                values = []
                for p in start_end:
                    val = line[p[0] : p[1]]
                    values.append(val.strip())
                if client:
                    values[0] = f"{client.strip()} {values[0]}"
                    client = ""
                elif not values[0]:
                    values[0] = clients[-1][0]
                clients.append(values)
        
        tables.append(clients)

    
        tables2 = []
        for table in tables:
            
            df = pd.DataFrame(table)

            df = pd.DataFrame(table)
            df = df[[0,1,2,3,4]]
            # # deleting rows with column with amounts are empty
            # df = df[((df[3] != "")&(df[4] != ""))]


            # Check cases where first line of the name is above the real line with all the numbers
            i = df[((df[1] == "")&(df[2] == "")&(df[3] == "")&(df[4] == ""))]
            if not i.empty:
                del_index = i.index[0]
                name = df[0][del_index]
            # prefix name from empty line to the line with data
                df[0][del_index+1] = name + " " + df[0][del_index+1]
   
            df[3].replace("", np.nan, inplace=True)
            df[4].replace("", np.nan, inplace=True)
            df.dropna(subset=[3,4], inplace=True)
            cols = [1, 2]
            df[cols] = df[cols].replace('', np.nan).ffill()
            tables2.append(df.values.tolist())



        table_num = 0
        for table in tables2:
            table_num += 1
            for i in table:
                if "table" not in i[0]:
                    i.insert(0, f"table {table_num}")
                    del i[6:]

        # tables2 = list(chain.from_iterable(tables2))
    # tables = [tables2]
            
        
    tables = tables2
    return tables




def save_json(obj_, fname):
    with open(fname, "w") as f:
        json.dump(obj_, f, indent=4)


def tables_to_json(tables):
    as_json = {}
    print(len(tables))
    for i, t in enumerate(tables):

        entries = []
        for entry in t:
            entry_obj = {}
            for k, v in enumerate(entry):
                entry_obj[COLUMN_NAMES[k]] = v
            entries.append(entry_obj)
            # as_json.update(entries)

        as_json[f"table_{str(i+1).zfill(2)}"] = entries
        

    return as_json


def main():
    fname = (
        DATA_DIR
        / r"0001054420-99-000019.txt"
    )
    with open(fname, mode="r") as f:
        data = f.readlines()

    tables_idx = get_tables_idx(data)
    tables = build_tables(data, tables_idx)
    json_obj = tables_to_json(tables)
    # print(pd.read_json(json_obj))
    save_json(json_obj, DATA_DIR / "output_99-000019.json")


if __name__ == "__main__":
    main()


#TODO: problems with 0000950134-08-002742.txt - 

# not every time the name of the company is parsed correctly. Too many variations  
# and dependencies with other columns. One possible solution is to deal with it 
# after the main data is extracted and check the CUSIP numhers against the Name of Issuer and
# assign correct ones somewhow based on statistics. 
# example of Name not being parsed correctly is 0000950134-08-002742.txt - it merges two names

# for Berkshire Hathaway, starting from Q2 2013 - xml started to appear. From filing date 2013-08-05

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New features #2

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

New features #2

kuatroka Oct 17, 2021

Replies: 2 comments · 3 replies

briancaffey Oct 17, 2021 Maintainer

kuatroka Oct 18, 2021 Author

briancaffey Oct 19, 2021 Maintainer

kuatroka Oct 19, 2021 Author

kuatroka Oct 19, 2021 Author

kuatroka
Oct 17, 2021

Replies: 2 comments 3 replies

briancaffey
Oct 17, 2021
Maintainer

kuatroka Oct 18, 2021
Author

briancaffey Oct 19, 2021
Maintainer

kuatroka Oct 19, 2021
Author

kuatroka
Oct 19, 2021
Author