Skip to content

Outline of Project Milestones

mathMakesArt edited this page Dec 23, 2021 · 1 revision

Outline of Project Milestones

This project is in a very early stage at the time of writing. Below is a brief outline of necessary steps in this project, not necessarily in a fixed order:

  • Develop a test framework with basic features necessary to facilitate the desired API calls
    • Focus on Twitter API, but with intention of eventually expanding into other domains
    • Object-oriented approach
      • "API Client" class instance should have a single function per API endpoint
        • Differences in use (e.g. input parameters provided) should be facilitated through optional inputs
      • "User" class as an abstraction of web accounts,
      • "Data Manager" class for redundancy checking, etc.
        • Owns the "User" instances
        • This is effectively the main server class instance
  • Perform preliminary review of data collection scope and limitations
    • Study API rate limits in conjunction with the specific requests planned
    • Estimate size of network to explore
      • Currently, Binance has the most followers of any known account (6.7M)
      • One possible approach is to begin with the followers of a single account
      • Alternatively, could begin with the set of followers of many large accounts (Binance, Coinbase, CoinMarketCap, Coingecko, VitalikButerin, CZ, Saylor, APompliano, Bitcoin, Ethereum)
  • Decide upon factors that influence the nature of the data collection itself
    • These quantities can be modified and expanded over time
    • The initial focus should be to include only decisions which are necessary for a minimum viable prototype
    • Examples of (potentially) important decisions include:
      • Search order and heuristics
      • Metrics for decisions about whether to include a given entity within the "Crypto User" network
      • Priority direction (e.g. followers vs. following)
      • Minimum amount of data to collect in every case
      • Scenarios in which to collect additional data beyond minimum (if ever)
      • Scenarios in which to collect repeat data over time, instead of ignoring repeated queries (if ever)
  • Finalize details of client-server approach
    • Division of tasks and data
    • Representation of data object instances as files, database tables, etc.
    • Pseudocode for each individual system and the communications between them
  • Develop client application for (standard rate limited) single-API-user requests, managed through a queue
    • Queue will be self-managed to begin with, but eventually all queue additions will be assigned by the server
  • Develop a separate local service (eventually server application) for
    • Processing of single-client-provided data (results of API requests) into single persistent "database" across disparate client sessions
    • Decisions about future searches and expansion of network, resulting from client-provided data and based on previously-defined heuristics
    • Assignment of additional requests into (single) client queue
  • Expand local service to function on an external server, separate from the client
  • Expand server application to manage multiple disparate clients running simultaneously
  • Develop additional software related to "post-collection" distribution of data
    • It's possible that an additional "data server" may be developed, instead of using the "control server" for this purpose
      • Clients would receive their instructions from the "control server", but would send their resulting data to the "data server"
    • The fundamental question is that of granting each "client user" access to the pool of data
    • Data distribution is likely to contribute heavily to the operating costs of this project
      • Some sort of peer-to-peer approach, possibly even a private torrent network, might be the most realistic solution to this
  • Revisit metrics and heuristics from step 1, going beyond "minimum viable prototype" into more advanced coordination of multiple clients
  • Decide upon factors relating not to the data itself, but to the operation of this network as an organization
    • Error prevention
      • Through parity checking of purposefully-redundant queries
        • This may require significant thought to balance time-efficiency with security
        • We can probably get away with assumptions of altruism, but this becomes harder as the organization scales
      • Through other means?
    • Incentivization of user contributions
      • Client runtime
      • Software development (not a requirement for organization membership)
    • Ensuring access to data by contributor users (those who reach some minimum client runtime threshold)
  • Additional routes of potential optimization not mentioned above:
    • Integration of automatic LZMA(2?) compression, wherever important but especially w/r/t the "data distribution" step
    • Expansion of data collection timeline: not only single snapshots, but change-data over time
      • As the number of simultaneously-live clients grows larger, certain snapshot-based data collection tasks become trivial
      • Should consider the priority of change-data in various contexts, in order to balance available resources
  • Integrate blockchain data (Ethereum and possibly others) directly into search decisions made by the server
    • Example: ENS Domains text records (com.twitter, com.github, etc.)
  • Expand client and server software to include additional APIs
    • Examples: Google Search, GitHub, Reddit, StackExchange

The above outline is an extreme simplification, but does highlight a variety of important milestones and a general direction for development.

Clone this wiki locally