Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RefC Backend + makeString memory problems #2455

Open
vfrinken opened this issue May 4, 2022 · 6 comments
Open

RefC Backend + makeString memory problems #2455

vfrinken opened this issue May 4, 2022 · 6 comments
Labels
backend: refc C with reference counting backend implem: string performance

Comments

@vfrinken
Copy link
Contributor

vfrinken commented May 4, 2022

The way strings are handled as (char *) poses problems. I don't see an easy way out. Here is the deal:

Problem

  1. The function Value* makeString(char*) creates a copy of the input (including malloc) to create the string object.
  2. Functions, such as fastConcat also allocate new memory (which is not freed anywhere but should be)
  3. Functions, such as idris2_strerror(int) return a pointer to an error message (which should not/can not be freed)
  4. The RefC backend indiscriminately creates the following code
{
     // ffi call to _charPointer_returning_function_
    char * retVal = _charPointer_returning_function(...);
    return (Value*)makeString(retVal);
}

Thus, invoking functions such as fastPack, or fastConcat, but also idris2_currentDirectory leak memory.

As an example, I compiled Idris itself with the RefC backend (with minor changes along the way to make it compilable), and started it. Just displaying the REFL and immediately exiting, without any actual computation leaks about 1MB already.

Potential solutions

I can see three different approaches, I don't know which ones are feasible. I probably also miss an obvious solution, though.

1. changing the datatype of String from char* to something like a struct with freeing information

typedef struct {
    char* c;
    int memory_managed_externally;
}Idris2_CString;

Functions like fastConcat would then return a struct with the flag set to 0, so makeString can free the pointer, whereas other functions may set it to 1.

2. Create a convention that all FFI functions that return char* have to allocate new memory.

This new memory is then consistently freed by makeString (resp. the garbage collector whenever the String object goes out of scope).

Both of these options might break existing code (resp. FFI calls) that interacts with Idris.

3. Use different primitives in the Idris code to indicate whether or not the char* should be kept or discarded.

This, however, seems to break the code the most. CFTypes would need to distinguish between StringFreeable and StringNonFreeable. Consider:

%foreign "C:idris2_getString, libidris2_support, idris_support.h"
         "javascript:lambda:x=>x"
export
prim__getString : Ptr String -> String

used in functions such as fGetLine:

fGetLine : HasIO io => (h : File) -> io (Either FileError String)
fGetLine (FHandle f)
    = do res <- primIO (prim__readLine f)
         if prim__nullPtr res /= 0
            then returnError
            else ok (prim__getString res)

now prim__readLine returns a char * (which is then as pointer, not as string) passed to prim__getString, which converts it into a char* (very nice), but then calls makeString which allocates new memory (bad) => memory leak. This would need to change to

%foreign "C:idris2_getString, libidris2_support, idris_support.h"
         "javascript:lambda:x=>x"
export
prim__getStringFreeable : Ptr String -> StringFreeable

Very ugly, and maybe not even possible.

Personally, I prefer the second option, but I don't have a strong opinion.There are not (yet) a lot of C libraries that interact with Idris2, and the changes seem minimal.

@j-nava
Copy link
Contributor

j-nava commented May 4, 2022

FWIW, I agree the second option looks the most natural. First option might be more flexible, but maybe it could lead to unnecessary complexity in the future?

@ohad
Copy link
Collaborator

ohad commented May 5, 2022

I guess the main point that @vfrinken makes is this: the current code is quite substantially broken, so someone does need to go through all the C FFI functions (in this code base, but also in external code bases) one by one and implement whichever solution we choose.

So the decision is not really whether to make any breaking changes, but how to make the function-by-function fixes straightforward, possibly by following a simple decision procedure.

@mattpolzin
Copy link
Collaborator

I think (2) describes roughly what the FFI documentation (undoubtedly written mostly or entirely with the Scheme backend in mind) describes currently, if I am understanding correctly.

Snippet from the Idris documentation:

A char* returned by a C function will be copied to the Idris heap, and the Idris run time immediately calls free with the returned char*.

@vfrinken
Copy link
Contributor Author

vfrinken commented May 5, 2022

Yes, @ohad, that is exactly the point I was trying to make. I went through the CFFI functions used in the Idris code itself and made the changes (only the direct call to getenv(...) needed a workaround) and was able to get a code base that was free of memory leaks in one afternoon/evening, so that is at least proof-of-concept that it is doable. (https://github.com/vfrinken/Idris2-RefC-Coronado) That code also implements the PR about the GC change.

The Idris documentation is quite clear, but the Idris code itself doesn't follow the guideline. :-(

Anyway, I'm going to argue in favor of 1. now, because I like it, too:

  • Having a struct with meta-information about a string is more flexible, particularly if we want to support different encodings in the future. The struct could have some information how the input is encoded.
  • A fundamental break in how strings are imported/exported forces a user to explicitly think about the memory issue, instead of some implicit convention. To write bug-free code, maybe this is necessary. Also, the compiler would complain if you overlooked a function without changing

I won't have time to code anything in the next few days, maybe even until after the Developer Meeting. If by that time the consensus is on 2. I'm happy to turn my code into a PR.

@iacore
Copy link
Contributor

iacore commented Jun 7, 2022

A char* returned by a C function will be copied to the Idris heap, and the Idris run time immediately calls free with the returned char*.

This is exactly what's problematic. You can't return char* allocated by another allocator. Maybe the data is on the stack? Maybe the data is in .data or .text?

@dunhamsteve
Copy link
Collaborator

dunhamsteve commented Jun 22, 2022

I didn't see this until after I filed #2552, but the scheme backends have the same problem. They appear to be leaking a lot of memory (e.g. 100k leaks from fGetLine when loading parts of the compiler source in the REPL / LSP).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend: refc C with reference counting backend implem: string performance
Projects
None yet
Development

No branches or pull requests

7 participants