-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to know how many characters were parsed #8
Comments
I'm not going to have time to work on this but I'd happily entertain a PR if you want to take a crack at this. I'm not sure what the API would look like. It might be cleanest to introduce another function that processed the data and simply returned the length consumed rather than the payload. It would mean parsing twice but that might not be a deal killer. |
Unfortunately the approach I ended up going with probably isn't going to work too well for this library since my case had extra context I was able to use, namely that I was parsing markdown output, and could look for opening "```json" and closing "```" tags. But yeah it basically came down to scanning for possible starts and closings to the json object, and then trying to parse just that substring. Suffice it to say I personally don't need this feature anymore. But if anyone else wants to try to extract some value from my work, here's what I ended up writing: def json_block_iter(message:str) -> Generator[str|Edit, None, None]:
"""
Iterator to extract text and json objects from the LLM message.
"""
original_message = message #for debugging
message = message.lstrip()
while len(message) > 0:
try:
i = message.index('```json')
except ValueError:
message = message.lstrip()
if message:
yield message
return
if i != 0:
yield message[:i]
message = message[i:]
message = message[7:].lstrip()
if not message.startswith('{') and not message.startswith('['):
pdb.set_trace()
raise ValueError(f"Expected json block to start with {{ or [ but found {message}")
#find candidate end indices
delimiter = '}' if message.startswith('{') else ']'
end_indices = [i for i, c in enumerate(message) if c == delimiter]
#find the first end index that is valid json
for end_index in end_indices:
try:
parsed_block = dirtyjson.loads(message[:end_index+1])
break
except ValueError:
continue
else:
raise ValueError(f"Failed to parse json block: {message}")
# yield the block if single block, or sequentially yield each item in the list of blocks
if isinstance(parsed_block, list):
for item in parsed_block:
assert 'code' in item and 'start' in item and 'end' in item, f"INTERNAL ERROR: Expected json block to have keys 'code', 'start', and 'end', but found {parsed_block}"
yield dict(item)
elif isinstance(parsed_block, dict):
assert 'code' in parsed_block and 'start' in parsed_block and 'end' in parsed_block, f"INTERNAL ERROR: Expected json block to have keys 'code', 'start', and 'end', but found {parsed_block}"
yield dict(parsed_block)
else:
raise ValueError(f"INTERNAL ERROR: Expected json block to be a dict or list, but found {parsed_block}")
#update message to be the remaining text
message = message[end_index+1:].lstrip()
assert message.startswith('```'), f"INTERNAL ERROR: Expected json block to end with ``` but found {message}"
message = message[3:].lstrip()
return This yields in sequence each of the non-json parts and each of the json parts. I think adapting it for this library might be tricky, since in general there isn't a good indicator for if a section is a valid json object or not. Conceivably there could be text containing |
I'm wondering if it would be possible to add an attribute to the result for how many characters were parsed. My usecase has me parsing a string input that has random text, interspersed with multiple json objects.So I want to do something like this:
But right now it's really not feasible to do this because there's not way to measure how many characters were eaten while parsing the current object. I guess technically it would be possible to use the row/column annotation of the last element in the object/list and then find the closing delimiter, but that's super cumbersome. Having the length of the characters eaten would be very useful
The text was updated successfully, but these errors were encountered: