Since starting my deep dives, I have come across the phrase “Prompt Engineering” repeatedly. The premise was (I thought), that if you give an AI a perfectly crafted prompt, then it would do exactly what you wanted.
That seemed counter to my actual goals. I wanted the AI to make me faster. More efficient. If I had to sit and try to figure out the exact right phrasing and all of the techniques that could cajole this AI to do what I wanted, that seemed like MORE work, not less work.
Especially when I compared that to the benefit I was getting when using Github Copilot. Using Github Copilot has been seamless. Once you add it to your workflow it’s a real accelerant. It actually feels like I have a pretty good intern that knows exactly the domain language I am working in at every moment of time. You have to pay attention so it doesn’t get off track and you need to frequently redirect it but definitely speeds things up.
I have had Principled Instructions Are All You Need for
Questioning LLaMA-1/2, GPT-3.5/4 open as a tab for at least 3 weeks now. But I figured I would start with Open AIs Prompt Engineering Guide and it slowly dawned on me what the point was.
Prompt Engineering is not really for users. It’s for people building PRODUCTS based on LLMs. So, if you’re building your own custom GPT that you plan on using repeatedly or providing to other users then that investment in time can absolutely compound.
But if you’re aiming for a single question and answer, then it’s just fine to fire off a quick question to chatGPT, and if it’s a little confused, then you can just iterate until you get what you’re looking for. There’s certainly still an articulation barrier where, if you can’t describe what you want you’re unlikely to get a decent answer, but Prompt Engineering wasn’t going to help you with that anyways.
I’m still pretty turned off by the whole concept of Prompt Engineering, due to the fuzziness of its repeatability. I’m used to writing unit tests to ensure that the algorithm behaves EXACTLY as we expect. I can quantify its correctness, accuracy and performance and I can tell when it’s “good enough” for my application. There are strategies for comparing with “Gold Standard” answers and an evaluation framework, but it still feels like it lacks objectivity.
That lack of objectivity, though, is inherit to the problems that LLMs solve. My sympathy goes out to all of those English teachers out there who have to evaluate writing on both accuracy and “quality”. Counting how many correct answers show up as part of the question is a pretty good way of evaluating accuracy. But evaluating the quality of the writing is hard and evaluating the quality of the writing, given a framing, is another skill entirely (Explain it to me like I am five or in Haiku).
But if I were to do all of that tuning and tweaking to get a particular prompt that behaved pretty damn well and the underlying model changes to the next version, how much of that work carries forward? Hopefully it’s a better model so it would adapt to my Prompt Engineering even more precisely? I.e. if I set up a whole scaffold to help a sub-par LLM to get more consistent results, shouldn’t that same scaffolding make a more powerful LLM even more capable?
I think my biggest concern is that the technique simply isn’t evergreen because I can’t be certain I can make it work the NEXT time I want to implement something. I’m not a fan in investing in skills that might not pay off. But the counter argument would be that figuring out the Facebook, Google or LinkedIn algorithm is a good skill to learn, because it can be leveraged to provide something valuable in the moment. Knowing how to beat the algorithm is still useful even if it doesn’t work in every scenario and it can change in the future.
I’m still trying to figure out where Prompt Engineering fits in the toolkit and if it’s a tool I want to practice with. If anyone has used it to great effect for a problem, I’d like to hear about it!